Using awk to isolate specific rows

10-07-2011

Registered User

10, 0

Join Date: Oct 2011

Last Activity: 2 December 2012, 3:58 AM EST

Posts: 10

Thanks Given: 9

Thanked 0 Times in 0 Posts

Using awk to isolate specific rows

Hi all!

Let's say I have obtained this dataset from the ISI Web of Knowledge

Code:

...
PT J
AU Yousefi, Ramin
   Muhamad, Muhamad Rasat
   Zak, Ali Khorsand
TI The effect of source temperature on morphological and optical properties
   of ZnO nanowires grown using a modified thermal evaporation set-up
SO CURRENT APPLIED PHYSICS
VL 11
IS 3
BP 767
EP 770
DI 10.1016/j.cap.2010.11.061
PD MAY 2011
PY 2011
TC 2
Z9 2
SN 1567-1739
UT WOS:000288183300097
ER

PT J
AU Ooi, C. H. Raymond
TI Conversion of heat to light using Townes' maser-laser engine: Quantum
   optics and thermodynamic analysis
SO PHYSICAL REVIEW A
VL 83
IS 4
AR 043838
DI 10.1103/PhysRevA.83.043838
PD APR 29 2010
PY 2010
TC 0
Z9 0
SN 1050-2947
UT WOS:000290107500018
ER
...

This is just a snippet. I would like to place each entry (from PT J to ER is considered an entry) into a separate file according to year.. Therefore

Code:

PT J
AU Yousefi, Ramin
   Muhamad, Muhamad Rasat
   Zak, Ali Khorsand
TI The effect of source temperature on morphological and optical properties
   of ZnO nanowires grown using a modified thermal evaporation set-up
SO CURRENT APPLIED PHYSICS
VL 11
IS 3
BP 767
EP 770
DI 10.1016/j.cap.2010.11.061
PD MAY 2011
PY 2011
TC 2
Z9 2
SN 1567-1739
UT WOS:000288183300097
ER

will be in the file named 2011 and so forth. I tried to do it with this

Code:

i=1990
while [ "$i" -lt 2011 ]
        do
                gawk '/AU/ , /'PY "$i"'/' savedrecs.txt savedrecs2.txt > "$i"
                ((i+=1))
        done

where savedrecs are the files containing the raw data from the database but it won't work because gawk will just keep dumping to $i until it meets the right PY "$i"..

Any ideas? Thanks in advance..

Moderator's Comments:

Video tutorial on how to use code tags in The UNIX and Linux Forums.

Last edited by radoulov; 10-07-2011 at 06:57 AM..

sidiqmk

View Public Profile for sidiqmk

Find all posts by sidiqmk

10-07-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Code:

awk '{ 
  /^PY/ && y = $2
  r = r ? r RS $0 : $0 
  }
/^ER/ {
  print r >> y
  r = x; close(y)
  }' infile

Le me know if you want to discard data outside PT J and ER.

Last edited by radoulov; 10-07-2011 at 07:10 AM..

This User Gave Thanks to radoulov For This Post:

radoulov

View Public Profile for radoulov

Find all posts by radoulov

10-07-2011

Registered User

10, 0

Join Date: Oct 2011

Last Activity: 2 December 2012, 3:58 AM EST

Posts: 10

Thanks Given: 9

Thanked 0 Times in 0 Posts

Hi thanks for replying..

I don't really understand what's going on but I'll pick it up later.. However when I ran the commands, it seems like only 1 occurrence will be placed in a file.. There should be more as there are more than 900 entries..

I'm sorry for inconveniencing you and thanks for the help again..

sidiqmk

View Public Profile for sidiqmk

Find all posts by sidiqmk

10-07-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

OK,
I edited my post and modified the script in order to handle multiple entries (you should remove the previously created before rerunning the script though).

This version will be more efficient, but some awk implementations may hit the high number of concurrently open files limit with it:

Code:

awk '{ 
  /^PY/ && y = $2
  r = r ? r RS $0 : $0 
  }
/^ER/ {
  print r > y
  r = x
  }' infile

Last edited by radoulov; 10-07-2011 at 07:19 AM..

This User Gave Thanks to radoulov For This Post:

radoulov

View Public Profile for radoulov

Find all posts by radoulov

10-07-2011

Registered User

10, 0

Join Date: Oct 2011

Last Activity: 2 December 2012, 3:58 AM EST

Posts: 10

Thanks Given: 9

Thanked 0 Times in 0 Posts

Thanks a lot.. Works like a charm now.. I will now go find out what these commands mean..

sidiqmk

View Public Profile for sidiqmk

Find all posts by sidiqmk

10-07-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

I'll try to explain

I'll take the first one, because it's more complicated.

This is the first action block:

Code:

{ 
  /^PY/ && y = $2
  r = r ? r RS $0 : $0 
  }

While reading every[1] input record (most awk code outside of the BEGIN/END/BEGINFILE/ENDFILE special patterns is wrapped in an implicit loop):

Code:

/^PY/ && y = $2

When the current record matches the regular expression above (i.e. begins with the string PY), set the content of the variable y to the value of the second field (the year).

Code:

r = r ? r RS $0 : $0

Store all the records in the variable r. This statement uses the ternary operator,
in pseudo-code:

Code:

expression ? if_true_return_this : otherwise_return_this

It simply appends all the records to the variable r. This approach is rather fragile, in some situation you will loose data (when the first line is empty(NULL or only white space characters or it contains only the digit 0 (zero), let me know if you need a more robust solution).

The next pattern/action pair:

Code:

/^ER/ {
  print r >> y
  r = x; close(y)
  }

If the current input record matches ^ER append ( >>) the content of the r variable to the file named y (the current year of the logical record, the previously saved value).
Reset r (x is an uninitialized variable, in awk those are NULL (when used as strings) or 0 (when used as numbers)).
Close the current input file (needed with some awk implementations on certain systems).

[1] I say every record because in that rule the pattern part is missing, so the action is executed for every record.

Hope this helps.

This User Gave Thanks to radoulov For This Post:

radoulov

View Public Profile for radoulov

Find all posts by radoulov

10-08-2011

Registered User

10, 0

Join Date: Oct 2011

Last Activity: 2 December 2012, 3:58 AM EST

Posts: 10

Thanks Given: 9

Thanked 0 Times in 0 Posts

Hi thanks for the tutorial.. I really got the picture, save the r RS $0 part but it's ok I'll look it up some other time.. Now I've tried to use your script to collect articles by author so let's say I'd like to collect all of Mr Yousefi's papers into one file, I'd type

Code:

gawk '{
  /"Yousefi.*$"/ && y = yousefi
  r = r ? r RS $0 : $0
  }
/^ER/ {
  print r > y
  r = x
  }' savedrecs.txt savedrecs2.txt

I'm pretty sure I'm doing it wrong because the error I get is
gawk: cmd. line:6: (FILENAME=savedrecs.txt FNR=21) fatal: expression for `>' redirection has null string value
which means y is not initialized I think..

Please helppp.. Thanks again..

Last edited by radoulov; 10-08-2011 at 05:23 AM.. Reason: Code tags!

sidiqmk

View Public Profile for sidiqmk

Find all posts by sidiqmk

Shell Programming and Scripting

Using awk to isolate specific rows

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

extract rows that have a specific name

Discussion started by: phil_heath

2. Shell Programming and Scripting

Grep to isolate a text file line and Awk to select a word?

Discussion started by: Ironguru

3. Shell Programming and Scripting

Cutting rows at specific length

Discussion started by: ida1215

4. Shell Programming and Scripting

Extracting specific rows

Discussion started by: CAch

5. UNIX for Dummies Questions & Answers

extract specific rows

Discussion started by: jdhahbi

6. Shell Programming and Scripting

Counting rows line by line from a specific column using Awk

Discussion started by: vnayak

7. Shell Programming and Scripting

awk: isolate a part of a file name

Discussion started by: friend

8. Shell Programming and Scripting

Deleting of Specific Rows.

Discussion started by: gregarion

9. Shell Programming and Scripting

Deleting specific rows in large files having rows greater than 100000

Discussion started by: manish2009

10. UNIX for Dummies Questions & Answers

how can i isolate the random sequence of numbers using awk?

Discussion started by: rcon1