Using awk to isolate specific rows


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Using awk to isolate specific rows
# 1  
Old 10-07-2011
Using awk to isolate specific rows

Hi all!

Let's say I have obtained this dataset from the ISI Web of Knowledge

Code:
...
PT J
AU Yousefi, Ramin
   Muhamad, Muhamad Rasat
   Zak, Ali Khorsand
TI The effect of source temperature on morphological and optical properties
   of ZnO nanowires grown using a modified thermal evaporation set-up
SO CURRENT APPLIED PHYSICS
VL 11
IS 3
BP 767
EP 770
DI 10.1016/j.cap.2010.11.061
PD MAY 2011
PY 2011
TC 2
Z9 2
SN 1567-1739
UT WOS:000288183300097
ER

PT J
AU Ooi, C. H. Raymond
TI Conversion of heat to light using Townes' maser-laser engine: Quantum
   optics and thermodynamic analysis
SO PHYSICAL REVIEW A
VL 83
IS 4
AR 043838
DI 10.1103/PhysRevA.83.043838
PD APR 29 2010
PY 2010
TC 0
Z9 0
SN 1050-2947
UT WOS:000290107500018
ER
...

This is just a snippet. I would like to place each entry (from PT J to ER is considered an entry) into a separate file according to year.. Therefore

Code:
PT J
AU Yousefi, Ramin
   Muhamad, Muhamad Rasat
   Zak, Ali Khorsand
TI The effect of source temperature on morphological and optical properties
   of ZnO nanowires grown using a modified thermal evaporation set-up
SO CURRENT APPLIED PHYSICS
VL 11
IS 3
BP 767
EP 770
DI 10.1016/j.cap.2010.11.061
PD MAY 2011
PY 2011
TC 2
Z9 2
SN 1567-1739
UT WOS:000288183300097
ER

will be in the file named 2011 and so forth. I tried to do it with this

Code:
i=1990
while [ "$i" -lt 2011 ]
        do
                gawk '/AU/ , /'PY "$i"'/' savedrecs.txt savedrecs2.txt > "$i"
                ((i+=1))
        done


where savedrecs are the files containing the raw data from the database but it won't work because gawk will just keep dumping to $i until it meets the right PY "$i"..

Any ideas? Thanks in advance..


Moderator's Comments:
Mod Comment Video tutorial on how to use code tags in The UNIX and Linux Forums.

Last edited by radoulov; 10-07-2011 at 06:57 AM..
# 2  
Old 10-07-2011
Code:
awk '{ 
  /^PY/ && y = $2
  r = r ? r RS $0 : $0 
  }
/^ER/ {
  print r >> y
  r = x; close(y)
  }' infile

Le me know if you want to discard data outside PT J and ER.

Last edited by radoulov; 10-07-2011 at 07:10 AM..
This User Gave Thanks to radoulov For This Post:
# 3  
Old 10-07-2011
Hi thanks for replying..

I don't really understand what's going on but I'll pick it up later.. However when I ran the commands, it seems like only 1 occurrence will be placed in a file.. There should be more as there are more than 900 entries..

I'm sorry for inconveniencing you and thanks for the help again..
# 4  
Old 10-07-2011
OK,
I edited my post and modified the script in order to handle multiple entries (you should remove the previously created before rerunning the script though).

This version will be more efficient, but some awk implementations may hit the high number of concurrently open files limit with it:
Code:
awk '{ 
  /^PY/ && y = $2
  r = r ? r RS $0 : $0 
  }
/^ER/ {
  print r > y
  r = x
  }' infile


Last edited by radoulov; 10-07-2011 at 07:19 AM..
This User Gave Thanks to radoulov For This Post:
# 5  
Old 10-07-2011
Thanks a lot.. Works like a charm now.. I will now go find out what these commands mean..
# 6  
Old 10-07-2011
I'll try to explain Smilie

I'll take the first one, because it's more complicated.

This is the first action block:
Code:
{ 
  /^PY/ && y = $2
  r = r ? r RS $0 : $0 
  }

While reading every[1] input record (most awk code outside of the BEGIN/END/BEGINFILE/ENDFILE special patterns is wrapped in an implicit loop):

Code:
/^PY/ && y = $2

When the current record matches the regular expression above (i.e. begins with the string PY), set the content of the variable y to the value of the second field (the year).

Code:
r = r ? r RS $0 : $0

Store all the records in the variable r. This statement uses the ternary operator,
in pseudo-code:

Code:
expression ? if_true_return_this : otherwise_return_this

It simply appends all the records to the variable r. This approach is rather fragile, in some situation you will loose data (when the first line is empty(NULL or only white space characters or it contains only the digit 0 (zero), let me know if you need a more robust solution).

The next pattern/action pair:

Code:
/^ER/ {
  print r >> y
  r = x; close(y)
  }

If the current input record matches ^ER append ( >>) the content of the r variable to the file named y (the current year of the logical record, the previously saved value).
Reset r (x is an uninitialized variable, in awk those are NULL (when used as strings) or 0 (when used as numbers)).
Close the current input file (needed with some awk implementations on certain systems).

[1] I say every record because in that rule the pattern part is missing, so the action is executed for every record.

Hope this helps.
This User Gave Thanks to radoulov For This Post:
# 7  
Old 10-08-2011
Hi thanks for the tutorial.. I really got the picture, save the r RS $0 part but it's ok I'll look it up some other time.. Now I've tried to use your script to collect articles by author so let's say I'd like to collect all of Mr Yousefi's papers into one file, I'd type

Code:
gawk '{
  /"Yousefi.*$"/ && y = yousefi
  r = r ? r RS $0 : $0
  }
/^ER/ {
  print r > y
  r = x
  }' savedrecs.txt savedrecs2.txt

I'm pretty sure I'm doing it wrong because the error I get is
gawk: cmd. line:6: (FILENAME=savedrecs.txt FNR=21) fatal: expression for `>' redirection has null string value
which means y is not initialized I think..

Please helppp.. Thanks again..

Last edited by radoulov; 10-08-2011 at 05:23 AM.. Reason: Code tags!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

extract rows that have a specific name

Hi I want to extract rows in a large files that have a specific name. The name can be "starved and rich", "rich", "starved" or " ". Heres an example: bob starved and rich jack rich joey starved mike so it can have either 3 names or no name. I want to put the names into a... (4 Replies)
Discussion started by: phil_heath
4 Replies

2. Shell Programming and Scripting

Grep to isolate a text file line and Awk to select a word?

I am looking at using grep to locate the line in the text file and them use awk to select a word or words out of it. I know awk needs -v to allow a variable to be used, but also needs -F to allow the break up of the sentence and allow the location of separate variables. $line = grep "1:" File |... (8 Replies)
Discussion started by: Ironguru
8 Replies

3. Shell Programming and Scripting

Cutting rows at specific length

Hi, i have a file containing nrows and 3cols. i want to cut it in specific length and save output to individual files. 1 2 3 4 5 6 5 8 9 10 11 12 13 14 15 16 17 18 i need to cut the file say every 2 rows and save it in individual file. 01.dat contains 1 2 3 4 5 6 02.dat 7 8 9... (10 Replies)
Discussion started by: ida1215
10 Replies

4. Shell Programming and Scripting

Extracting specific rows

Hi all..... I have a file which contains large data...like I want to print the rows starting from "pixel" till the file read the letter "TER" into a new output file.... can anyone plz help in doing this ?? (5 Replies)
Discussion started by: CAch
5 Replies

5. UNIX for Dummies Questions & Answers

extract specific rows

Hi I have a file that looks like the one below. For the same 'TCONS' in the second column, I would like to extract the row that has the highest (or last) number in the fourth column. Any kind of help will be appreciated. input transcript_id "TCONS_00000051"; exon_number "1"; transcript_id... (4 Replies)
Discussion started by: jdhahbi
4 Replies

6. Shell Programming and Scripting

Counting rows line by line from a specific column using Awk

Dear UNIX community, I would like to to count characters from a specific row and have them displayed line-by-line. I have a file called testAwk2.csv which contain the following data: rabbit penguin goat giraffe emu ostrich I would like to count in the middle row individually... (4 Replies)
Discussion started by: vnayak
4 Replies

7. Shell Programming and Scripting

awk: isolate a part of a file name

hi there, i have a file named 'x20080613_x20100106.pwr1.gc', i want to isolate the part 'x20080613_x20100106' but by using the following line i isolate the part '.pwr1.gc': `awk '$0=substr($0, length($0)-7)' $temp` how can i reverse that? thank you! (3 Replies)
Discussion started by: friend
3 Replies

8. Shell Programming and Scripting

Deleting of Specific Rows.

Fruit : Price : Quantity apple : 20 : 40 chiku : 40 :30 Hey guys, i have written a code using sed to delete a specific char which is being typed in. But the problem i am having is , how can i expand my coding to actually allow it do delete the whole row. For example,... (21 Replies)
Discussion started by: gregarion
21 Replies

9. Shell Programming and Scripting

Deleting specific rows in large files having rows greater than 100000

Hi Guys, I need help in modifying a large text file containing more than 1-2 lakh rows of data using unix commands. I am quite new to the unix language the text file contains data in a pipe delimited format sdfsdfs sdfsdfsd START_ROW sdfsd|sdfsdfsd|sdfsdfasdf|sdfsadf|sdfasdf... (9 Replies)
Discussion started by: manish2009
9 Replies

10. UNIX for Dummies Questions & Answers

how can i isolate the random sequence of numbers using awk?

as you can see there is a delimiter after c8 "::". Awk sees the rest as fields because it doesn't recognize spaces and tabs as delimiters. So i am basically looking to isolate 20030003ba13f6cc. Can anyone help? c8::20030003ba13f6cc disk connected configured unknown (2 Replies)
Discussion started by: rcon1
2 Replies
Login or Register to Ask a Question