Search for a pattern,extract value(s) from next line, extract lines having those extracted value(s)
I have hundreds of files to process. In each file
I need to look for a pattern then
extract value(s) from next line and then
search for value(s) selected from point (2) in the same file at a specific position.
Code:
HEADER ELECTRON TRANSPORT 18-MAR-98 1A7V
TITLE CYTOCHROME C' FROM RHODOPSEUDOMONAS PALUSTRIS
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: CYTOCHROME C';
COMPND 3 CHAIN: A, B
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: RHODOPSEUDOMONAS PALUSTRIS;
SOURCE 3 ORGANISM_TAXID: 1076
KEYWDS ELECTRON TRANSPORT
EXPDTA X-RAY DIFFRACTION
AUTHOR N.SHIBATA,S.IBA,S.MISAKI,T.E.MEYER,R.G.BARTSCH,
AUTHOR 2 M.A.CUSANOVICH,Y.HIGUCHI,N.YASUOKA
REVDAT 2 24-FEB-09 1A7V 1 VERSN
REVDAT 1 17-JUN-98 1A7V 0
...........................................................................
Many lines in between
..........................................................................
ATOM 1 N GLN A 1 45.346 45.040 5.004 1.00 90.15 N
ATOM 2 CA GLN A 1 45.068 43.614 4.669 1.00 89.25 C
ATOM 3 C GLN A 1 45.626 42.698 5.751 1.00 89.26 C
ATOM 4 O GLN A 1 46.326 43.158 6.652 1.00 89.60 O
ATOM 5 CB GLN A 1 45.662 43.254 3.302 0.20 89.81 C
ATOM 6 CG GLN A 1 45.062 44.027 2.134 0.20 89.99 C
ATOM 7 CD GLN A 1 43.546 43.995 2.137 0.20 89.88 C
ATOM 8 OE1 GLN A 1 42.909 44.738 2.883 0.20 89.97 O
.........................................................................................................................
ATOM 920 OG SER A 125 44.804 18.922 -1.607 1.00 91.77 O
ATOM 921 OXT SER A 125 43.350 14.761 -1.403 1.00 94.70 O
TER 922 SER A 125
ATOM 923 N GLN B 1 11.868 35.655 8.087 1.00 91.68 N
ATOM 924 CA GLN B 1 13.224 35.969 8.625 1.00 90.25 C
ATOM 925 C GLN B 1 13.335 37.449 8.982 1.00 89.59 C
ATOM 926 O GLN B 1 12.346 38.180 8.909 1.00 89.38 O
ATOM 927 CB GLN B 1 14.309 35.585 7.611 0.20 91.63 C
ATOM 928 CG GLN B 1 15.059 34.291 7.944 0.20 89.78 C
..........................................................................................................................
..........................................................................................................................
In this example,
I need to look for CYTOCHROME C
extract a and b from just next line
print all lines having a and b at field number 5.
So the output should be:
Code:
ATOM 1 N GLN A 1 45.346 45.040 5.004 1.00 90.15 N
ATOM 2 CA GLN A 1 45.068 43.614 4.669 1.00 89.25 C
ATOM 3 C GLN A 1 45.626 42.698 5.751 1.00 89.26 C
ATOM 4 O GLN A 1 46.326 43.158 6.652 1.00 89.60 O
ATOM 5 CB GLN A 1 45.662 43.254 3.302 0.20 89.81 C
ATOM 6 CG GLN A 1 45.062 44.027 2.134 0.20 89.99 C
ATOM 7 CD GLN A 1 43.546 43.995 2.137 0.20 89.88 C
ATOM 8 OE1 GLN A 1 42.909 44.738 2.883 0.20 89.97 O
.........................................................................................................................
ATOM 920 OG SER A 125 44.804 18.922 -1.607 1.00 91.77 O
ATOM 921 OXT SER A 125 43.350 14.761 -1.403 1.00 94.70 O
ATOM 923 N GLN B 1 11.868 35.655 8.087 1.00 91.68 N
ATOM 924 CA GLN B 1 13.224 35.969 8.625 1.00 90.25 C
ATOM 925 C GLN B 1 13.335 37.449 8.982 1.00 89.59 C
ATOM 926 O GLN B 1 12.346 38.180 8.909 1.00 89.38 O
ATOM 927 CB GLN B 1 14.309 35.585 7.611 0.20 91.63 C
ATOM 928 CG GLN B 1 15.059 34.291 7.944 0.20 89.78 C
.............................................................................................................................
.............................................................................................................................
Now the problem is, the search pattern can be in many ways, like:
Code:
COMPND 2 MOLECULE: CYTOCHROME C';
COMPND 3 CHAIN: A;
OR
COMPND 2 MOLECULE: CYTOCHROME C';
COMPND 3 CHAIN: A, B
OR
COMPND 2 MOLECULE: CYTOCHROME C';
COMPND 3 CHAIN: A, B , C, D;
OR
COMPND 2 MOLECULE: CYTOCHROME C;
COMPND 3 CHAIN: A;
COMPND 4 SYNONYM: SOXA;
COMPND 5 MOL_ID: 2;
COMPND 6 MOLECULE: CYTOCHROME C;
COMPND 7 CHAIN: B;
Sorry for sounding complicated. Any help is highly appreciated. I respect your time.
I'm not sure about everything you want to do, but I think this does most of it:
Code:
sed -n '/^[\ \t]*COMPND.*CYTOCHROME\ C.*/{n;p;}' out.test | awk -F":" '{print $2}' | sed 's/[\ \,]//g'
sed - print the line after it finds the line with the matching regex.
awk - print only the text after the colon, could change this if needed pretty simply.
sed - remove spaces & commas so now it'll just read: A, or AB, or AC, etc.
Here's the list of lines where the 5th argument matches your AB:
Code:
awk '$5 ~ "[AB]"' out.test
HEADER ELECTRON TRANSPORT 18-MAR-98 1A7V
TITLE CYTOCHROME C' FROM RHODOPSEUDOMONAS PALUSTRIS
COMPND 3 CHAIN: A, B
SOURCE 2 ORGANISM_SCIENTIFIC: RHODOPSEUDOMONAS PALUSTRIS;
ATOM 1 N GLN A 1 45.346 45.040 5.004 1.00 90.15 N
ATOM 2 CA GLN A 1 45.068 43.614 4.669 1.00 89.25 C
ATOM 3 C GLN A 1 45.626 42.698 5.751 1.00 89.26 C
ATOM 4 O GLN A 1 46.326 43.158 6.652 1.00 89.60 O
ATOM 5 CB GLN A 1 45.662 43.254 3.302 0.20 89.81 C
ATOM 6 CG GLN A 1 45.062 44.027 2.134 0.20 89.99 C
ATOM 7 CD GLN A 1 43.546 43.995 2.137 0.20 89.88 C
ATOM 8 OE1 GLN A 1 42.909 44.738 2.883 0.20 89.97 O
ATOM 920 OG SER A 125 44.804 18.922 -1.607 1.00 91.77 O
ATOM 921 OXT SER A 125 43.350 14.761 -1.403 1.00 94.70 O
ATOM 923 N GLN B 1 11.868 35.655 8.087 1.00 91.68 N
ATOM 924 CA GLN B 1 13.224 35.969 8.625 1.00 90.25 C
ATOM 925 C GLN B 1 13.335 37.449 8.982 1.00 89.59 C
ATOM 926 O GLN B 1 12.346 38.180 8.909 1.00 89.38 O
ATOM 927 CB GLN B 1 14.309 35.585 7.611 0.20 91.63 C
ATOM 928 CG GLN B 1 15.059 34.291 7.944 0.20 89.78 C
Adding a grep above for ^[\ \t]*ATOM will give us just the atom lines, so now we just combine it all:
Code:
$ awk "\$5 ~ \"[$(sed -n '/^[\ \t]*COMPND.*CYTOCHROME\ C.*/{n;p;}' out.test | awk -F':' '{print $2}' | sed 's/[\ \,]//g')]\"" out.test | grep '^[\ \t]*ATOM'
ATOM 1 N GLN A 1 45.346 45.040 5.004 1.00 90.15 N
ATOM 2 CA GLN A 1 45.068 43.614 4.669 1.00 89.25 C
ATOM 3 C GLN A 1 45.626 42.698 5.751 1.00 89.26 C
ATOM 4 O GLN A 1 46.326 43.158 6.652 1.00 89.60 O
ATOM 5 CB GLN A 1 45.662 43.254 3.302 0.20 89.81 C
ATOM 6 CG GLN A 1 45.062 44.027 2.134 0.20 89.99 C
ATOM 7 CD GLN A 1 43.546 43.995 2.137 0.20 89.88 C
ATOM 8 OE1 GLN A 1 42.909 44.738 2.883 0.20 89.97 O
ATOM 920 OG SER A 125 44.804 18.922 -1.607 1.00 91.77 O
ATOM 921 OXT SER A 125 43.350 14.761 -1.403 1.00 94.70 O
ATOM 923 N GLN B 1 11.868 35.655 8.087 1.00 91.68 N
ATOM 924 CA GLN B 1 13.224 35.969 8.625 1.00 90.25 C
ATOM 925 C GLN B 1 13.335 37.449 8.982 1.00 89.59 C
ATOM 926 O GLN B 1 12.346 38.180 8.909 1.00 89.38 O
ATOM 927 CB GLN B 1 14.309 35.585 7.611 0.20 91.63 C
ATOM 928 CG GLN B 1 15.059 34.291 7.944 0.20 89.78 C
Edit: Probably a cleaner way to do this using just awk, but I never do things that way, so not sure on the exact changes you'd need to make.
Last edited by Vryali; 07-18-2012 at 02:13 PM..
Reason: Cleaned a bit.
Whenever you have sed | awk | grep | kitchen | sink, it can probably be done all in one awk. It's a lot more than a glorified 'cut'.
1) Search for a line containing CYTOCHROME C where there's two fields (as delimited by : )
2) Get the next line, clean it up with gsub(strip out " " ";" ","), turn the second field into a regex like [AB]
3) Set field separator to space.
4) For every line thereafter, if the line contains ATOM and the fifth field matches the regex, print the line.
But I don't know how to run it on command line directly or by saving it in an AWK script like temp.awk although I use AWK a little bit. Once again thank you very much for the help.
Quote:
Originally Posted by Corona688
Whenever you have sed | awk | grep | kitchen | sink, it can probably be done all in one awk. It's a lot more than a glorified 'cut'.
1) Search for a line containing CYTOCHROME C where there's two fields (as delimited by : )
2) Get the next line, clean it up with gsub(strip out " " ";" ","), turn the second field into a regex like [AB]
3) Set field separator to space.
4) For every line thereafter, if the line contains ATOM and the fifth field matches the regex, print the line.
I can't tell. CYTOCHROME C isn't anywhere in that file, so I have no idea what it's supposed to match. It must be picking up regex-like characters from the string it's trying to catch, which foul up the RGX variable when it's created.
You can put the script in a file easily enough like this:
Thanks Corona688 for the reply. Files containing CYTOCHROME C are treating your script very well. Above I took examples of those files which are throwing errors. The only difference here is, I am searching for pattern "LYSOZYME" instead of CYTOCHROME C. The script on a file (with extension .pdb) goes like this:
My little understanding tells me that your speculation is right. Something is messing up with RGX variable. I think ":" symbol just 1 line above to the line mentioned in error. As in file 132L.pdb, error mentions line 4 and I can see a ":" in line 3 and matching pattern in left hand side of it. Just to be clear, I don't want to extract line 3 or 4 here.
Thanks and Regards,
Ashwani
Quote:
Originally Posted by Corona688
I can't tell. CYTOCHROME C isn't anywhere in that file, so I have no idea what it's supposed to match. It must be picking up regex-like characters from the string it's trying to catch, which foul up the RGX variable when it's created.
You can put the script in a file easily enough like this:
I'm working on some code. So far I haven't encountered the error you did, so I'm a bit puzzled.
It's better to look for COMPND than to reject ANTIBODY, more specific and less special cases.
But then, I don't think your data is the same as the stuff you posted, since your data contains no ATOM lines at all, the things necessary to find any results. Can you post something more complete?
Hi All,
i would like to get some help regarding extracting certain characters from a line grepped.
blahblah{1:F01IRVTUS30XXXX0000000001}{2:I103IRVTDEF0XXXXN}{4:blah
blahblah{1:F01IRVTUS30XXXX0000000001}{2:I103IRVTDEF0XXXXN}{4:blah... (10 Replies)
Hi,
I have below file structure and need to display hours, minutes and seconds as different fields.
Incase hour or minute field is not there it should default to zero.
*** Total elapsed time was 2 hours, 54 minutes and 40 seconds.
*** Total elapsed time was 42 minutes and 36 seconds.... (7 Replies)
Hi all,
I got a file that contains the following content, Actually it is a part of the file content,
Installing XYZ XYZA Image, API 18, revision 2
Unzipping XYZ XYZA Image, API 18, revision 2 (1%)
Unzipping XYZ XYZA Image, API 18, revision 2 (96%)
Unzipping XYZ XYZA Image, API 18,... (7 Replies)
I have a file that has some lines starts with *
I want to get these lines, then get the word between "diac" and "lex".
ex.
file:
;;WORD AlAx
*0.942490 diac:Al>ax lex:>ax_1 bw:Al/DET+>ax/NOUN+ gloss:brother pos:noun prc3:0 prc2:0 prc1:0 prc0:Al_det per:na asp:na vox:na mod:na gen:m num:s... (4 Replies)
Hi,
I need to extract <APPNUMBER> tag alone, if the <college> haas IIT Chennai value. college tag value will have spaces embedded. Those spaces should not be suppresses.
My Source file
<Record><sno>1</sno><empid>E0001</empid><name>Rejsh suderam</name><college>IIT ... (3 Replies)
Hi Guys,
I have a situation wherein I need to extract two lines from below the search string.
Eg.
Current:
$ grep "$(date +'%a %b %e')" alert.log
Mon Apr 12 03:58:10 2010
Mon Apr 12 12:51:48 2010
$
Here I would like the display to be something like
Mon Apr 12... (6 Replies)
This is my first post, please be nice. I have tried to google and read different tutorials.
The task at hand is:
Input file input.txt (example)
abc123defhij-E-1234jslo
456ujs-W-abXjklp
From this file the task is to grep the -E- and -W- strings that are unique and write a new file... (5 Replies)
The text line has the following formats:
what.ever.bla.bla.C01G06.BLA.BLA2
what.ever.bla.bla.C11G33.BLA.BLA2
what.ever.bla.bla.01x03.BLA.BLA2
what.ever.bla.bla.03x05.BLA.BLA2
what.ever.bla.bla.Part01.BLA.BLA2
and other similar ones, I need a way to select the "what.ever.bla.bla" part out... (4 Replies)
Hi,
the text line looks like this:
"test1" " " "test2" "test3" "test4" "10" "test 10 12" "00:05:58" "filename.bin" "3.3MB" "/dir/name" "18459"
what's the best way to select any of it? So I can for example get only the time or size and so on.
I was trying awk -F""" '{print $N}' but... (3 Replies)
Hello ,
I need your help to extract a line in a big file , and this line is always 11 lines
before a specific pattern . Do you know a way via Awk ?
Thanks in advance
npn35 (17 Replies)