Search for a pattern,extract value(s) from next line, extract lines having those extracted value(s)

07-18-2012

Registered User

26, 0

Join Date: Nov 2009

Last Activity: 19 March 2015, 4:56 PM EDT

Posts: 26

Thanks Given: 18

Thanked 0 Times in 0 Posts

Search for a pattern,extract value(s) from next line, extract lines having those extracted value(s)

I have hundreds of files to process. In each file

I need to look for a pattern then
extract value(s) from next line and then
search for value(s) selected from point (2) in the same file at a specific position.

Code:

 HEADER ELECTRON TRANSPORT 18-MAR-98 1A7V  
 TITLE CYTOCHROME C' FROM RHODOPSEUDOMONAS PALUSTRIS  
 COMPND MOL_ID: 1;  
 COMPND 2 MOLECULE: CYTOCHROME C';  
 COMPND 3 CHAIN: A, B  
 SOURCE MOL_ID: 1;  
 SOURCE 2 ORGANISM_SCIENTIFIC: RHODOPSEUDOMONAS PALUSTRIS;  
 SOURCE 3 ORGANISM_TAXID: 1076  
 KEYWDS ELECTRON TRANSPORT  
 EXPDTA X-RAY DIFFRACTION  
 AUTHOR N.SHIBATA,S.IBA,S.MISAKI,T.E.MEYER,R.G.BARTSCH,  
 AUTHOR 2 M.A.CUSANOVICH,Y.HIGUCHI,N.YASUOKA  
 REVDAT 2 24-FEB-09 1A7V 1 VERSN  
 REVDAT 1 17-JUN-98 1A7V 0  
 ...........................................................................
 Many lines in between
 ..........................................................................
 ATOM 1 N GLN A 1 45.346 45.040 5.004 1.00 90.15 N  
 ATOM 2 CA GLN A 1 45.068 43.614 4.669 1.00 89.25 C  
 ATOM 3 C GLN A 1 45.626 42.698 5.751 1.00 89.26 C  
 ATOM 4 O GLN A 1 46.326 43.158 6.652 1.00 89.60 O  
 ATOM 5 CB GLN A 1 45.662 43.254 3.302 0.20 89.81 C  
 ATOM 6 CG GLN A 1 45.062 44.027 2.134 0.20 89.99 C  
 ATOM 7 CD GLN A 1 43.546 43.995 2.137 0.20 89.88 C  
 ATOM 8 OE1 GLN A 1 42.909 44.738 2.883 0.20 89.97 O  
 .........................................................................................................................
 ATOM 920 OG SER A 125 44.804 18.922 -1.607 1.00 91.77 O  
 ATOM 921 OXT SER A 125 43.350 14.761 -1.403 1.00 94.70 O  
 TER 922 SER A 125
 ATOM 923 N GLN B 1 11.868 35.655 8.087 1.00 91.68 N  
 ATOM 924 CA GLN B 1 13.224 35.969 8.625 1.00 90.25 C  
 ATOM 925 C GLN B 1 13.335 37.449 8.982 1.00 89.59 C  
 ATOM 926 O GLN B 1 12.346 38.180 8.909 1.00 89.38 O  
 ATOM 927 CB GLN B 1 14.309 35.585 7.611 0.20 91.63 C  
 ATOM 928 CG GLN B 1 15.059 34.291 7.944 0.20 89.78 C  
 ..........................................................................................................................
 ..........................................................................................................................

In this example,

I need to look for CYTOCHROME C
extract a and b from just next line
print all lines having a and b at field number 5.

So the output should be:

Code:

 ATOM 1 N GLN A 1 45.346 45.040 5.004 1.00 90.15 N  
 ATOM 2 CA GLN A 1 45.068 43.614 4.669 1.00 89.25 C  
 ATOM 3 C GLN A 1 45.626 42.698 5.751 1.00 89.26 C  
 ATOM 4 O GLN A 1 46.326 43.158 6.652 1.00 89.60 O  
 ATOM 5 CB GLN A 1 45.662 43.254 3.302 0.20 89.81 C  
 ATOM 6 CG GLN A 1 45.062 44.027 2.134 0.20 89.99 C  
 ATOM 7 CD GLN A 1 43.546 43.995 2.137 0.20 89.88 C  
 ATOM 8 OE1 GLN A 1 42.909 44.738 2.883 0.20 89.97 O  
 .........................................................................................................................
 ATOM 920 OG SER A 125 44.804 18.922 -1.607 1.00 91.77 O  
 ATOM 921 OXT SER A 125 43.350 14.761 -1.403 1.00 94.70 O  
 ATOM 923 N GLN B 1 11.868 35.655 8.087 1.00 91.68 N  
 ATOM 924 CA GLN B 1 13.224 35.969 8.625 1.00 90.25 C  
 ATOM 925 C GLN B 1 13.335 37.449 8.982 1.00 89.59 C  
 ATOM 926 O GLN B 1 12.346 38.180 8.909 1.00 89.38 O  
 ATOM 927 CB GLN B 1 14.309 35.585 7.611 0.20 91.63 C  
 ATOM 928 CG GLN B 1 15.059 34.291 7.944 0.20 89.78 C  
 .............................................................................................................................
 .............................................................................................................................

Now the problem is, the search pattern can be in many ways, like:

Code:

  
 COMPND 2 MOLECULE: CYTOCHROME C';  
 COMPND 3 CHAIN: A;
 

 OR
 

 COMPND 2 MOLECULE: CYTOCHROME C';  
 COMPND 3 CHAIN: A, B  
 

 OR
 

 COMPND 2 MOLECULE: CYTOCHROME C';  
 COMPND 3 CHAIN: A, B , C, D;
 

 OR
 

 COMPND 2 MOLECULE: CYTOCHROME C;  
 COMPND 3 CHAIN: A;  
 COMPND 4 SYNONYM: SOXA;  
 COMPND 5 MOL_ID: 2;  
 COMPND 6 MOLECULE: CYTOCHROME C;  
 COMPND 7 CHAIN: B;

Sorry for sounding complicated. Any help is highly appreciated. I respect your time.

Thanks and Regards,
Ashwani

AshwaniSharma09

View Public Profile for AshwaniSharma09

Find all posts by AshwaniSharma09

07-18-2012

Registered User

64, 17

Join Date: Jun 2008

Last Activity: 12 August 2016, 1:24 PM EDT

Location: Columbia, SC

Posts: 64

Thanks Given: 19

Thanked 17 Times in 17 Posts

I'm not sure about everything you want to do, but I think this does most of it:

Code:

sed -n '/^[\ \t]*COMPND.*CYTOCHROME\ C.*/{n;p;}' out.test | awk -F":" '{print $2}' | sed 's/[\ \,]//g'

sed - print the line after it finds the line with the matching regex.
awk - print only the text after the colon, could change this if needed pretty simply.
sed - remove spaces & commas so now it'll just read: A, or AB, or AC, etc.

Here's the list of lines where the 5th argument matches your AB:

Code:

awk '$5 ~ "[AB]"' out.test             
HEADER ELECTRON TRANSPORT 18-MAR-98 1A7V  
 TITLE CYTOCHROME C' FROM RHODOPSEUDOMONAS PALUSTRIS  
 COMPND 3 CHAIN: A, B  
 SOURCE 2 ORGANISM_SCIENTIFIC: RHODOPSEUDOMONAS PALUSTRIS;  
 ATOM 1 N GLN A 1 45.346 45.040 5.004 1.00 90.15 N  
 ATOM 2 CA GLN A 1 45.068 43.614 4.669 1.00 89.25 C  
 ATOM 3 C GLN A 1 45.626 42.698 5.751 1.00 89.26 C  
 ATOM 4 O GLN A 1 46.326 43.158 6.652 1.00 89.60 O  
 ATOM 5 CB GLN A 1 45.662 43.254 3.302 0.20 89.81 C  
 ATOM 6 CG GLN A 1 45.062 44.027 2.134 0.20 89.99 C  
 ATOM 7 CD GLN A 1 43.546 43.995 2.137 0.20 89.88 C  
 ATOM 8 OE1 GLN A 1 42.909 44.738 2.883 0.20 89.97 O  
 ATOM 920 OG SER A 125 44.804 18.922 -1.607 1.00 91.77 O  
 ATOM 921 OXT SER A 125 43.350 14.761 -1.403 1.00 94.70 O  
 ATOM 923 N GLN B 1 11.868 35.655 8.087 1.00 91.68 N  
 ATOM 924 CA GLN B 1 13.224 35.969 8.625 1.00 90.25 C  
 ATOM 925 C GLN B 1 13.335 37.449 8.982 1.00 89.59 C  
 ATOM 926 O GLN B 1 12.346 38.180 8.909 1.00 89.38 O  
 ATOM 927 CB GLN B 1 14.309 35.585 7.611 0.20 91.63 C  
 ATOM 928 CG GLN B 1 15.059 34.291 7.944 0.20 89.78 C

Adding a grep above for ^[\ \t]*ATOM will give us just the atom lines, so now we just combine it all:

Code:

$ awk "\$5 ~ \"[$(sed -n '/^[\ \t]*COMPND.*CYTOCHROME\ C.*/{n;p;}' out.test | awk -F':' '{print $2}' | sed 's/[\ \,]//g')]\"" out.test | grep '^[\ \t]*ATOM'
 ATOM 1 N GLN A 1 45.346 45.040 5.004 1.00 90.15 N  
 ATOM 2 CA GLN A 1 45.068 43.614 4.669 1.00 89.25 C  
 ATOM 3 C GLN A 1 45.626 42.698 5.751 1.00 89.26 C  
 ATOM 4 O GLN A 1 46.326 43.158 6.652 1.00 89.60 O  
 ATOM 5 CB GLN A 1 45.662 43.254 3.302 0.20 89.81 C  
 ATOM 6 CG GLN A 1 45.062 44.027 2.134 0.20 89.99 C  
 ATOM 7 CD GLN A 1 43.546 43.995 2.137 0.20 89.88 C  
 ATOM 8 OE1 GLN A 1 42.909 44.738 2.883 0.20 89.97 O  
 ATOM 920 OG SER A 125 44.804 18.922 -1.607 1.00 91.77 O  
 ATOM 921 OXT SER A 125 43.350 14.761 -1.403 1.00 94.70 O  
 ATOM 923 N GLN B 1 11.868 35.655 8.087 1.00 91.68 N  
 ATOM 924 CA GLN B 1 13.224 35.969 8.625 1.00 90.25 C  
 ATOM 925 C GLN B 1 13.335 37.449 8.982 1.00 89.59 C  
 ATOM 926 O GLN B 1 12.346 38.180 8.909 1.00 89.38 O  
 ATOM 927 CB GLN B 1 14.309 35.585 7.611 0.20 91.63 C  
 ATOM 928 CG GLN B 1 15.059 34.291 7.944 0.20 89.78 C

Edit: Probably a cleaner way to do this using just awk, but I never do things that way, so not sure on the exact changes you'd need to make.

Last edited by Vryali; 07-18-2012 at 03:13 PM.. Reason: Cleaned a bit.

This User Gave Thanks to Vryali For This Post:

Vryali

View Public Profile for Vryali

Find all posts by Vryali

07-18-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Whenever you have sed | awk | grep | kitchen | sink, it can probably be done all in one awk. It's a lot more than a glorified 'cut'.

1) Search for a line containing CYTOCHROME C where there's two fields (as delimited by : )
2) Get the next line, clean it up with gsub(strip out " " ";" ","), turn the second field into a regex like [AB]
3) Set field separator to space.
4) For every line thereafter, if the line contains ATOM and the fifth field matches the regex, print the line.

Code:

awk -F":" '(!RGX) && /CYTOCHROME C/ && (NF==2) {
        getline
        gsub(/[;, ]*/, "");
        RGX="[" $2 "]"
        FS=" ";
} RGX && ($5 ~ RGX) && /ATOM/' inputfile

These 2 Users Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

07-19-2012

Registered User

26, 0

Join Date: Nov 2009

Last Activity: 19 March 2015, 4:56 PM EDT

Posts: 26

Thanks Given: 18

Thanked 0 Times in 0 Posts

Thank you very much Corona688, you made my day

.
This script is working fine when I put it in a shell script like this:

Code:

cat temp.sh 
awk -F":" '(!RGX) && /CYTOCHROME C/ && (NF==2) {
        getline
        gsub(/[;, ]*/, "");
        RGX="[" $2 "]"
        FS=" ";
} RGX && ($5 ~ RGX) && /^ATOM/' 1A3R.pdb

But I don't know how to run it on command line directly or by saving it in an AWK script like temp.awk although I use AWK a little bit. Once again thank you very much for the help

Quote:

Originally Posted by Corona688

Code:

awk -F":" '(!RGX) && /CYTOCHROME C/ && (NF==2) {
        getline
        gsub(/[;, ]*/, "");
        RGX="[" $2 "]"
        FS=" ";
} RGX && ($5 ~ RGX) && /ATOM/' inputfile

---------- Post updated at 11:24 AM ---------- Previous update was at 11:21 AM ----------

Thanks Vryali for your reply

.

---------- Post updated at 07:57 PM ---------- Previous update was at 11:24 AM ----------

On running the script, some files are giving error. These are few top most lines of 2 files and their respective errors:

Code:

  	 	 	 	 	 	   cat 132L.pdb
 

 HEADER    HYDROLASE(O-GLYCOSYL)                   02-JUN-93   132L 
 TITLE     STRUCTURAL CONSEQUENCES OF REDUCTIVE METHYLATION OF LYSINE 
 TITLE    2 RESIDUES IN HEN EGG WHITE LYSOZYME: AN X-RAY ANALYSIS AT 
 TITLE    3 1.8 ANGSTROMS RESOLUTION 
 COMPND    MOL_ID: 1; 
 COMPND   2 MOLECULE: HEN EGG WHITE LYSOZYME; 
 COMPND   3 CHAIN: A; 
 COMPND   4 EC: 3.2.1.17; 
 COMPND   5 ENGINEERED: YES 
 SOURCE    MOL_ID: 1; 
 SOURCE   2 ORGANISM_SCIENTIFIC: GALLUS GALLUS; 
 SOURCE   3 ORGANISM_COMMON: CHICKEN; 
 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 

 cat 1G7H.pdb  
 

 HEADER    HYDROLASE INHIBITOR/HYDROLASE           10-NOV-00   1G7H 
 TITLE     CRYSTAL STRUCTURE OF HEN EGG WHITE LYSOZYME (HEL) COMPLEXED 
 TITLE    2 WITH THE MUTANT ANTI-HEL MONOCLONAL ANTIBODY D1.3(VLW92A) 
 COMPND    MOL_ID: 1; 
 COMPND   2 MOLECULE: ANTI-HEN EGG WHITE LYSOZYME MONOCLONAL ANTIBODY 
 COMPND   3 D1.3; 
 COMPND   4 CHAIN: A; 
 COMPND   5 FRAGMENT: LIGHT CHAIN; 
 COMPND   6 ENGINEERED: YES; 
 COMPND   7 MUTATION: YES; 
 COMPND   8 MOL_ID: 2; 
 COMPND   9 MOLECULE: ANTI-HEN EGG WHITE LYSOZYME MONOCLONAL ANTIBODY 
 COMPND  10 D1.3; 
 COMPND  11 CHAIN: B; 
 COMPND  12 FRAGMENT: HEAVY CHAIN; 
 ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 

 errors:
 

 awk: cmd. line:6: (FILENAME=132L.pdb FNR=4) fatal: Unmatched [ or [^: /[]/ 
 

 awk: cmd. line:6: (FILENAME=1G7H.pdb FNR=6) fatal: Unmatched [ or [^: /[]/

Is it something with gsub function or following expression? Thanks & Regards

AshwaniSharma09

View Public Profile for AshwaniSharma09

Find all posts by AshwaniSharma09

07-19-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

I can't tell. CYTOCHROME C isn't anywhere in that file, so I have no idea what it's supposed to match. It must be picking up regex-like characters from the string it's trying to catch, which foul up the RGX variable when it's created.

You can put the script in a file easily enough like this:

Code:

$ cat script.awk

(!RGX) && ($0 ~ RMATCH) && (NF==2) {
         getline
         gsub(/[;, ]*/, "");
         RGX="[" $2 "]"
         FS=" ";
} RGX && ($5 ~ RGX) && /ATOM/

$ awk -v RMATCH="CYTOCHROME C" -f script.awk file

...

$

I'm probably out for the rest of the day unfortunately. I'll check this evening if I can, for further details on your difficulty.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

07-24-2012

Registered User

26, 0

Join Date: Nov 2009

Last Activity: 19 March 2015, 4:56 PM EDT

Posts: 26

Thanks Given: 18

Thanked 0 Times in 0 Posts

Thanks Corona688 for the reply. Files containing CYTOCHROME C are treating your script very well. Above I took examples of those files which are throwing errors. The only difference here is, I am searching for pattern "LYSOZYME" instead of CYTOCHROME C. The script on a file (with extension .pdb) goes like this:

Code:

cat test.sh 
awk -F":" '(!RGX) && /LYSOZYME/ && !/ANTIBODY/ && (NF == 2){
        getline
        gsub(/[;, ]*/, "");
        RGX="[" $2 "]"
        FS=" ";
} RGX && ($5 ~ RGX) && /ATOM/' 1G7H.pdb

My little understanding tells me that your speculation is right. Something is messing up with RGX variable. I think ":" symbol just 1 line above to the line mentioned in error. As in file 132L.pdb, error mentions line 4 and I can see a ":" in line 3 and matching pattern in left hand side of it. Just to be clear, I don't want to extract line 3 or 4 here.

Thanks and Regards,
Ashwani

Quote:

Originally Posted by Corona688

Code:

$ cat script.awk

(!RGX) && ($0 ~ RMATCH) && (NF==2) {
         getline
         gsub(/[;, ]*/, "");
         RGX="[" $2 "]"
         FS=" ";
} RGX && ($5 ~ RGX) && /ATOM/

$ awk -v RMATCH="CYTOCHROME C" -f script.awk file

...

$

I'm probably out for the rest of the day unfortunately. I'll check this evening if I can, for further details on your difficulty.

AshwaniSharma09

View Public Profile for AshwaniSharma09

Find all posts by AshwaniSharma09

07-24-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

I'm working on some code. So far I haven't encountered the error you did, so I'm a bit puzzled.

It's better to look for COMPND than to reject ANTIBODY, more specific and less special cases.

But then, I don't think your data is the same as the stuff you posted, since your data contains no ATOM lines at all, the things necessary to find any results. Can you post something more complete?

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

Shell Programming and Scripting

Search for a pattern,extract value(s) from next line, extract lines having those extracted value(s)

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Extract some characters from lines based on pattern

Discussion started by: mad man

2. UNIX for Dummies Questions & Answers

Extract fields before search pattern

Discussion started by: fristyguy

3. Shell Programming and Scripting

Extract lines that match a pattern

Discussion started by: Kashyap

4. Shell Programming and Scripting

Extract a pattern from multiple lines in a file

Discussion started by: Viernes

5. Shell Programming and Scripting

extract specific line if the search pattern is found

Discussion started by: Sekar1

6. Shell Programming and Scripting

Extract two lines before and after the 'search text'

Discussion started by: geetap

7. Shell Programming and Scripting

sed: Find start of pattern and extract text to end of line, including the pattern

Discussion started by: TestTomas

8. Shell Programming and Scripting

Extract pattern from text line

Discussion started by: TehOne

9. Shell Programming and Scripting

Extract pattern from text line

Discussion started by: TehOne

10. Shell Programming and Scripting

awk: need to extract a line before a pattern

Discussion started by: npn35