Search for a pattern,extract value(s) from next line, extract lines having those extracted value(s)

07-18-2012

Registered User

26, 0

Join Date: Nov 2009

Last Activity: 19 March 2015, 4:56 PM EDT

Posts: 26

Thanks Given: 18

Thanked 0 Times in 0 Posts

Search for a pattern,extract value(s) from next line, extract lines having those extracted value(s)

I have hundreds of files to process. In each file

I need to look for a pattern then
extract value(s) from next line and then
search for value(s) selected from point (2) in the same file at a specific position.

Code:

 HEADER ELECTRON TRANSPORT 18-MAR-98 1A7V  
 TITLE CYTOCHROME C' FROM RHODOPSEUDOMONAS PALUSTRIS  
 COMPND MOL_ID: 1;  
 COMPND 2 MOLECULE: CYTOCHROME C';  
 COMPND 3 CHAIN: A, B  
 SOURCE MOL_ID: 1;  
 SOURCE 2 ORGANISM_SCIENTIFIC: RHODOPSEUDOMONAS PALUSTRIS;  
 SOURCE 3 ORGANISM_TAXID: 1076  
 KEYWDS ELECTRON TRANSPORT  
 EXPDTA X-RAY DIFFRACTION  
 AUTHOR N.SHIBATA,S.IBA,S.MISAKI,T.E.MEYER,R.G.BARTSCH,  
 AUTHOR 2 M.A.CUSANOVICH,Y.HIGUCHI,N.YASUOKA  
 REVDAT 2 24-FEB-09 1A7V 1 VERSN  
 REVDAT 1 17-JUN-98 1A7V 0  
 ...........................................................................
 Many lines in between
 ..........................................................................
 ATOM 1 N GLN A 1 45.346 45.040 5.004 1.00 90.15 N  
 ATOM 2 CA GLN A 1 45.068 43.614 4.669 1.00 89.25 C  
 ATOM 3 C GLN A 1 45.626 42.698 5.751 1.00 89.26 C  
 ATOM 4 O GLN A 1 46.326 43.158 6.652 1.00 89.60 O  
 ATOM 5 CB GLN A 1 45.662 43.254 3.302 0.20 89.81 C  
 ATOM 6 CG GLN A 1 45.062 44.027 2.134 0.20 89.99 C  
 ATOM 7 CD GLN A 1 43.546 43.995 2.137 0.20 89.88 C  
 ATOM 8 OE1 GLN A 1 42.909 44.738 2.883 0.20 89.97 O  
 .........................................................................................................................
 ATOM 920 OG SER A 125 44.804 18.922 -1.607 1.00 91.77 O  
 ATOM 921 OXT SER A 125 43.350 14.761 -1.403 1.00 94.70 O  
 TER 922 SER A 125
 ATOM 923 N GLN B 1 11.868 35.655 8.087 1.00 91.68 N  
 ATOM 924 CA GLN B 1 13.224 35.969 8.625 1.00 90.25 C  
 ATOM 925 C GLN B 1 13.335 37.449 8.982 1.00 89.59 C  
 ATOM 926 O GLN B 1 12.346 38.180 8.909 1.00 89.38 O  
 ATOM 927 CB GLN B 1 14.309 35.585 7.611 0.20 91.63 C  
 ATOM 928 CG GLN B 1 15.059 34.291 7.944 0.20 89.78 C  
 ..........................................................................................................................
 ..........................................................................................................................

In this example,

I need to look for CYTOCHROME C
extract a and b from just next line
print all lines having a and b at field number 5.

So the output should be:

Code:

 ATOM 1 N GLN A 1 45.346 45.040 5.004 1.00 90.15 N  
 ATOM 2 CA GLN A 1 45.068 43.614 4.669 1.00 89.25 C  
 ATOM 3 C GLN A 1 45.626 42.698 5.751 1.00 89.26 C  
 ATOM 4 O GLN A 1 46.326 43.158 6.652 1.00 89.60 O  
 ATOM 5 CB GLN A 1 45.662 43.254 3.302 0.20 89.81 C  
 ATOM 6 CG GLN A 1 45.062 44.027 2.134 0.20 89.99 C  
 ATOM 7 CD GLN A 1 43.546 43.995 2.137 0.20 89.88 C  
 ATOM 8 OE1 GLN A 1 42.909 44.738 2.883 0.20 89.97 O  
 .........................................................................................................................
 ATOM 920 OG SER A 125 44.804 18.922 -1.607 1.00 91.77 O  
 ATOM 921 OXT SER A 125 43.350 14.761 -1.403 1.00 94.70 O  
 ATOM 923 N GLN B 1 11.868 35.655 8.087 1.00 91.68 N  
 ATOM 924 CA GLN B 1 13.224 35.969 8.625 1.00 90.25 C  
 ATOM 925 C GLN B 1 13.335 37.449 8.982 1.00 89.59 C  
 ATOM 926 O GLN B 1 12.346 38.180 8.909 1.00 89.38 O  
 ATOM 927 CB GLN B 1 14.309 35.585 7.611 0.20 91.63 C  
 ATOM 928 CG GLN B 1 15.059 34.291 7.944 0.20 89.78 C  
 .............................................................................................................................
 .............................................................................................................................

Now the problem is, the search pattern can be in many ways, like:

Code:

  
 COMPND 2 MOLECULE: CYTOCHROME C';  
 COMPND 3 CHAIN: A;
 

 OR
 

 COMPND 2 MOLECULE: CYTOCHROME C';  
 COMPND 3 CHAIN: A, B  
 

 OR
 

 COMPND 2 MOLECULE: CYTOCHROME C';  
 COMPND 3 CHAIN: A, B , C, D;
 

 OR
 

 COMPND 2 MOLECULE: CYTOCHROME C;  
 COMPND 3 CHAIN: A;  
 COMPND 4 SYNONYM: SOXA;  
 COMPND 5 MOL_ID: 2;  
 COMPND 6 MOLECULE: CYTOCHROME C;  
 COMPND 7 CHAIN: B;

Sorry for sounding complicated. Any help is highly appreciated. I respect your time.

Thanks and Regards,
Ashwani

AshwaniSharma09

View Public Profile for AshwaniSharma09

Find all posts by AshwaniSharma09

07-18-2012

Registered User

64, 17

Join Date: Jun 2008

Last Activity: 12 August 2016, 1:24 PM EDT

Location: Columbia, SC

Posts: 64

Thanks Given: 19

Thanked 17 Times in 17 Posts

I'm not sure about everything you want to do, but I think this does most of it:

Code:

sed -n '/^[\ \t]*COMPND.*CYTOCHROME\ C.*/{n;p;}' out.test | awk -F":" '{print $2}' | sed 's/[\ \,]//g'

sed - print the line after it finds the line with the matching regex.
awk - print only the text after the colon, could change this if needed pretty simply.
sed - remove spaces & commas so now it'll just read: A, or AB, or AC, etc.

Here's the list of lines where the 5th argument matches your AB:

Code:

awk '$5 ~ "[AB]"' out.test             
HEADER ELECTRON TRANSPORT 18-MAR-98 1A7V  
 TITLE CYTOCHROME C' FROM RHODOPSEUDOMONAS PALUSTRIS  
 COMPND 3 CHAIN: A, B  
 SOURCE 2 ORGANISM_SCIENTIFIC: RHODOPSEUDOMONAS PALUSTRIS;  
 ATOM 1 N GLN A 1 45.346 45.040 5.004 1.00 90.15 N  
 ATOM 2 CA GLN A 1 45.068 43.614 4.669 1.00 89.25 C  
 ATOM 3 C GLN A 1 45.626 42.698 5.751 1.00 89.26 C  
 ATOM 4 O GLN A 1 46.326 43.158 6.652 1.00 89.60 O  
 ATOM 5 CB GLN A 1 45.662 43.254 3.302 0.20 89.81 C  
 ATOM 6 CG GLN A 1 45.062 44.027 2.134 0.20 89.99 C  
 ATOM 7 CD GLN A 1 43.546 43.995 2.137 0.20 89.88 C  
 ATOM 8 OE1 GLN A 1 42.909 44.738 2.883 0.20 89.97 O  
 ATOM 920 OG SER A 125 44.804 18.922 -1.607 1.00 91.77 O  
 ATOM 921 OXT SER A 125 43.350 14.761 -1.403 1.00 94.70 O  
 ATOM 923 N GLN B 1 11.868 35.655 8.087 1.00 91.68 N  
 ATOM 924 CA GLN B 1 13.224 35.969 8.625 1.00 90.25 C  
 ATOM 925 C GLN B 1 13.335 37.449 8.982 1.00 89.59 C  
 ATOM 926 O GLN B 1 12.346 38.180 8.909 1.00 89.38 O  
 ATOM 927 CB GLN B 1 14.309 35.585 7.611 0.20 91.63 C  
 ATOM 928 CG GLN B 1 15.059 34.291 7.944 0.20 89.78 C

Adding a grep above for ^[\ \t]*ATOM will give us just the atom lines, so now we just combine it all:

Code:

$ awk "\$5 ~ \"[$(sed -n '/^[\ \t]*COMPND.*CYTOCHROME\ C.*/{n;p;}' out.test | awk -F':' '{print $2}' | sed 's/[\ \,]//g')]\"" out.test | grep '^[\ \t]*ATOM'
 ATOM 1 N GLN A 1 45.346 45.040 5.004 1.00 90.15 N  
 ATOM 2 CA GLN A 1 45.068 43.614 4.669 1.00 89.25 C  
 ATOM 3 C GLN A 1 45.626 42.698 5.751 1.00 89.26 C  
 ATOM 4 O GLN A 1 46.326 43.158 6.652 1.00 89.60 O  
 ATOM 5 CB GLN A 1 45.662 43.254 3.302 0.20 89.81 C  
 ATOM 6 CG GLN A 1 45.062 44.027 2.134 0.20 89.99 C  
 ATOM 7 CD GLN A 1 43.546 43.995 2.137 0.20 89.88 C  
 ATOM 8 OE1 GLN A 1 42.909 44.738 2.883 0.20 89.97 O  
 ATOM 920 OG SER A 125 44.804 18.922 -1.607 1.00 91.77 O  
 ATOM 921 OXT SER A 125 43.350 14.761 -1.403 1.00 94.70 O  
 ATOM 923 N GLN B 1 11.868 35.655 8.087 1.00 91.68 N  
 ATOM 924 CA GLN B 1 13.224 35.969 8.625 1.00 90.25 C  
 ATOM 925 C GLN B 1 13.335 37.449 8.982 1.00 89.59 C  
 ATOM 926 O GLN B 1 12.346 38.180 8.909 1.00 89.38 O  
 ATOM 927 CB GLN B 1 14.309 35.585 7.611 0.20 91.63 C  
 ATOM 928 CG GLN B 1 15.059 34.291 7.944 0.20 89.78 C

Edit: Probably a cleaner way to do this using just awk, but I never do things that way, so not sure on the exact changes you'd need to make.

Last edited by Vryali; 07-18-2012 at 02:13 PM.. Reason: Cleaned a bit.

This User Gave Thanks to Vryali For This Post:

Vryali

View Public Profile for Vryali

Find all posts by Vryali

07-18-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Whenever you have sed | awk | grep | kitchen | sink, it can probably be done all in one awk. It's a lot more than a glorified 'cut'.

1) Search for a line containing CYTOCHROME C where there's two fields (as delimited by : )
2) Get the next line, clean it up with gsub(strip out " " ";" ","), turn the second field into a regex like [AB]
3) Set field separator to space.
4) For every line thereafter, if the line contains ATOM and the fifth field matches the regex, print the line.

Code:

awk -F":" '(!RGX) && /CYTOCHROME C/ && (NF==2) {
        getline
        gsub(/[;, ]*/, "");
        RGX="[" $2 "]"
        FS=" ";
} RGX && ($5 ~ RGX) && /ATOM/' inputfile

These 2 Users Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

07-19-2012

Registered User

26, 0

Join Date: Nov 2009

Last Activity: 19 March 2015, 4:56 PM EDT

Posts: 26

Thanks Given: 18

Thanked 0 Times in 0 Posts

Thank you very much Corona688, you made my day

.
This script is working fine when I put it in a shell script like this:

Code:

cat temp.sh 
awk -F":" '(!RGX) && /CYTOCHROME C/ && (NF==2) {
        getline
        gsub(/[;, ]*/, "");
        RGX="[" $2 "]"
        FS=" ";
} RGX && ($5 ~ RGX) && /^ATOM/' 1A3R.pdb

But I don't know how to run it on command line directly or by saving it in an AWK script like temp.awk although I use AWK a little bit. Once again thank you very much for the help

Quote:

Originally Posted by Corona688

Code:

awk -F":" '(!RGX) && /CYTOCHROME C/ && (NF==2) {
        getline
        gsub(/[;, ]*/, "");
        RGX="[" $2 "]"
        FS=" ";
} RGX && ($5 ~ RGX) && /ATOM/' inputfile

---------- Post updated at 11:24 AM ---------- Previous update was at 11:21 AM ----------

Thanks Vryali for your reply

.

---------- Post updated at 07:57 PM ---------- Previous update was at 11:24 AM ----------

On running the script, some files are giving error. These are few top most lines of 2 files and their respective errors:

Code:

  	 	 	 	 	 	   cat 132L.pdb
 

 HEADER    HYDROLASE(O-GLYCOSYL)                   02-JUN-93   132L 
 TITLE     STRUCTURAL CONSEQUENCES OF REDUCTIVE METHYLATION OF LYSINE 
 TITLE    2 RESIDUES IN HEN EGG WHITE LYSOZYME: AN X-RAY ANALYSIS AT 
 TITLE    3 1.8 ANGSTROMS RESOLUTION 
 COMPND    MOL_ID: 1; 
 COMPND   2 MOLECULE: HEN EGG WHITE LYSOZYME; 
 COMPND   3 CHAIN: A; 
 COMPND   4 EC: 3.2.1.17; 
 COMPND   5 ENGINEERED: YES 
 SOURCE    MOL_ID: 1; 
 SOURCE   2 ORGANISM_SCIENTIFIC: GALLUS GALLUS; 
 SOURCE   3 ORGANISM_COMMON: CHICKEN; 
 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 

 cat 1G7H.pdb  
 

 HEADER    HYDROLASE INHIBITOR/HYDROLASE           10-NOV-00   1G7H 
 TITLE     CRYSTAL STRUCTURE OF HEN EGG WHITE LYSOZYME (HEL) COMPLEXED 
 TITLE    2 WITH THE MUTANT ANTI-HEL MONOCLONAL ANTIBODY D1.3(VLW92A) 
 COMPND    MOL_ID: 1; 
 COMPND   2 MOLECULE: ANTI-HEN EGG WHITE LYSOZYME MONOCLONAL ANTIBODY 
 COMPND   3 D1.3; 
 COMPND   4 CHAIN: A; 
 COMPND   5 FRAGMENT: LIGHT CHAIN; 
 COMPND   6 ENGINEERED: YES; 
 COMPND   7 MUTATION: YES; 
 COMPND   8 MOL_ID: 2; 
 COMPND   9 MOLECULE: ANTI-HEN EGG WHITE LYSOZYME MONOCLONAL ANTIBODY 
 COMPND  10 D1.3; 
 COMPND  11 CHAIN: B; 
 COMPND  12 FRAGMENT: HEAVY CHAIN; 
 ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
 

 errors:
 

 awk: cmd. line:6: (FILENAME=132L.pdb FNR=4) fatal: Unmatched [ or [^: /[]/ 
 

 awk: cmd. line:6: (FILENAME=1G7H.pdb FNR=6) fatal: Unmatched [ or [^: /[]/

Is it something with gsub function or following expression? Thanks & Regards

AshwaniSharma09

View Public Profile for AshwaniSharma09

Find all posts by AshwaniSharma09

07-19-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

I can't tell. CYTOCHROME C isn't anywhere in that file, so I have no idea what it's supposed to match. It must be picking up regex-like characters from the string it's trying to catch, which foul up the RGX variable when it's created.

You can put the script in a file easily enough like this:

Code:

$ cat script.awk

(!RGX) && ($0 ~ RMATCH) && (NF==2) {
         getline
         gsub(/[;, ]*/, "");
         RGX="[" $2 "]"
         FS=" ";
} RGX && ($5 ~ RGX) && /ATOM/

$ awk -v RMATCH="CYTOCHROME C" -f script.awk file

...

$

I'm probably out for the rest of the day unfortunately. I'll check this evening if I can, for further details on your difficulty.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

07-24-2012

Registered User

26, 0

Join Date: Nov 2009

Last Activity: 19 March 2015, 4:56 PM EDT

Posts: 26

Thanks Given: 18

Thanked 0 Times in 0 Posts

Thanks Corona688 for the reply. Files containing CYTOCHROME C are treating your script very well. Above I took examples of those files which are throwing errors. The only difference here is, I am searching for pattern "LYSOZYME" instead of CYTOCHROME C. The script on a file (with extension .pdb) goes like this:

Code:

cat test.sh 
awk -F":" '(!RGX) && /LYSOZYME/ && !/ANTIBODY/ && (NF == 2){
        getline
        gsub(/[;, ]*/, "");
        RGX="[" $2 "]"
        FS=" ";
} RGX && ($5 ~ RGX) && /ATOM/' 1G7H.pdb

My little understanding tells me that your speculation is right. Something is messing up with RGX variable. I think ":" symbol just 1 line above to the line mentioned in error. As in file 132L.pdb, error mentions line 4 and I can see a ":" in line 3 and matching pattern in left hand side of it. Just to be clear, I don't want to extract line 3 or 4 here.

Thanks and Regards,
Ashwani

Quote:

Originally Posted by Corona688

Code:

$ cat script.awk

(!RGX) && ($0 ~ RMATCH) && (NF==2) {
         getline
         gsub(/[;, ]*/, "");
         RGX="[" $2 "]"
         FS=" ";
} RGX && ($5 ~ RGX) && /ATOM/

$ awk -v RMATCH="CYTOCHROME C" -f script.awk file

...

$

I'm probably out for the rest of the day unfortunately. I'll check this evening if I can, for further details on your difficulty.

AshwaniSharma09

View Public Profile for AshwaniSharma09

Find all posts by AshwaniSharma09

Shell Programming and Scripting

Search for a pattern,extract value(s) from next line, extract lines having those extracted value(s)

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Extract some characters from lines based on pattern

Discussion started by: mad man

2. UNIX for Dummies Questions & Answers

Extract fields before search pattern

Discussion started by: fristyguy

3. Shell Programming and Scripting

Extract lines that match a pattern

Discussion started by: Kashyap

4. Shell Programming and Scripting

Extract a pattern from multiple lines in a file

Discussion started by: Viernes

5. Shell Programming and Scripting

extract specific line if the search pattern is found

Discussion started by: Sekar1

6. Shell Programming and Scripting

Extract two lines before and after the 'search text'

Discussion started by: geetap

7. Shell Programming and Scripting

sed: Find start of pattern and extract text to end of line, including the pattern

Discussion started by: TestTomas

8. Shell Programming and Scripting

Extract pattern from text line

Discussion started by: TehOne

9. Shell Programming and Scripting

Extract pattern from text line

Discussion started by: TehOne

10. Shell Programming and Scripting

awk: need to extract a line before a pattern

Discussion started by: npn35