awk to extract and print first occurrence of pattern in each line


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to extract and print first occurrence of pattern in each line
# 1  
Old 09-26-2017
awk to extract and print first occurrence of pattern in each line

I am trying to use awk to extract and print the first ocurrence of NM_ and NP_ with a : before in each line. The input file is tab-delimeted, but the output does not need to be. The below does execute but prints all the lines in the file not just the patterns. Thank you Smilie.

file tab-delimeted
Code:
Input Variant	HGVS description(s)	Errors and warnings
rs41302905	NC_000009.11:g.136131316C>T|NC_000009.12:g.133255929C>T|NG_006669.1:g.21739G>A|NM_020469.2:c.802G>A|NW_009646201.1:g.82022C>T|NP_065202.2:p.Gly268Arg|XM_005276848.1:c.799G>A|XM_005276851.1:c.379G>A|XM_005276850.1:c.379G>A|XM_005276849.1:c.745G>A|XM_005276852.1:c.379G>A|XP_005276908.1:p.Gly127Arg|XP_005276907.1:p.Gly127Arg|XP_005276909.1:p.Gly127Arg|XP_005276906.1:p.Gly249Arg|XP_005276905.1:p.Gly267Arg		
rs8176745	NC_000009.11:g.136131347G>A|NC_000009.12:g.133255960G>A|NG_006669.1:g.21708C>T|NM_020469.2:c.771C>T|NP_065202.2:p.Pro257=|NW_009646201.1:g.82053G>A|XM_005276852.1:c.348C>T|XM_005276848.1:c.768C>T|XM_005276851.1:c.348C>T|XM_005276850.1:c.348C>T|XM_005276849.1:c.714C>T|XP_005276909.1:p.Pro116=|XP_005276908.1:p.Pro116=|XP_005276907.1:p.Pro116=|XP_005276906.1:p.Pro238=|XP_005276905.1:p.Pro256=

desired output
Code:
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

awk
Code:
awk -F'\t' '/NM_/{f=1} && /NP_/{f=2} f{ if(/{/){count++}; print":"; if(/}/){count--; if(count==0) exit}}' file

maybe
Code:
awk -F'\t' 'NR > 1 && /NM_/{     # skip header and find NM_ pattern
            match($2,/NM_*]/);   # match value for NM
            NM=substr($2,RSTART+1,RLENGTH-2);  # extract value and read into NM
            match($2,/NP_*]/);  # match value for NP
            NP=substr($2,RSTART+1,RLENGTH-2);  # extract value and read into NM
                       for(i=1;i<=NM;i++){  # start loop and iterate over each line in file
                          print $1, $NM":"$NP  # print output with : in between each 
                       }  # close block
}1' input > out  # define output


Last edited by cmccabe; 09-26-2017 at 10:33 AM.. Reason: added awk, fixed format
# 2  
Old 09-26-2017
Consider the use of
Code:
match()

, which will put your selected text into an array that you can print. From the awk manual:
Quote:
match(string, regexp [, array])

Search string for the longest, leftmost substring matched by the regular expression regexp and return the character position (index) at which that substring begins (one, if it starts at the beginning of string). If no match is found, return zero.

The regexp argument may be either a regexp constant (/…/) or a string constant ("…"). In the latter case, the string is treated as a regexp to be matched. See Computed Regexps for a discussion of the difference between the two forms, and the implications for writing your program correctly.

The order of the first two arguments is the opposite of most other string functions that work with regular expressions, such as sub() and gsub(). It might help to remember that for match(), the order is the same as for the ‘~’ operator: ‘string ~ regexp’.

The match() function sets the predefined variable RSTART to the index. It also sets the predefined variable RLENGTH to the length in characters of the matched substring. If no match is found, RSTART is set to zero, and RLENGTH to -1.

For example:

{
if ($1 == "FIND")
regex = $2
else {
where = match($0, regex)
if (where != 0)
print "Match of", regex, "found at", where, "in", $0
}
}

This program looks for lines that match the regular expression stored in the variable regex. This regular expression can be changed. If the first word on a line is ‘FIND’, regex is changed to be the second word on that line. Therefore, if given:

FIND ru+n
My program runs
but not very quickly
FIND Melvin
JF+KM
This line is property of Reality Engineering Co.
Melvin was here.

awk prints:

Match of ru+n found at 12 in My program runs
Match of Melvin found at 1 in Melvin was here.

If array is present, it is cleared, and then the zeroth element of array is set to the entire portion of string matched by regexp. If regexp contains parentheses, the integer-indexed elements of array are set to contain the portion of string matching the corresponding parenthesized subexpression. For example:

$ echo foooobazbarrrrr |
> gawk '{ match($0, /(fo+).+(bar*)/, arr)
> print arr[1], arr[2] }'
-| foooo barrrrr

In addition, multidimensional subscripts are available providing the start index and length of each matched subexpression:

$ echo foooobazbarrrrr |
> gawk '{ match($0, /(fo+).+(bar*)/, arr)
> print arr[1], arr[2]
> print arr[1, "start"], arr[1, "length"]
> print arr[2, "start"], arr[2, "length"]
> }'
-| foooo barrrrr
-| 1 5
-| 9 7

There may not be subscripts for the start and index for every parenthesized subexpression, because they may not all have matched text; thus, they should be tested for with the in operator (see Reference to Elements).

The array argument to match() is a gawk extension. In compatibility mode (see Options), using a third argument is a fatal error.
This User Gave Thanks to jim mcnamara For This Post:
# 3  
Old 09-27-2017
The below utilizes match as suggested but it returns multiple lines after it executes. I am not sure what I am doing wrong. Thank you Smilie.

EDIT: I below awk seems to address the duplicates, however the entire line prints. Do I need to split $2 by the | and loop through? Thank you Smilie.

awk
Code:
awk -F'\t' 'NR > 1 && ($2 ~ /NM_/ && match($2,/NP_/)) {  # search for NM_ and NP_
    match($2,/NM_.*:/);   # match value for NM_
    NM=substr($2,RSTART+1,RLENGTH-2);  # extract value and read into NM, starting with the N (in purple) - this is RSTART and ending at the : - this is RLENGTH, so NM_020469.2
    # Get its length
    lenNM=length(NM)
           match($2,/NP_.:*|/);   # match value for NP_
           NP=substr($2,RSTART+1,RLENGTH-2);  # extract value and read into NP, starting with the p (portion in purple) -this is RSTART and ending at the | -  this is RLENGTH, so NP_065202.2:p.Gly268Arg
       # Get its length
         lenNP=length(NP)
           # Cycle through each line
             for (i=1; i<=$lenNM; i++) {
             print $1, $NM":"$NP  # print output with : in between each 
        }  # close block
}1' input > out

I am still a little unclear on the RSTART and RLENGHTH concepts but, using line1 as an example from the input:

The NM variable would be NM_020469.2
The NP variable would be :p.Gly268Arg
I also update the awk with comments.

Last edited by cmccabe; 09-29-2017 at 08:46 AM.. Reason: updated awk and added more comment lines
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

UNIX help to print 50 lines after every 3rd occurrence pattern till end of file

I need help with extract/print lines till stop pattern. This needs to happen after every 3rd occurrence of start pattern and continue till end of file. Consider below is an example of the log file. my start pattern will be every 3rd occurrence of ERROR_FILE_NOT_FOUND and stop pattern will be... (5 Replies)
Discussion started by: NSS
5 Replies

2. Shell Programming and Scripting

Match pattern and print the line number of occurence using awk

Hi, I have a simple problem but i guess stupid enough to figure it out. i have thousands rows of data. and i need to find match patterns of two columns and print the number of rows. for example: inputfile abd abp 123 abc abc 325 ndc ndc 451 mjk lkj... (3 Replies)
Discussion started by: redse171
3 Replies

3. Shell Programming and Scripting

awk print pattern match line and following lines

Data: Pattern Data Data Data Data Data Data Data Data Data ... With awk, how do I print the pattern matching line, then the subsequent lines following the pattern matching line. Varying number of lines following the pattern matching line. (9 Replies)
Discussion started by: dmesserly
9 Replies

4. Shell Programming and Scripting

Insert new pattern in newline after the nth occurrence of a line pattern - Bash in Ubuntu 12.04

Hi, I am getting crazy after days on looking at it: Bash in Ubuntu 12.04.1 I want to do this: pattern="system /path1/file1 file1" new_pattern=" data /path2/file2 file2" file to edit: data.db - I need to search in the file data.db for the nth occurrence of pattern - pattern must... (14 Replies)
Discussion started by: Phil3759
14 Replies

5. Shell Programming and Scripting

print only the first occurrence of a pattern

Hi, I have a file as below select or create proc /*comments*/ /*comments*/ /*comments*/ /*comments*/ ( variables4 datatypes1, variables1 datatypes2, variables2 datatypes3, variables3 datatypes2 ) some text some text ( sometext some text ) some text some text (3 Replies)
Discussion started by: manasa_vs
3 Replies

6. Shell Programming and Scripting

find string nth occurrence in file and print line number

Hi I have requirement to find nth occurrence in a file and capture data from with in lines (between lines) Data in File. <QUOTE> <SESSION> <ATTRIBUTE NAME='Parameter Filename' VALUE='file1.parm'/> <ATTRIBUTE NAME='Service Name' VALUE='None'/> </SESSION> <SESSION> <ATTRIBUTE... (6 Replies)
Discussion started by: tmalik79
6 Replies

7. Shell Programming and Scripting

[Solved] Sed/awk print between patterns the first occurrence

Guys, I am trying the following: i have a log file of a webbap which logs in the following pattern: 2011-08-14 21:10:04,535 blablabla ERROR blablabla bla bla bla bla 2011-08-14 21:10:04,535 blablabla ERROR blablabla bla bla bla ... (6 Replies)
Discussion started by: ppolianidis
6 Replies

8. Shell Programming and Scripting

Sed/awk print between different patterns the first occurrence

Thanks for the help yesterday. I have a little modification today, I am trying the following: i have a log file of a webbap which logs in the following pattern: 2011-08-14 21:10:04,535 blablabla ERROR Exception1 blablabla bla bla bla bla 2011-08-14... (2 Replies)
Discussion started by: ppolianidis
2 Replies

9. UNIX for Dummies Questions & Answers

line number of the i-th occurrence of a pattern

Hi all, is there a simple way to obtain the line number of the i-th occurrence of a pattern? I have OCCURRENCE=`grep -io "${STRING_NAME}" ${1}-${8}${EXT}.out_bis| wc -l` which tells me how many occurency I have. I would like to go through them and determine the line number and assign... (6 Replies)
Discussion started by: f_o_555
6 Replies

10. Shell Programming and Scripting

awk: need to extract a line before a pattern

Hello , I need your help to extract a line in a big file , and this line is always 11 lines before a specific pattern . Do you know a way via Awk ? Thanks in advance npn35 (17 Replies)
Discussion started by: npn35
17 Replies
Login or Register to Ask a Question