awk to extract and print first occurrence of pattern in each line

09-26-2017

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

awk to extract and print first occurrence of pattern in each line

I am trying to use awk to extract and print the first ocurrence of NM_ and NP_ with a : before in each line. The input file is tab-delimeted, but the output does not need to be. The below does execute but prints all the lines in the file not just the patterns. Thank you

.

file tab-delimeted

Code:

Input Variant	HGVS description(s)	Errors and warnings
rs41302905	NC_000009.11:g.136131316C>T|NC_000009.12:g.133255929C>T|NG_006669.1:g.21739G>A|NM_020469.2:c.802G>A|NW_009646201.1:g.82022C>T|NP_065202.2:p.Gly268Arg|XM_005276848.1:c.799G>A|XM_005276851.1:c.379G>A|XM_005276850.1:c.379G>A|XM_005276849.1:c.745G>A|XM_005276852.1:c.379G>A|XP_005276908.1:p.Gly127Arg|XP_005276907.1:p.Gly127Arg|XP_005276909.1:p.Gly127Arg|XP_005276906.1:p.Gly249Arg|XP_005276905.1:p.Gly267Arg		
rs8176745	NC_000009.11:g.136131347G>A|NC_000009.12:g.133255960G>A|NG_006669.1:g.21708C>T|NM_020469.2:c.771C>T|NP_065202.2:p.Pro257=|NW_009646201.1:g.82053G>A|XM_005276852.1:c.348C>T|XM_005276848.1:c.768C>T|XM_005276851.1:c.348C>T|XM_005276850.1:c.348C>T|XM_005276849.1:c.714C>T|XP_005276909.1:p.Pro116=|XP_005276908.1:p.Pro116=|XP_005276907.1:p.Pro116=|XP_005276906.1:p.Pro238=|XP_005276905.1:p.Pro256=

desired output

Code:

rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

awk

Code:

awk -F'\t' '/NM_/{f=1} && /NP_/{f=2} f{ if(/{/){count++}; print":"; if(/}/){count--; if(count==0) exit}}' file

maybe

Code:

awk -F'\t' 'NR > 1 && /NM_/{     # skip header and find NM_ pattern
            match($2,/NM_*]/);   # match value for NM
            NM=substr($2,RSTART+1,RLENGTH-2);  # extract value and read into NM
            match($2,/NP_*]/);  # match value for NP
            NP=substr($2,RSTART+1,RLENGTH-2);  # extract value and read into NM
                       for(i=1;i<=NM;i++){  # start loop and iterate over each line in file
                          print $1, $NM":"$NP  # print output with : in between each 
                       }  # close block
}1' input > out  # define output

Last edited by cmccabe; 09-26-2017 at 10:33 AM.. Reason: added awk, fixed format

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

09-26-2017

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Consider the use of

Code:

match()

, which will put your selected text into an array that you can print. From the awk manual:

Quote:

match(string, regexp [, array])

Search string for the longest, leftmost substring matched by the regular expression regexp and return the character position (index) at which that substring begins (one, if it starts at the beginning of string). If no match is found, return zero.

The regexp argument may be either a regexp constant (/…/) or a string constant ("…"). In the latter case, the string is treated as a regexp to be matched. See Computed Regexps for a discussion of the difference between the two forms, and the implications for writing your program correctly.

The order of the first two arguments is the opposite of most other string functions that work with regular expressions, such as sub() and gsub(). It might help to remember that for match(), the order is the same as for the ‘~’ operator: ‘string ~ regexp’.

The match() function sets the predefined variable RSTART to the index. It also sets the predefined variable RLENGTH to the length in characters of the matched substring. If no match is found, RSTART is set to zero, and RLENGTH to -1.

For example:

{
if ($1 == "FIND")
regex = $2
else {
where = match($0, regex)
if (where != 0)
print "Match of", regex, "found at", where, "in", $0
}
}

This program looks for lines that match the regular expression stored in the variable regex. This regular expression can be changed. If the first word on a line is ‘FIND’, regex is changed to be the second word on that line. Therefore, if given:

FIND ru+n
My program runs
but not very quickly
FIND Melvin
JF+KM
This line is property of Reality Engineering Co.
Melvin was here.

awk prints:

Match of ru+n found at 12 in My program runs
Match of Melvin found at 1 in Melvin was here.

If array is present, it is cleared, and then the zeroth element of array is set to the entire portion of string matched by regexp. If regexp contains parentheses, the integer-indexed elements of array are set to contain the portion of string matching the corresponding parenthesized subexpression. For example:

$ echo foooobazbarrrrr |
> gawk '{ match($0, /(fo+).+(bar*)/, arr)
> print arr[1], arr[2] }'
-| foooo barrrrr

In addition, multidimensional subscripts are available providing the start index and length of each matched subexpression:

$ echo foooobazbarrrrr |
> gawk '{ match($0, /(fo+).+(bar*)/, arr)
> print arr[1], arr[2]
> print arr[1, "start"], arr[1, "length"]
> print arr[2, "start"], arr[2, "length"]
> }'
-| foooo barrrrr
-| 1 5
-| 9 7

There may not be subscripts for the start and index for every parenthesized subexpression, because they may not all have matched text; thus, they should be tested for with the in operator (see Reference to Elements).

The array argument to match() is a gawk extension. In compatibility mode (see Options), using a third argument is a fatal error.

This User Gave Thanks to jim mcnamara For This Post:

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

09-27-2017

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

The below utilizes match as suggested but it returns multiple lines after it executes. I am not sure what I am doing wrong. Thank you

.

EDIT: I below awk seems to address the duplicates, however the entire line prints. Do I need to split $2 by the | and loop through? Thank you

.

awk

Code:

awk -F'\t' 'NR > 1 && ($2 ~ /NM_/ && match($2,/NP_/)) {  # search for NM_ and NP_
    match($2,/NM_.*:/);   # match value for NM_
    NM=substr($2,RSTART+1,RLENGTH-2);  # extract value and read into NM, starting with the N (in purple) - this is RSTART and ending at the : - this is RLENGTH, so NM_020469.2
    # Get its length
    lenNM=length(NM)
           match($2,/NP_.:*|/);   # match value for NP_
           NP=substr($2,RSTART+1,RLENGTH-2);  # extract value and read into NP, starting with the p (portion in purple) -this is RSTART and ending at the | -  this is RLENGTH, so NP_065202.2:p.Gly268Arg
       # Get its length
         lenNP=length(NP)
           # Cycle through each line
             for (i=1; i<=$lenNM; i++) {
             print $1, $NM":"$NP  # print output with : in between each 
        }  # close block
}1' input > out

I am still a little unclear on the RSTART and RLENGHTH concepts but, using line1 as an example from the input:

The NM variable would be NM_020469.2
The NP variable would be :p.Gly268Arg
I also update the awk with comments.

Last edited by cmccabe; 09-29-2017 at 08:46 AM.. Reason: updated awk and added more comment lines

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

Shell Programming and Scripting

awk to extract and print first occurrence of pattern in each line

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

UNIX help to print 50 lines after every 3rd occurrence pattern till end of file

Discussion started by: NSS

2. Shell Programming and Scripting

Match pattern and print the line number of occurence using awk

Discussion started by: redse171

3. Shell Programming and Scripting

awk print pattern match line and following lines

Discussion started by: dmesserly

4. Shell Programming and Scripting

Insert new pattern in newline after the nth occurrence of a line pattern - Bash in Ubuntu 12.04

Discussion started by: Phil3759

5. Shell Programming and Scripting

print only the first occurrence of a pattern

Discussion started by: manasa_vs

6. Shell Programming and Scripting

find string nth occurrence in file and print line number

Discussion started by: tmalik79

7. Shell Programming and Scripting

[Solved] Sed/awk print between patterns the first occurrence

Discussion started by: ppolianidis

8. Shell Programming and Scripting

Sed/awk print between different patterns the first occurrence

Discussion started by: ppolianidis

9. UNIX for Dummies Questions & Answers

line number of the i-th occurrence of a pattern

Discussion started by: f_o_555

10. Shell Programming and Scripting

awk: need to extract a line before a pattern

Discussion started by: npn35