awk to update specific value in file with match and add +1 to specific digit

12-16-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

awk to update specific value in file with match and add +1 to specific digit

I am trying to use awk to match the NM_ in file with $1 of id which is tab-delimited. The NM_ will always be in the line of file that starts with > and be after the second _. When there is a match between each NM_ and id, then the value of $2 in id is substituted or used to update the NM_. Each NM_ may not be unique, as in the example below, but will have a match in id.

After the third _ there is a digit 0,1,2,etc that I am trying to add the word exon and add +1 to the digit. Not sure if my awk attempt helps at all to address the first question. Thank you

.

file

Code:

>hg19_refGene_NM_001195684_0 range=chr1:92327018-92327098 5'pad=10 3'pad=10 strand=- repeatMasking=none
agaaataaaaATGACTTCCCATTATGTGATTGCCATCTTTGCCCTGATGA
GCTCCTGTTTAGCCACTGCAGgtaagttgca
>hg19_refGene_NM_001195684_1 range=chr1:92262834-92263038 5'pad=10 3'pad=10 strand=- repeatMasking=none
cccttggcagGTCCAGAGCCTGGTGCACTGTGTGAACTGTCACCTGTCAG
TGCCTCCCATCCTGTCCAGGCCTTGATGGAGAGCTTCACTGTTTTGTCAG
GCTGTGCCAGCAGAGGCACAACTGGGCTGCCACAGGAGGTGCATGTCCTG
AATCTCCGCACTGCAGGCCAGGGGCCTGGCCAGCTACAGAGAGAGgtagg
tgcag
>hg19_refGene_NM_001195684_2 range=chr1:92224160-92224317 5'pad=10 3'pad=10 strand=- repeatMasking=none
tgcttcctagGTCACACTTCACCTGAATCCCATCTCCTCAGTCCACATCC
ACCACAAGTCTGTTGTGTTCCTGCTCAACTCCCCACACCCCCTGGTGTGG
CATCTGAAGACAGAGAGACTTGCCACTGGGGTCTCCAGACTGTTTTTGgt
aagtgctt
>hg19_refGene_NM_001195683_2 range=chr1:92224160-92224317 5'pad=10 3'pad=10 strand=- repeatMasking=none
tgcttcctagGTCACACTTCACCTGAATCCCATCTCCTCAGTCCACATCC
ACCACAAGTCTGTTGTGTTCCTGCTCAACTCCCCACACCCCCTGGTGTGG
CATCTGAAGACAGAGAGACTTGCCACTGGGGTCTCCAGACTGTTTTTGgt
aagtgctt
>hg19_refGene_NM_001195683_3 range=chr1:92200323-92200526 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcctctagGTGTCTGAGGGTTCTGTGGTCCAGTTTTCATCAGCAAACT
TCTCCTTGACAGCAGAAACAGAAGAAAGGAACTTCCCCCATGGAAATGAA
CATCTGTTAAATTGGGCCCGAAAAGAGTATGGAGCAGTTACTTCATTCAC
CGAACTCAAGATAGCAAGAAACATTTATATTAAAGTGGGGGAAGgtaaat
ttta

Code:

NM_001195684    TGFBR3
NM_001206389    FGF8
NM_001197220    PDE4D
NM_001195683   TGFBR3

desired output value in bold updated with $2 in id because NM_ matched in $1 of id,
value in italics added one to the 0 and the word exon

Code:

>hg19_refGene_TGFBR3_exon1 range=chr1:92327018-92327098 5'pad=10 3'pad=10 strand=- repeatMasking=none
agaaataaaaATGACTTCCCATTATGTGATTGCCATCTTTGCCCTGATGA
GCTCCTGTTTAGCCACTGCAGgtaagttgca
>hg19_refGene_TGFBR3_exon2 range=chr1:92262834-92263038 5'pad=10 3'pad=10 strand=- repeatMasking=none
cccttggcagGTCCAGAGCCTGGTGCACTGTGTGAACTGTCACCTGTCAG
TGCCTCCCATCCTGTCCAGGCCTTGATGGAGAGCTTCACTGTTTTGTCAG
GCTGTGCCAGCAGAGGCACAACTGGGCTGCCACAGGAGGTGCATGTCCTG
AATCTCCGCACTGCAGGCCAGGGGCCTGGCCAGCTACAGAGAGAGgtagg
tgcag
>hg19_refGene_TGFBR3_exon3 range=chr1:92224160-92224317 5'pad=10 3'pad=10 strand=- repeatMasking=none
tgcttcctagGTCACACTTCACCTGAATCCCATCTCCTCAGTCCACATCC
ACCACAAGTCTGTTGTGTTCCTGCTCAACTCCCCACACCCCCTGGTGTGG
CATCTGAAGACAGAGAGACTTGCCACTGGGGTCTCCAGACTGTTTTTGgt
aagtgctt
>hg19_refGene_TGFBR3_exon3 range=chr1:92224160-92224317 5'pad=10 3'pad=10 strand=- repeatMasking=none
tgcttcctagGTCACACTTCACCTGAATCCCATCTCCTCAGTCCACATCC
ACCACAAGTCTGTTGTGTTCCTGCTCAACTCCCCACACCCCCTGGTGTGG
CATCTGAAGACAGAGAGACTTGCCACTGGGGTCTCCAGACTGTTTTTGgt
aagtgctt
>hg19_refGene_TGFBR3_exon4 range=chr1:92200323-92200526 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcctctagGTGTCTGAGGGTTCTGTGGTCCAGTTTTCATCAGCAAACT
TCTCCTTGACAGCAGAAACAGAAGAAAGGAACTTCCCCCATGGAAATGAA
CATCTGTTAAATTGGGCCCGAAAAGAGTATGGAGCAGTTACTTCATTCAC
CGAACTCAAGATAGCAAGAAACATTTATATTAAAGTGGGGGAAGgtaaat
ttta

awk

Code:

awk 'NR==FNR{a[$1];next} {k=$2; sub(/_.*/,"",k)} k in a' file id

Last edited by cmccabe; 12-17-2016 at 11:33 PM.. Reason: fixed format, added details, fixed typo

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

12-17-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello cmccabe,

Could you please try following and let me know if this helps.

Code:

awk 'FNR==NR{A[$1]=$NF;next} {match($0,/NM_[0-9]+/);Q=substr($0,RSTART,RLENGTH);match($0,/NM_[0-9]+_[0-9]+/);W=substr($0,RSTART,RLENGTH);sub(/.*_/,X,W);if(Q && A[Q]){sub(Q"_",A[Q]"_exon",$0);sub(/exon[0-9]+/,"exon" ++W,$0);print;next};print}'  id  Input_file

Output will be as follows.

Code:

>hg19_refGene_TGFBR3_exon1 range=chr1:92327018-92327098 5'pad=10 3'pad=10 strand=- repeatMasking=none
agaaataaaaATGACTTCCCATTATGTGATTGCCATCTTTGCCCTGATGA
GCTCCTGTTTAGCCACTGCAGgtaagttgca
>hg19_refGene_TGFBR3_exon2 range=chr1:92262834-92263038 5'pad=10 3'pad=10 strand=- repeatMasking=none
cccttggcagGTCCAGAGCCTGGTGCACTGTGTGAACTGTCACCTGTCAG
TGCCTCCCATCCTGTCCAGGCCTTGATGGAGAGCTTCACTGTTTTGTCAG
GCTGTGCCAGCAGAGGCACAACTGGGCTGCCACAGGAGGTGCATGTCCTG
AATCTCCGCACTGCAGGCCAGGGGCCTGGCCAGCTACAGAGAGAGgtagg
tgcag
>hg19_refGene_TGFBR3_exon3 range=chr1:92224160-92224317 5'pad=10 3'pad=10 strand=- repeatMasking=none
tgcttcctagGTCACACTTCACCTGAATCCCATCTCCTCAGTCCACATCC
ACCACAAGTCTGTTGTGTTCCTGCTCAACTCCCCACACCCCCTGGTGTGG
CATCTGAAGACAGAGAGACTTGCCACTGGGGTCTCCAGACTGTTTTTGgt
aagtgctt
>hg19_refGene_NM_001195683_2 range=chr1:92224160-92224317 5'pad=10 3'pad=10 strand=- repeatMasking=none
tgcttcctagGTCACACTTCACCTGAATCCCATCTCCTCAGTCCACATCC
ACCACAAGTCTGTTGTGTTCCTGCTCAACTCCCCACACCCCCTGGTGTGG
CATCTGAAGACAGAGAGACTTGCCACTGGGGTCTCCAGACTGTTTTTGgt
aagtgctt
>hg19_refGene_NM_001195683_3 range=chr1:92200323-92200526 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcctctagGTGTCTGAGGGTTCTGTGGTCCAGTTTTCATCAGCAAACT
TCTCCTTGACAGCAGAAACAGAAGAAAGGAACTTCCCCCATGGAAATGAA
CATCTGTTAAATTGGGCCCGAAAAGAGTATGGAGCAGTTACTTCATTCAC
CGAACTCAAGATAGCAAGAAACATTTATATTAAAGTGGGGGAAGgtaaat
ttta

EDIT: Just going through your output again, not sure how the last 2 rows got the replacement in your output? As I can't see like string NM_001195683, my code is not taking care of this as I am not sure how it has come over there, kindly explain it more so that we could try to help you on same.

EDIT2: Adding a non-one liner form of solution now too.

Code:

awk 'FNR==NR{
		A[$1]=$NF;
		next
            } 
            {
		match($0,/NM_[0-9]+/);
		Q=substr($0,RSTART,RLENGTH);
		match($0,/NM_[0-9]+_[0-9]+/);
		W=substr($0,RSTART,RLENGTH);
		sub(/.*_/,X,W);
		if(Q && A[Q]){
				sub(Q"_",A[Q]"_exon",$0);
				sub(/exon[0-9]+/,"exon" ++W,$0);
				print;
				next
			     };
		print
	    }
    ' id  Input_file

Thanks,
R. Singh

Last edited by RavinderSingh13; 12-17-2016 at 03:39 AM.. Reason: Added a comment now to ask OP a question about OP's output which is not clear.

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

12-17-2016

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Another way:

Code:

awk '
  {
    split($1,F,/_/)
  }
  NR==FNR {
    A[F[1],F[2]]=$2
    next
  } 
  (F[3],F[4]) in A {
    sub(F[3] "_" F[4] "_" F[5], A[F[3],F[4]] "_exon" F[5]+1)
  }
  {
    print RS $0
  }
' file RS=\> ORS= id

if you only want to print the ones that matched:

Code:

awk '
  {
    split($1,F,/_/)
  }
  NR==FNR {
    A[F[1],F[2]]=$2
    next
  } 
  (F[3],F[4]) in A {
    sub(F[3] "_" F[4] "_" F[5], A[F[3],F[4]] "_exon" F[5]+1)
    print RS $0
  }
' file RS=\> ORS= id

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

12-17-2016

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Thank you both for your help. I fixed the typo in the id file as well as all the NM_ should be found. Thank you

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

Shell Programming and Scripting

awk to update specific value in file with match and add +1 to specific digit

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to match file1 and extract specific tag values

Discussion started by: cmccabe

2. Shell Programming and Scripting

awk to assign points to variables based on conditions and update specific field

Discussion started by: cmccabe

3. Shell Programming and Scripting

awk to update file based on partial match in field1 and exact match in field2

Discussion started by: cmccabe

4. Shell Programming and Scripting

awk to output match and mismatch with count using specific fields

Discussion started by: cmccabe

5. Shell Programming and Scripting

awk partial string match and add specific fields

Discussion started by: cmccabe

6. Shell Programming and Scripting

Add tab after digit in specific field in file

Discussion started by: cmccabe

7. Shell Programming and Scripting

How to compare specific digit in number?

Discussion started by: linuxUser_

8. Shell Programming and Scripting

Replace specific field on specific line sed or awk

Discussion started by: crownedzero

9. Shell Programming and Scripting

Assigning a specific format to a specific column in a text file using awk and printf

Discussion started by: goodbenito

10. Shell Programming and Scripting

Insert a text from a specific row into a specific column using SED or AWK

Discussion started by: Issemael