awk to update specific value in file with match and add +1 to specific digit


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to update specific value in file with match and add +1 to specific digit
# 1  
Old 12-16-2016
awk to update specific value in file with match and add +1 to specific digit

I am trying to use awk to match the NM_ in file with $1 of id which is tab-delimited. The NM_ will always be in the line of file that starts with > and be after the second _. When there is a match between each NM_ and id, then the value of $2 in id is substituted or used to update the NM_. Each NM_ may not be unique, as in the example below, but will have a match in id.

After the third _ there is a digit 0,1,2,etc that I am trying to add the word exon and add +1 to the digit. Not sure if my awk attempt helps at all to address the first question. Thank you Smilie.


file
Code:
>hg19_refGene_NM_001195684_0 range=chr1:92327018-92327098 5'pad=10 3'pad=10 strand=- repeatMasking=none
agaaataaaaATGACTTCCCATTATGTGATTGCCATCTTTGCCCTGATGA
GCTCCTGTTTAGCCACTGCAGgtaagttgca
>hg19_refGene_NM_001195684_1 range=chr1:92262834-92263038 5'pad=10 3'pad=10 strand=- repeatMasking=none
cccttggcagGTCCAGAGCCTGGTGCACTGTGTGAACTGTCACCTGTCAG
TGCCTCCCATCCTGTCCAGGCCTTGATGGAGAGCTTCACTGTTTTGTCAG
GCTGTGCCAGCAGAGGCACAACTGGGCTGCCACAGGAGGTGCATGTCCTG
AATCTCCGCACTGCAGGCCAGGGGCCTGGCCAGCTACAGAGAGAGgtagg
tgcag
>hg19_refGene_NM_001195684_2 range=chr1:92224160-92224317 5'pad=10 3'pad=10 strand=- repeatMasking=none
tgcttcctagGTCACACTTCACCTGAATCCCATCTCCTCAGTCCACATCC
ACCACAAGTCTGTTGTGTTCCTGCTCAACTCCCCACACCCCCTGGTGTGG
CATCTGAAGACAGAGAGACTTGCCACTGGGGTCTCCAGACTGTTTTTGgt
aagtgctt
>hg19_refGene_NM_001195683_2 range=chr1:92224160-92224317 5'pad=10 3'pad=10 strand=- repeatMasking=none
tgcttcctagGTCACACTTCACCTGAATCCCATCTCCTCAGTCCACATCC
ACCACAAGTCTGTTGTGTTCCTGCTCAACTCCCCACACCCCCTGGTGTGG
CATCTGAAGACAGAGAGACTTGCCACTGGGGTCTCCAGACTGTTTTTGgt
aagtgctt
>hg19_refGene_NM_001195683_3 range=chr1:92200323-92200526 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcctctagGTGTCTGAGGGTTCTGTGGTCCAGTTTTCATCAGCAAACT
TCTCCTTGACAGCAGAAACAGAAGAAAGGAACTTCCCCCATGGAAATGAA
CATCTGTTAAATTGGGCCCGAAAAGAGTATGGAGCAGTTACTTCATTCAC
CGAACTCAAGATAGCAAGAAACATTTATATTAAAGTGGGGGAAGgtaaat
ttta

id
Code:
NM_001195684    TGFBR3
NM_001206389    FGF8
NM_001197220    PDE4D
NM_001195683   TGFBR3

desired output value in bold updated with $2 in id because NM_ matched in $1 of id,
value in italics added one to the 0 and the word exon
Code:
>hg19_refGene_TGFBR3_exon1 range=chr1:92327018-92327098 5'pad=10 3'pad=10 strand=- repeatMasking=none
agaaataaaaATGACTTCCCATTATGTGATTGCCATCTTTGCCCTGATGA
GCTCCTGTTTAGCCACTGCAGgtaagttgca
>hg19_refGene_TGFBR3_exon2 range=chr1:92262834-92263038 5'pad=10 3'pad=10 strand=- repeatMasking=none
cccttggcagGTCCAGAGCCTGGTGCACTGTGTGAACTGTCACCTGTCAG
TGCCTCCCATCCTGTCCAGGCCTTGATGGAGAGCTTCACTGTTTTGTCAG
GCTGTGCCAGCAGAGGCACAACTGGGCTGCCACAGGAGGTGCATGTCCTG
AATCTCCGCACTGCAGGCCAGGGGCCTGGCCAGCTACAGAGAGAGgtagg
tgcag
>hg19_refGene_TGFBR3_exon3 range=chr1:92224160-92224317 5'pad=10 3'pad=10 strand=- repeatMasking=none
tgcttcctagGTCACACTTCACCTGAATCCCATCTCCTCAGTCCACATCC
ACCACAAGTCTGTTGTGTTCCTGCTCAACTCCCCACACCCCCTGGTGTGG
CATCTGAAGACAGAGAGACTTGCCACTGGGGTCTCCAGACTGTTTTTGgt
aagtgctt
>hg19_refGene_TGFBR3_exon3 range=chr1:92224160-92224317 5'pad=10 3'pad=10 strand=- repeatMasking=none
tgcttcctagGTCACACTTCACCTGAATCCCATCTCCTCAGTCCACATCC
ACCACAAGTCTGTTGTGTTCCTGCTCAACTCCCCACACCCCCTGGTGTGG
CATCTGAAGACAGAGAGACTTGCCACTGGGGTCTCCAGACTGTTTTTGgt
aagtgctt
>hg19_refGene_TGFBR3_exon4 range=chr1:92200323-92200526 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcctctagGTGTCTGAGGGTTCTGTGGTCCAGTTTTCATCAGCAAACT
TCTCCTTGACAGCAGAAACAGAAGAAAGGAACTTCCCCCATGGAAATGAA
CATCTGTTAAATTGGGCCCGAAAAGAGTATGGAGCAGTTACTTCATTCAC
CGAACTCAAGATAGCAAGAAACATTTATATTAAAGTGGGGGAAGgtaaat
ttta

awk
Code:
awk 'NR==FNR{a[$1];next} {k=$2; sub(/_.*/,"",k)} k in a' file id


Last edited by cmccabe; 12-17-2016 at 11:33 PM.. Reason: fixed format, added details, fixed typo
# 2  
Old 12-17-2016
Hello cmccabe,

Could you please try following and let me know if this helps.
Code:
awk 'FNR==NR{A[$1]=$NF;next} {match($0,/NM_[0-9]+/);Q=substr($0,RSTART,RLENGTH);match($0,/NM_[0-9]+_[0-9]+/);W=substr($0,RSTART,RLENGTH);sub(/.*_/,X,W);if(Q && A[Q]){sub(Q"_",A[Q]"_exon",$0);sub(/exon[0-9]+/,"exon" ++W,$0);print;next};print}'  id  Input_file

Output will be as follows.
Code:
>hg19_refGene_TGFBR3_exon1 range=chr1:92327018-92327098 5'pad=10 3'pad=10 strand=- repeatMasking=none
agaaataaaaATGACTTCCCATTATGTGATTGCCATCTTTGCCCTGATGA
GCTCCTGTTTAGCCACTGCAGgtaagttgca
>hg19_refGene_TGFBR3_exon2 range=chr1:92262834-92263038 5'pad=10 3'pad=10 strand=- repeatMasking=none
cccttggcagGTCCAGAGCCTGGTGCACTGTGTGAACTGTCACCTGTCAG
TGCCTCCCATCCTGTCCAGGCCTTGATGGAGAGCTTCACTGTTTTGTCAG
GCTGTGCCAGCAGAGGCACAACTGGGCTGCCACAGGAGGTGCATGTCCTG
AATCTCCGCACTGCAGGCCAGGGGCCTGGCCAGCTACAGAGAGAGgtagg
tgcag
>hg19_refGene_TGFBR3_exon3 range=chr1:92224160-92224317 5'pad=10 3'pad=10 strand=- repeatMasking=none
tgcttcctagGTCACACTTCACCTGAATCCCATCTCCTCAGTCCACATCC
ACCACAAGTCTGTTGTGTTCCTGCTCAACTCCCCACACCCCCTGGTGTGG
CATCTGAAGACAGAGAGACTTGCCACTGGGGTCTCCAGACTGTTTTTGgt
aagtgctt
>hg19_refGene_NM_001195683_2 range=chr1:92224160-92224317 5'pad=10 3'pad=10 strand=- repeatMasking=none
tgcttcctagGTCACACTTCACCTGAATCCCATCTCCTCAGTCCACATCC
ACCACAAGTCTGTTGTGTTCCTGCTCAACTCCCCACACCCCCTGGTGTGG
CATCTGAAGACAGAGAGACTTGCCACTGGGGTCTCCAGACTGTTTTTGgt
aagtgctt
>hg19_refGene_NM_001195683_3 range=chr1:92200323-92200526 5'pad=10 3'pad=10 strand=- repeatMasking=none
tttcctctagGTGTCTGAGGGTTCTGTGGTCCAGTTTTCATCAGCAAACT
TCTCCTTGACAGCAGAAACAGAAGAAAGGAACTTCCCCCATGGAAATGAA
CATCTGTTAAATTGGGCCCGAAAAGAGTATGGAGCAGTTACTTCATTCAC
CGAACTCAAGATAGCAAGAAACATTTATATTAAAGTGGGGGAAGgtaaat
ttta

EDIT: Just going through your output again, not sure how the last 2 rows got the replacement in your output? As I can't see like string NM_001195683, my code is not taking care of this as I am not sure how it has come over there, kindly explain it more so that we could try to help you on same.

EDIT2: Adding a non-one liner form of solution now too.
Code:
awk 'FNR==NR{
		A[$1]=$NF;
		next
            } 
            {
		match($0,/NM_[0-9]+/);
		Q=substr($0,RSTART,RLENGTH);
		match($0,/NM_[0-9]+_[0-9]+/);
		W=substr($0,RSTART,RLENGTH);
		sub(/.*_/,X,W);
		if(Q && A[Q]){
				sub(Q"_",A[Q]"_exon",$0);
				sub(/exon[0-9]+/,"exon" ++W,$0);
				print;
				next
			     };
		print
	    }
    ' id  Input_file

Thanks,
R. Singh

Last edited by RavinderSingh13; 12-17-2016 at 03:39 AM.. Reason: Added a comment now to ask OP a question about OP's output which is not clear.
This User Gave Thanks to RavinderSingh13 For This Post:
# 3  
Old 12-17-2016
Another way:
Code:
awk '
  {
    split($1,F,/_/)
  }
  NR==FNR {
    A[F[1],F[2]]=$2
    next
  } 
  (F[3],F[4]) in A {
    sub(F[3] "_" F[4] "_" F[5], A[F[3],F[4]] "_exon" F[5]+1)
  }
  {
    print RS $0
  }
' file RS=\> ORS= id

if you only want to print the ones that matched:
Code:
awk '
  {
    split($1,F,/_/)
  }
  NR==FNR {
    A[F[1],F[2]]=$2
    next
  } 
  (F[3],F[4]) in A {
    sub(F[3] "_" F[4] "_" F[5], A[F[3],F[4]] "_exon" F[5]+1)
    print RS $0
  }
' file RS=\> ORS= id

This User Gave Thanks to Scrutinizer For This Post:
# 4  
Old 12-17-2016
Thank you both for your help. I fixed the typo in the id file as well as all the NM_ should be found. Thank you Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to match file1 and extract specific tag values

File2 is tab-delimeted and I am trying to use $2 in file1 (space delimeted) as a search term in file2. If it is found then the AF= in and the FDP= values from file2 are extracted and printed next to the file1 line. I commented the awk before I added the lines in bold the current output resulted. I... (7 Replies)
Discussion started by: cmccabe
7 Replies

2. Shell Programming and Scripting

awk to assign points to variables based on conditions and update specific field

I have been reading old posts and trying to come up with a solution for the below: Use a tab-delimited input file to assign point to variables that are used to update a specific field, Rank. I really couldn't find too much in the way of assigning points to variable, but made an attempt at an awk... (4 Replies)
Discussion started by: cmccabe
4 Replies

3. Shell Programming and Scripting

awk to update file based on partial match in field1 and exact match in field2

I am trying to create a cronjob that will run on startup that will look at a list.txt file to see if there is a later version of a database using database.txt as the source. The matching lines are written to output. $1 in database.txt will be in list.txt as a partial match. $2 of database.txt... (2 Replies)
Discussion started by: cmccabe
2 Replies

4. Shell Programming and Scripting

awk to output match and mismatch with count using specific fields

In the below awk I am trying output to one file those lines that match between $2,$3,$4 of file1 and file2 with the count in (). I am also trying to output those lines that are missing between $2,$3,$4 of file1 and file2 with the count of in () each. Both input files are tab-delimited, but the... (7 Replies)
Discussion started by: cmccabe
7 Replies

5. Shell Programming and Scripting

awk partial string match and add specific fields

Trying to combine strings that are a partial match to another in $1 (usually below it). If a match is found than the $2 value is added to the $2 value of the match and the $3 value is added to the $3 value of the match. I am not sure how to do this and need some expert help. Thank you :). file ... (2 Replies)
Discussion started by: cmccabe
2 Replies

6. Shell Programming and Scripting

Add tab after digit in specific field in file

I am trying to add a tab after the last digit in $3 in the input. The grep below is all I can think off. Thank you :) sed -n 's/:/&/p' input input chr1 955542 955763AGRN-6|gc=75 chr1 957570 957852AGRN-7|gc=61.2 chr1 976034 976270AGRN-9|gc=74.5 desired output chr1... (5 Replies)
Discussion started by: cmccabe
5 Replies

7. Shell Programming and Scripting

How to compare specific digit in number?

Dear All, Lets say I have a number with following format: ####.12e-## now I want to compare place holder in position 1 and 2. How can I do that? Note: My number is stored in a variable say var. example: var=9999.12e-05 Thanks & Regards, linuxUser_ (6 Replies)
Discussion started by: linuxUser_
6 Replies

8. Shell Programming and Scripting

Replace specific field on specific line sed or awk

I'm trying to update a text file via sed/awk, after a lot of searching I still can't find a code snippet that I can get to work. Brief overview: I have user input a line to a variable, I then find a specific value in this line 10th field in this case. After asking for new input and doing some... (14 Replies)
Discussion started by: crownedzero
14 Replies

9. Shell Programming and Scripting

Assigning a specific format to a specific column in a text file using awk and printf

Hi, I have the following text file: 8 T1mapping_flip02 ok 128 108 30 1 665000-000008-000001.dcm 9 T1mapping_flip05 ok 128 108 30 1 665000-000009-000001.dcm 10 T1mapping_flip10 ok 128 108 30 1 665000-000010-000001.dcm 11 T1mapping_flip15 ok 128 108 30... (2 Replies)
Discussion started by: goodbenito
2 Replies

10. Shell Programming and Scripting

Insert a text from a specific row into a specific column using SED or AWK

Hi, I am having trouble converting a text file. I have been working for this whole day now, still i couldn't make it. Here is how the text file looks: _______________________________________________________ DEVICE STATUS INFORMATION FOR LOCATION 1: OPER STATES: Disabled E:Enabled ... (5 Replies)
Discussion started by: Issemael
5 Replies
Login or Register to Ask a Question