awk to update value based on pattern match in another file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to update value based on pattern match in another file
# 1  
Old 09-05-2017
awk to update value based on pattern match in another file

In the awk, thanks you @RavinderSingh13, for the help in below, hopefully it is close as I am trying to update the value in $12 of the tab-delimeted file2 with the matching value in $1 of the space delimeted file1. I have added comments for each line as well. Thank you Smilie.

awk
Code:
awk '$12 == /NM_/{                                              # search $12 for pattern NM_
            match($12,/p..*/);                                  # using match the regex will match in $12 from the p. to the end
            VAL=substr($12,RSTART+1,RLENGTH-2);                 # Putting the values of subtring whose value starts from RSTART+1 to RSTART-2
            for(i=1;i<=num;i++){                                # Starting a loop which will start from value 1 of variable i to till value of variable num
            awk 'FNR==NR {a[$1]=$2; next} a[$i]{VAL=a[$i]}      # Store $2 value from file1 in array a and update array i matching VAL
                }                                               # close block
                    next                                        # process next line
                    }1' OFS="\t" file1 FS'\t' file2             # define file1 as space delimited and file2 as tab-delimited with the output also being tab-delimited

file1
Code:
C Cys
D Asp
V Val
W Trp
Y Tyr

file2
Code:
R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritence	ExonicFunc.refGene	AAChange.refGene	avsnp147
1	chr1	948846	948846	-	A	upstream	ISG15	dist=1	.	.	.	rs3841266
2	chr1	948870	948870	C	G	UTR5	ISG15	NM_005101:c.-84C>G	.	.	.	rs4615788
3	chr1	948921	948921	T	C	UTR5	ISG15	NM_005101:c.-33T>C	.	.	.	rs15842
4	chr1	949597	949597	C	T	exonic	ISG15	.	.	synonymous SNV	ISG15:NM_005101:exon2:c.237C>T:p.D79D	rs61766284
5	chr1	949654	949654	A	G	exonic	ISG15	.	.	synonymous SNV	ISG15:NM_005101:exon2:c.294A>G:p.V98V	rs8997
6	chr1	1269554	1269554	T	C	exonic	TAS1R3	.	.	nonsynonymous SNV	TAS1R3:NM_152228:exon6:c.2269T>C:p.C757R	rs307377

desired output the 3 lines in bold are updated because the one-letter code before and after the digit matched $1 in file1, so it is updated to the three-letter code in $2.

Code:
R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritence	ExonicFunc.refGene	AAChange.refGene	avsnp147
1	chr1	948846	948846	-	A	upstream	ISG15	dist=1	.	.	.	rs3841266
2	chr1	948870	948870	C	G	UTR5	ISG15	NM_005101:c.-84C>G	.	.	.	rs4615788
3	chr1	948921	948921	T	C	UTR5	ISG15	NM_005101:c.-33T>C	.	.	.	rs15842
4	chr1	949597	949597	C	T	exonic	ISG15	.	.	synonymous SNV	ISG15:NM_005101:exon2:c.237C>T:p.Asp79Asp	rs61766284
5	chr1	949654	949654	A	G	exonic	ISG15	.	.	synonymous SNV	ISG15:NM_005101:exon2:c.294A>G:p.Val98Val	rs8997
6	chr1	1269554	1269554	T	C	exonic	TAS1R3	.	.	nonsynonymous SNV	TAS1R3:NM_152228:exon6:c.2269T>C:p.Cys757Arg	rs307377


Last edited by cmccabe; 09-05-2017 at 10:02 AM.. Reason: fixed format
# 2  
Old 09-05-2017
To get you started
Code:
awk '
  # BEGIN runs before any of the input files is opened
  BEGIN { FS=OFS="\t" }
  # The input files are processed one by one and the following code runs for each line
  # FNR is equal to NR when processing file1
  # a[ ] is indexed by the one letter code, its value is the three letter code
  FNR==NR { a[$1]=$2; next }
  # The next goes to the next input cycle
  # The following code runs for file2 (and further files)
  $12 ~ /:NM_/{                                                 # search $12 for pattern :NM_
            match($12,/p..*/)                                   # using match the regex will match in $12 from the p. to the 
            VAL=substr($12,RSTART+1,RLENGTH-2)                  # Put the values of substring whose value starts from RSTART+1 to RSTART-2
  # more to come
  }
  { print }
' file1 file2

This User Gave Thanks to MadeInGermany For This Post:
# 3  
Old 09-06-2017
Both awk command below execute but I do not get the intended result:

I added some more comments to both as well. Thank you Smilie.

awk 1
Code:
awk '
  # BEGIN runs before any of the input files is opened
  BEGIN { FS=OFS="\t" }
  # The input files are processed one by one and the following code runs for each line
  # FNR is equal to NR when processing file1
  # a[ ] is indexed by the one letter code, its value is the three letter code
  FNR==NR { a[$1]=$2; next }
  # The next goes to the next input cycle
  # The following code runs for file2 (and further files)
  $12 ~ /:NM_/{                                                 # search $12 for pattern :NM_
            match($12,/p..*/)                                   # using match the regex will match in $12 from the p. to the 
            VAL=substr($12,RSTART+1,RLENGTH-2)                  # Put the values of substring whose value starts from RSTART+1 to RSTART-2
  # update one letter code to three letter by storing the value of $12 in array then updating if it matches file1
  { if(a[$12]){$12=a[$12] };
  }
  print}' file1 file2 > out

out
Code:
4	chr1	949597	949597	C	T	exonic	ISG15	.	.	synonymous SNV	ISG15:NM_005101:exon2:c.237C>T:p.D79D	rs61766284
5	chr1	949654	949654	A	G	exonic	ISG15	.	.	synonymous SNV	ISG15:NM_005101:exon2:c.294A>G:p.V98V	rs8997

awk 2
Code:
# store value of $2 in file1 in array A and update each sub-patter of p. matching $12 in file2
awk 'FNR==NR {A[$1]=$2; next}  $12 in A {sub ($12, $12 "p." A[$12]) }1' OFS="\t" file1 FS='\t' file2 > out

out
Code:
R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritence	ExonicFunc.refGene	AAChange.refGene	avsnp147
1	chr1	948846	948846	-	A	upstream	ISG15	dist=1	.	.	.	rs3841266
2	chr1	948870	948870	C	G	UTR5	ISG15	NM_005101:c.-84C>G	.	.	.	rs4615788
3	chr1	948921	948921	T	C	UTR5	ISG15	NM_005101:c.-33T>C	.	.	.	rs15842
4	chr1	949597	949597	C	T	exonic	ISG15	.	.	synonymous SNV	ISG15:NM_005101:exon2:c.237C>T:p.D79D	rs61766284
5	chr1	949654	949654	A	G	exonic	ISG15	.	.	synonymous SNV	ISG15:NM_005101:exon2:c.294A>G:p.V98V	rs8997

# 4  
Old 09-06-2017
You need to cycle through each character, as you maybe intended in post#1; you need substr() to cut out a single character.
Code:
awk '
  BEGIN { OFS="\t" }
  # The input files are processed one by one and the following code runs for each line
  # FNR is equal to NR when processing file1
  # a[ ] is indexed by the one letter code, its value is the three letter code
  FNR==NR { a[$1]=$2; next }
  # The next goes to the next input cycle
  # The following code runs for file2 (and further files)
  ($12 ~ /:NM_/ && match($12,/p..*/)) {  # search for :NM_ and p..*
    # Get the substring after p.
    VAL=substr($12,RSTART+2)
    # Get its length
    lenVAL=length(VAL)
    ostring=""
    # Cycle through each character, append to ostring, if in a[ ] replace by its value
    for (i=1; i<=lenVAL; i++) {
      c=substr(VAL,i,1)
      ostring=(ostring ((c in a) ? a[c] : c))
    }
    # copy ostring back to $12 (unconditionally), retaining the part up to p.
    $12=(substr($12,1,RSTART+1) ostring)
  }
  # always print
  { print }
' file1 FS="\t" file2

I have reverted to define FS after reading file1, because file1 might not be TAB-separated.
This User Gave Thanks to MadeInGermany For This Post:
# 5  
Old 09-12-2017
Thank you very much Smilie.

---------- Post updated 09-12-17 at 07:45 AM ---------- Previous update was 09-11-17 at 05:19 PM ----------

A line with multiple NM_ values in $12 seperated by a ; seems to chage a matching c. as well. I have tried adding in ; as a FS, but that splits eaach into multiple tabs in the output. I also added tried a break after the ostring=(ostring ((c in a) ? a[c] : c)), thinking that would process each, then break, and loop to the next... but that only processed one and then stopped. Maybe I added it in the wrong place or is there a better way? Thank you Smilie

line
Code:
R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritence	ExonicFunc.refGene	AAChange.refGene	avsnp147
1	chr1	949654	949654	A	G	exonic	ISG15	.	.	synonymous SNV	ISG15:NM_005101:exon2:c.294A>G:p.V98V;ISG15:NM_005101:exon2:c.237C>T:p.D79D	rs8997

current output
Code:
R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritence	ExonicFunc.refGene	AAChange.refGene	avsnp147
1	chr1	949654	949654	A	G	exonic	ISG15	.	.	synonymous SNV	ISG15:NM_005101:exon2:c.294A>G:p.Val98Val;ISG15:NM_005101:exon2:c.237Cys>T:p.Asp79Asp	rs8997


Last edited by cmccabe; 09-12-2017 at 09:47 AM.. Reason: fixed format
# 6  
Old 09-12-2017
Hello cmccabe,

Could you please try following and let me know if this helps you.
Code:
awk 'FNR==NR{b[$1]=$2;next} {if($13 ~ /NM_/){va=$13;match($13,/.*\./);sub(/.*\./,"",va);num1=split(va,c,"");for(i=1;i<=num1;i++){val=c[i] in b?val b[c[i]]:val c[i]};$13=substr($13,RSTART,RLENGTH) val;val=""}} 1' OFS="\t" FILE1 FILE2

EDIT: Adding a non-one liner form of solution too now.
Code:
awk '
FNR==NR{
  b[$1]=$2;
  next
}
{
  if($13 ~ /NM_/){
     va=$13;
     match($13,/.*\./);
     sub(/.*\./,"",va);
     num1=split(va,c,"");
     for(i=1;i<=num1;i++){
         val=c[i] in b?val b[c[i]]:val c[i]
};
  $13=substr($13,RSTART,RLENGTH) val;
  val=""
}
}
1' OFS="\t" FILE1 FILE2

Thanks,
R. Singh

Last edited by RavinderSingh13; 09-12-2017 at 10:52 AM.. Reason: Adding a non-one liner form of solution too successfully now.
This User Gave Thanks to RavinderSingh13 For This Post:
# 7  
Old 09-12-2017
Im not sure I follow completly, but it is close... line 3 is the multiple NM_, but only the second p. looks to be updated. Can you add comments if possible? Thank you Smilie.

Code:
1	chr1	949597	949597	C	T	exonic	ISG15	.	.	synonymous	SNV	ISG15:NM_005101:exon2:c.237C>T:p.Asp79Asp	rs61766284
2	chr1	949654	949654	A	G	exonic	ISG15	.	.	synonymous	SNV	ISG15:NM_005101:exon2:c.294A>G:p.Val98Val	rs8997
3	chr1	949654	949654	A	G	exonic	ISG15	.	.	synonymous	SNV	ISG15:NM_005101:exon2:c.294A>G:p.V98V;ISG15:NM_005101:exon2:c.237C>T:p.Asp79Asp	rs8997

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to update file based on match in 3 fields

Trying to use awk to store the value of $5 in file1 in array x. That array x is then used to search $4 of file1 to find aa match (I use x to skip the header in file1). Since $4 can have multiple strings in it seperated by a , (comma), I split them and iterate througn each split looking for a match.... (2 Replies)
Discussion started by: cmccabe
2 Replies

2. Shell Programming and Scripting

Perl to update field in file based of match to another file

In the perl below I am trying to set/update the value of $14 (last field) in file2, using the matching NM_ in $12 or $9 in file2 with the NM_ in $2 of file1. The lengths of $9 and $12 can be variable but what is consistent is the start pattern will always be NM_ and the end pattern is always ;... (4 Replies)
Discussion started by: cmccabe
4 Replies

3. Shell Programming and Scripting

awk to update file based on partial match in field1 and exact match in field2

I am trying to create a cronjob that will run on startup that will look at a list.txt file to see if there is a later version of a database using database.txt as the source. The matching lines are written to output. $1 in database.txt will be in list.txt as a partial match. $2 of database.txt... (2 Replies)
Discussion started by: cmccabe
2 Replies

4. Shell Programming and Scripting

awk to update field in file based of match in another

I am trying to use awk to match two files that are tab-delimited. When a match is found between file1 $1 and file2 $4, $4 in file2 is updated using the $2 value in file1. If no match is found then the next line is processed. Thank you :). file1 uc001bwr.3 ADC uc001bws.3 ADC... (4 Replies)
Discussion started by: cmccabe
4 Replies

5. Shell Programming and Scripting

awk match to update contents of file

I am trying to match $1 in file1 with $2 in file2. If a match is found then $3 and $4 of file2 are copied to file1. Both files are tab-delimeted and I am getting a syntax error and would also like to update file1 in-place without creating a new file, but am not sure how. Thank you :). file1 ... (19 Replies)
Discussion started by: cmccabe
19 Replies

6. Shell Programming and Scripting

awk to update field file based on match

If $1 in file1 matches $2 in file2. Then the value in $2 of file2 is updated to $1"."$2 of file2. The awk seems to only match the two files but not update. Thank you :). awk awk 'NR==FNR{A ; next} $1 in A { $2 = a }1' file1 file2 file1 name version NM_000593 5 NM_001257406... (3 Replies)
Discussion started by: cmccabe
3 Replies

7. Shell Programming and Scripting

Help with ksh-to read ip file & append lines to another file based on pattern match

Hi, I need help with this- input.txt : L B white X Y white A B brown M Y black Read this input file and if 3rd column is "white", then add specific lines to another file insert.txt. If 3rd column is brown, add different set of lines to insert.txt, and so on. For example, the given... (6 Replies)
Discussion started by: prashob123
6 Replies

8. Shell Programming and Scripting

Help needed - Split large file into smaller files based on pattern match

Help needed urgently please. I have a large file - a few hundred thousand lines. Sample CP START ACCOUNT 1234556 name 1 CP END ACCOUNT CP START ACCOUNT 2224444 name 1 CP END ACCOUNT CP START ACCOUNT 333344444 name 1 CP END ACCOUNT I need to split this file each time "CP START... (7 Replies)
Discussion started by: frustrated1
7 Replies

9. Shell Programming and Scripting

AWK match $1 $2 pattern in file 1 to $1 $2 pattern in file2

Hi, I have 2 files that I have modified to basically match each other, however I want to determine what (if any) line in file 1 does not exist in file 2. I need to match column $1 and $2 as a single string in file1 to $1 and $2 in file2 as these two columns create a match. I'm stuck in an AWK... (9 Replies)
Discussion started by: right_coaster
9 Replies

10. Shell Programming and Scripting

Merge two file data together based on specific pattern match

My input: File_1: 2000_t g1110.b1 abb.1 2001_t g1111.b1 abb.2 abb.2 g1112.b1 abb.3 2002_t . . File_2: 2000_t Ali england 135 abb.1 Zoe british 150 2001_t Ali england 305 g1111.b1 Lucy russia 126 (6 Replies)
Discussion started by: patrick87
6 Replies
Login or Register to Ask a Question