awk to update file based on match in 3 fields


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to update file based on match in 3 fields
# 1  
Old 03-22-2018
awk to update file based on match in 3 fields using two files

Trying to use awk to store the value of $5 in file1 in array x. That array x is then used to search $4 of file1 to find aa match (I use x[2] to skip the header in file1). Since $4 can have multiple strings in it seperated by a , (comma), I split them and iterate througn each split looking for a match. Each split is then stored in array and compared to x[2] for a match. The pattern c. is extracted and stored as VAL.
The awk below will hopefully do that but where I am struggling is in order for $6 in file1 to be updated, the $2 and $3 and array value must match $5 and $6 and $9 in file2. If that is true then $6 in file1 is updated with the values of $1 and $3 from file2. Line 4 is an example of this because the NM_000138.4 matches array x[2] and the c. value up to the : (colon) and that value matches $9 in file2. So all the conditions are meet to update $6 in file1. Line2 satisfies all but the c. value in VAL as it does not match $9 in file2, so $6 is not updated. I hope this is a good start and that I didn't over-complicate things (though I may have and there is probably a better way). Thank you Smilie.


file1 tab-delimeted
Code:
R_Index	Chr	Start	AAChange.refGeneWithVer	MajorTranscript	HGMD	Sanger
1	chr15	48720526	FBN1:NM_000138.4:exon57:c.6997+17C>G:p.?	NM_000138.4	.	.
2	chr15	48741091	FBN1:NM_000138.4:exon46:c.5546-1G>A:p.?	NM_000138.4	.	.
3	chr15	48807637	FBN1:NM_000138.4:exon12:c.1415G>A:p.Cys472Tyr	NM_000138.4	.	.
4	chr15	48741091	FBN1:NM_000138.4:exon46:c.5546-1G>A:p.?,FBN1:NM_000138.4:exon46:c.5546-1G>T:p.?	NM_000138.4	.	.

file2 tab-delimeted
Code:
HGMD ID	Disease	Variant Class	Gene Symbol	chromosome	coordinate start	coordinate end	strand	hgvs
CS057006	Marfan syndrome	DM	FBN1	chr15	48802240	48802240	-	c.1714+1G>T
CS057007	Marfan syndrome	DM	FBN1	chr15	48797346	48797346	-	c.1838-2A>G
CS057008	Marfan syndrome	DM	FBN1	chr15	48741091	48741091	-	c.5546-1G>T

desired output tab-delimited
Code:
R_Index	Chr	Start	AAChange.refGeneWithVer	MajorTranscript	HGMD	Sanger
1	chr15	48720526	FBN1:NM_000138.4:exon57:c.6997+17C>G:p.?	NM_000138.4	.	.
2	chr15	48741091	FBN1:NM_000138.4:exon46:c.5546-1G>A:p.?	NM_000138.4	.	.
3	chr15	48807637	FBN1:NM_000138.4:exon12:c.1415G>A:p.Cys472Tyr	NM_000138.4	.	.
4	chr15	48741091	FBN1:NM_000138.3:exon46:c.5546-1G>A:p.?,FBN1:NM_000138.4:exon46:c.5546-1G>T:p.?	NM_000138.4	CS057008 DM	.

Code:
awk '
  BEGIN { FS=OFS="\t" }
        FNR==NR {x[NR]=$5}  # store value in $5 in array x
        #{print X[2]} 
         $4 ~ x[2] {      # if $4 matches x[2]
         match($4,"NM"].*],);  # regex match from NM to till , in 4rd field
         val=substr($4,RSTART+1,RLENGTH-2); # store substring value that starts from RSTART+1 to RLENGTH-2 using $4 in val
         NM=split($4, array,",");   # Split $4 on "," and storing it's length(array's length) to variable named num. 
             for(i=1;i<=NM;i++){ # Starting a loop which will start from value 1 of variable i to till value of variable num
              if(array[i] ~  x[2]){  # Check condition if any array's value is equal to array x[2] skipping header
                if (match(NM[i],/c[.].:/)) {  # extract pattern c. in each split from c. to :
                    VAL=substr(NM[n],RSTART+2) # store each c. from split in VAL
                              }
                             }
                            }
                           }
                              {a[$5,$6,$9]=$1,$3; next} a[$2,$3]{$6=a[$2,$3]}1' file1 file2  # update $6 in file1 if condition is met


Last edited by cmccabe; 03-23-2018 at 12:36 PM.. Reason: added details, updated thread title
# 2  
Old 03-25-2018
Few things here.

1. Process file2 first so you can have a[$5,$6,$9] populated when processing files2
2. Why store x[NR] when you are only interested in x[2] I replaced with FNR==2 {x=$5} to only store $5 from the 2nd line of file1
3. If $4 matches x split $4 array and then attempt to match "c." value against the a[] values stored from file2


Code:
wk '
BEGIN { FS=OFS="\t" }
FNR==NR {a[$5,$6,$9]=$1" "$3; next}
FNR==2 {x=$5}  # store value of $5 from line 2

$4 ~ x {      # if $4 matches x[2]
    NM=split($4, array,",");   # Split $4 on "," and storing in array[]
    for(i=1;i<=NM;i++){ # Loop thru each split value from above
        if(index(array[i], x) > 0) {  # If x is in this elelemt
            if (match(array[i], "c[.].*:")) {  # extract pattern c. from split from c. to :
                VAL=substr(array[i], RSTART, RLENGTH-1) # extrace c. value from split in VAL
                if($2 SUBSEP $3 SUBSEP VAL in a) $6=a[$2,$3,VAL]  # if this is in a[] array update $6
            }
        }
    }
}
1' file2 file1  # update $6 in file1 if condition is met

This User Gave Thanks to Chubler_XL For This Post:
# 3  
Old 03-28-2018
Thank you very much Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to update file with sum of matching fields in another file

In the awk below I am trying to add a penalty to a score to each matching $1 in file2 based on the sum of $3+$4 (variable TL) from file1. Then the $4 value in file1 is divided by TL and multiplied by 100 (this valvue is variable S). Finally, $2 in file2 - S gives the updated $2 result in file2.... (2 Replies)
Discussion started by: cmccabe
2 Replies

2. Shell Programming and Scripting

awk move select fields to match file prefix in two directories

In the awk below I am trying to use the file1 as a match to file2. In file2 the contents of $5,&6,and $7 (always tab-delimited) and are copied to the output under the header Quality metrics. The below executes but the output is empty. I have added comments to help and show my thinking. Thank you... (0 Replies)
Discussion started by: cmccabe
0 Replies

3. UNIX for Beginners Questions & Answers

Match Fields between two files, print portions of each file together when matched in ([g]awk)'

I've written an awk script to compare two fields in two different files and then print portions of each file on the same line when matched. It works reasonably well, but every now and again, I notice some errors and cannot seem to figure out what the issue may be and am turning to you for help. ... (2 Replies)
Discussion started by: jvoot
2 Replies

4. Shell Programming and Scripting

awk to update value based on pattern match in another file

In the awk, thanks you @RavinderSingh13, for the help in below, hopefully it is close as I am trying to update the value in $12 of the tab-delimeted file2 with the matching value in $1 of the space delimeted file1. I have added comments for each line as well. Thank you :). awk awk '$12 ==... (10 Replies)
Discussion started by: cmccabe
10 Replies

5. Shell Programming and Scripting

awk to print match or non-match and select fields/patterns for non-matches

In the awk below I am trying to output those lines that Match between file1 and file2, those Missing in file1, and those missing in file2. Using each $1,$2,$4,$5 value as a key to match on, that is if those 4 fields are found in both files the match, but if those 4 fields are not found then missing... (0 Replies)
Discussion started by: cmccabe
0 Replies

6. Shell Programming and Scripting

Perl to update field in file based of match to another file

In the perl below I am trying to set/update the value of $14 (last field) in file2, using the matching NM_ in $12 or $9 in file2 with the NM_ in $2 of file1. The lengths of $9 and $12 can be variable but what is consistent is the start pattern will always be NM_ and the end pattern is always ;... (4 Replies)
Discussion started by: cmccabe
4 Replies

7. Shell Programming and Scripting

awk to update file based on partial match in field1 and exact match in field2

I am trying to create a cronjob that will run on startup that will look at a list.txt file to see if there is a later version of a database using database.txt as the source. The matching lines are written to output. $1 in database.txt will be in list.txt as a partial match. $2 of database.txt... (2 Replies)
Discussion started by: cmccabe
2 Replies

8. Shell Programming and Scripting

awk to update field in file based of match in another

I am trying to use awk to match two files that are tab-delimited. When a match is found between file1 $1 and file2 $4, $4 in file2 is updated using the $2 value in file1. If no match is found then the next line is processed. Thank you :). file1 uc001bwr.3 ADC uc001bws.3 ADC... (4 Replies)
Discussion started by: cmccabe
4 Replies

9. Shell Programming and Scripting

awk match to update contents of file

I am trying to match $1 in file1 with $2 in file2. If a match is found then $3 and $4 of file2 are copied to file1. Both files are tab-delimeted and I am getting a syntax error and would also like to update file1 in-place without creating a new file, but am not sure how. Thank you :). file1 ... (19 Replies)
Discussion started by: cmccabe
19 Replies

10. Shell Programming and Scripting

awk to update field file based on match

If $1 in file1 matches $2 in file2. Then the value in $2 of file2 is updated to $1"."$2 of file2. The awk seems to only match the two files but not update. Thank you :). awk awk 'NR==FNR{A ; next} $1 in A { $2 = a }1' file1 file2 file1 name version NM_000593 5 NM_001257406... (3 Replies)
Discussion started by: cmccabe
3 Replies
Login or Register to Ask a Question