Unix/Linux Go Back    


Shell Programming and Scripting BSD, Linux, and UNIX shell scripting — Post awk, bash, csh, ksh, perl, php, python, sed, sh, shell scripts, and other shell scripting languages questions here.

awk to update file based on match in 3 fields

Shell Programming and Scripting


Tags
solved

Reply    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 4 Weeks Ago   -   Original Discussion by cmccabe
cmccabe's Unix or Linux Image
cmccabe cmccabe is offline
Registered User
 
Join Date: Nov 2013
Last Activity: 18 April 2018, 8:01 AM EDT
Location: Chicago
Posts: 1,221
Thanks: 734
Thanked 14 Times in 13 Posts
awk to update file based on match in 3 fields using two files

Trying to use awk to store the value of $5 in file1 in array x. That array x is then used to search $4 of file1 to find aa match (I use x[2] to skip the header in file1). Since $4 can have multiple strings in it seperated by a , (comma), I split them and iterate througn each split looking for a match. Each split is then stored in array and compared to x[2] for a match. The pattern c. is extracted and stored as VAL.
The awk below will hopefully do that but where I am struggling is in order for $6 in file1 to be updated, the $2 and $3 and array value must match $5 and $6 and $9 in file2. If that is true then $6 in file1 is updated with the values of $1 and $3 from file2. Line 4 is an example of this because the NM_000138.4 matches array x[2] and the c. value up to the : (colon) and that value matches $9 in file2. So all the conditions are meet to update $6 in file1. Line2 satisfies all but the c. value in VAL as it does not match $9 in file2, so $6 is not updated. I hope this is a good start and that I didn't over-complicate things (though I may have and there is probably a better way). Thank you Linux.


file1 tab-delimeted


Code:
R_Index	Chr	Start	AAChange.refGeneWithVer	MajorTranscript	HGMD	Sanger
1	chr15	48720526	FBN1:NM_000138.4:exon57:c.6997+17C>G:p.?	NM_000138.4	.	.
2	chr15	48741091	FBN1:NM_000138.4:exon46:c.5546-1G>A:p.?	NM_000138.4	.	.
3	chr15	48807637	FBN1:NM_000138.4:exon12:c.1415G>A:p.Cys472Tyr	NM_000138.4	.	.
4	chr15	48741091	FBN1:NM_000138.4:exon46:c.5546-1G>A:p.?,FBN1:NM_000138.4:exon46:c.5546-1G>T:p.?	NM_000138.4	.	.

file2 tab-delimeted


Code:
HGMD ID	Disease	Variant Class	Gene Symbol	chromosome	coordinate start	coordinate end	strand	hgvs
CS057006	Marfan syndrome	DM	FBN1	chr15	48802240	48802240	-	c.1714+1G>T
CS057007	Marfan syndrome	DM	FBN1	chr15	48797346	48797346	-	c.1838-2A>G
CS057008	Marfan syndrome	DM	FBN1	chr15	48741091	48741091	-	c.5546-1G>T

desired output tab-delimited


Code:
R_Index	Chr	Start	AAChange.refGeneWithVer	MajorTranscript	HGMD	Sanger
1	chr15	48720526	FBN1:NM_000138.4:exon57:c.6997+17C>G:p.?	NM_000138.4	.	.
2	chr15	48741091	FBN1:NM_000138.4:exon46:c.5546-1G>A:p.?	NM_000138.4	.	.
3	chr15	48807637	FBN1:NM_000138.4:exon12:c.1415G>A:p.Cys472Tyr	NM_000138.4	.	.
4	chr15	48741091	FBN1:NM_000138.3:exon46:c.5546-1G>A:p.?,FBN1:NM_000138.4:exon46:c.5546-1G>T:p.?	NM_000138.4	CS057008 DM	.



Code:
awk '
  BEGIN { FS=OFS="\t" }
        FNR==NR {x[NR]=$5}  # store value in $5 in array x
        #{print X[2]} 
         $4 ~ x[2] {      # if $4 matches x[2]
         match($4,"NM"].*],);  # regex match from NM to till , in 4rd field
         val=substr($4,RSTART+1,RLENGTH-2); # store substring value that starts from RSTART+1 to RLENGTH-2 using $4 in val
         NM=split($4, array,",");   # Split $4 on "," and storing it's length(array's length) to variable named num. 
             for(i=1;i<=NM;i++){ # Starting a loop which will start from value 1 of variable i to till value of variable num
              if(array[i] ~  x[2]){  # Check condition if any array's value is equal to array x[2] skipping header
                if (match(NM[i],/c[.].:/)) {  # extract pattern c. in each split from c. to :
                    VAL=substr(NM[n],RSTART+2) # store each c. from split in VAL
                              }
                             }
                            }
                           }
                              {a[$5,$6,$9]=$1,$3; next} a[$2,$3]{$6=a[$2,$3]}1' file1 file2  # update $6 in file1 if condition is met


Last edited by cmccabe; 4 Weeks Ago at 11:36 AM.. Reason: added details, updated thread title
Sponsored Links
    #2  
Old Unix and Linux 4 Weeks Ago   -   Original Discussion by cmccabe
Chubler_XL's Unix or Linux Image
Chubler_XL Chubler_XL is offline Forum Staff  
Moderator
 
Join Date: Oct 2010
Last Activity: 18 April 2018, 9:54 PM EDT
Posts: 3,512
Thanks: 154
Thanked 1,249 Times in 1,145 Posts
Few things here.

1. Process file2 first so you can have a[$5,$6,$9] populated when processing files2
2. Why store x[NR] when you are only interested in x[2] I replaced with FNR==2 {x=$5} to only store $5 from the 2nd line of file1
3. If $4 matches x split $4 array and then attempt to match "c." value against the a[] values stored from file2




Code:
wk '
BEGIN { FS=OFS="\t" }
FNR==NR {a[$5,$6,$9]=$1" "$3; next}
FNR==2 {x=$5}  # store value of $5 from line 2

$4 ~ x {      # if $4 matches x[2]
    NM=split($4, array,",");   # Split $4 on "," and storing in array[]
    for(i=1;i<=NM;i++){ # Loop thru each split value from above
        if(index(array[i], x) > 0) {  # If x is in this elelemt
            if (match(array[i], "c[.].*:")) {  # extract pattern c. from split from c. to :
                VAL=substr(array[i], RSTART, RLENGTH-1) # extrace c. value from split in VAL
                if($2 SUBSEP $3 SUBSEP VAL in a) $6=a[$2,$3,VAL]  # if this is in a[] array update $6
            }
        }
    }
}
1' file2 file1  # update $6 in file1 if condition is met

The Following User Says Thank You to Chubler_XL For This Useful Post:
cmccabe (3 Weeks Ago)
Sponsored Links
    #3  
Old Unix and Linux 3 Weeks Ago   -   Original Discussion by cmccabe
cmccabe's Unix or Linux Image
cmccabe cmccabe is offline
Registered User
 
Join Date: Nov 2013
Last Activity: 18 April 2018, 8:01 AM EDT
Location: Chicago
Posts: 1,221
Thanks: 734
Thanked 14 Times in 13 Posts
Thank you very much Linux.
Sponsored Links
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
awk to update value based on pattern match in another file cmccabe Shell Programming and Scripting 10 09-13-2017 07:29 PM
Perl to update field in file based of match to another file cmccabe Shell Programming and Scripting 4 07-13-2017 07:44 PM
awk to update file based on partial match in field1 and exact match in field2 cmccabe Shell Programming and Scripting 2 03-01-2017 07:17 AM
awk to update field in file based of match in another cmccabe Shell Programming and Scripting 4 11-23-2016 08:33 AM
awk to update field file based on match cmccabe Shell Programming and Scripting 3 06-02-2016 12:19 PM



All times are GMT -4. The time now is 05:17 AM.