Unix/Linux Go Back    


Shell Programming and Scripting BSD, Linux, and UNIX shell scripting — Post awk, bash, csh, ksh, perl, php, python, sed, sh, shell scripts, and other shell scripting languages questions here.

Perl to update field in file based of match to another file

Shell Programming and Scripting


Tags
perl, solved

Reply    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 07-07-2017
cmccabe cmccabe is offline
Registered User
 
Join Date: Nov 2013
Last Activity: 13 September 2017, 7:30 PM EDT
Location: Chicago
Posts: 1,176
Thanks: 705
Thanked 15 Times in 14 Posts
Perl to update field in file based of match to another file

In the perl below I am trying to set/update the value of $14 (last field) in file2, using the matching NM_ in $12
or $9 in file2 with the NM_ in $2 of file1.
The lengths of $9 and $12 can be variable but what is consistent is the start pattern will always be NM_ and the end pattern is always
; (semi-colon) or a break (if it is the last).
What is extracted into $14 (last field) is all the text from the start to end (string between the NM_ up to the ; or break. The value in $7 determines the field to use, that is if
$7 is exonic then $12 is used to extract from. If $7 is not exonic then $9 is used to extract from.
There will always be a value in $7 and exonic is there the majority of the time, but not always.
The below seems to be happening in this code:
The NM_ value of $2 in file1, after splitting on the ., will match a substring NM_ in $12 (the majority of the time),
or $9 (in some cases). The substring that matches is extracted starting from the NM_ until the ; or break (if it is the last value, like in line 2 in the example).
The text in $7 of file2 determines the field to use/ extract from.... that is if $7=exonic, then use $12, but if
$7 is not = exonic, then use $9. The extracted value is used to update $14 (last field) from a . to the extracted value.
My question is why does the Sanger column header in $14 (last field) get removed ---- does the header row need to be skipped ----
why does the rs3841266 after the . in line get removed
since the last feield is line 1 is empty . (dot) should result
I can not seem to do add these 3 things to the script to get the desired output. Thank you Linux.
file1 space delimeted

Code:
ATP13A2 NM_022089.3
PPT1 NM_000310.3
ISG15 NM_005101.3

file2 tab-delimeted

Code:
R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritence ExonicFunc.refGene AAChange.refGene avsnp147 Sanger 
1 chr1 948846 948846 - A upstream ISG15 . . . . rs3841266
2 chr1 17314702 17314702 C T exonic ATP13A2 . . synonymous SNV ATP13A2:NM_001141974:exon24:c.2658G>A:p.S886S;ATP13A2:NM_001141973:exon25:c.2775G>A:p.S925S;ATP13A2:NM_022089:exon25:c.2790G>A:p.S930S rs3738815 .
3 chr1 40562993 40562993 T C UTR5 PPT1 NM_001142604:c.-83A>G;NM_000310:c.-83A>G . . . rs6600313 .

current file2 after perl script executed tab-delimeted --- the rs3841266 after the . in line is removed, Sanger is removed from the last field as the column header,
and since the last feield is line 1 is empty . should result ---

Code:
R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritence ExonicFunc.refGene AAChange.refGene avsnp147 
1 chr1 948846 948846 - A upstream ISG15 . . . . 
2 chr1 17314702 17314702 C T exonic ATP13A2 . . synonymous SNV ATP13A2:NM_001141974:exon24:c.2658G>A:p.S886S;ATP13A2:NM_001141973:exon25:c.2775G>A:p.S925S;ATP13A2:NM_022089:exon25:c.2790G>A:p.S930S rs3738815 NM_022089:exon25:c.2790G>A:p.S930S
3 chr1 40562993 40562993 T C UTR5 PPT1 NM_001142604:c.-83A>G;NM_000310:c.-83A>G . . . rs6600313 NM_000310:c.-83A>G

desired output of file2 after script executed tab-delimeted

Code:
R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritence ExonicFunc.refGene AAChange.refGene avsnp147 Sanger 
1 chr1 948846 948846 - A upstream ISG15 . . . . rs3841266 . 
2 chr1 17314702 17314702 C T exonic ATP13A2 . . synonymous SNV ATP13A2:NM_001141974:exon24:c.2658G>A:p.S886S;ATP13A2:NM_001141973:exon25:c.2775G>A:p.S925S;ATP13A2:NM_022089:exon25:c.2790G>A:p.S930S rs3738815 NM_022089:exon25:c.2790G>A:p.S930S
3 chr1 40562993 40562993 T C UTR5 PPT1 NM_001142604:c.-83A>G;NM_000310:c.-83A>G . . . rs6600313 NM_000310:c.-83A>G

perl

Code:
perl -i.bak -aF/\\t/ -lne 'BEGIN{%m=map {chomp;(split/[\s\.]/)[1,0]} <STDIN>};($r)=grep {$x=$_;grep {$x=~/$_/} keys %m} (split/\;/,$F[$F[6]=~/exonic/?11:8]);$r=~s/.*?(NM_.*)$/$1/;pop @F;print join("\t",@F,$r)' file2.txt < file1.txt

Sponsored Links
    #2  
Old Unix and Linux 07-07-2017
durden_tyler's Unix or Linux Image
durden_tyler durden_tyler is offline Forum Advisor  
Registered User
 
Join Date: Apr 2009
Last Activity: 9 September 2017, 1:30 PM EDT
Posts: 2,083
Thanks: 21
Thanked 383 Times in 346 Posts
Quote:
Originally Posted by cmccabe View Post
...
...
My question is why does the Sanger column header in $14 (last field) get removed
...
...
Because of the "pop @F" in your code. See the text in red below.


Code:
perl -i.bak -aF/\\t/ -lne 'BEGIN{%m=map {chomp;(split/[\s\.]/)[1,0]} <STDIN>};($r)=grep {$x=$_;grep {$x=~/$_/} keys %m} (split/\;/,$F[$F[6]=~/exonic/?11:8]);$r=~s/.*?(NM_.*)$/$1/;pop @F;print join("\t",@F,$r)' file2.txt < file1.txt

Here's the documentation of the "pop" function: pop - perldoc.perl.org


Quote:
Originally Posted by cmccabe View Post
...
---- does the header row need to be skipped ----
...
Skipping the header will retain the "Sanger" column header.
And the "pop" will then remove the last column from the remaining rows.

Quote:
Originally Posted by cmccabe View Post
...
why does the rs3841266 after the . in line get removed
...
...
For the same reason the "Sanger" column header gets removed - the "pop" function.

Quote:
Originally Posted by cmccabe View Post
...
...
since the last feield is line 1 is empty . (dot) should result

...
...
I did not understand this statement.
The last field in line 1 of "file2.txt" is "Sanger". It is not empty.
The Following User Says Thank You to durden_tyler For This Useful Post:
cmccabe (07-07-2017)
Sponsored Links
    #3  
Old Unix and Linux 07-07-2017
cmccabe cmccabe is offline
Registered User
 
Join Date: Nov 2013
Last Activity: 13 September 2017, 7:30 PM EDT
Location: Chicago
Posts: 1,176
Thanks: 705
Thanked 15 Times in 14 Posts
I apologize line 1 after the header.... if the last field is blank then a .(dot) results.
R_Index 1 will always be the the first line with data in it and has an index, as the header row does not get an index. Thank you very much that helps and questions 1 and 2 Linux

Last edited by cmccabe; 07-07-2017 at 06:22 PM..
    #4  
Old Unix and Linux 07-07-2017
durden_tyler's Unix or Linux Image
durden_tyler durden_tyler is offline Forum Advisor  
Registered User
 
Join Date: Apr 2009
Last Activity: 9 September 2017, 1:30 PM EDT
Posts: 2,083
Thanks: 21
Thanked 383 Times in 346 Posts
Quote:
Originally Posted by cmccabe View Post
...
...
if the last field is blank then a .(dot) results.
...
For line # 2 of "file2.txt", this:


Code:
$F[$F[6]=~/exonic/?11:8]

returns $F[8] which is "."

However, this:


Code:
grep {$x=~/$_/} keys %m

does not return anything because none of the keys of hash %m (shown below)


Code:
'NM_005101'
'NM_000310'
'NM_022089'

exist in the string "."

Therefore the variable $r is an empty string.
And hence, this:


Code:
print join("\t",@F,$r);

does not append anything to the array @F for line # 2 of "file2.txt".
The Following User Says Thank You to durden_tyler For This Useful Post:
cmccabe (07-13-2017)
Sponsored Links
    #5  
Old Unix and Linux 07-13-2017
cmccabe cmccabe is offline
Registered User
 
Join Date: Nov 2013
Last Activity: 13 September 2017, 7:30 PM EDT
Location: Chicago
Posts: 1,176
Thanks: 705
Thanked 15 Times in 14 Posts
Thank you very muchLinux.
Sponsored Links
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
awk to update file based on partial match in field1 and exact match in field2 cmccabe Shell Programming and Scripting 2 03-01-2017 07:17 AM
awk to update field in file based of match in another cmccabe Shell Programming and Scripting 4 11-23-2016 08:33 AM
awk to update field file based on match cmccabe Shell Programming and Scripting 3 06-02-2016 12:19 PM
Match columns from two csv files and update field in one of the csv file djoseph Shell Programming and Scripting 10 11-27-2014 07:20 AM
Update a field in a file based on condition kichu Shell Programming and Scripting 1 12-07-2010 12:57 PM



All times are GMT -4. The time now is 04:06 PM.