Perl to update field in file based of match to another file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Perl to update field in file based of match to another file
# 1  
Old 07-07-2017
Perl to update field in file based of match to another file

In the perl below I am trying to set/update the value of $14 (last field) in file2, using the matching NM_ in $12
or $9 in file2 with the NM_ in $2 of file1.
The lengths of $9 and $12 can be variable but what is consistent is the start pattern will always be NM_ and the end pattern is always
; (semi-colon) or a break (if it is the last).
What is extracted into $14 (last field) is all the text from the start to end (string between the NM_ up to the ; or break. The value in $7 determines the field to use, that is if
$7 is exonic then $12 is used to extract from. If $7 is not exonic then $9 is used to extract from.
There will always be a value in $7 and exonic is there the majority of the time, but not always.
The below seems to be happening in this code:
The NM_ value of $2 in file1, after splitting on the ., will match a substring NM_ in $12 (the majority of the time),
or $9 (in some cases). The substring that matches is extracted starting from the NM_ until the ; or break (if it is the last value, like in line 2 in the example).
The text in $7 of file2 determines the field to use/ extract from.... that is if $7=exonic, then use $12, but if
$7 is not = exonic, then use $9. The extracted value is used to update $14 (last field) from a . to the extracted value.
My question is why does the Sanger column header in $14 (last field) get removed ---- does the header row need to be skipped ----
why does the rs3841266 after the . in line get removed
since the last feield is line 1 is empty . (dot) should result
I can not seem to do add these 3 things to the script to get the desired output. Thank you Smilie.
file1 space delimeted
Code:
ATP13A2 NM_022089.3
PPT1 NM_000310.3
ISG15 NM_005101.3

file2 tab-delimeted
Code:
R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritence ExonicFunc.refGene AAChange.refGene avsnp147 Sanger 
1 chr1 948846 948846 - A upstream ISG15 . . . . rs3841266
2 chr1 17314702 17314702 C T exonic ATP13A2 . . synonymous SNV ATP13A2:NM_001141974:exon24:c.2658G>A:p.S886S;ATP13A2:NM_001141973:exon25:c.2775G>A:p.S925S;ATP13A2:NM_022089:exon25:c.2790G>A:p.S930S rs3738815 .
3 chr1 40562993 40562993 T C UTR5 PPT1 NM_001142604:c.-83A>G;NM_000310:c.-83A>G . . . rs6600313 .

current file2 after perl script executed tab-delimeted --- the rs3841266 after the . in line is removed, Sanger is removed from the last field as the column header,
and since the last feield is line 1 is empty . should result ---
Code:
R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritence ExonicFunc.refGene AAChange.refGene avsnp147 
1 chr1 948846 948846 - A upstream ISG15 . . . . 
2 chr1 17314702 17314702 C T exonic ATP13A2 . . synonymous SNV ATP13A2:NM_001141974:exon24:c.2658G>A:p.S886S;ATP13A2:NM_001141973:exon25:c.2775G>A:p.S925S;ATP13A2:NM_022089:exon25:c.2790G>A:p.S930S rs3738815 NM_022089:exon25:c.2790G>A:p.S930S
3 chr1 40562993 40562993 T C UTR5 PPT1 NM_001142604:c.-83A>G;NM_000310:c.-83A>G . . . rs6600313 NM_000310:c.-83A>G

desired output of file2 after script executed tab-delimeted
Code:
R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritence ExonicFunc.refGene AAChange.refGene avsnp147 Sanger 
1 chr1 948846 948846 - A upstream ISG15 . . . . rs3841266 . 
2 chr1 17314702 17314702 C T exonic ATP13A2 . . synonymous SNV ATP13A2:NM_001141974:exon24:c.2658G>A:p.S886S;ATP13A2:NM_001141973:exon25:c.2775G>A:p.S925S;ATP13A2:NM_022089:exon25:c.2790G>A:p.S930S rs3738815 NM_022089:exon25:c.2790G>A:p.S930S
3 chr1 40562993 40562993 T C UTR5 PPT1 NM_001142604:c.-83A>G;NM_000310:c.-83A>G . . . rs6600313 NM_000310:c.-83A>G

perl
Code:
perl -i.bak -aF/\\t/ -lne 'BEGIN{%m=map {chomp;(split/[\s\.]/)[1,0]} <STDIN>};($r)=grep {$x=$_;grep {$x=~/$_/} keys %m} (split/\;/,$F[$F[6]=~/exonic/?11:8]);$r=~s/.*?(NM_.*)$/$1/;pop @F;print join("\t",@F,$r)' file2.txt < file1.txt

# 2  
Old 07-07-2017
Quote:
Originally Posted by cmccabe
...
...
My question is why does the Sanger column header in $14 (last field) get removed
...
...
Because of the "pop @F" in your code. See the text in red below.

Code:
perl -i.bak -aF/\\t/ -lne 'BEGIN{%m=map {chomp;(split/[\s\.]/)[1,0]} <STDIN>};($r)=grep {$x=$_;grep {$x=~/$_/} keys %m} (split/\;/,$F[$F[6]=~/exonic/?11:8]);$r=~s/.*?(NM_.*)$/$1/;pop @F;print join("\t",@F,$r)' file2.txt < file1.txt

Here's the documentation of the "pop" function: pop - perldoc.perl.org


Quote:
Originally Posted by cmccabe
...
---- does the header row need to be skipped ----
...
Skipping the header will retain the "Sanger" column header.
And the "pop" will then remove the last column from the remaining rows.

Quote:
Originally Posted by cmccabe
...
why does the rs3841266 after the . in line get removed
...
...
For the same reason the "Sanger" column header gets removed - the "pop" function.

Quote:
Originally Posted by cmccabe
...
...
since the last feield is line 1 is empty . (dot) should result

...
...
I did not understand this statement.
The last field in line 1 of "file2.txt" is "Sanger". It is not empty.
This User Gave Thanks to durden_tyler For This Post:
# 3  
Old 07-07-2017
I apologize line 1 after the header.... if the last field is blank then a .(dot) results.
R_Index 1 will always be the the first line with data in it and has an index, as the header row does not get an index. Thank you very much that helps and questions 1 and 2 Smilie

Last edited by cmccabe; 07-07-2017 at 07:22 PM..
# 4  
Old 07-07-2017
Quote:
Originally Posted by cmccabe
...
...
if the last field is blank then a .(dot) results.
...
For line # 2 of "file2.txt", this:

Code:
$F[$F[6]=~/exonic/?11:8]

returns $F[8] which is "."

However, this:

Code:
grep {$x=~/$_/} keys %m

does not return anything because none of the keys of hash %m (shown below)

Code:
'NM_005101'
'NM_000310'
'NM_022089'

exist in the string "."

Therefore the variable $r is an empty string.
And hence, this:

Code:
print join("\t",@F,$r);

does not append anything to the array @F for line # 2 of "file2.txt".
This User Gave Thanks to durden_tyler For This Post:
# 5  
Old 07-13-2017
Thank you very muchSmilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Perl script to fill the entire row of Excel file with color based on pattern match

Hi All , I have to write one Perl script in which I need to read one pre-existing xls and based on pattern match for one word in some cells of the XLS , I need to fill the entire row with one color of that matched cell and write the content to another excel Please find the below stated... (2 Replies)
Discussion started by: kshitij
2 Replies

2. Shell Programming and Scripting

awk to update file based on match in 3 fields

Trying to use awk to store the value of $5 in file1 in array x. That array x is then used to search $4 of file1 to find aa match (I use x to skip the header in file1). Since $4 can have multiple strings in it seperated by a , (comma), I split them and iterate througn each split looking for a match.... (2 Replies)
Discussion started by: cmccabe
2 Replies

3. Shell Programming and Scripting

Update a specific field in file with Variable value based on other Key Word

I have an input file with A=xyz B=pqr I would want the value in Second Field (xyz or pqr) updated with a value present in Shell Variable based on the value passed in the first field. (A or B ) while read line do NEW_VALUE = `some functionality done on $line` If $line=First Field-... (1 Reply)
Discussion started by: infernalhell
1 Replies

4. Shell Programming and Scripting

awk to update value based on pattern match in another file

In the awk, thanks you @RavinderSingh13, for the help in below, hopefully it is close as I am trying to update the value in $12 of the tab-delimeted file2 with the matching value in $1 of the space delimeted file1. I have added comments for each line as well. Thank you :). awk awk '$12 ==... (10 Replies)
Discussion started by: cmccabe
10 Replies

5. Shell Programming and Scripting

Perl to update field based on a specific set of rules

In the perl below, which does execute, I am having trouble with the else in Rule 3. The digit in f{8} is extracted and used to update f accordinly along with the value in f. There can be either - * or + before the number that is extracted but the same logic applies, that is if the value is greater... (5 Replies)
Discussion started by: cmccabe
5 Replies

6. Shell Programming and Scripting

awk to update file based on partial match in field1 and exact match in field2

I am trying to create a cronjob that will run on startup that will look at a list.txt file to see if there is a later version of a database using database.txt as the source. The matching lines are written to output. $1 in database.txt will be in list.txt as a partial match. $2 of database.txt... (2 Replies)
Discussion started by: cmccabe
2 Replies

7. Shell Programming and Scripting

awk to update field in file based of match in another

I am trying to use awk to match two files that are tab-delimited. When a match is found between file1 $1 and file2 $4, $4 in file2 is updated using the $2 value in file1. If no match is found then the next line is processed. Thank you :). file1 uc001bwr.3 ADC uc001bws.3 ADC... (4 Replies)
Discussion started by: cmccabe
4 Replies

8. Shell Programming and Scripting

awk to update field file based on match

If $1 in file1 matches $2 in file2. Then the value in $2 of file2 is updated to $1"."$2 of file2. The awk seems to only match the two files but not update. Thank you :). awk awk 'NR==FNR{A ; next} $1 in A { $2 = a }1' file1 file2 file1 name version NM_000593 5 NM_001257406... (3 Replies)
Discussion started by: cmccabe
3 Replies

9. Shell Programming and Scripting

Match columns from two csv files and update field in one of the csv file

Hi, I have a file of csv data, which looks like this: file1: 1AA,LGV_PONCEY_LES_ATHEE,1,\N,1,00020460E1,0,\N,\N,\N,\N,2,00.22335321,0.00466628 2BB,LES_POUGES_ASF,\N,200,200,00006298G1,0,\N,\N,\N,\N,1,00.30887539,0.00050312... (10 Replies)
Discussion started by: djoseph
10 Replies

10. Shell Programming and Scripting

Update a field in a file based on condition

Hi i am new to scripting. i have a file file.dat with content as : CONTENT_STORAGE PERCENTAGE FLAG: /storage_01 64% 0 /storage_02 17% 1 I need to update the value of FLAG for a particular CONTENT_STORAGE value I have written the following code #!/bin/sh threshold=20... (1 Reply)
Discussion started by: kichu
1 Replies
Login or Register to Ask a Question