Perl to update field based on a specific set of rules


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Perl to update field based on a specific set of rules
# 1  
Old 07-21-2017
Perl to update field based on a specific set of rules

In the perl below, which does execute, I am having trouble with the else in Rule 3. The digit in f{8} is extracted and used to update f[55] accordinly along with the value in f[13].
There can be either - * or + before the number that is extracted but the same logic applies, that is if the value is greater than 10 and f[13] is greater than 0.01 f[55] is Likely Benign, if the value is less than 10
and f[13] is less than 0.01 f[55] is VUS in. It is possible for f[13] to be . (dot) but that is the same as zero. However, currently the value does not seem to be extracted by the perl below and I
get different output then the desired. If I comment this rule out it fixes some of the lines but causes other lines to be incorrect. I am not sure what is causing the issue or how to fix it. Thank you Smilie.

I added a description to explain each line of the output and what should be happening, which currently is not Smilie.

file
Code:
R_Index    Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    Inheritence    ExonicFunc.refGene    AAChange.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    dpsi_max_tissue    dpsi_zscore    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Score    Classification    HGMD    Sanger
47    chr16    10032451    10032451    T    C    splicing    GRIN2A    NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G    .    .    .    rs760077182    0.0004    .    .    .    .    .    .    0.0001    .    .    .    .    .    .    0.0004    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    0.7699    1.365    .    .    .    .    .    GOOD    93    het    9    .    .    .
77    chr14    76427420    76427420    C    G    splicing    TGFB3    NM_003239:exon6:c.927-1G>C    .    .    .    rs767548724    0.0001    .    .    .    .    .    .    .    .    0.0001    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    1    D    .    .    -3.9123    -2.334    .    .    .    .    .    GOOD    190    het    13    .    .    .
160    chr3    186944313    186944313    G    C    splicing    MASP1    NM_001879:exon12:c.1442-5C>G    .    .    .    rs138989954    0.0055    0.0018    .    .    .    0.005    0.0041    0.0036    0.0008    0.0032    .    0.0007    0.0043    0.0037    0.0055    0.0025    0.0005    0.0036    .    .    .    .    .    .    .    .    .    .    .    .    .    -0.8646    -1.388    .    .    .    .    .    GOOD    265    het    13    .    .    .

desired output tab-delimeted

Code:
R_Index    Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    Inheritence    ExonicFunc.refGene    AAChange.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    dpsi_max_tissue    dpsi_zscore    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Score    Classification    HGMD    Sanger
47    chr16    10032451    10032451    T    C    splicing    GRIN2A    NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G    .    .    .    rs760077182    0.0004    .    .    .    .    .    .    0.0001    .    .    .    .    .    .    0.0004    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    0.7699    1.365    .    .    .    .    .    GOOD    93    het    9    Likely Benign    .    .
77    chr14    76427420    76427420    C    G    splicing    TGFB3    NM_003239:exon6:c.927-1G>C    .    .    .    rs767548724    0.0001    .    .    .    .    .    .    .    .    0.0001    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    1    D    .    .    -3.9123    -2.334    .    .    .    .    .    GOOD    190    het    13    VUS    .    .
160    chr3    186944313    186944313    G    C    splicing    MASP1    NM_001879:exon12:c.1442-5C>G    .    .    .    rs138989954    0.0055    0.0018    .    .    .    0.005    0.0041    0.0036    0.0008    0.0032    .    0.0007    0.0043    0.0037    0.0055    0.0025    0.0005    0.0036    .    .    .    .    .    .    .    .    .    .    .    .    .    -0.8646    -1.388    .    .    .    .    .    GOOD    265    het    13    VUS    .    .

description
Code:
line 1 is Likely Benign in f[55] because in NM_001134408:exon3:c.415-43A>G in f[8] the 43 after the minus is extracted and since that value is greater than 10 f[55] is Likely Benign
line 2 is VUS in f[55] because in NM_003239:exon6:c.927-1G>C in f[8] the 1 after the minus is extracted and since that value is less than 10 and f[13] is less than 0.01 f[55] is VUS
line 3 is VUS in f[55] because in NM_001879:exon12:c.1442-5C>G in f[8] the 1 after the minus is extracted and since that value is less than 10 and f[13] is less than 0.01 f[55] is VUS

perl
Code:
# disables certain Perl expressions that could behave unexpectedly or are difficult to debug, turning them into errors
use strict;
use warnings;

# Read and display the first line of the file passed at command line.
print scalar <>;

# Read line by line the file given at the command line, could be the stdin if no file is give as argument.
while (<>)
{  # start conditional block

# Make tokens out of the line, using the tab as separator.
    my @f = split /\t/;

    # Select 4 tokens and define @f.
    my ($FuncrefGene,$PopFreqMax,$GeneDetailrefGene,$ClinSig) = @f[6,13,8,46];

    # Rule1: Default classification VUS.
    my $classification = 'VUS';

    # map to 0 if it doesn't have numeric meaning.
    $GeneDetailrefGene = 0 if $GeneDetailrefGene eq '.';
    $PopFreqMax = 0 if $PopFreqMax eq '.';

    # Rule2: Change to Likely Benign if conditions occurs.
    if ($PopFreqMax > 0.011) {
        $classification = 'Likely Benign';
    }

     #Rule3: GeneDetail condition
    if ($FuncrefGene !~ /exonic/i && $GeneDetailrefGene=~/^\D(\d+)$/) {   # capture the digits after any non-digit into $1
                if ($1 > 10) {   # 
                      $1 //= 0;  # Give it a value of zero if no numeric value was found.
                        $classification = 'Likely Benign';  # Reclassify intronic variants (with distance only) based on distance to exon > 10 to Likely Benign
         }
     }
   else {
                 if ($FuncrefGene !~ /exonic/i) {
                    my ($transcript) = ($GeneDetailrefGene) =~ /(?:\.\d+[+*-]|\D)(\d+)/;   # Get a numeric value if exists using (.) and (+/*/-) and capture digits into $transcript.
                             $transcript //= 0;  # Give it a value of zero if no numeric value was found.
                                $classification = 'Likely Benign' if $transcript > 10; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10
                 }
           }

    # Rule4: ClinVar condition for ACMG
           if ($ClinSig =~ /Pathogenic|Likely pathogenic/i) {
               $classification = 'VUS';
           }

# token 55 is classification.
    $f[55] = $classification;

    # display results and update @f.
    print join "\t", @f;
}   # end conditional block

current output
Code:
R_Index    Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    Inheritence    ExonicFunc.refGene    AAChange.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    dpsi_max_tissue    dpsi_zscore    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Score    Classification    HGMD    Sanger
47    chr16    10032451    10032451    T    C    splicing    GRIN2A    NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G    .    .    .    rs760077182    0.0004    .    .    .    .    .    .    0.0001    .    .    .    .    .    .    0.0004    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    0.7699    1.365    .    .    .    .    .    GOOD    93    het    9    Likely Benign    n    .
77    chr14    76427420    76427420    C    G    splicing    TGFB3    NM_003239:exon6:c.927-1G>C    .    .    .    rs767548724    0.0001    .    .    .    .    .    .    .    .    0.0001    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    1    D    .    .    -3.9123    -2.334    .    .    .    .    .    GOOD    190    het    13    Likely Benign    .    .
160    chr3    186944313    186944313    G    C    splicing    MASP1    NM_001879:exon12:c.1442-5C>G    .    .    .    rs138989954    0.0055    0.0018    .    .    .    0.005    0.0041    0.0036    0.0008    0.0032    .    0.0007    0.0043    0.0037    0.0055    0.0025    0.0005    0.0036    .    .    .    .    .    .    .    .    .    .    .    .    .    -0.8646    -1.388    .    .    .    .    .    GOOD    265    het    13    Likely Benign    .    .


Last edited by cmccabe; 07-21-2017 at 09:50 AM.. Reason: fixed format, added current output
# 2  
Old 07-21-2017
The problem lies in the regular expression you are using with "$GeneDetailrefGene".

See below:
Quote:
Originally Posted by cmccabe
...
...
Code:
 ...
 ...
my ($transcript) = ($GeneDetailrefGene) =~ /(?:\.\d+[+*-]|\D)(\d+)/;   # Get a numeric value if exists using (.) and (+/*/-) and capture digits into $transcript.
...
 ...

...
...
What you are essentially saying is that in the string $GeneDetailrefGene, look for one of the following:

a) a dot character (".") followed by 1 or more digits [0-9] followed by any one of the characters "+", "*" or "-"
Code:
\.\d+[+*-]

== OR ==

b) a non-digit character
Code:
\D

whichever comes first.

If either a) or b) is followed by 1 or more digits [0-9], then capture those digits and set them to $transcript.

The "whichever comes first" part is crucial here.
If you have a regex like this: "A|B", Perl will match either A or B, whichever comes first - reading the string from left to right.

So, when your program sees this value for the $GeneDetailrefGene:

Code:
NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G

it matches the "_001134408" to "\D(\d+)" and therefore sets $transcript to "001134408".

For a clearer explanation, the red part of the regex below matches the red part of the string $GeneDetailrefGene. Ditto with the blue part, which is eventually assigned to $transcript.

Line 1:
Code:
NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G
/(?:\.\d+[+*-]|\D)(\d+)/

Line 2:
Code:
NM_003239:exon6:c.927-1G>C
/(?:\.\d+[+*-]|\D)(\d+)/

Line 3:
Code:
NM_001879:exon12:c.1442-5C>G
/(?:\.\d+[+*-]|\D)(\d+)/

This User Gave Thanks to durden_tyler For This Post:
# 3  
Old 07-21-2017
Code:
if ($FuncrefGene !~ /exonic/i) {
                    my ($transcript) = ($GeneDetailrefGene) =~ /(?:\.\d+[+*-]|\D)(\d+)/;   # Get a numeric value if exists using (.) and (+/*/-) and capture digits into $transcript.
                             $transcript //= 0;  # Give it a value of zero if no numeric value was found.
                                $classification = 'Likely Benign' if $transcript > 10; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10

changed to:
Code:
if ($FuncrefGene !~ /exonic/i) {
                    my ($transcript) = ($GeneDetailrefGene) =~ /(?:[+*-]d=)(\d+)/;   # Get a numeric value if exists using (.) and (+/*/-) and capture digits into $transcript.
                             $transcript //= 0;  # Give it a value of zero if no numeric value was found.
                                $classification = 'Likely Benign' if $transcript > 10; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10

should capture the 43 in NM_001134408:exon3:c.415-43A>G and that wiill be the value of $transcript? I am not sure how to also use f[13} in this rule. In the cases that have multiple f[8] values, like in line 1, the first can be used.

In line 1 f[8] is

NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G
aand the ; (semi-colon) indicates the start of a new value. NM_001134408:exon3:c.415-43A>G would be the first value, so 43 is read into the $transcript variable and since f[13] is 0.0004, f[55] is VUS. Thank you Smilie.
# 4  
Old 07-21-2017
Quote:
Originally Posted by cmccabe
...
...
Code:
if ($FuncrefGene !~ /exonic/i) {
                    my ($transcript) = ($GeneDetailrefGene) =~ /(?:[+*-]d=)(\d+)/;   # Get a numeric value if exists using (.) and (+/*/-) and capture digits into $transcript.
                             $transcript //= 0;  # Give it a value of zero if no numeric value was found.
                                $classification = 'Likely Benign' if $transcript > 10; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10

should capture the 43 in NM_001134408:exon3:c.415-43A>G and that wiill be the value of $transcript?
...
...
No, it will not match because the pattern "[+*-]d=" is not present in $GeneDetailrefGene. The characters "d" and "=" do not follow any one of ("+", "*", "-").
Note that in a regex, "d" matches the character "d", but "\d" matches a single digit in the range [0-9].

Also, the stream of 1 or more digits is to be matched before [+*-].
So, you may want to use this regex:

Code:
/(?:\.\d+[+*-])(\d+)/

Code:
 ...
 ...
 I am not sure how to also use f[13} in this rule.
 ...
 ...

Just use it the way you laid down the rules in your first post.
Here's the relevant excerpt from your first post:


Quote:
Originally Posted by cmccabe
...
...
but the same logic applies, that is if the value is greater than 10 and f[13] is greater than 0.01 f[55] is Likely Benign, if the value is less than 10
and f[13] is less than 0.01 f[55] is VUS in. It is possible for f[13] to be . (dot) but that is the same as zero.
...
...
...
So, since the "value" you talk about is $transcript and f[13] is $PopFreqMax, your logic would be something like:

Code:
 if ($transcript > 10 and $PopFreqMax > 0.01) {
     $classification = "Likely Benign";
 } elsif ($transcript < 10 and $PopFreqMax < 0.01) {
     $classification = "VUS";
 }

Quote:
Originally Posted by cmccabe
...
...
It is possible for f[13] to be . (dot) but that is the same as zero.
...
...
Which you have already taken care of in line # 23 of your Perl code in your post # 1, so no worries.

I don't know why you run "Rule 2" (regarding f[13] or PopFreqMax) on its own in line # 25 through 28 of your Perl code in post # 1.

Since you have to use it in conjunction with $transcript as per your logic above, use it after you have determined the value of $transcript.


Quote:
Originally Posted by cmccabe
...
...
In the cases that have multiple f[8] values, like in line 1, the first can be used.

In line 1 f[8] is

NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G
aand the ; (semi-colon) indicates the start of a new value. NM_001134408:exon3:c.415-43A>G would be the first value, so 43 is read into the $transcript variable and since f[13] is 0.0004, f[55] is VUS. ...
And the first one will be used, as the regex reads from the left of the string and tries to match as early as possible.

See below:
Code:
$
$ echo "NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-44A>G;NM_000833:exon4:c.415-45A>G" | perl -lne '/(?:\.\d+[+*-])(\d+)/ and print $1'
43
$
$

However, if your first "value" within $GeneDetailrefGene (where "values" are delimited by ";") does not match the pattern, then that pattern will be attempted in the second "value" within $GeneDetailrefGene.

In the example below, I have replaced the first "-" character by "#", so the pattern will not match anything in the first "value".
Code:
$
$ echo "NM_001134408:exon3:c.415#43A>G;NM_001134407:exon3:c.415-44A>G;NM_000833:exon4:c.415-45A>G" | perl -lne '/(?:\.\d+[+*-])(\d+)/ and print $1'
44
$
$

Perl keeps looking forward and extracts the first substring it encounters that matches the pattern.
This substring happens to be in the second "value" inside $GeneDetailrefGene.
This User Gave Thanks to durden_tyler For This Post:
# 5  
Old 07-26-2017
Thank you very much Smilie.
# 6  
Old 09-11-2017
wrong thread, please ignore. I apologize.

I think the below will capture lines 2-6, but not line 1 (looks like 018328) is being captured by the regex. Is the syntax correct or is there a better way? Thank you Smilie.

perl
Code:
    if ($FuncrefGene !~ /exonic/i && $GeneDetailrefGene=~/\D(\d+)/) {   # capture the digits into $1
              if ($1 > 11) {   # 
                      $1 ||= 0;  # Give it a value of zero if no numeric value was found.
                        $classification = 'Likely Benign';  # Reclassify intronic variants (with distance only) based on distance to exon > 10 to Likely Benign
         }
     }
    else {
              if ($FuncrefGene !~ /exonic/i && $GeneDetailrefGene=~/(\D\d+)/) {   # capture the digits after any non-digit into $1
                 if ($1 > 11) {   # 
                      $1 ||= 0;  # Give it a value of zero if no numeric value was found.
                        $classification = 'Likely Benign';  # Reclassify intronic variants (with distance only) based on distance to exon > 10 to Likely Benign
         }
     }
    else {
              if ($FuncrefGene !~ /exonic/i) {
                 my ($transcript) = ($GeneDetailrefGene) =~ /(?:\.\d+[+*-])/;   # Get a numeric value if exists using (.) and (+/-) and capture digits into $transcript.
                           $transcript ||= 0;  # Give it a value of zero if no numeric value was found.
                             $classification = 'Likely Benign' if $transcript > 11; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10
         }
     }

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Problem with getting awk to multiply a field by a value set based on condition of another field

Hi, So awk is driving me crazy on this one. I have searched everywhere and read man, docs and every related post Google can find and still no luck. The actual files I need to run this on are sensitive in nature, but it is the same thing as if I needed to calculate weighted grades for multiple... (15 Replies)
Discussion started by: cotilloe
15 Replies

2. Shell Programming and Scripting

Update a specific field in file with Variable value based on other Key Word

I have an input file with A=xyz B=pqr I would want the value in Second Field (xyz or pqr) updated with a value present in Shell Variable based on the value passed in the first field. (A or B ) while read line do NEW_VALUE = `some functionality done on $line` If $line=First Field-... (1 Reply)
Discussion started by: infernalhell
1 Replies

3. Shell Programming and Scripting

awk to assign points to variables based on conditions and update specific field

I have been reading old posts and trying to come up with a solution for the below: Use a tab-delimited input file to assign point to variables that are used to update a specific field, Rank. I really couldn't find too much in the way of assigning points to variable, but made an attempt at an awk... (4 Replies)
Discussion started by: cmccabe
4 Replies

4. Shell Programming and Scripting

Perl to change value based on set of rules

In the perl there is a default rule that sets f to VUS, and then a seris of rules that will change f based on the result that is obtained from the rule. The code below is a rule that is supposed to be applicable to lines 2-4 because this rule just looks at the digit in f. So in line 2 f is 27... (4 Replies)
Discussion started by: cmccabe
4 Replies

5. Shell Programming and Scripting

Perl to update field in file based of match to another file

In the perl below I am trying to set/update the value of $14 (last field) in file2, using the matching NM_ in $12 or $9 in file2 with the NM_ in $2 of file1. The lengths of $9 and $12 can be variable but what is consistent is the start pattern will always be NM_ and the end pattern is always ;... (4 Replies)
Discussion started by: cmccabe
4 Replies

6. Shell Programming and Scripting

awk to update value in field based on another field

In the tab-delimeted input file below I am trying to use awk to update the value in $2 if TYPE=ins in bold, by adding the value of HRUN= in italics. In the below since in line 1 TYPE=ins the 117282541 value in $2 has 6 added because that is the value of HRUN=. Hopefully the awk is a start but I... (2 Replies)
Discussion started by: cmccabe
2 Replies

7. Shell Programming and Scripting

Add specific string to last field of each line in perl based on value

I am trying to add a condition to the below perl that will capture the GTtag and place a specific string in the last field of each line. The problem is that the GT value used is not right after the tag rather it is a few fields away. The values should always be 0/1 or 1/2 and are in bold in the... (12 Replies)
Discussion started by: cmccabe
12 Replies

8. Shell Programming and Scripting

Update specific field in a line of text file

I have a text file like this: subject1:LecturerA:10 subject2:LecturerA:40 if I was given string in column 1 and 2 (which are subject 1 and LecturerA) , i need to update 3rd field of that line containing that given string , which is, number 10 need to be updated to 100 ,for example. The... (6 Replies)
Discussion started by: bmtoan
6 Replies

9. Shell Programming and Scripting

Help with allocated text content based on specific rules...

Input file format: /tag="ABL" /note="abl homolog 2 /tag="ABLIM1" /note="actin binding LIM 1 /tag="ABP1" /note="amiloride binding protein 1 (amine oxidase (copper- containing)) /tag="ABR" /note="active BCR-related /tag="AC003042.1" /note="SDR family member 11 precursor . . . (4 Replies)
Discussion started by: perl_beginner
4 Replies

10. Shell Programming and Scripting

Update a field in a file based on condition

Hi i am new to scripting. i have a file file.dat with content as : CONTENT_STORAGE PERCENTAGE FLAG: /storage_01 64% 0 /storage_02 17% 1 I need to update the value of FLAG for a particular CONTENT_STORAGE value I have written the following code #!/bin/sh threshold=20... (1 Reply)
Discussion started by: kichu
1 Replies
Login or Register to Ask a Question