Perl to update field based on a specific set of rules
In the perl below, which does execute, I am having trouble with the else in Rule 3. The digit in f{8} is extracted and used to update f[55] accordinly along with the value in f[13].
There can be either - * or + before the number that is extracted but the same logic applies, that is if the value is greater than 10 and f[13] is greater than 0.01 f[55] is Likely Benign, if the value is less than 10
and f[13] is less than 0.01 f[55] is VUS in. It is possible for f[13] to be . (dot) but that is the same as zero. However, currently the value does not seem to be extracted by the perl below and I
get different output then the desired. If I comment this rule out it fixes some of the lines but causes other lines to be incorrect. I am not sure what is causing the issue or how to fix it. Thank you .
I added a description to explain each line of the output and what should be happening, which currently is not .
line 1 is Likely Benign in f[55] because in NM_001134408:exon3:c.415-43A>G in f[8] the 43 after the minus is extracted and since that value is greater than 10 f[55] is Likely Benign
line 2 is VUS in f[55] because in NM_003239:exon6:c.927-1G>C in f[8] the 1 after the minus is extracted and since that value is less than 10 and f[13] is less than 0.01 f[55] is VUS
line 3 is VUS in f[55] because in NM_001879:exon12:c.1442-5C>G in f[8] the 1 after the minus is extracted and since that value is less than 10 and f[13] is less than 0.01 f[55] is VUS
perl
Code:
# disables certain Perl expressions that could behave unexpectedly or are difficult to debug, turning them into errors
use strict;
use warnings;
# Read and display the first line of the file passed at command line.
print scalar <>;
# Read line by line the file given at the command line, could be the stdin if no file is give as argument.
while (<>)
{ # start conditional block
# Make tokens out of the line, using the tab as separator.
my @f = split /\t/;
# Select 4 tokens and define @f.
my ($FuncrefGene,$PopFreqMax,$GeneDetailrefGene,$ClinSig) = @f[6,13,8,46];
# Rule1: Default classification VUS.
my $classification = 'VUS';
# map to 0 if it doesn't have numeric meaning.
$GeneDetailrefGene = 0 if $GeneDetailrefGene eq '.';
$PopFreqMax = 0 if $PopFreqMax eq '.';
# Rule2: Change to Likely Benign if conditions occurs.
if ($PopFreqMax > 0.011) {
$classification = 'Likely Benign';
}
#Rule3: GeneDetail condition
if ($FuncrefGene !~ /exonic/i && $GeneDetailrefGene=~/^\D(\d+)$/) { # capture the digits after any non-digit into $1
if ($1 > 10) { #
$1 //= 0; # Give it a value of zero if no numeric value was found.
$classification = 'Likely Benign'; # Reclassify intronic variants (with distance only) based on distance to exon > 10 to Likely Benign
}
}
else {
if ($FuncrefGene !~ /exonic/i) {
my ($transcript) = ($GeneDetailrefGene) =~ /(?:\.\d+[+*-]|\D)(\d+)/; # Get a numeric value if exists using (.) and (+/*/-) and capture digits into $transcript.
$transcript //= 0; # Give it a value of zero if no numeric value was found.
$classification = 'Likely Benign' if $transcript > 10; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10
}
}
# Rule4: ClinVar condition for ACMG
if ($ClinSig =~ /Pathogenic|Likely pathogenic/i) {
$classification = 'VUS';
}
# token 55 is classification.
$f[55] = $classification;
# display results and update @f.
print join "\t", @f;
} # end conditional block
The problem lies in the regular expression you are using with "$GeneDetailrefGene".
See below:
Quote:
Originally Posted by cmccabe
...
...
Code:
...
...
my ($transcript) = ($GeneDetailrefGene) =~ /(?:\.\d+[+*-]|\D)(\d+)/; # Get a numeric value if exists using (.) and (+/*/-) and capture digits into $transcript.
...
...
...
...
What you are essentially saying is that in the string $GeneDetailrefGene, look for one of the following:
a) a dot character (".") followed by 1 or more digits [0-9] followed by any one of the characters "+", "*" or "-"
Code:
\.\d+[+*-]
== OR ==
b) a non-digit character
Code:
\D
whichever comes first.
If either a) or b) is followed by 1 or more digits [0-9], then capture those digits and set them to $transcript.
The "whichever comes first" part is crucial here.
If you have a regex like this: "A|B", Perl will match either A or B, whichever comes first - reading the string from left to right.
So, when your program sees this value for the $GeneDetailrefGene:
it matches the "_001134408" to "\D(\d+)" and therefore sets $transcript to "001134408".
For a clearer explanation, the red part of the regex below matches the red part of the string $GeneDetailrefGene. Ditto with the blue part, which is eventually assigned to $transcript.
if ($FuncrefGene !~ /exonic/i) {
my ($transcript) = ($GeneDetailrefGene) =~ /(?:\.\d+[+*-]|\D)(\d+)/; # Get a numeric value if exists using (.) and (+/*/-) and capture digits into $transcript.
$transcript //= 0; # Give it a value of zero if no numeric value was found.
$classification = 'Likely Benign' if $transcript > 10; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10
changed to:
Code:
if ($FuncrefGene !~ /exonic/i) {
my ($transcript) = ($GeneDetailrefGene) =~ /(?:[+*-]d=)(\d+)/; # Get a numeric value if exists using (.) and (+/*/-) and capture digits into $transcript.
$transcript //= 0; # Give it a value of zero if no numeric value was found.
$classification = 'Likely Benign' if $transcript > 10; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10
should capture the 43 in NM_001134408:exon3:c.415-43A>G and that wiill be the value of $transcript? I am not sure how to also use f[13} in this rule. In the cases that have multiple f[8] values, like in line 1, the first can be used.
In line 1 f[8] is
NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G aand the ; (semi-colon) indicates the start of a new value. NM_001134408:exon3:c.415-43A>G would be the first value, so 43 is read into the $transcript variable and since f[13] is 0.0004, f[55] is VUS. Thank you .
if ($FuncrefGene !~ /exonic/i) {
my ($transcript) = ($GeneDetailrefGene) =~ /(?:[+*-]d=)(\d+)/; # Get a numeric value if exists using (.) and (+/*/-) and capture digits into $transcript.
$transcript //= 0; # Give it a value of zero if no numeric value was found.
$classification = 'Likely Benign' if $transcript > 10; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10
should capture the 43 in NM_001134408:exon3:c.415-43A>G and that wiill be the value of $transcript?
...
...
No, it will not match because the pattern "[+*-]d=" is not present in $GeneDetailrefGene. The characters "d" and "=" do not follow any one of ("+", "*", "-"). Note that in a regex, "d" matches the character "d", but "\d" matches a single digit in the range [0-9].
Also, the stream of 1 or more digits is to be matched before [+*-].
So, you may want to use this regex:
Code:
/(?:\.\d+[+*-])(\d+)/
Code:
...
...
I am not sure how to also use f[13} in this rule.
...
...
Just use it the way you laid down the rules in your first post.
Here's the relevant excerpt from your first post:
Quote:
Originally Posted by cmccabe
...
...
but the same logic applies, that is if the value is greater than 10 and f[13] is greater than 0.01 f[55] is Likely Benign, if the value is less than 10
and f[13] is less than 0.01 f[55] is VUS in. It is possible for f[13] to be . (dot) but that is the same as zero.
...
...
...
So, since the "value" you talk about is $transcript and f[13] is $PopFreqMax, your logic would be something like:
Code:
if ($transcript > 10 and $PopFreqMax > 0.01) {
$classification = "Likely Benign";
} elsif ($transcript < 10 and $PopFreqMax < 0.01) {
$classification = "VUS";
}
Quote:
Originally Posted by cmccabe
...
...
It is possible for f[13] to be . (dot) but that is the same as zero.
...
...
Which you have already taken care of in line # 23 of your Perl code in your post # 1, so no worries.
I don't know why you run "Rule 2" (regarding f[13] or PopFreqMax) on its own in line # 25 through 28 of your Perl code in post # 1.
Since you have to use it in conjunction with $transcript as per your logic above, use it after you have determined the value of $transcript.
Quote:
Originally Posted by cmccabe
...
...
In the cases that have multiple f[8] values, like in line 1, the first can be used.
In line 1 f[8] is
NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G aand the ; (semi-colon) indicates the start of a new value. NM_001134408:exon3:c.415-43A>G would be the first value, so 43 is read into the $transcript variable and since f[13] is 0.0004, f[55] is VUS. ...
And the first one will be used, as the regex reads from the left of the string and tries to match as early as possible.
However, if your first "value" within $GeneDetailrefGene (where "values" are delimited by ";") does not match the pattern, then that pattern will be attempted in the second "value" within $GeneDetailrefGene.
In the example below, I have replaced the first "-" character by "#", so the pattern will not match anything in the first "value".
Perl keeps looking forward and extracts the first substring it encounters that matches the pattern.
This substring happens to be in the second "value" inside $GeneDetailrefGene.
This User Gave Thanks to durden_tyler For This Post:
I think the below will capture lines 2-6, but not line 1 (looks like 018328) is being captured by the regex. Is the syntax correct or is there a better way? Thank you .
perl
Code:
if ($FuncrefGene !~ /exonic/i && $GeneDetailrefGene=~/\D(\d+)/) { # capture the digits into $1
if ($1 > 11) { #
$1 ||= 0; # Give it a value of zero if no numeric value was found.
$classification = 'Likely Benign'; # Reclassify intronic variants (with distance only) based on distance to exon > 10 to Likely Benign
}
}
else {
if ($FuncrefGene !~ /exonic/i && $GeneDetailrefGene=~/(\D\d+)/) { # capture the digits after any non-digit into $1
if ($1 > 11) { #
$1 ||= 0; # Give it a value of zero if no numeric value was found.
$classification = 'Likely Benign'; # Reclassify intronic variants (with distance only) based on distance to exon > 10 to Likely Benign
}
}
else {
if ($FuncrefGene !~ /exonic/i) {
my ($transcript) = ($GeneDetailrefGene) =~ /(?:\.\d+[+*-])/; # Get a numeric value if exists using (.) and (+/-) and capture digits into $transcript.
$transcript ||= 0; # Give it a value of zero if no numeric value was found.
$classification = 'Likely Benign' if $transcript > 11; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10
}
}
Hi,
So awk is driving me crazy on this one. I have searched everywhere and read man, docs and every related post Google can find and still no luck. The actual files I need to run this on are sensitive in nature, but it is the same thing as if I needed to calculate weighted grades for multiple... (15 Replies)
I have an input file with
A=xyz
B=pqr
I would want the value in Second Field (xyz or pqr) updated with a value present in Shell Variable based on the value passed in the first field. (A or B )
while read line
do
NEW_VALUE = `some functionality done on $line`
If $line=First Field-... (1 Reply)
I have been reading old posts and trying to come up with a solution for the below: Use a tab-delimited input file to assign
point to variables that are used to update a specific field, Rank. I really couldn't find too much in the way of assigning points
to variable, but made an attempt at an awk... (4 Replies)
In the perl there is a default rule that sets f to VUS, and then a seris of rules that will change f based on the result that is
obtained from the rule. The code below is a rule that is supposed to be applicable to lines 2-4 because this rule just looks at the digit in f. So in line 2 f is 27... (4 Replies)
In the perl below I am trying to set/update the value of $14 (last field) in file2, using the matching NM_ in $12
or $9 in file2 with the NM_ in $2 of file1.
The lengths of $9 and $12 can be variable but what is consistent is the start pattern will always be NM_ and the end pattern is always
;... (4 Replies)
In the tab-delimeted input file below I am trying to use awk to update the value in $2 if TYPE=ins in bold, by adding the value of
HRUN= in italics. In the below since in line 1 TYPE=ins the 117282541 value in $2 has 6 added because that is the value of HRUN=.
Hopefully the awk is a start but I... (2 Replies)
I am trying to add a condition to the below perl that will capture the GTtag and place a specific string in the last field of each line. The problem is that the GT value used is not right after the tag rather it is a few fields away. The values should always be 0/1 or 1/2 and are in bold in the... (12 Replies)
I have a text file like this:
subject1:LecturerA:10
subject2:LecturerA:40
if I was given string in column 1 and 2 (which are subject 1 and LecturerA) , i need to update 3rd field of that line containing that given string , which is, number 10 need to be updated to 100 ,for example.
The... (6 Replies)
Hi
i am new to scripting. i have a file file.dat with content as :
CONTENT_STORAGE PERCENTAGE FLAG:
/storage_01 64% 0
/storage_02 17% 1
I need to update the value of FLAG for a particular CONTENT_STORAGE value
I have written the following code
#!/bin/sh
threshold=20... (1 Reply)