Perl to update field based on a specific set of rules
In the perl below, which does execute, I am having trouble with the else in Rule 3. The digit in f{8} is extracted and used to update f[55] accordinly along with the value in f[13].
There can be either - * or + before the number that is extracted but the same logic applies, that is if the value is greater than 10 and f[13] is greater than 0.01 f[55] is Likely Benign, if the value is less than 10
and f[13] is less than 0.01 f[55] is VUS in. It is possible for f[13] to be . (dot) but that is the same as zero. However, currently the value does not seem to be extracted by the perl below and I
get different output then the desired. If I comment this rule out it fixes some of the lines but causes other lines to be incorrect. I am not sure what is causing the issue or how to fix it. Thank you .
I added a description to explain each line of the output and what should be happening, which currently is not .
file
desired output tab-delimeted
description
perl
current output
Last edited by cmccabe; 07-21-2017 at 09:50 AM..
Reason: fixed format, added current output
The problem lies in the regular expression you are using with "$GeneDetailrefGene".
See below:
Quote:
Originally Posted by cmccabe
...
...
...
...
What you are essentially saying is that in the string $GeneDetailrefGene, look for one of the following:
a) a dot character (".") followed by 1 or more digits [0-9] followed by any one of the characters "+", "*" or "-"
== OR ==
b) a non-digit character
whichever comes first.
If either a) or b) is followed by 1 or more digits [0-9], then capture those digits and set them to $transcript.
The "whichever comes first" part is crucial here.
If you have a regex like this: "A|B", Perl will match either A or B, whichever comes first - reading the string from left to right.
So, when your program sees this value for the $GeneDetailrefGene:
it matches the "_001134408" to "\D(\d+)" and therefore sets $transcript to "001134408".
For a clearer explanation, the red part of the regex below matches the red part of the string $GeneDetailrefGene. Ditto with the blue part, which is eventually assigned to $transcript.
Line 1:
Line 2:
Line 3:
This User Gave Thanks to durden_tyler For This Post:
changed to:
should capture the 43 in NM_001134408:exon3:c.415-43A>G and that wiill be the value of $transcript? I am not sure how to also use f[13} in this rule. In the cases that have multiple f[8] values, like in line 1, the first can be used.
In line 1 f[8] is
NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G aand the ; (semi-colon) indicates the start of a new value. NM_001134408:exon3:c.415-43A>G would be the first value, so 43 is read into the $transcript variable and since f[13] is 0.0004, f[55] is VUS. Thank you .
...
...
should capture the 43 in NM_001134408:exon3:c.415-43A>G and that wiill be the value of $transcript?
...
...
No, it will not match because the pattern "[+*-]d=" is not present in $GeneDetailrefGene. The characters "d" and "=" do not follow any one of ("+", "*", "-"). Note that in a regex, "d" matches the character "d", but "\d" matches a single digit in the range [0-9].
Also, the stream of 1 or more digits is to be matched before [+*-].
So, you may want to use this regex:
Just use it the way you laid down the rules in your first post.
Here's the relevant excerpt from your first post:
Quote:
Originally Posted by cmccabe
...
...
but the same logic applies, that is if the value is greater than 10 and f[13] is greater than 0.01 f[55] is Likely Benign, if the value is less than 10
and f[13] is less than 0.01 f[55] is VUS in. It is possible for f[13] to be . (dot) but that is the same as zero.
...
...
...
So, since the "value" you talk about is $transcript and f[13] is $PopFreqMax, your logic would be something like:
Quote:
Originally Posted by cmccabe
...
...
It is possible for f[13] to be . (dot) but that is the same as zero.
...
...
Which you have already taken care of in line # 23 of your Perl code in your post # 1, so no worries.
I don't know why you run "Rule 2" (regarding f[13] or PopFreqMax) on its own in line # 25 through 28 of your Perl code in post # 1.
Since you have to use it in conjunction with $transcript as per your logic above, use it after you have determined the value of $transcript.
Quote:
Originally Posted by cmccabe
...
...
In the cases that have multiple f[8] values, like in line 1, the first can be used.
In line 1 f[8] is
NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G aand the ; (semi-colon) indicates the start of a new value. NM_001134408:exon3:c.415-43A>G would be the first value, so 43 is read into the $transcript variable and since f[13] is 0.0004, f[55] is VUS. ...
And the first one will be used, as the regex reads from the left of the string and tries to match as early as possible.
See below:
However, if your first "value" within $GeneDetailrefGene (where "values" are delimited by ";") does not match the pattern, then that pattern will be attempted in the second "value" within $GeneDetailrefGene.
In the example below, I have replaced the first "-" character by "#", so the pattern will not match anything in the first "value".
Perl keeps looking forward and extracts the first substring it encounters that matches the pattern.
This substring happens to be in the second "value" inside $GeneDetailrefGene.
This User Gave Thanks to durden_tyler For This Post:
I think the below will capture lines 2-6, but not line 1 (looks like 018328) is being captured by the regex. Is the syntax correct or is there a better way? Thank you .
Hi,
So awk is driving me crazy on this one. I have searched everywhere and read man, docs and every related post Google can find and still no luck. The actual files I need to run this on are sensitive in nature, but it is the same thing as if I needed to calculate weighted grades for multiple... (15 Replies)
I have an input file with
A=xyz
B=pqr
I would want the value in Second Field (xyz or pqr) updated with a value present in Shell Variable based on the value passed in the first field. (A or B )
while read line
do
NEW_VALUE = `some functionality done on $line`
If $line=First Field-... (1 Reply)
I have been reading old posts and trying to come up with a solution for the below: Use a tab-delimited input file to assign
point to variables that are used to update a specific field, Rank. I really couldn't find too much in the way of assigning points
to variable, but made an attempt at an awk... (4 Replies)
In the perl there is a default rule that sets f to VUS, and then a seris of rules that will change f based on the result that is
obtained from the rule. The code below is a rule that is supposed to be applicable to lines 2-4 because this rule just looks at the digit in f. So in line 2 f is 27... (4 Replies)
In the perl below I am trying to set/update the value of $14 (last field) in file2, using the matching NM_ in $12
or $9 in file2 with the NM_ in $2 of file1.
The lengths of $9 and $12 can be variable but what is consistent is the start pattern will always be NM_ and the end pattern is always
;... (4 Replies)
In the tab-delimeted input file below I am trying to use awk to update the value in $2 if TYPE=ins in bold, by adding the value of
HRUN= in italics. In the below since in line 1 TYPE=ins the 117282541 value in $2 has 6 added because that is the value of HRUN=.
Hopefully the awk is a start but I... (2 Replies)
I am trying to add a condition to the below perl that will capture the GTtag and place a specific string in the last field of each line. The problem is that the GT value used is not right after the tag rather it is a few fields away. The values should always be 0/1 or 1/2 and are in bold in the... (12 Replies)
I have a text file like this:
subject1:LecturerA:10
subject2:LecturerA:40
if I was given string in column 1 and 2 (which are subject 1 and LecturerA) , i need to update 3rd field of that line containing that given string , which is, number 10 need to be updated to 100 ,for example.
The... (6 Replies)
Hi
i am new to scripting. i have a file file.dat with content as :
CONTENT_STORAGE PERCENTAGE FLAG:
/storage_01 64% 0
/storage_02 17% 1
I need to update the value of FLAG for a particular CONTENT_STORAGE value
I have written the following code
#!/bin/sh
threshold=20... (1 Reply)