Classify lines in file using perl

04-08-2017

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Classify lines in file using perl

The below perl executes and does classify each of the 3 lines in file.txt. Lines 2 and 3 are correct as they fit the criteria for Rule 2.
The problem is that line one should be classified VUS as it does not meet the criteria for Rule 1, so Rule 3 is used.
However, currently Rule 2 is changing the classification to Likely Benign, if I comment that Rule out I get the expected result. I am not sure why that rule is even executed on that line as the first criteria is $FuncIDPrefGene !~ "exonic" --- if field is not exonic, but in line one that field is.
I have included comments in the code, but each rule is designed to follow a specific set of criteria. I have tried changing the order but the result is the same. Thank you

perl

Code:

#!/usr/bin/perl
use strict;

while (<>)
{
        $.<2 and print and next;
          my @f=split/\t/;
         #my @f=split/\s+/;
          my ($FuncIDPrefGene,$AAChangeIDPrefGene,$PopFreqMax,$GeneDetailIDPrefGene,$ClinSig,$Score)=@f[6,11,13,8,46,54];
# Check score for exonic set to 5
         $FuncIDPrefGene eq "exonic" && abs($Score) < 5 and &pj(\@f,"Likely Benign") and next; # Rule 1. Set classification to Likely benign based on score less than 5 for exons

# Check score for everything else set to 5 with GeneDetail following c. nomenclature
        $FuncIDPrefGene !~ "exonic" and abs($Score) < 5 and $GeneDetailIDPrefGene=~/\.\d+[\+\-](\d+)/; # this will capture the digits after    +/- into $1
        $1 < 10 and &pj(\@f,"Likely Benign") and next; # Rule 2. Reclassify intronic variants (with c.) less than 10 based on score

# PopFreqMax VUS
         &pj(\@f,"VUS"); # Rule 3.  If none of the above tests succeeded, and the PopFreqMax < 0.011 set the Classification field to the string VUS.
}
sub pj
{
    my $fr=shift;
       $fr->[55]=shift;
       print join("\t",@{$fr}); # add separator ,"\n"
}

desired result in [55] Classification

Code:

VUS
Likely Benign
Likely Benign

file.txt (1.3 KB)

Last edited by cmccabe; 04-08-2017 at 11:29 PM..

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

04-08-2017

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Perhaps this might help you:

Code:

#!/usr/bin/perl
use strict;
use warnings;

my $header = scalar <>;
while (<>)
{
    my @f = split /\t/;
    my ( $FuncIDPrefGene,
         $AAChangeIDPrefGene,
         $PopFreqMax,
         $GeneDetailIDPrefGene,
         $ClinSig,
         $Score ) = @f[6,11,13,8,46,54];

     print "\$FuncIDPrefGene = $FuncIDPrefGene and you're trying to abs($Score)\n";

}

Using it with the example you posted it outputs:

Code:

perl showme.pl file.txt

Code:

$FuncIDPrefGene = exonic and you're trying to abs(12)
$FuncIDPrefGene = splicing and you're trying to abs(2)
$FuncIDPrefGene = intronic and you're trying to abs(.)

You have also, precedent issues with the _and_. I suggest you make use of if/else.

This User Gave Thanks to Aia For This Post:

Aia

View Public Profile for Aia

Find all posts by Aia

04-09-2017

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

I am not sure I follow completely. Is the logic not right. Thank you

.

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

04-09-2017

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Quote:

Originally Posted by cmccabe

I am not sure I follow completely. Is the logic not right. Thank you Smilie

.

abs() is a function for numeric values, a dot is not numeric, turning the pragma warnings, would had shown you that at some point.

If the code is not producing the desired result but it runs, then the logic must not be correct.
This appears to be the flow you are following but it is flawed because of your use of abs() regardless if it has a numeric value or not. It is not possible for me to find out what's the meaning of $f[54], if it does not contain a numeric value.

Code:

#!/usr/bin/perl
use strict;
use warnings;

print scalar <>;
while (<>)
{
    my @f = split /\t/;
    my ( $FuncIDPrefGene,
         $AAChangeIDPrefGene,
         $PopFreqMax,
         $GeneDetailIDPrefGene,
         $ClinSig,
         $Score ) = @f[6,11,13,8,46,54];

    if (abs($Score) < 5) {
        if($FuncIDPrefGene eq 'exonic') {
            pj(\@f,'Likely Benign');
        }
        else {
            my $scored = $GeneDetailIDPrefGene=~/\.\d+[\+\-](\d+)/;
            pj(\@f, 'Likely Benign') if $scored < 10;
        }
    }
    else {
        pj(\@f, 'VUS');
    }
}
sub pj
{
    my $fr = shift;
    $fr->[55] = shift;
    print join "\t", @{$fr};
}

Test:

Code:

perl test.pl file.txt 2>/dev/null | perl -naF'\t' -le 'print $F[55]'

Code:

Classification
VUS
Likely Benign
Likely Benign

Last edited by Aia; 04-09-2017 at 01:02 AM..

This User Gave Thanks to Aia For This Post:

Aia

View Public Profile for Aia

Find all posts by Aia

04-09-2017

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

If f[54] has a . in it, the value associated with it is zero. In order to prevent column shifting due to null values I use a . in these fields.
So, I think I follow but just to make sure the abs($Score) is only used if f[54] is not a .? Is that right? Also, could you please comment the code so I may try to learn from more from it, if possible. Thank you very much

.

Code:

#!/usr/bin/perl    # call perl
use strict;     # use exactdefined criteria
use warnings;   # display warning messages

print scalar <>;  # skip header line
while (<>)    # start conditional checks
{
    my @f = split /\t/;      # split on tabs
    my ( $FuncIDPrefGene,    # field 1
         $AAChangeIDPrefGene, # field 2
         $PopFreqMax,         # field 3
         $GeneDetailIDPrefGene, # field 4
         $ClinSig,              # field 5
         $Score ) = @f[6,11,13,8,46,54];   # field 6 and define field locations using 0 coordinate

    if (abs($Score) < 5) {      # check field 6 for value and ensure its less than 5
        if($FuncIDPrefGene eq 'exonic') {   # check field 1 and if exonic and conditon above met
            pj(\@f,'Likely Benign');    # set field 55 to Likely Benign
        } # end condition 1 block
        else {
            my $scored = $GeneDetailIDPrefGene=~/\.\d+[\+\-](\d+)/;  # use field 4 and split on the . and +/1 and read value into variable
            pj(\@f, 'Likely Benign') if $scored > 10;    # if variable greater then 10 then field 55 is Likely Benign
        }  # end condition 2 block
    } 
    else {
        pj(\@f, 'VUS');  # if niether condition is meet set field 55 to VUS
    }
}  # end while block
sub pj     # define subroutine
{    # start sub block
    my $fr = shift;  # define variable 
    $fr->[55] = shift;  # use field 55 as variable
    print join "\t", @{$fr};   # print value in field
}   # end sub block

Last edited by cmccabe; 04-09-2017 at 11:18 AM.. Reason: added comment question

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

04-09-2017

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

# Rule 1. Set classification to Likely benign based on score less than 5 for exons
What would you like to happen if it is an exon but it is more than 5?
Your logic place these into rule #3 only if PopFreqMax is less than 0.011. Would these be disregarded, otherwise?

# Rule 2. Reclassify intronic variants (with c.) less than 10 based on score
What would you like to happen if it is an intronic but with score more than 10?
Your logic place these into rule #3 only if PopFreqMax is less than 0.011. Do you disregard, otherwise?

# Rule 3. If none of the above tests succeeded, and the PopFreqMax < 0.011
What if the PopFreqMax is more than 0.011? Where would those go?

Can $FuncIDPrefGene be anything else than exonic, splicing, or intronic?

Would $Score ever contain a value with a plus (+12) or minor(-12)?
Would $Score ever contain a value beside a dot (.) that would not have a numeric interpretation?

This User Gave Thanks to Aia For This Post:

Aia

View Public Profile for Aia

Find all posts by Aia

04-09-2017

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Quote:

# Rule 1. Set classification to Likely benign based on score less than 5 for exons
What would you like to happen if it is an exon but it is more than 5?
Your logic place these into rule #3 only if PopFreqMax is less than 0.011. Would these be disregarded, otherwise?

Rule 3 was meant to be a catch all type rule but maybe it is better not to have that. If Rule 1 is exon and more than 5 then the classification is VUS. So is it better to have an else statement in Rule 1 or just remove the PopFreqMax condition from Rule 3?

Quote:

# Rule 2. Reclassify intronic variants (with c.) less than 10 based on score
What would you like to happen if it is an intronic but with score more than 10?
Your logic place these into rule #3 only if PopFreqMax is less than 0.011. Do you disregard, otherwise?

I think this followss the same logic as Rule 1 in that i need an else to capture the other condition or redo Rule 3.

Quote:

# Rule 3. If none of the above tests succeeded, and the PopFreqMax < 0.011
What if the PopFreqMax is more than 0.011? Where would those go?

If PopFreqMax is greater than 0.011 classification is Likely Benign.

Quote:

Can $FuncIDPrefGene be anything else than exonic, splicing, or intronic?

Yes, these are just three of the more common, but there are several other. However eventhough there are many possible values they can all be grouped in to exonic, for exons or not exonic, for everything else.

Quote:

Would $Score ever contain a value with a plus (+12) or minor(-12)?

The number in $Score should always be 1 2 15 20 (some positive #). I used abs() just in case the format every changed to include a + or some other symbol.

Quote:

Would $Score ever contain a value beside a dot (.) that would not have a numeric interpretation?

No, a dot is only used for a null value and is always zero.

Thank you very much

.

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

Shell Programming and Scripting

Classify lines in file using perl

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

AWK to classify a file into several ones ..

Discussion started by: engkemo2002

2. Shell Programming and Scripting

How to delete lines from a file in PERL?

Discussion started by: vanitham

3. UNIX for Dummies Questions & Answers

Classify value to a range

Discussion started by: chen.xiao.po

4. Shell Programming and Scripting

How to get the lines matched of a file in perl?

Discussion started by: vanitham

5. Shell Programming and Scripting

How to use awk to classify file extension from input ls -l

Discussion started by: retsuseiba

6. Shell Programming and Scripting

Using Perl to Merge Multiple Lines in a File

Discussion started by: Peggy White

7. Shell Programming and Scripting

Parsing a file using perl and skipping some lines

Discussion started by: bvids

8. Shell Programming and Scripting

How to remove the lines from file using perl

Discussion started by: dipakg

9. Shell Programming and Scripting

add lines in file with perl

Discussion started by: jinsh

10. Shell Programming and Scripting

strip first 4 and last 2 lines from a file using perl

Discussion started by: meghana