Perl to parse

03-25-2015

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

Code:

$
$ cat out_position.txt
Input Variant   Errors  Chromosomal Variant     Coding Variant(s)
NM_004004.5:c.575_576delCA              NC_000013.10:g.20763145_20763146delTG   NM_004004.5:c.575_576delCA      XM_005266354.1:c.575_576delCA   XM_005266355.1:c.575_576delCA   XM_005266356.1:c.57(Scheduler): Empty Line
$
$
$ perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {
                printf ("%d\t%d\t%d\t%s\t-\n", $1, $2, $3, $4);
            }
           ' out_position.txt
13      20763145        20763146        TG      -
$
$

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

03-25-2015

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

The code below returns only 1 line, eventhough there are 3 lines to parse 13 20763145 20763146 TG - , what did I do wrong? The input file is attached. Thank you

.

Trying to have the input file parsed (for any of those conditions) no matter the input, it was working great but I did something. Thanks.

Code:

 
parse() {
    printf "\n\n"
	cd 'C:\Users\cmccabe\Desktop\annovar'
    perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {     # condtional parse  
                ($num1, $num2, $common) = ($1, $2, $3);
                if ($common =~ /([A-Z])>([A-Z])/) {                    # SNP
                    ($ch1, $ch2) = ($1, $2)
                } elsif ($common =~ /del([A-Z])/) {                    # deletion
                    ($ch1, $ch2) = ($1, "-")
		} elsif ($common =~ /ins([A-Z])/) {                    # insertion
                    ($ch1, $ch2) = ("-", $1)
                } elsif ($common =~ /del([A-Z])/) {                    # multi deletion
                    ($ch1, $ch2) = ($1, "-")             
                } elsif ($common =~ /ins([A-Z])/) {                    # multi insertion
                    ($ch1, $ch2) = ("-", $1)
                }
                printf ("%d\t%d\t%d\t%s\t-\n", $1, $2, $3, $4);        # output
                map {undef} ($num1, $num2, $common, $ch1, $ch2);
				}
	           ' out_position.txt > out_parse.txt
	             annovar
}

out_position.txt (526 Bytes)

Last edited by cmccabe; 03-25-2015 at 05:16 PM..

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

03-25-2015

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

Quote:

Originally Posted by cmccabe

...what did I do wrong? ...
Trying to have the input file parsed (for any of those conditions) no matter the input, it was working great but I did something. ...

Code:

 
...
    perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {     # condtional parse  
                ($num1, $num2, $common) = ($1, $2, $3);
                if ($common =~ /([A-Z])>([A-Z])/) {                    # SNP
                    ($ch1, $ch2) = ($1, $2)
                } elsif ($common =~ /del([A-Z])/) {                    # deletion
                    ($ch1, $ch2) = ($1, "-")
        } elsif ($common =~ /ins([A-Z])/) {                    # insertion
                    ($ch1, $ch2) = ("-", $1)
                } elsif ($common =~ /del([A-Z])/) {                    # multi deletion
                    ($ch1, $ch2) = ($1, "-")             
                } elsif ($common =~ /ins([A-Z])/) {                    # multi insertion
                    ($ch1, $ch2) = ("-", $1)
                }
                printf ("%d\t%d\t%d\t%s\t-\n", $1, $2, $3, $4);        # output
                map {undef} ($num1, $num2, $common, $ch1, $ch2);
                }
               ' out_position.txt > out_parse.txt
...
}

The first step towards fixing something is understanding how that thing works. The deeper your understanding, the easier it is for you to fix it.
And understanding comes with practice - lots of practice.
I've highlighted a few problematic parts of your code in red color. But before that, you have to understand what that Perl one-liner does.

It reads each line of your file, strips off the EOL (end-of-line) character and runs the code within the single-quote. The same code is run against each line.

The "next if ..." statement skips the first line of your file.

Then there is this loop:
"while (/<blah>/g) { <do_something> }"
It matches the regular expression <blah> against the line and, for each part of the line that matches that regular expression (regex), it runs the part within the parenthesis i.e. <do_something>.
And it does this thing repeatedly (due to the "g"/global at the end) as long as there is something to read in the line.

In effect, the "while(/<blah>/g)" tokenizes the line i.e. it splits the line into tokens. We could have used the "split(/<blah>/)" function as well over there and it would've worked.

The regex <blah> is the most important part of the code. It has to be constructed in such a way so that you're able to pick up the most generic token in each line.

So if you have the following 4 tokens in your file:

Code:

NC_000013.10:g.20763642C>T
NC_000013.10:g.20763686delC
NC_000013.10:g.20763686insG
NC_000013.10:g.20763145_20763146delTG
NC_000013.10:g.20763145_20763146delAC

and you want a common regex to match as many common parts in each token,
then you'd want to match them according to the color code below:

Code:

NC_000013.10:g.20763642C>T
NC_000013.10:g.20763686delC
NC_000013.10:g.20763686insG
NC_000013.10:g.20763145_20763146delTG
NC_000013.10:g.20763145_20763146delAC

The part in red is all numbers, so that's \d+
The part in orange is all numbers again, so that \d+
The part in blue is some non-whitespace text, so we can use \S+
The part in black is common in all the tokens

With this knowledge, we could construct the regex as follows:

Code:

NC_(\d+)\.\S+g\.(\d+)(\S+)

I've added the color codes so you can understand what part of the regex matches what part of the token.

A token may be preceded by 0 or more tabs, so we need \t* at the beginning.
Note that the first token at the beginning of the line has 0 tabs before it. Every other token has 1 or more tabs in front of it. So now the regex becomes:

Code:

\t*NC_(\d+)\.\S+g\.(\d+)(\S+)

and that is what we should use in our "while" loop:

Code:

while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g) {

The stuff between the first parentheses goes into $1 and we assign it to $num1.
The stuff between the second parentheses goes into $2 and we assign it to $num2.
The stuff between the third parentheses goes into $3 and we assign it to $common.

Code:

while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g) {
    ($num1, $num2, $common) = ($1, $2, $3);
    ...
}

Now have a look at your code and especially the part in red:

Code:

while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {

The regex won't match this token:

Code:

NC_000013.10:g.20763642C>T

because the token does not have the "del" text in it. It does not have two numbers separated by underscore.

The regex won't match this token either:

Code:

NC_000013.10:g.20763686delC

since there are no two numbers separated by underscore.

The regex will match only this token in line 4 of your input file:

Code:

NC_000013.10:g.20763145_20763146delTG

I'll color code the parts of the regex and the parts they match in the token so it's clear:

Code:

while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {

Code:

NC_000013.10:g.20763145_20763146delTG

So that was the issue.
Once your main regex is wrong, most of your regexes inside the "while" loop become redundant.
For example, the one with the ">" will never be true:

Code:

if ($common =~ /([A-Z])>([A-Z])/) {

because $common will never have the ">" character. It has "del" instead.
And so on....

The second issue was in your "printf" statement.
Since we have assigned variables inside the while loop that contain our information, we should be printing the variables. Not $1, $2, $3, ... etc.
That is, we should print $num1, $num2, $ch1, ... etc.

Back to the correct code.
Once you understand that $common can contain the following different cases in blue color below:

Code:

NC_000013.10:g.20763642C>T
NC_000013.10:g.20763686delC
NC_000013.10:g.20763686insG
NC_000013.10:g.20763145_20763146delTG
NC_000013.10:g.20763145_20763146delAC

you can then work with each of them individually to obtain the information you want.

Another point is about the $num2. You print it twice for the first three cases above. But in cases 4 and 5 above, you need the number after the underscore ("_") and before "del".

What I've done is, I've defined a new variable called $num3.
- By default, $num3 equals $num2. And it is set as soon as we know the value of $num2.
- In the cases 4 and 5, we extract the value of $num3 and overwrite the default value.

We can then print $num1, $num2, $num3, $ch1, $ch2.

All the ideas above are incorporated in the code below:

Code:

$
$ cat out_position.txt
Input Variant   Errors  Chromosomal Variant     Coding Variant(s)
NM_004004.5:c.79G>A             NC_000013.10:g.20763642C>T      NM_004004.5:c.79G>A     XM_005266354.1:c.79G>A  XM_005266355.1:c.79G>A  XM_005266356.1:c.79G>A
NM_004004.5:c.35delG            NC_000013.10:g.20763686delC     NM_004004.5:c.35delG    XM_005266354.1:c.35delG XM_005266355.1:c.35delG XM_005266356.1:c.35delG
NM_004004.5:c.575_576delCA              NC_000013.10:g.20763145_20763146delTG   NM_004004.5:c.575_576delCA      XM_005266354.1:c.575_576delCA   XM_005266355.1:c.575_576delCA   XM_005266356.1:c.575_576delCA
$
$
$ perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g) {                                            # conditional parse
                ($num1, $num2, $common) = ($1, $2, $3);
                $num3 = $num2;
                if    ($common =~ /^([A-Z])>([A-Z])$/)   { ($ch1, $ch2) = ($1, $2) }              # SNP
                elsif ($common =~ /^del([A-Z])$/)        { ($ch1, $ch2) = ($1, "-") }             # deletion
                elsif ($common =~ /^ins([A-Z])$/)        { ($ch1, $ch2) = ("-", $1) }             # insertion
                elsif ($common =~ /^_(\d+)del([A-Z]+)$/) { ($num3, $ch1, $ch2) = ($1, $2, "-") }  # multi deletion
                elsif ($common =~ /^_(\d+)ins([A-Z]+)$/) { ($num3, $ch1, $ch2) = ($1, "-", $2) }  # multi insertion
                printf ("%d\t%d\t%d\t%s\t%s\n", $num1, $num2, $num3, $ch1, $ch2);                 # output
                map {undef} ($num1, $num2, $num3, $common, $ch1, $ch2);
            }
           ' out_position.txt
13      20763642        20763642        C       T
13      20763686        20763686        C       -
13      20763145        20763146        TG      -
$
$

Make sure you understand it thoroughly. If in doubt, ask.
Cheers.

This User Gave Thanks to durden_tyler For This Post:

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

03-26-2015

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

Thank you for the explanations and color coding, that helps a lot. It's a lot too take in, but it definitely makes sense, I really appreciate your help and efforts.

cmccabe

View Public Profile for cmccabe

Find all posts by cmccabe

Shell Programming and Scripting

Perl to parse

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help to parse syslog with perl

Discussion started by: arm

2. Shell Programming and Scripting

awk or perl to parse file

Discussion started by: cmccabe

3. Programming

Perl parse string

Discussion started by: ab52

4. Shell Programming and Scripting

Perl parse error

Discussion started by: nmattam

5. Shell Programming and Scripting

Parse file contents in perl...

Discussion started by: vanitham

6. Shell Programming and Scripting

perl parse log

Discussion started by: netxus

7. Shell Programming and Scripting

Perl Parse

Discussion started by: netxus

8. Shell Programming and Scripting

perl parse line

Discussion started by: netxus

9. Shell Programming and Scripting

Perl Parse Word Cksum help

Discussion started by: TeamUSA

10. Shell Programming and Scripting

Perl parse string to time

Discussion started by: King Nothing