Perl to parse


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Perl to parse
# 8  
Old 03-25-2015
Code:
$
$ cat out_position.txt
Input Variant   Errors  Chromosomal Variant     Coding Variant(s)
NM_004004.5:c.575_576delCA              NC_000013.10:g.20763145_20763146delTG   NM_004004.5:c.575_576delCA      XM_005266354.1:c.575_576delCA   XM_005266355.1:c.575_576delCA   XM_005266356.1:c.57(Scheduler): Empty Line
$
$
$ perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {
                printf ("%d\t%d\t%d\t%s\t-\n", $1, $2, $3, $4);
            }
           ' out_position.txt
13      20763145        20763146        TG      -
$
$

# 9  
Old 03-25-2015
The code below returns only 1 line, eventhough there are 3 lines to parse 13 20763145 20763146 TG - , what did I do wrong? The input file is attached. Thank you Smilie.

Trying to have the input file parsed (for any of those conditions) no matter the input, it was working great but I did something. Thanks.

Code:
 
parse() {
    printf "\n\n"
	cd 'C:\Users\cmccabe\Desktop\annovar'
    perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {     # condtional parse  
                ($num1, $num2, $common) = ($1, $2, $3);
                if ($common =~ /([A-Z])>([A-Z])/) {                    # SNP
                    ($ch1, $ch2) = ($1, $2)
                } elsif ($common =~ /del([A-Z])/) {                    # deletion
                    ($ch1, $ch2) = ($1, "-")
		} elsif ($common =~ /ins([A-Z])/) {                    # insertion
                    ($ch1, $ch2) = ("-", $1)
                } elsif ($common =~ /del([A-Z])/) {                    # multi deletion
                    ($ch1, $ch2) = ($1, "-")             
                } elsif ($common =~ /ins([A-Z])/) {                    # multi insertion
                    ($ch1, $ch2) = ("-", $1)
                }
                printf ("%d\t%d\t%d\t%s\t-\n", $1, $2, $3, $4);        # output
                map {undef} ($num1, $num2, $common, $ch1, $ch2);
				}
	           ' out_position.txt > out_parse.txt
	             annovar
}


Last edited by cmccabe; 03-25-2015 at 05:16 PM..
# 10  
Old 03-25-2015
Quote:
Originally Posted by cmccabe
...what did I do wrong? ...
Trying to have the input file parsed (for any of those conditions) no matter the input, it was working great but I did something. ...
Code:
 
...
    perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {     # condtional parse  
                ($num1, $num2, $common) = ($1, $2, $3);
                if ($common =~ /([A-Z])>([A-Z])/) {                    # SNP
                    ($ch1, $ch2) = ($1, $2)
                } elsif ($common =~ /del([A-Z])/) {                    # deletion
                    ($ch1, $ch2) = ($1, "-")
        } elsif ($common =~ /ins([A-Z])/) {                    # insertion
                    ($ch1, $ch2) = ("-", $1)
                } elsif ($common =~ /del([A-Z])/) {                    # multi deletion
                    ($ch1, $ch2) = ($1, "-")             
                } elsif ($common =~ /ins([A-Z])/) {                    # multi insertion
                    ($ch1, $ch2) = ("-", $1)
                }
                printf ("%d\t%d\t%d\t%s\t-\n", $1, $2, $3, $4);        # output
                map {undef} ($num1, $num2, $common, $ch1, $ch2);
                }
               ' out_position.txt > out_parse.txt
...
}

The first step towards fixing something is understanding how that thing works. The deeper your understanding, the easier it is for you to fix it.
And understanding comes with practice - lots of practice.
I've highlighted a few problematic parts of your code in red color. But before that, you have to understand what that Perl one-liner does.

It reads each line of your file, strips off the EOL (end-of-line) character and runs the code within the single-quote. The same code is run against each line.

The "next if ..." statement skips the first line of your file.

Then there is this loop:
"while (/<blah>/g) { <do_something> }"
It matches the regular expression <blah> against the line and, for each part of the line that matches that regular expression (regex), it runs the part within the parenthesis i.e. <do_something>.
And it does this thing repeatedly (due to the "g"/global at the end) as long as there is something to read in the line.

In effect, the "while(/<blah>/g)" tokenizes the line i.e. it splits the line into tokens. We could have used the "split(/<blah>/)" function as well over there and it would've worked.

The regex <blah> is the most important part of the code. It has to be constructed in such a way so that you're able to pick up the most generic token in each line.

So if you have the following 4 tokens in your file:

Code:
NC_000013.10:g.20763642C>T
NC_000013.10:g.20763686delC
NC_000013.10:g.20763686insG
NC_000013.10:g.20763145_20763146delTG
NC_000013.10:g.20763145_20763146delAC

and you want a common regex to match as many common parts in each token,
then you'd want to match them according to the color code below:

Code:
NC_000013.10:g.20763642C>T
NC_000013.10:g.20763686delC
NC_000013.10:g.20763686insG
NC_000013.10:g.20763145_20763146delTG
NC_000013.10:g.20763145_20763146delAC

The part in red is all numbers, so that's \d+
The part in orange is all numbers again, so that \d+
The part in blue is some non-whitespace text, so we can use \S+
The part in black is common in all the tokens

With this knowledge, we could construct the regex as follows:

Code:
NC_(\d+)\.\S+g\.(\d+)(\S+)

I've added the color codes so you can understand what part of the regex matches what part of the token.

A token may be preceded by 0 or more tabs, so we need \t* at the beginning.
Note that the first token at the beginning of the line has 0 tabs before it. Every other token has 1 or more tabs in front of it. So now the regex becomes:

Code:
\t*NC_(\d+)\.\S+g\.(\d+)(\S+)

and that is what we should use in our "while" loop:

Code:
while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g) {

The stuff between the first parentheses goes into $1 and we assign it to $num1.
The stuff between the second parentheses goes into $2 and we assign it to $num2.
The stuff between the third parentheses goes into $3 and we assign it to $common.

Code:
while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g) {
    ($num1, $num2, $common) = ($1, $2, $3);
    ...
}

Now have a look at your code and especially the part in red:

Code:
while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {

The regex won't match this token:

Code:
NC_000013.10:g.20763642C>T

because the token does not have the "del" text in it. It does not have two numbers separated by underscore.

The regex won't match this token either:

Code:
NC_000013.10:g.20763686delC

since there are no two numbers separated by underscore.

The regex will match only this token in line 4 of your input file:

Code:
NC_000013.10:g.20763145_20763146delTG

I'll color code the parts of the regex and the parts they match in the token so it's clear:

Code:
while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {

Code:
NC_000013.10:g.20763145_20763146delTG

So that was the issue.
Once your main regex is wrong, most of your regexes inside the "while" loop become redundant.
For example, the one with the ">" will never be true:

Code:
if ($common =~ /([A-Z])>([A-Z])/) {

because $common will never have the ">" character. It has "del" instead.
And so on....

The second issue was in your "printf" statement.
Since we have assigned variables inside the while loop that contain our information, we should be printing the variables. Not $1, $2, $3, ... etc.
That is, we should print $num1, $num2, $ch1, ... etc.

Back to the correct code.
Once you understand that $common can contain the following different cases in blue color below:

Code:
NC_000013.10:g.20763642C>T
NC_000013.10:g.20763686delC
NC_000013.10:g.20763686insG
NC_000013.10:g.20763145_20763146delTG
NC_000013.10:g.20763145_20763146delAC

you can then work with each of them individually to obtain the information you want.

Another point is about the $num2. You print it twice for the first three cases above. But in cases 4 and 5 above, you need the number after the underscore ("_") and before "del".

What I've done is, I've defined a new variable called $num3.
- By default, $num3 equals $num2. And it is set as soon as we know the value of $num2.
- In the cases 4 and 5, we extract the value of $num3 and overwrite the default value.

We can then print $num1, $num2, $num3, $ch1, $ch2.

All the ideas above are incorporated in the code below:

Code:
$
$ cat out_position.txt
Input Variant   Errors  Chromosomal Variant     Coding Variant(s)
NM_004004.5:c.79G>A             NC_000013.10:g.20763642C>T      NM_004004.5:c.79G>A     XM_005266354.1:c.79G>A  XM_005266355.1:c.79G>A  XM_005266356.1:c.79G>A
NM_004004.5:c.35delG            NC_000013.10:g.20763686delC     NM_004004.5:c.35delG    XM_005266354.1:c.35delG XM_005266355.1:c.35delG XM_005266356.1:c.35delG
NM_004004.5:c.575_576delCA              NC_000013.10:g.20763145_20763146delTG   NM_004004.5:c.575_576delCA      XM_005266354.1:c.575_576delCA   XM_005266355.1:c.575_576delCA   XM_005266356.1:c.575_576delCA
$
$
$ perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g) {                                            # conditional parse
                ($num1, $num2, $common) = ($1, $2, $3);
                $num3 = $num2;
                if    ($common =~ /^([A-Z])>([A-Z])$/)   { ($ch1, $ch2) = ($1, $2) }              # SNP
                elsif ($common =~ /^del([A-Z])$/)        { ($ch1, $ch2) = ($1, "-") }             # deletion
                elsif ($common =~ /^ins([A-Z])$/)        { ($ch1, $ch2) = ("-", $1) }             # insertion
                elsif ($common =~ /^_(\d+)del([A-Z]+)$/) { ($num3, $ch1, $ch2) = ($1, $2, "-") }  # multi deletion
                elsif ($common =~ /^_(\d+)ins([A-Z]+)$/) { ($num3, $ch1, $ch2) = ($1, "-", $2) }  # multi insertion
                printf ("%d\t%d\t%d\t%s\t%s\n", $num1, $num2, $num3, $ch1, $ch2);                 # output
                map {undef} ($num1, $num2, $num3, $common, $ch1, $ch2);
            }
           ' out_position.txt
13      20763642        20763642        C       T
13      20763686        20763686        C       -
13      20763145        20763146        TG      -
$
$

Make sure you understand it thoroughly. If in doubt, ask.
Cheers.
This User Gave Thanks to durden_tyler For This Post:
# 11  
Old 03-26-2015
Thank you for the explanations and color coding, that helps a lot. It's a lot too take in, but it definitely makes sense, I really appreciate your help and efforts.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help to parse syslog with perl

logver=56 idseq=63256900099118326 itime=1563205190 devid=FG-5KDTB18800138 devname=LAL-C1-FGT-03 vd=USER date=2019-07-15 time=18:39:49 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1563205189 srcip=11.3.3.17 srcport=50544 srcintf="SGI-CORE.123" srcintfrole="undefined"... (3 Replies)
Discussion started by: arm
3 Replies

2. Shell Programming and Scripting

awk or perl to parse file

I have an input file attached that I am trying to parse in tab-delimanted format: The chromosomal variant column contains all the information: parse rules: 1. 4 zeros after the NC_ and the digits before the . 2. digits after the g. repeated twice separated by a tab 3. letter before the > 4.... (10 Replies)
Discussion started by: cmccabe
10 Replies

3. Programming

Perl parse string

Hi Perl Guys I have another perl question I have the following code that i have written Getopt::Long::config(qw( permute bundling )); my $OPT = {}; GetOptions($OPT, qw( ver=s help|h )) or die "options parsing failed"; This will allow the user to do something like... (4 Replies)
Discussion started by: ab52
4 Replies

4. Shell Programming and Scripting

Perl parse error

Hello there, I em executing the following command in a perl script to append "\0" to the end of every line in a file: ###command start my $cmd = qx{"C:\\gawk" '{print $0 "\\\0"}' C:\file.txt > C:\file_1.txt}; ###command end But i get the following error: ###error meaasge start... (2 Replies)
Discussion started by: nmattam
2 Replies

5. Shell Programming and Scripting

Parse file contents in perl...

Hi, I have the file like this: #Contents of file 1 are: Dec 10 12:33:44 User1 Interface: Probe Dec 10 12:33:47 uSER1 SOME DATA Dec 10 12:33:47 user1 Interface: MSGETYPE Dec 10 12:34:48 user1 ID: 10. Dec 10 12:33:55 user1 Interface: MSGTYPE Dec 10 12:33:55 user1 Id: 9 ... (1 Reply)
Discussion started by: vanitham
1 Replies

6. Shell Programming and Scripting

perl parse log

Hi anyone can help.how can i get all second column data in this log below?? x 799002577959.pdf, 25728 bytes, 51 tape blocks x 800002357216.pdf, 25728 bytes, 51 tape blocks x aadb090910.txt, 80424 bytes, 158 tape blocks x tsese090909.txt, 13974 bytes, 28 tape blocks (4 Replies)
Discussion started by: netxus
4 Replies

7. Shell Programming and Scripting

Perl Parse

Hi I'm writing simple perl script to parse the ftp log as below: Local directory now /home/user/testing 227 Entering Passive Mode (192,254,19,34,8,228). 125 Data connection already open; Transfer starting. 09-25-09 02:33PM 25333629 abc.tar 09-14-09 12:50PM 18015752... (1 Reply)
Discussion started by: netxus
1 Replies

8. Shell Programming and Scripting

perl parse line

Dear all anyone willling to help me..i have try so many time but still failed to get the ip address for line when i print the line is like below Connected to 192.168.1.13 #!/usr/local/bin/perl foreach $line(@lines){ if ($line =~ /connected to/) { $line=~/connected to(.*?) /; ... (2 Replies)
Discussion started by: netxus
2 Replies

9. Shell Programming and Scripting

Perl Parse Word Cksum help

Hi all, I'm attempting to parse through a .bin file word by word and perform a cksum on each word using perl. I'm new to perl so I dont exactly know how to get started. Any help would be greatly appreciated. Thanks! (1 Reply)
Discussion started by: TeamUSA
1 Replies

10. Shell Programming and Scripting

Perl parse string to time

Hi, I have got this value 18:21:23.330 in one of my variables. Now I need to parse this time to something. And then I have to compare it with 2 times, let's say, 15:00 hrs to 23:00 hrs. Can Date::Manip rescue me from this horrifying situation? I am quite new to Perl and especially this... (1 Reply)
Discussion started by: King Nothing
1 Replies
Login or Register to Ask a Question