The code below returns only 1 line, eventhough there are 3 lines to parse 13 20763145 20763146 TG - , what did I do wrong? The input file is attached. Thank you .
Trying to have the input file parsed (for any of those conditions) no matter the input, it was working great but I did something. Thanks.
...what did I do wrong? ...
Trying to have the input file parsed (for any of those conditions) no matter the input, it was working great but I did something. ...
The first step towards fixing something is understanding how that thing works. The deeper your understanding, the easier it is for you to fix it.
And understanding comes with practice - lots of practice.
I've highlighted a few problematic parts of your code in red color. But before that, you have to understand what that Perl one-liner does.
It reads each line of your file, strips off the EOL (end-of-line) character and runs the code within the single-quote. The same code is run against each line.
The "next if ..." statement skips the first line of your file.
Then there is this loop:
"while (/<blah>/g) { <do_something> }"
It matches the regular expression <blah> against the line and, for each part of the line that matches that regular expression (regex), it runs the part within the parenthesis i.e. <do_something>.
And it does this thing repeatedly (due to the "g"/global at the end) as long as there is something to read in the line.
In effect, the "while(/<blah>/g)" tokenizes the line i.e. it splits the line into tokens. We could have used the "split(/<blah>/)" function as well over there and it would've worked.
The regex <blah> is the most important part of the code. It has to be constructed in such a way so that you're able to pick up the most generic token in each line.
So if you have the following 4 tokens in your file:
The part in red is all numbers, so that's \d+
The part in orange is all numbers again, so that \d+
The part in blue is some non-whitespace text, so we can use \S+
The part in black is common in all the tokens
With this knowledge, we could construct the regex as follows:
Code:
NC_(\d+)\.\S+g\.(\d+)(\S+)
I've added the color codes so you can understand what part of the regex matches what part of the token.
A token may be preceded by 0 or more tabs, so we need \t* at the beginning.
Note that the first token at the beginning of the line has 0 tabs before it. Every other token has 1 or more tabs in front of it. So now the regex becomes:
Code:
\t*NC_(\d+)\.\S+g\.(\d+)(\S+)
and that is what we should use in our "while" loop:
Code:
while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g) {
The stuff between the first parentheses goes into $1 and we assign it to $num1.
The stuff between the second parentheses goes into $2 and we assign it to $num2.
The stuff between the third parentheses goes into $3 and we assign it to $common.
Now have a look at your code and especially the part in red:
Code:
while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {
The regex won't match this token:
Code:
NC_000013.10:g.20763642C>T
because the token does not have the "del" text in it. It does not have two numbers separated by underscore.
The regex won't match this token either:
Code:
NC_000013.10:g.20763686delC
since there are no two numbers separated by underscore.
The regex will match only this token in line 4 of your input file:
Code:
NC_000013.10:g.20763145_20763146delTG
I'll color code the parts of the regex and the parts they match in the token so it's clear:
Code:
while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {
Code:
NC_000013.10:g.20763145_20763146delTG
So that was the issue.
Once your main regex is wrong, most of your regexes inside the "while" loop become redundant.
For example, the one with the ">" will never be true:
Code:
if ($common =~ /([A-Z])>([A-Z])/) {
because $common will never have the ">" character. It has "del" instead.
And so on....
The second issue was in your "printf" statement.
Since we have assigned variables inside the while loop that contain our information, we should be printing the variables. Not $1, $2, $3, ... etc.
That is, we should print $num1, $num2, $ch1, ... etc.
Back to the correct code.
Once you understand that $common can contain the following different cases in blue color below:
you can then work with each of them individually to obtain the information you want.
Another point is about the $num2. You print it twice for the first three cases above. But in cases 4 and 5 above, you need the number after the underscore ("_") and before "del".
What I've done is, I've defined a new variable called $num3.
- By default, $num3 equals $num2. And it is set as soon as we know the value of $num2.
- In the cases 4 and 5, we extract the value of $num3 and overwrite the default value.
We can then print $num1, $num2, $num3, $ch1, $ch2.
All the ideas above are incorporated in the code below:
Thank you for the explanations and color coding, that helps a lot. It's a lot too take in, but it definitely makes sense, I really appreciate your help and efforts.
I have an input file attached that I am trying to parse in tab-delimanted format:
The chromosomal variant column contains all the information:
parse rules:
1. 4 zeros after the NC_ and the digits before the .
2. digits after the g. repeated twice separated by a tab
3. letter before the >
4.... (10 Replies)
Hi Perl Guys
I have another perl question
I have the following code that i have written
Getopt::Long::config(qw( permute bundling ));
my $OPT = {};
GetOptions($OPT, qw(
ver=s
help|h
)) or die "options parsing failed";
This will allow the user to do something like... (4 Replies)
Hello there,
I em executing the following command in a perl script to append "\0" to the end of every line in a file:
###command start
my $cmd = qx{"C:\\gawk" '{print $0 "\\\0"}' C:\file.txt > C:\file_1.txt};
###command end
But i get the following error:
###error meaasge start... (2 Replies)
Hi,
I have the file like this:
#Contents of file 1 are:
Dec 10 12:33:44 User1 Interface: Probe
Dec 10 12:33:47 uSER1 SOME DATA
Dec 10 12:33:47 user1 Interface: MSGETYPE
Dec 10 12:34:48 user1 ID: 10.
Dec 10 12:33:55 user1 Interface: MSGTYPE
Dec 10 12:33:55 user1 Id: 9
... (1 Reply)
Hi anyone can help.how can i get all second column data in this log below??
x 799002577959.pdf, 25728 bytes, 51 tape blocks
x 800002357216.pdf, 25728 bytes, 51 tape blocks
x aadb090910.txt, 80424 bytes, 158 tape blocks
x tsese090909.txt, 13974 bytes, 28 tape blocks (4 Replies)
Hi
I'm writing simple perl script to parse the ftp log as below:
Local directory now /home/user/testing
227 Entering Passive Mode (192,254,19,34,8,228).
125 Data connection already open; Transfer starting.
09-25-09 02:33PM 25333629 abc.tar
09-14-09 12:50PM 18015752... (1 Reply)
Dear all
anyone willling to help me..i have try so many time but still failed to get the ip address for line
when i print the line is like below
Connected to 192.168.1.13
#!/usr/local/bin/perl
foreach $line(@lines){
if ($line =~ /connected to/) {
$line=~/connected to(.*?) /;
... (2 Replies)
Hi all,
I'm attempting to parse through a .bin file word by word and perform a cksum on each word using perl. I'm new to perl so I dont exactly know how to get started. Any help would be greatly appreciated. Thanks! (1 Reply)
Hi,
I have got this value 18:21:23.330 in one of my variables.
Now I need to parse this time to something.
And then I have to compare it with 2 times, let's say, 15:00 hrs to 23:00 hrs.
Can Date::Manip rescue me from this horrifying situation?
I am quite new to Perl and especially this... (1 Reply)