Perl to parse


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Perl to parse
# 1  
Old 03-24-2015
Perl to parse

The below code works great to parse out a file if the input is in the attached SNP format ">".

Code:
 perl -ne 'next if $.==1; while(/\t*NC_(\d+)\.\S+g\.(\d+)([A-Z])>([A-Z])/g){printf("%d\t%d\t%d\t%s\t%s\n",$1,$2,$2,$3,$4,$5)}' out_position.txt > out_parse.txt

My question is if there is another format in the input, such as "del" can both be parsed at the same time?

Code:
SNP parse rules (column 3 after the headr row is skipped):
1. 4 zeros after the NC_  (not always the case) and the digits before the .
2. g. ###   g.### (all digits)
3. letter (C) before the > 
4. letter (T) after the > 
Desired Output from parse:  13     20763438     20763438     C     G
Desired Output from parse: 13     20763642     20763642     C     G

DEL parse rules (column 3 after the headr row is skipped):
1. 4 zeros after the NC_  (not always the case) and the digits before the .
2. g. ###   g.### (all digits)
3. letter after "del" (C) 
4.  hyphen "-" used in this spot   
      
Desired Output from parse:  13     20763438     20763438     C     G


Last edited by cmccabe; 03-24-2015 at 05:53 PM..
# 2  
Old 03-24-2015
Quote:
Originally Posted by cmccabe
...
Code:
...
DEL parse rules (column 3 after the headr row is skipped):
1. 4 zeros after the NC_  (not always the case) and the digits before the .
2. g. ###   g.### (all digits)
3. letter after "del" (C) 
4.  hyphen "-" used in this spot   
      
Desired Output from parse:  13     20763438     20763438     C     G

(a)
Rule # 4 - there is no hyphen ("-") either in your attached file or in your output. So what exactly is that rule? Can you post input data on which that rule can be applied?

(b)
Desired output:
The token "20763438" is not seen in the last line of your attached file (the one where the "DEL" parse rule is to be applied apparently), but it is seen in your desired output.

The characters "C" and "G" are not seen together in any token in the last line of your attached file, but are seen together in the desired output.

Can you post the appropriate input data and its corresponding desired output that is obtained after applying all the rules?
# 3  
Old 03-24-2015
In the attached file the lastline:
NM_004004.5:c.35delG NC_000013.10:g.20763686delC NM_004004.5:c.35delG XM_005266354.1:c.35delG XM_005266355.1:c.35delG XM_005266356.1:c.35delG and the third column (the header row is skipped) NC_000013.10:g.20763686delC is the column/field to be parsed into the desired output: 13 20763686 20763686 C -

there is no "-" in the file the only indicator is in $5 there is a "-" to signify a deletion.

The code in the post seems to work for the first and second case (where there is a >), but not for the third (del).

Thank you Smilie.
# 4  
Old 03-25-2015
Code:
$
$ cat out_position.txt
Input Variant   Errors  Chromosomal Variant     Coding Variant(s)
NM_004004.5:c.283G>C            NC_000013.10:g.20763438C>G      NM_004004.5:c.283G>C    XM_005266354.1:c.283G>C XM_005266355.1:c.283G>C XM_005266356.1:c.283G>C
NM_004004.5:c.79G>C             NC_000013.10:g.20763642C>G      NM_004004.5:c.79G>C     XM_005266354.1:c.79G>C  XM_005266355.1:c.79G>C  XM_005266356.1:c.79G>C
NM_004004.5:c.35delG            NC_000013.10:g.20763686delC     NM_004004.5:c.35delG    XM_005266354.1:c.35delG XM_005266355.1:c.35delG XM_005266356.1:c.35delG
$
$
$ # Method 1 : Using a non-capturing grouping in Perl regular expression
$ perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)(?:del)*([A-Z])>*([A-Z]*)/g) {
                printf ("%d\t%d\t%d\t%s\t%s\n", $1, $2, $2, $3, $4 || "-");
            }
           ' out_position.txt
13      20763438        20763438        C       G
13      20763642        20763642        C       G
13      20763686        20763686        C       -
$
$
$ # Method 2 : Using more elaborate but plain-vanilla regular expressions
$ perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+[A-Z])/g) {
                ($num1, $num2, $common) = ($1, $2, $3);
                if ($common =~ /([A-Z])>([A-Z])/) {
                    ($ch1, $ch2) = ($1, $2)
                } elsif ($common =~ /del([A-Z])/) {
                    ($ch1, $ch2) = ($1, "-")
                }
                printf ("%d\t%d\t%d\t%s\t%s\n", $num1, $num2, $num2, $ch1, $ch2);
                map {undef} ($num1, $num2, $common, $ch1, $ch2);
            }
           ' out_position.txt
13      20763438        20763438        C       G
13      20763642        20763642        C       G
13      20763686        20763686        C       -
$
$

This User Gave Thanks to durden_tyler For This Post:
# 5  
Old 03-25-2015
Thank you Smilie.... it works perfectly. I went with method 2- I am reading into capturing groupings and regular expressions and they seen most useful in replacement operations of a named value. Is this correct? Thanks again Smilie.
# 6  
Old 03-25-2015
I have added a block of code to the script (in bold) to parse the attached file, but am getting a syntax error. Thank you Smilie.

The field to parse is NC_000013.10:g.20763145_20763146delTG

Code:
Parse rules:
1. 4 zeros after the NC_  (not always the case) and the digits before the .
2. g. ### (before underscore)  _### (# after the _)
3. TG (all letters after del)
4. -  (hyphen used in this spot)    
Desired Output: 13     20763145     20763146     TG     -

Code:
 perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+[A-Z])/g) {     # 3 condtion parse  
                ($num1, $num2, $common) = ($1, $2, $3);
                if ($common =~ /([A-Z])>([A-Z])/) {             # SNP
                    ($ch1, $ch2) = ($1, $2)
                } elsif ($common =~ /del([A-Z])/) {             # deletion
                    ($ch1, $ch2) = ($1, "-")
				} elsif ($common =~ /ins([A-Z])/) {             # insertion
                    ($ch1, $ch2) = ("-", $1)
			while (/\t*NC_(\d+)\.\S+g\.\S+g\.(\d+)(\S+[A-Z])/g) {      # 2 condtion parse  
                ($num1, $num2, $common) = ($1, $2, $3);
                if ($common =~ /del([A-Z])/) {                        # multi deletion
                    ($ch1, $ch2) = ($1, "-")             
                } elsif ($common =~ /ins([A-Z])/) {                   # multi insertion
                    ($ch1, $ch2) = ("-", $1)
                }
                printf ("%d\t%d\t%d\t%s\t%s\n", $num1, $num2, $num2, $ch1, $ch2);    # output
                map {undef} ($num1, $num2, $common, $ch1, $ch2);
            }
           ' out_position.txt > out_parse.txt


Code:
 
syntax error at -e line 10, near ") {"
Missing right curly or square bracket at -e line 20, at end of line
syntax error at -e line 20, at EOF
Execution of -e aborted due to compilation errors.

# 7  
Old 03-25-2015
I figured out the syntax error but the parse is not working correctly as it is following the same set of rules for deletion, not the new set in post 6.

I attached the file to be parsed as well. Thank you Smilie.

Code:
Result as  of now:
13	20763145	20763145	T	-


Should be:
13     20763145     20763146     TG     -

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help to parse syslog with perl

logver=56 idseq=63256900099118326 itime=1563205190 devid=FG-5KDTB18800138 devname=LAL-C1-FGT-03 vd=USER date=2019-07-15 time=18:39:49 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1563205189 srcip=11.3.3.17 srcport=50544 srcintf="SGI-CORE.123" srcintfrole="undefined"... (3 Replies)
Discussion started by: arm
3 Replies

2. Shell Programming and Scripting

awk or perl to parse file

I have an input file attached that I am trying to parse in tab-delimanted format: The chromosomal variant column contains all the information: parse rules: 1. 4 zeros after the NC_ and the digits before the . 2. digits after the g. repeated twice separated by a tab 3. letter before the > 4.... (10 Replies)
Discussion started by: cmccabe
10 Replies

3. Programming

Perl parse string

Hi Perl Guys I have another perl question I have the following code that i have written Getopt::Long::config(qw( permute bundling )); my $OPT = {}; GetOptions($OPT, qw( ver=s help|h )) or die "options parsing failed"; This will allow the user to do something like... (4 Replies)
Discussion started by: ab52
4 Replies

4. Shell Programming and Scripting

Perl parse error

Hello there, I em executing the following command in a perl script to append "\0" to the end of every line in a file: ###command start my $cmd = qx{"C:\\gawk" '{print $0 "\\\0"}' C:\file.txt > C:\file_1.txt}; ###command end But i get the following error: ###error meaasge start... (2 Replies)
Discussion started by: nmattam
2 Replies

5. Shell Programming and Scripting

Parse file contents in perl...

Hi, I have the file like this: #Contents of file 1 are: Dec 10 12:33:44 User1 Interface: Probe Dec 10 12:33:47 uSER1 SOME DATA Dec 10 12:33:47 user1 Interface: MSGETYPE Dec 10 12:34:48 user1 ID: 10. Dec 10 12:33:55 user1 Interface: MSGTYPE Dec 10 12:33:55 user1 Id: 9 ... (1 Reply)
Discussion started by: vanitham
1 Replies

6. Shell Programming and Scripting

perl parse log

Hi anyone can help.how can i get all second column data in this log below?? x 799002577959.pdf, 25728 bytes, 51 tape blocks x 800002357216.pdf, 25728 bytes, 51 tape blocks x aadb090910.txt, 80424 bytes, 158 tape blocks x tsese090909.txt, 13974 bytes, 28 tape blocks (4 Replies)
Discussion started by: netxus
4 Replies

7. Shell Programming and Scripting

Perl Parse

Hi I'm writing simple perl script to parse the ftp log as below: Local directory now /home/user/testing 227 Entering Passive Mode (192,254,19,34,8,228). 125 Data connection already open; Transfer starting. 09-25-09 02:33PM 25333629 abc.tar 09-14-09 12:50PM 18015752... (1 Reply)
Discussion started by: netxus
1 Replies

8. Shell Programming and Scripting

perl parse line

Dear all anyone willling to help me..i have try so many time but still failed to get the ip address for line when i print the line is like below Connected to 192.168.1.13 #!/usr/local/bin/perl foreach $line(@lines){ if ($line =~ /connected to/) { $line=~/connected to(.*?) /; ... (2 Replies)
Discussion started by: netxus
2 Replies

9. Shell Programming and Scripting

Perl Parse Word Cksum help

Hi all, I'm attempting to parse through a .bin file word by word and perform a cksum on each word using perl. I'm new to perl so I dont exactly know how to get started. Any help would be greatly appreciated. Thanks! (1 Reply)
Discussion started by: TeamUSA
1 Replies

10. Shell Programming and Scripting

Perl parse string to time

Hi, I have got this value 18:21:23.330 in one of my variables. Now I need to parse this time to something. And then I have to compare it with 2 times, let's say, 15:00 hrs to 23:00 hrs. Can Date::Manip rescue me from this horrifying situation? I am quite new to Perl and especially this... (1 Reply)
Discussion started by: King Nothing
1 Replies
Login or Register to Ask a Question