Perl to parse a variety of formats


Login or Register for Dates, Times and to Reply

 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Perl to parse a variety of formats
# 1  
Perl to parse a variety of formats

The below perl script parses a variety of formats. If I use the numeric text file as input the script works correctly. However using the alpha text file as input there is a black output file. The portion in bold splits the field to parse f[2] or NC_000023.10:g.153297761C>A into a variable $common but since the portion after the NC_ can be alpha or numeric I think that is the issue. Currently it is set to only numeric (\d+), so maybe (\d+ || [Aa-Zz]) would solve this. I am still learning so I wanted to check and make sure it wasn't something else I over-looked. Thank you Smilie.

numeric tab-delimited

Code:
Input Variant    Errors    Chromosomal Variant    Coding Variant(s)
NM_004992.3:c.274G>T        NC_000023.10:g.153297761C>A    XM_005274683.1:c.-6G>T    XM_005274682.1:c.-6G>T    XM_005274681.1:c.274G>T    LRG_764t2:c.274G>T    NM_004992.3:c.274G>T    LRG_764t1:c.310G>T    NM_001110792.1:c.310G>T

Code:
perl -ne 'next if $. == 1;
    if(/.*del([A-Z]+)ins([A-Z]+).*NC_0+([^.]+)\..*g\.([0-9]+)_([0-9]+)/)   # indel
    {
            print join("\t", $3, $4, $5, $1, $2), "\n";
            }
              else
{
            while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g) {                                            # conditional parse
                ($num1, $num2, $common) = ($1, $2, $3);
                $num3 = $num2;
                if    ($common =~ /^([A-Z])>([A-Z])$/)   { ($ch1, $ch2) = ($1, $2) }              # SNP
                elsif ($common =~ /^del([A-Z])$/)        { ($ch1, $ch2) = ($1, "-") }             # deletion
                elsif ($common =~ /^ins([A-Z])$/)        { ($ch1, $ch2) = ("-", $1) }             # insertion
                elsif ($common =~ /^_(\d+)del([A-Z]+)$/) { ($num3, $ch1, $ch2) = ($1, $2, "-") }  # multi deletion
                elsif ($common =~ /^_(\d+)ins([A-Z]+)$/) { ($num3, $ch1, $ch2) = ("-", $1, $2) }  # multi insertion
                printf ("%d\t%d\t%d\t%s\t%s\n", $num1, $num2, $num3, $ch1, $ch2);                 # output
                map {undef} ($num1, $num2, $num3, $common, $ch1, $ch2);
            }
}' numeric

Code:
23    153297761    153297761    C    A tab-delimeted

alpha tab-delimited

Code:
Input Variant    Errors    Chromosomal Variant    Coding Variant(s)
NM_004992.3:c.274G>T        NC_0000X.10:g.153297761C>A    XM_005274683.1:c.-6G>T    XM_005274682.1:c.-6G>T    XM_005274681.1:c.274G>T    LRG_764t2:c.274G>T    NM_004992.3:c.274G>T    LRG_764t1:c.310G>T    NM_001110792.1:c.310G>T
Same script produces a blank output file after it executes.

desired output tab-delimeted
Code:
X    153297761    153297761    C    A


Last edited by cmccabe; 04-30-2017 at 11:39 AM.. Reason: fixed format and added desired output
# 2  
I have changed:

Code:
 while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g) {

to

Code:
 while (/\t*NC_(\w+)\.\S+g\.(\d+)(\S+)/g) {

and

Code:
printf ("%d\t%d\t%d\t%s\t%s\n", $num1, $num2, $num3, $ch1, $ch2);

to

Code:
printf ("%s\t%d\t%d\t%s\t%s\n", $num1, $num2, $num3, $ch1, $ch2);

and getting

using numeric as input:
Code:
000023    153297761    153297761    C    A

using alpha as input:
Code:
0000X    153297761    153297761    C    A

there are multiple lines of input in each file but each line is following the same format.

Is the printf not correct or should I not use \w+? Thank you Smilie.

Last edited by cmccabe; 05-01-2017 at 12:45 PM.. Reason: fixed format
# 3  
While in Perl you could execute a lot of code at the command line, I would recommend that you use one-liners and executable for just testing concepts and quick throw-away.

For example, I want to test that I can split the line by tabs and work with the second field.
I want to extract information from the second field. I might do the following:

Code:
cat example.txt
Input Variant    Errors    Chromosomal Variant    Coding Variant(s)
NM_004992.3:c.274G>T        NC_000023.10:g.153297761C>A    XM_005274683.1:c.-6G>T    XM_005274682.1:c.-6G>T    XM_005274681.1:c.274G>T    LRG_764t2:c.274G>T    NM_004992.3:c.274G>T    LRG_764t1:c.310G>T    NM_001110792.1:c.310G>
NM_004992.3:c.274G>T        NC_0000X.10:g.153297761C>A    XM_005274683.1:c.-6G>T    XM_005274682.1:c.-6G>T    XM_005274681.1:c.274G>T    LRG_764t2:c.274G>T    NM_004992.3:c.274G>T    LRG_764t1:c.310G>T    NM_001110792.1:c.310G>T

Code:
perl -nale '@f = $F[1] =~ /NC_0+(\w+)\.\d+:g\.(\d+)(\w)>(\w)/; print join "\t", @f[0,1],$f[1],@f[2,3]' example.txt

23      153297761       153297761       C       A
X       153297761       153297761       C       A

Now, that I know that my regex is extracting what I want, let's implement it to keep:
Code:
#!/usr/bin/perl
# extract.pl
use strict;
use warnings;

{
    local $, = "\t";
    while(<>) {
        my @fields = split /\t+/;
        my @u = $fields[1] =~ /NC_0+(\w+)\.\d+:g\.(\d+)(\w)>(\w)/;
        print @u[0,1],$u[1],@u[2,3] . "\n" if @u;
    }
}


Code:
perl extract.pl example.txt
23      153297761       153297761       C       A
X       153297761       153297761       C       A


Last edited by Aia; 05-02-2017 at 02:47 AM.. Reason: remove scalar
This User Gave Thanks to Aia For This Post:
# 4  
Since my input could contain multiple format types, I use multiple regex to capture them. I see your point but need to redo for each condition. They aren't in every file, but the format is often different.

For example,

A SNP regex would not work for an insertion (comments on each line). Thank you very much for the suggestion and help, it is working ).

Last edited by cmccabe; 05-02-2017 at 09:40 AM..
Login or Register for Dates, Times and to Reply

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #534
Difficulty: Medium
All programming languages have an explicit Boolean type.
True or False?

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Parse apache log file with three different time formats

Hi, I want to parse below file and Write a function to extract the logs between two given timestamp. Apache (Unix) Log Samples - MonitorWare The challenge here is there are three date and time format. First :- 07/Mar/2004:16:05:49 Second :- Sun Mar 7 16:02:00 2004 Third :- 29-Mar... (6 Replies)
Discussion started by: sahil_shine
6 Replies

2. Shell Programming and Scripting

Help to parse syslog with perl

logver=56 idseq=63256900099118326 itime=1563205190 devid=FG-5KDTB18800138 devname=LAL-C1-FGT-03 vd=USER date=2019-07-15 time=18:39:49 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1563205189 srcip=11.3.3.17 srcport=50544 srcintf="SGI-CORE.123" srcintfrole="undefined"... (3 Replies)
Discussion started by: arm
3 Replies

3. Shell Programming and Scripting

Perl:: mass replacement of converting C code formats to tgmath.h

hello, i have a lot of C old code I'm updating to C11 with tgmath.h for generic math. the old code has very specific types, real and complex, like cabsl, csinhl, etc usually for simple bulk replacements i would do something simple like this perl -pi -e 's/cosl/cos/g' *.c the reference... (0 Replies)
Discussion started by: f77hack
0 Replies

4. Shell Programming and Scripting

Perl to parse

The below code works great to parse out a file if the input is in the attached SNP format ">". perl -ne 'next if $.==1; while(/\t*NC_(\d+)\.\S+g\.(\d+)()>()/g){printf("%d\t%d\t%d\t%s\t%s\n",$1,$2,$2,$3,$4,$5)}' out_position.txt > out_parse.txt My question is if there is another format in... (10 Replies)
Discussion started by: cmccabe
10 Replies

5. Shell Programming and Scripting

Using awk to parse a file with mixed formats in columns

Greetings I have a file formatted like this: rhino grey weight=1003;height=231;class=heaviest;histology=9,0,0,8 bird white weight=23;height=88;class=light;histology=7,5,1,0,0 turtle green weight=40;height=9;class=light;histology=6,0,2,0... (2 Replies)
Discussion started by: Twinklefingers
2 Replies

6. Programming

Perl parse string

Hi Perl Guys I have another perl question I have the following code that i have written Getopt::Long::config(qw( permute bundling )); my $OPT = {}; GetOptions($OPT, qw( ver=s help|h )) or die "options parsing failed"; This will allow the user to do something like... (4 Replies)
Discussion started by: ab52
4 Replies

7. Shell Programming and Scripting

Perl Parse

Hi I'm writing simple perl script to parse the ftp log as below: Local directory now /home/user/testing 227 Entering Passive Mode (192,254,19,34,8,228). 125 Data connection already open; Transfer starting. 09-25-09 02:33PM 25333629 abc.tar 09-14-09 12:50PM 18015752... (1 Reply)
Discussion started by: netxus
1 Replies

8. Shell Programming and Scripting

Breaking if-else loop and variety of comparisions

Hello Friends, Im trying to write a script to invoke nagios. In order to do this I grep some words that comes from output of some backup scripts. When there is "End-of-tape detected" in directed output logs it should give alarm. First I would like to know if there is any better way to write... (5 Replies)
Discussion started by: EAGL€
5 Replies

9. Shell Programming and Scripting

Passing date formats in Perl: i.e. Jul/10/2007 -> 20070710 (yyyymmdd) - Perl

Hi , This script working for fine if pass script-name.sh Jul/10/2007 ,I want to pass 20070710(yyyymmdd) .Please any help it should be appereciated. use Time::Local; my $d = $ARGV; my $t = $ARGV; my $m = ""; @d = split /\//, $d; @t = split /:/, $t; if ( $d eq "Jan" ) { $m = 0 }... (7 Replies)
Discussion started by: akil
7 Replies

10. Windows & DOS: Issues & Discussions

print image files to variety printer models

Hi, I am currently working on a windows platform (2000 and XP) and was wondering if there are today solutions for the task I have. I need to print image files onto a variety of inkjet printer models, most epson non-postscript. Some of the models I know but new models are added almost every... (0 Replies)
Discussion started by: jokofix007
0 Replies

Featured Tech Videos