Need help comparing Base Pairs within PERL

06-06-2012

Registered User

160, 12

Join Date: Aug 2008

Last Activity: 22 July 2013, 9:20 AM EDT

Location: Florida

Posts: 160

Thanks Given: 5

Thanked 12 Times in 11 Posts

OK, so the reference will always be static?

As in this is ALWAYS the reference:

Code:

Reference:
1)T
2)A
3)C
4)G
5)C

ddreggors

View Public Profile for ddreggors

Find all posts by ddreggors

06-06-2012

Registered User

15, 0

Join Date: Jun 2012

Last Activity: 27 June 2012, 11:12 AM EDT

Posts: 15

Thanks Given: 2

Thanked 0 Times in 0 Posts

Yes but the numbers continue into the millions so itd be easier to set the reference as the entire filename instead I believe.

drossy

View Public Profile for drossy

Find all posts by drossy

06-06-2012

Registered User

160, 12

Join Date: Aug 2008

Last Activity: 22 July 2013, 9:20 AM EDT

Location: Florida

Posts: 160

Thanks Given: 5

Thanked 12 Times in 11 Posts

OK getting closer, but I am still unsure of *where* the references come from.

If they were static like you posted that is easy. Now you mention a filename:

Quote:

Yes but the numbers continue into the millions so itd be easier to set the reference as the entire filename instead I believe.

What file name?

I cannot find anywhere in this post where you have mentioned getting the references from filenames...

If there is more than one file involved please specify that, if the *names* of the file are important then also point that out and specify the names.

We need a sample of *ALL* data you are working with (even if bogus but representative), and a description of what columns, fields, or characters you want to match on.

Beyond that we can only work with what we are given and your results may not be helpful.

EDIT:
Going on the static reference you gave I come up with this:

Code:

#!/usr/bin/perl


use strict;
use warnings;

my ($idx, $chr, $geno, $mutation);
my @ref = ('T', 'A', 'C', 'G', 'C');
open(FILE,"<","file.txt") or die $!;
while (<FILE>) {
        next unless $_ !~ /Index/;
        next unless $_ =~ /^(\d*)\s*([a-z]*\d)\s*([a-z]*)\s*([a-z]*)/i;
        $idx = $1 - 1;
        $geno = $3;
        next unless $geno !~ /$ref[$idx]/;
        print $_;
}

RESULTS:

Code:

> ./test.pl   
3           Chr1           AG            Exonic
4           Chr1           CC            Frameshift

Maybe that will help you get further in your solution...

Last edited by ddreggors; 06-06-2012 at 12:08 PM..

ddreggors

View Public Profile for ddreggors

Find all posts by ddreggors

06-06-2012

Registered User

15, 0

Join Date: Jun 2012

Last Activity: 27 June 2012, 11:12 AM EDT

Posts: 15

Thanks Given: 2

Thanked 0 Times in 0 Posts

Okay I'll start over and try to be more exact.
I am looking at a file that has about 3 million rows and about 100 columns. The row number is given by the index number, in the 1st column, as previously shown. The genotypes that I need to look at, 'TT' or 'GG', for example, are located in the 88th column. The last column of importance is the 6th in which it gives the name of the mutation, such as 'intronic'.
In a seperate file, there are two columns. The first being the index number that ranges from 1 to about 3 million, and the second has static values for the references, i.e 'T' or 'G'.
I would like to compare the letters in index 1 of the first file with index 1 of the second file, and so on and so forth.
Now I am changing one aspect of this so I apologize.
If either one of the letters from the first file at index 1 matches the letter from the second file at index 1, I would like for the 1st column and 6th column of the first file to be printed out.

For example:

Code:

 
File #1
Column 1     ....        Column 6     .....   Column 88
(Index)                   (mutation)            (Genotype)
1                             Intronic                 TT
2                             Frameshift             GT
3                             Exonic                   AT
4                             Exonic                   AA
5                             Intronic                 GC
 
File #2
Column 1      Column 2
(index)          (reference letter)
1                      A
2                      C
3                      C
4                      A
5                      G

Output: Since at index 4 and 5, one of the letters in the genotype in file 1 match the letter from file 2, i would like the following to be displayed:

Code:

 
4 Exonic 
5 Intronic

I hope this is more helpful

drossy

View Public Profile for drossy

Find all posts by drossy

06-06-2012

Registered User

160, 12

Join Date: Aug 2008

Last Activity: 22 July 2013, 9:20 AM EDT

Location: Florida

Posts: 160

Thanks Given: 5

Thanked 12 Times in 11 Posts

OK, now I can do that. I understand now and will give you an example in a minute...

---------- Post updated at 01:05 PM ---------- Previous update was at 11:34 AM ----------

OK given the following files:

FILE1

Code:

Column 1     ....        Column 6     .....   Column 88
(Index)                   (mutation)            (Genotype)
1                             Intronic                 TT
2                             Frameshift             GT
3                             Exonic                   AT
4                             Exonic                   AA
5                             Intronic                 GC

FILE2

Code:

Column 1      Column 2
(index)          (reference letter)
1                      A
2                      C
3                      C
4                      A
5                      G

I have written this:

Code:

#!/usr/bin/perl


use strict;
use warnings;
my ($idx, $tmp, $geno, @ref, @data);


# file1.text is the actual data file
# We want to match data lines to reference lines
# So first we place the data lines (not columns) into an array.

open(FILE1,"<","file1.txt") or die $!;
while (<FILE1>) {
        chomp;
        # If the line does not start with a number
        # we skip this line
        next unless $_ =~ /^\d/;
        # split the line into index (idx), and all else is placed in tmp
        ($idx, $tmp) = split(/\s+/,$_,2);
        # Populate the data array with the line minus the index column
        $data[$idx] = $tmp;
}


# file2.txt is the referece file with only 2 columns
# We parse the file and split it into index and value pairs.
# Then we can use the index to match the data index and
# once we have that data we can begin to break it down to
# it's column components and match as needed/
open(FILE2,"<","file2.txt") or die $!;
while (<FILE2>) {
        chomp;
        next unless $_ =~ /^\d/;
        # Split the index and data
        /^(\d*)\s*([a-z]*)/i;
        # Split the data line columns by spaces and 
        # place these columns into a new temp array  
        my @tmparr = split(/\s+/,$data[$1]);
        # Now we can look directly at 1 (or other) column for testing
        # column 1 in my case but column 86 in yours 
        # Column 88 becomes column 86 because we -1 for removed index in first loop above and we -1 for 0 based array
        # My file1.txt had 3 colmns, removing the index leaves 2 columns, and a zero based array means we have column 0 and 1.
        next unless $tmparr[1] !~ /$2/;
        print $1 . "\t" . $data[$1] . "\n";
}

and the result is:

Code:

> ./test.pl   
1       Intronic                 TT
2       Frameshift             GT
3       Exonic                   AT

---------- Post updated at 01:16 PM ---------- Previous update was at 01:05 PM ----------

Just a quick note, without all my comments this is not a large script either:

Code:

#!/usr/bin/perl


use strict;
use warnings;
my ($idx, $tmp, $geno, @ref, @data);
open(FILE1,"<","file1.txt") or die $!;
while (<FILE1>) {
        chomp;
        next unless $_ =~ /^\d/;
        ($idx, $tmp) = split(/\s+/,$_,2);
        $data[$idx] = $tmp;
}
open(FILE2,"<","file2.txt") or die $!;
while (<FILE2>) {
        chomp;
        next unless $_ =~ /^\d/;
        /^(\d*)\s*([a-z]*)/i;
        my @tmparr = split(/\s+/,$data[$1]);
        next unless $tmparr[1] !~ /$2/;
        print $1 . "\t" . $data[$1] . "\n";
}

Last edited by ddreggors; 06-06-2012 at 02:25 PM..

ddreggors

View Public Profile for ddreggors

Find all posts by ddreggors

06-06-2012

Registered User

15, 0

Join Date: Jun 2012

Last Activity: 27 June 2012, 11:12 AM EDT

Posts: 15

Thanks Given: 2

Thanked 0 Times in 0 Posts

Thank you so much its running flawlessly for me.
Is there any quick alteration that could be made so that instead of reporting back the cases in which neither letter is the same as the reference, it would instead report all cases where at least one of the letters DOES match the reference?
Sorry for changing up, and i appreciate your help a ton.

drossy

View Public Profile for drossy

Find all posts by drossy

06-06-2012

Registered User

160, 12

Join Date: Aug 2008

Last Activity: 22 July 2013, 9:20 AM EDT

Location: Florida

Posts: 160

Thanks Given: 5

Thanked 12 Times in 11 Posts

Quote:

Originally Posted by drossy

Is there any quick alteration that could be made so that instead of reporting back the cases in which neither letter is the same as the reference, it would instead report all cases where at least one of the letters DOES match the reference?

As in the opposite of what it is doing now?

If so then the next to last line in the second loop can be changed as follows...

Original:

Code:

next unless $tmparr[1] !~ /$2/;

Changed:

Code:

next unless $tmparr[1] =~ /$2/;

BTW my original code was only really good for 3 columns in the data file.
Here is the same code working for 88 columns as you stated...

Code:

#!/usr/bin/perl

use strict;
use warnings;
my ($idx, $tmp, @data);
open(FILE1,"<","file3.txt") or die $!;
while (<FILE1>) {
        chomp;
        next unless $_ =~ /^\d/;
        ($idx, $tmp) = split(/\s+/,$_,2);
        $data[$idx] = $tmp;
}
close FILE1;

open(FILE2,"<","file2.txt") or die $!;
while (<FILE2>) {
        chomp;
        next unless $_ =~ /^\d/;
        /^(\d*)\s*([a-z]*)/i;
        my @tmparr = split(/\s+/,$data[$1]);
        next unless $tmparr[86] !~ /$2/;
        print $1 . "\t" . $tmparr[4] . "\t" . $tmparr[86] . "\n";
}
close FILE2;

again changing the next to last line of the second loop from "!~" to "=~" to get the opposite behavior.

NOTE:
I cleaned up the code a bit, I removed the unused variables ($geno & @ref) and also now I close the file handles (FILE1 & FILE2).

Last edited by ddreggors; 06-06-2012 at 03:56 PM..

ddreggors

View Public Profile for ddreggors

Find all posts by ddreggors

Shell Programming and Scripting

Need help comparing Base Pairs within PERL

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Perl for comparing numbers from previous lines in a file?

Discussion started by: lucshi09

2. Shell Programming and Scripting

Need help in comparing two files using shell or Perl

Discussion started by: sags007_99

3. Shell Programming and Scripting

Perl: Need help comparing huge files

Discussion started by: mrn6430

4. Shell Programming and Scripting

Perl: Comparing to two files and displaying the differences

Discussion started by: dont_be_hasty

5. Shell Programming and Scripting

PERL: simple comparing arrays question

Discussion started by: hcclnoodles

6. Shell Programming and Scripting

comparing list values in Perl

Discussion started by: karla

7. Shell Programming and Scripting

PERL name value pairs substituions

Discussion started by: uandme2k2

8. Shell Programming and Scripting

Comparing arrays in perl

Discussion started by: chriss_58

9. Shell Programming and Scripting

Comparing Variables in Perl

Discussion started by: kamitsin

10. Shell Programming and Scripting

perl search and replace pairs

Discussion started by: umen