Need help comparing Base Pairs within PERL


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Need help comparing Base Pairs within PERL
# 8  
Old 06-06-2012
OK, so the reference will always be static?

As in this is ALWAYS the reference:
Code:
Reference:
1)T
2)A
3)C
4)G
5)C

# 9  
Old 06-06-2012
Yes but the numbers continue into the millions so itd be easier to set the reference as the entire filename instead I believe.
# 10  
Old 06-06-2012
OK getting closer, but I am still unsure of *where* the references come from.

If they were static like you posted that is easy. Now you mention a filename:

Quote:
Yes but the numbers continue into the millions so itd be easier to set the reference as the entire filename instead I believe.
What file name?

I cannot find anywhere in this post where you have mentioned getting the references from filenames...

If there is more than one file involved please specify that, if the *names* of the file are important then also point that out and specify the names.

We need a sample of *ALL* data you are working with (even if bogus but representative), and a description of what columns, fields, or characters you want to match on.

Beyond that we can only work with what we are given and your results may not be helpful.

EDIT:
Going on the static reference you gave I come up with this:
Code:
#!/usr/bin/perl


use strict;
use warnings;

my ($idx, $chr, $geno, $mutation);
my @ref = ('T', 'A', 'C', 'G', 'C');
open(FILE,"<","file.txt") or die $!;
while (<FILE>) {
        next unless $_ !~ /Index/;
        next unless $_ =~ /^(\d*)\s*([a-z]*\d)\s*([a-z]*)\s*([a-z]*)/i;
        $idx = $1 - 1;
        $geno = $3;
        next unless $geno !~ /$ref[$idx]/;
        print $_;
}

RESULTS:
Code:
> ./test.pl   
3           Chr1           AG            Exonic
4           Chr1           CC            Frameshift

Maybe that will help you get further in your solution...

Last edited by ddreggors; 06-06-2012 at 12:08 PM..
# 11  
Old 06-06-2012
Okay I'll start over and try to be more exact.
I am looking at a file that has about 3 million rows and about 100 columns. The row number is given by the index number, in the 1st column, as previously shown. The genotypes that I need to look at, 'TT' or 'GG', for example, are located in the 88th column. The last column of importance is the 6th in which it gives the name of the mutation, such as 'intronic'.
In a seperate file, there are two columns. The first being the index number that ranges from 1 to about 3 million, and the second has static values for the references, i.e 'T' or 'G'.
I would like to compare the letters in index 1 of the first file with index 1 of the second file, and so on and so forth.
Now I am changing one aspect of this so I apologize.
If either one of the letters from the first file at index 1 matches the letter from the second file at index 1, I would like for the 1st column and 6th column of the first file to be printed out.


For example:
Code:
 
File #1
Column 1     ....        Column 6     .....   Column 88
(Index)                   (mutation)            (Genotype)
1                             Intronic                 TT
2                             Frameshift             GT
3                             Exonic                   AT
4                             Exonic                   AA
5                             Intronic                 GC
 
File #2
Column 1      Column 2
(index)          (reference letter)
1                      A
2                      C
3                      C
4                      A
5                      G

Output: Since at index 4 and 5, one of the letters in the genotype in file 1 match the letter from file 2, i would like the following to be displayed:
Code:
 
4 Exonic 
5 Intronic

I hope this is more helpful
# 12  
Old 06-06-2012
OK, now I can do that. I understand now and will give you an example in a minute...

---------- Post updated at 01:05 PM ---------- Previous update was at 11:34 AM ----------

OK given the following files:

FILE1
Code:
Column 1     ....        Column 6     .....   Column 88
(Index)                   (mutation)            (Genotype)
1                             Intronic                 TT
2                             Frameshift             GT
3                             Exonic                   AT
4                             Exonic                   AA
5                             Intronic                 GC

FILE2
Code:
Column 1      Column 2
(index)          (reference letter)
1                      A
2                      C
3                      C
4                      A
5                      G


I have written this:
Code:
#!/usr/bin/perl


use strict;
use warnings;
my ($idx, $tmp, $geno, @ref, @data);


# file1.text is the actual data file
# We want to match data lines to reference lines
# So first we place the data lines (not columns) into an array.

open(FILE1,"<","file1.txt") or die $!;
while (<FILE1>) {
        chomp;
        # If the line does not start with a number
        # we skip this line
        next unless $_ =~ /^\d/;
        # split the line into index (idx), and all else is placed in tmp
        ($idx, $tmp) = split(/\s+/,$_,2);
        # Populate the data array with the line minus the index column
        $data[$idx] = $tmp;
}


# file2.txt is the referece file with only 2 columns
# We parse the file and split it into index and value pairs.
# Then we can use the index to match the data index and
# once we have that data we can begin to break it down to
# it's column components and match as needed/
open(FILE2,"<","file2.txt") or die $!;
while (<FILE2>) {
        chomp;
        next unless $_ =~ /^\d/;
        # Split the index and data
        /^(\d*)\s*([a-z]*)/i;
        # Split the data line columns by spaces and 
        # place these columns into a new temp array  
        my @tmparr = split(/\s+/,$data[$1]);
        # Now we can look directly at 1 (or other) column for testing
        # column 1 in my case but column 86 in yours 
        # Column 88 becomes column 86 because we -1 for removed index in first loop above and we -1 for 0 based array
        # My file1.txt had 3 colmns, removing the index leaves 2 columns, and a zero based array means we have column 0 and 1.
        next unless $tmparr[1] !~ /$2/;
        print $1 . "\t" . $data[$1] . "\n";
}


and the result is:
Code:
> ./test.pl   
1       Intronic                 TT
2       Frameshift             GT
3       Exonic                   AT

---------- Post updated at 01:16 PM ---------- Previous update was at 01:05 PM ----------

Just a quick note, without all my comments this is not a large script either:

Code:
#!/usr/bin/perl


use strict;
use warnings;
my ($idx, $tmp, $geno, @ref, @data);
open(FILE1,"<","file1.txt") or die $!;
while (<FILE1>) {
        chomp;
        next unless $_ =~ /^\d/;
        ($idx, $tmp) = split(/\s+/,$_,2);
        $data[$idx] = $tmp;
}
open(FILE2,"<","file2.txt") or die $!;
while (<FILE2>) {
        chomp;
        next unless $_ =~ /^\d/;
        /^(\d*)\s*([a-z]*)/i;
        my @tmparr = split(/\s+/,$data[$1]);
        next unless $tmparr[1] !~ /$2/;
        print $1 . "\t" . $data[$1] . "\n";
}


Last edited by ddreggors; 06-06-2012 at 02:25 PM..
# 13  
Old 06-06-2012
Thank you so much its running flawlessly for me.
Is there any quick alteration that could be made so that instead of reporting back the cases in which neither letter is the same as the reference, it would instead report all cases where at least one of the letters DOES match the reference?
Sorry for changing up, and i appreciate your help a ton.
# 14  
Old 06-06-2012
Quote:
Originally Posted by drossy
Is there any quick alteration that could be made so that instead of reporting back the cases in which neither letter is the same as the reference, it would instead report all cases where at least one of the letters DOES match the reference?
As in the opposite of what it is doing now?

If so then the next to last line in the second loop can be changed as follows...

Original:
Code:
next unless $tmparr[1] !~ /$2/;

Changed:
Code:
next unless $tmparr[1] =~ /$2/;

BTW my original code was only really good for 3 columns in the data file.
Here is the same code working for 88 columns as you stated...

Code:
#!/usr/bin/perl

use strict;
use warnings;
my ($idx, $tmp, @data);
open(FILE1,"<","file3.txt") or die $!;
while (<FILE1>) {
        chomp;
        next unless $_ =~ /^\d/;
        ($idx, $tmp) = split(/\s+/,$_,2);
        $data[$idx] = $tmp;
}
close FILE1;

open(FILE2,"<","file2.txt") or die $!;
while (<FILE2>) {
        chomp;
        next unless $_ =~ /^\d/;
        /^(\d*)\s*([a-z]*)/i;
        my @tmparr = split(/\s+/,$data[$1]);
        next unless $tmparr[86] !~ /$2/;
        print $1 . "\t" . $tmparr[4] . "\t" . $tmparr[86] . "\n";
}
close FILE2;

again changing the next to last line of the second loop from "!~" to "=~" to get the opposite behavior.

NOTE:
I cleaned up the code a bit, I removed the unused variables ($geno & @ref) and also now I close the file handles (FILE1 & FILE2).

Last edited by ddreggors; 06-06-2012 at 03:56 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Perl for comparing numbers from previous lines in a file?

Hi everyone I have a question for you, as I am trying to learn more about Perl and work with some weather data. I have an ascii file (shown below) that has 10 lines with different columns. What I would like is have Perl find an "anomalous" value by comparing a field with the values from the last... (2 Replies)
Discussion started by: lucshi09
2 Replies

2. Shell Programming and Scripting

Need help in comparing two files using shell or Perl

I have these two file that I am trying to compare using shell arrays. I need to find out the changed or the missing enteries from File2. For example. The line "f nsd1" in file2 is different from file1 and the line "g nsd6" is missing from file2. I dont want to use "for loop" because my files... (2 Replies)
Discussion started by: sags007_99
2 Replies

3. Shell Programming and Scripting

Perl: Need help comparing huge files

What do i need to do have the below perl program load 205 million record files into the hash. It currently works on smaller files, but not working on huge files. Any idea what i need to do to modify to make it work with huge files: #!/usr/bin/perl $ot1=$ARGV; $ot2=$ARGV; open(mfileot1,... (12 Replies)
Discussion started by: mrn6430
12 Replies

4. Shell Programming and Scripting

Perl: Comparing to two files and displaying the differences

Hi, I'm new to perl and i have to write a perl script that will compare to log/txt files and display the differences. Unfortunately I'm not allowed to use any complied binaries or applications like diff or comm. So far i've across a code like this: use strict; use warnings; my $list1;... (2 Replies)
Discussion started by: dont_be_hasty
2 Replies

5. Shell Programming and Scripting

PERL: simple comparing arrays question

Hi there, i have been trying different methods and i wonder if somebody could explain to me how i would perform a comparison on two arrays for example my @array1 = ("gary" ,"peter", "paul"); my @array2 = ("gary" ,"peter", "joe"); I have two arrays above, and i want to something like this... (5 Replies)
Discussion started by: hcclnoodles
5 Replies

6. Shell Programming and Scripting

comparing list values in Perl

Hi, I have tab separated list: KB0005 1019 T IFVATVPVI 0.691 PKC YES KB0005 1036 T YFLQTSQQL 0.785 PKC YES KB0005 1037 S FLQTSQQLK 0.585 DNAPK YES KB0005 508 S ENIISGVSY 0.507 cdc2 YES KB0005 511 S ... (1 Reply)
Discussion started by: karla
1 Replies

7. Shell Programming and Scripting

PERL name value pairs substituions

I have a main file with variable tokens like this: name: File1 =========== Destination/Company=@deploy.company@ Destination/Environment=@deploy.env@ Destination/Location=@deploy.location@ Destination/Domain=@deploy.location@ MIG_GatewayAddresses=@deploy.gwaddress@ MIG_URL=@deploy.mig_url@... (1 Reply)
Discussion started by: uandme2k2
1 Replies

8. Shell Programming and Scripting

Comparing arrays in perl

Hi all, I am trying to compare two arrays in perl using the following code. foreach $item (@arrayA){ push(@arrayC, $item) unless grep(/$item/, @arrayB); ... (1 Reply)
Discussion started by: chriss_58
1 Replies

9. Shell Programming and Scripting

Comparing Variables in Perl

Hi. I have three arrays. @a=('AB','CD','EF'); @b=('AB,'DG',HK'); @c=('DD','TT','MM'); I want to compare the elements of the first two array and if they match then so some substition. I tried using the if statement using the scalar value of the array but its not giving me any output. ... (7 Replies)
Discussion started by: kamitsin
7 Replies

10. Shell Programming and Scripting

perl search and replace pairs

Hello all im facing some kind of problem i have this string : functionA() $" "$ functionB("arg1") $" = "$ i will like to replace all the pairs of opening and closing "$" to be something like that functionA() <#" "#> functionB("arg1") <#" = "#> i cant of course do is with simple ... (1 Reply)
Discussion started by: umen
1 Replies
Login or Register to Ask a Question