Perl join two files by "common" column

02-04-2011

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Perl join two files by "common" column

Hello;
I am posting to get any help on my code that I have been struggling for some time. The project is to join two files each with 80k~180k rows. I want to merge them together by the shared common column. The problem of the shared column is partially matching, not exactly the same.
File1:

Code:

GT_Xhyb_CTGSIN-SS-mira_assembly_rep_c3140|919|60    TACATCCTCCAAAGGACAAGATCTTGCCTTCCTTGTTGGTAGAAAAAATGCCGAGAGCAG
GT_Specific_CTGSIN-SS-EX055483|266|60    TTCTACCTATCGTTTCGGCTCAAGTTAGTGTCAGCAAATGATCCGAACGGTCTGGAAATG
GT_Specific_CTGSIN-SS-CL15294Contig1|1386|60_New    TTTTCTTTATAAAGAACAGTCTGTGTGTTAATAATTCTCATCTCCTGTCCGGACATAGAC
GT_Xhyb_SUPCTG-SS-SuperContig_CL53Contig7|737|60    CGTTTGAATGTATGACATATGAACATCGTTGCTCTCCTTCATCTTTTATGTGTTTTGGTT
GT_Specific_CTGSIN-SS-CL11320Contig1|392|60    TACTCTTGTAAAACCTTATACATACTTGCACATAAGAGAAAGATGGGATGTATTTCACAA
.......

File2:

Code:

mira_assembly_rep_c5    AT4G25140.1    OLEO1 (OLEOSIN 1) 
mira_assembly_rep_c8    AT4G27140.1    2S seed storage protein 1 / 2S albumin storage protein / NWMU1-2S albumin 1 
mira_assembly_rep_c24    AT5G38195.1    protease inhibitor/seed storage/lipid transfer protein (LTP) family protein 
mira_assembly_rep_c29    AT5G39850.1    40S ribosomal protein S9 (RPS9C) 
mira_assembly_rep_c36    AT4G32100.1    galactosyltransferase 
......

I want to merge the two file to get like:

Code:

GT_Xhyb_CTGSIN-SS-mira_assembly_rep_c3140|919|60    TACATCCTCCAAAGGACAAGATCTTGCCTTCCTTGTTGGTAGAAAAAATGCCGAGAGCAG    mira_assembly_rep_c3140 AT4G25140.1    OLEO1 (OLEOSIN 1) 
GT_Specific_CTGSIN-SS-mira_assembly_rep_c5|266|60    TTCTACCTATCGTTTCGGCTCAAGTTAGTGTCAGCAAATGATCCGAACGGTCTGGAAATG   mira_assembly_rep_c5 AT4G10270.1    wound-responsive family protein 
GT_Specific_CTGSIN-SS-mira_assembly_rep_c8|1386|60_New    TTTTCTTTATAAAGAACAGTCTGTGTGTTAATAATTCTCATCTCCTGTCCGGACATAGAC   -mira_assembly_rep_c8 AT2G33830.2    dormancy/auxin associated family protein 
GT_Xhyb_SUPCTG-SS-SuperContig_mira_assembly_rep_c29|737|60    CGTTTGAATGTATGACATATGAACATCGTTGCTCTCCTTCATCTTTTATGTGTTTTGGTT mira_assembly_rep_c29   AT3G49910.1    60S ribosomal protein L26 (RPL26A)
......

Here is my code:

Code:

#!/usr/bin/perl -w
use strict;


my %line2;
my $merged;
my $count2;
my $col1=0;                 #The common column in file1
my $col2=0;                 #The common column in file2
my ($f1,$f2)=@ARGV;             #The two files to be merged
open(F2,$f2) or die $!; 
while (<F2>) { 
    s/\r?\n//;                 #remove return of carriage at the end of each line;
    my @F=split /\t/, $_;             #split the line by tab
    $line2{$F[$col2]} .= "$_\n"; }         #create a hash to store the line

$count2 = $.;                     #input line number
    
open(F1,$f1) or die $!; 
    while (<F1>) { 
        s/\r?\n//; 
        my @F=split /\t/, $_; 
        my $x = $line2{$F[$col1]}; 
            if ($x =~ m/$F[$col2]\|/) { 
            my $num_changes = ($x =~ s/^/$_\t/gm);     #substitute the beginning of the line with  
                                #the current line plus TAB
            print $x; 
             $merged += $num_changes;
            } 
    } 
    
warn "Joining $f1 column $col1 with $f2 column $col2\n$f1: $. lines\n$f2: $count2 lines\nMerged file: $merged lines\n";

# usage: match_script.pl file1 file2 > merged_file.tab

Note the first column of the File2 contains only part of the first column of File1 before the first vertical bar "|". And not all of the rows of File1 has a match in File2, may be 80k out of 180k. They are big files.

It was running, but did not append the matched part of File2 to File1. Could anyone give me some clue?
I found this join/merge problem is quite common in my work, and I do not have database like MySQL. It would be great for me to catch the spirit of the coding for this.
Thanks a lot!
Yifang

Last edited by yifangt; 02-05-2011 at 11:33 AM..

yifangt

View Public Profile for yifangt

Find all posts by yifangt

02-04-2011

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

The two files do not seem to match even partially!
The string "mira_assembly_rep_c" is present in line 1 of File 1, and it is present in at the beginning of all lines of File 2.

Other than there, there is nothing in common.

What's the logic for the merged file then?

Why is the 2nd line of merged file as follows?

Code:

GT_Specific_CTGSIN-SS-mira_assembly_rep_c5|266|60    TTCTACCTATCGTTTCGGCTCAAGTTAGTGTCAGCAAATGATCCGAACGGTCTGGAAATG   mira_assembly_rep_c5 AT4G10270.1    wound-responsive family protein

And why is the 3rd line of the merged file like so ?

Code:

GT_Specific_CTGSIN-SS-mira_assembly_rep_c8|1386|60_New    TTTTCTTTATAAAGAACAGTCTGTGTGTTAATAATTCTCATCTCCTGTCCGGACATAGAC   -mira_assembly_rep_c8 AT2G33830.2    dormancy/auxin associated family protein

Maybe you could explain how you derived the merged file.

tyler_durden

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

02-05-2011

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Re: Perl join two files by "common" column

Thanks durben_tyler!

I could not show either file as they are big (180000 rows and 90000rows). That's why I need some help here. What I posted were just part of the files and they do not have common columns at first sight. But, if I search through the rest of the file, for sure there are matches between them. I should have be more careful with the example part. Here I rearranged the first several rows of each file.

File1:

Code:

ID Sequence
GT_Xhyb_CTGSIN-SS-mira_assembly_rep_c3140|919|60    TACATCCTCC...
GT_Specific_CTGSIN-SS-mira_assembly_rep_c24|266|60    TTCTACCTATC...
GT_Specific_CTGSIN-SS-mira_assembly_rep_c3|1386|60_New    TTTTCTT...
GT_Xhyb_SUPCTG-SS-SuperContig_mira_assembly_rep_c29|737|60  CGTTTGA...
GT_Specific_CTGSIN-SS-mira_assembly_rep_c8|392|60    TACTCTTGT...
.......

Note the "mira_assembly_rep_c3140" are the shared part in columns1 of File1.
In File2:

Code:

ID_Part Gene_Symbol Description
mira_assembly_rep_c5    AT4G25140.1    OLEO1 (OLEOSIN 1) 
mira_assembly_rep_c8    AT4G27140.1    2S seed storage protein 1 / 2S albumin storage protein / NWMU1-2S albumin 1 
mira_assembly_rep_c24    AT5G38195.1    protease inhibitor/seed storage/lipid transfer protein (LTP) family protein 
mira_assembly_rep_c29    AT5G39850.1    40S ribosomal protein S9 (RPS9C) 
mira_assembly_rep_c36    AT4G32100.1    galactosyltransferase

In File2, the ID_Part column is partially matching the Column1 (ID) of File1. in my code; I used condition to test the match:

Code:

if ($x =~ m/$F[$col2]\|/)

The merged file contains the full ID, sequence and gene symbol and function description of the gene.

Code:

GT_Xhyb_CTGSIN-SS-mira_assembly_rep_c3140|919|60    TACATCCTCC...   mira_assembly_rep_c3140 AT4G25140.1    OLEO1 (OLEOSIN 1) 
GT_Specific_CTGSIN-SS-mira_assembly_rep_c5|266|60    TTCTACCTATC...  mira_assembly_rep_c5 AT4G10270.1    wound-responsive family protein 
GT_Specific_CTGSIN-SS-mira_assembly_rep_c8|1386|60_New    TTTTCTTTATAAAGAACA...  -mira_assembly_rep_c8 AT2G33830.2    dormancy/auxin associated family protein 
GT_Xhyb_SUPCTG-SS-SuperContig_mira_assembly_rep_c29|737|60    CGTTTGAATGTATGAC...mira_assembly_rep_c29   AT3G49910.1    60S ribosomal protein L26 (RPL26A)
......

The merged file will be appended to another database with the ID as identifier. That's why I need the two files merged.

Quote:

Actually "mira_assembly_rep_c3" instead of "mira_assembly_rep_c" was used for matching test, that's why I append a vertical bar "|" at the end of the string

Code:

$x=~ m/$F[$col2]\|/

Quote:

What's the logic for the merged file then?

The logic is tricky as the match is the middle part of the string.

Quote:

Why is the 2nd line of merged file as follows?

Maybe it is a typo because the row became too long to see, and was wrapped. Anyway, there are five fields:
ID Sequence Part_ID gene_symbol Description.

yifangt

View Public Profile for yifangt

Find all posts by yifangt

02-05-2011

Registered User

49, 4

Join Date: Aug 2007

Last Activity: 26 September 2013, 2:30 AM EDT

Location: Yokohama, Japan

Posts: 49

Thanks Given: 1

Thanked 4 Times in 4 Posts

Hello!

A common bioinformatics problem, joining two tables :P I wonder why you posted your question to the "Web Development" section but it happens to be the only forum I subscribe to

Some comments about your code:

Code:

#!/usr/bin/perl -w
use strict;

#!/usr/bin/perl
use strict;
use warnings;

-w has been superseded by use warnings.

Code:

while (<F2>) { 
    s/\r?\n//;                 #remove return of carriage at the end of each line;

chomp;

The chomp() function removes return carriages and newlines.

The strategy I would use is to find some way to just capture the assembly information and using the assembly information to store the information on the line:

Code:

#note untested code
my @F=split /\t/, $_;
#use informative name for first column
my $id_seq = $f[0];
#remove anything after the first pipe
$id_seq =~ s/\|.*//;
#declare new variable for assembly information
my $assembly = '';
#store only assembly information
if ($id_seq =~ /.*(mira_.*)/){
   $assembly = $1;
} else {
   die "Unexpected notation on $. for $id_seq;
}
#store the line information into a hash using $assembly as the key
$line2{$assembly} = $_;

Then read file2 like you did before and get the required information from your %line2 hash.

Hope that works and helps,

Dave

z1dane

View Public Profile for z1dane

Find all posts by z1dane

02-06-2011

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Hi Dave:
Thanks for your comments! Actually I prefer your style too, e.g. use warnings etc. I scripted another codes as:

Code:

 #!/usr/bin/perl
use strict;
#use warnings;
my %Probe_n_Seq = ();
open(FILE1, "029931_D_SequenceList_20100827.txt") || die "Can't find file $!";
    while(<FILE1>){
          chomp $_;
          my @AAA=split(/\t/, $_);
           $Probe_n_Seq{$AAA[0]}=$AAA[1];
      }
close(FILE1);

open(FILE2, "CTG_n_SCTG_AGI_Entries.txtb")        || die "Can't find file $!";
while(<FILE2>) {
   chomp $_;
    my @BBB =split (/\t/, $_);
   foreach my $key (keys (%Probe_n_Seq)) {
    if ($key =~ m/$BBB[0]\|/) {
     print $key, "\t", $Probe_n_Seq{$key},"\t",$BBB[0]."\t".$BBB[1]."\t".$BBB[2],"\n";
         } 
    } 
 }
close(FILE2);

I used the first column $AAA[0] of file1 as key of the hash, and then compare with the first column $BBB[0] of file2. If $AAA[0] contains the string $BBB[0], it means a match, as "mira_" is not the only assembly marker.

Code:

if ($key =~ m/$BBB[0]\|/)

It seems running except a small bug for

Code:

my %Probe_n_Seq = ();

which caused the warning and stopped the program. So that I have to comment the use warnings.
The code takes ~6 hours for my 2.3Ghz dual CPU + 3GB RAM (compaq machine) to run. Not sure if this could be improved for file1 has 147478 rows (15.2MB) and file2 86837 rows(7.2MB).
Actually I have another idea in my mind to reduce the work load because the iteration is 147478x86837 times. If a match is found in file1, then the matched row in file1 can be deleted so that for the next $BBB[0] in file2 does not need to search this row again. ... so that the last search is 86838 instead of 147478 loops ( when the match is in the last row, worst scenario!). The reason is each row is unique in both file. Could not figure out this by myself. Any clue is highly appreciated!
Yifang

yifangt

View Public Profile for yifangt

Find all posts by yifangt

02-07-2011

Registered User

49, 4

Join Date: Aug 2007

Last Activity: 26 September 2013, 2:30 AM EDT

Location: Yokohama, Japan

Posts: 49

Thanks Given: 1

Thanked 4 Times in 4 Posts

Hi yifangt,

No problems. chomp by the way operates on the default input ($_), so you can just specify chomp instead of chomp $_

The computation will take a while as you pointed out. I think it might be worth fixing up the first file so that everything is systematic e.g. having a standardised assembly notation, so you don't need to use a regular expression. Once that is fixed up, you can just use a hash to see if the key exists.

As for your second approach of deleting elements in the hash, look up the delete() function.

Good luck and happy coding!

Dave

This User Gave Thanks to z1dane For This Post:

z1dane

View Public Profile for z1dane

Find all posts by z1dane

Web Development

Perl join two files by "common" column

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Join, merge, fill NULL the void columns of multiples files like sql "LEFT JOIN" by using awk

Discussion started by: yjacknewton

2. Shell Programming and Scripting

Delete all log files older than 10 day and whose first string of the first line is "MSH" or "<?xml"

Discussion started by: Hiroshi

3. UNIX for Dummies Questions & Answers

How to join 2 .txt files based on a common column?

Discussion started by: alisrpp

4. Shell Programming and Scripting

Problem of Perl's "join" function

Discussion started by: carloszhang

5. UNIX for Dummies Questions & Answers

How to use the the join command to join multiple files by a common column

Discussion started by: evelibertine

6. Shell Programming and Scripting

Substituting comma "," for dot "." in a specific column when comma"," is a delimiter

Discussion started by: poliver

7. UNIX for Dummies Questions & Answers

how to join two files using "Join" command with one common field in this problem?

Discussion started by: mindfreak

8. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Discussion started by: shis100

9. Shell Programming and Scripting

Join multiple files based on 1 common column

Discussion started by: quincyjones

10. Shell Programming and Scripting

"Join" or "Merge" more than 2 files into single output based on common key (column)

Discussion started by: Katabatic