Sponsored Content
Top Forums Web Development Perl join two files by "common" column Post 302494078 by yifangt on Friday 4th of February 2011 08:01:54 PM
Old 02-04-2011
Perl join two files by "common" column

Hello;
I am posting to get any help on my code that I have been struggling for some time. The project is to join two files each with 80k~180k rows. I want to merge them together by the shared common column. The problem of the shared column is partially matching, not exactly the same.
File1:
Code:
GT_Xhyb_CTGSIN-SS-mira_assembly_rep_c3140|919|60    TACATCCTCCAAAGGACAAGATCTTGCCTTCCTTGTTGGTAGAAAAAATGCCGAGAGCAG
GT_Specific_CTGSIN-SS-EX055483|266|60    TTCTACCTATCGTTTCGGCTCAAGTTAGTGTCAGCAAATGATCCGAACGGTCTGGAAATG
GT_Specific_CTGSIN-SS-CL15294Contig1|1386|60_New    TTTTCTTTATAAAGAACAGTCTGTGTGTTAATAATTCTCATCTCCTGTCCGGACATAGAC
GT_Xhyb_SUPCTG-SS-SuperContig_CL53Contig7|737|60    CGTTTGAATGTATGACATATGAACATCGTTGCTCTCCTTCATCTTTTATGTGTTTTGGTT
GT_Specific_CTGSIN-SS-CL11320Contig1|392|60    TACTCTTGTAAAACCTTATACATACTTGCACATAAGAGAAAGATGGGATGTATTTCACAA
.......

File2:
Code:
mira_assembly_rep_c5    AT4G25140.1    OLEO1 (OLEOSIN 1) 
mira_assembly_rep_c8    AT4G27140.1    2S seed storage protein 1 / 2S albumin storage protein / NWMU1-2S albumin 1 
mira_assembly_rep_c24    AT5G38195.1    protease inhibitor/seed storage/lipid transfer protein (LTP) family protein 
mira_assembly_rep_c29    AT5G39850.1    40S ribosomal protein S9 (RPS9C) 
mira_assembly_rep_c36    AT4G32100.1    galactosyltransferase 
......

I want to merge the two file to get like:
Code:
GT_Xhyb_CTGSIN-SS-mira_assembly_rep_c3140|919|60    TACATCCTCCAAAGGACAAGATCTTGCCTTCCTTGTTGGTAGAAAAAATGCCGAGAGCAG    mira_assembly_rep_c3140 AT4G25140.1    OLEO1 (OLEOSIN 1) 
GT_Specific_CTGSIN-SS-mira_assembly_rep_c5|266|60    TTCTACCTATCGTTTCGGCTCAAGTTAGTGTCAGCAAATGATCCGAACGGTCTGGAAATG   mira_assembly_rep_c5 AT4G10270.1    wound-responsive family protein 
GT_Specific_CTGSIN-SS-mira_assembly_rep_c8|1386|60_New    TTTTCTTTATAAAGAACAGTCTGTGTGTTAATAATTCTCATCTCCTGTCCGGACATAGAC   -mira_assembly_rep_c8 AT2G33830.2    dormancy/auxin associated family protein 
GT_Xhyb_SUPCTG-SS-SuperContig_mira_assembly_rep_c29|737|60    CGTTTGAATGTATGACATATGAACATCGTTGCTCTCCTTCATCTTTTATGTGTTTTGGTT mira_assembly_rep_c29   AT3G49910.1    60S ribosomal protein L26 (RPL26A)
......

Here is my code:
Code:
#!/usr/bin/perl -w
use strict;


my %line2;
my $merged;
my $count2;
my $col1=0;                 #The common column in file1
my $col2=0;                 #The common column in file2
my ($f1,$f2)=@ARGV;             #The two files to be merged
open(F2,$f2) or die $!; 
while (<F2>) { 
    s/\r?\n//;                 #remove return of carriage at the end of each line;
    my @F=split /\t/, $_;             #split the line by tab
    $line2{$F[$col2]} .= "$_\n"; }         #create a hash to store the line

$count2 = $.;                     #input line number
    
open(F1,$f1) or die $!; 
    while (<F1>) { 
        s/\r?\n//; 
        my @F=split /\t/, $_; 
        my $x = $line2{$F[$col1]}; 
            if ($x =~ m/$F[$col2]\|/) { 
            my $num_changes = ($x =~ s/^/$_\t/gm);     #substitute the beginning of the line with  
                                #the current line plus TAB
            print $x; 
             $merged += $num_changes;
            } 
    } 
    
warn "Joining $f1 column $col1 with $f2 column $col2\n$f1: $. lines\n$f2: $count2 lines\nMerged file: $merged lines\n";

# usage: match_script.pl file1 file2 > merged_file.tab

Note the first column of the File2 contains only part of the first column of File1 before the first vertical bar "|". And not all of the rows of File1 has a match in File2, may be 80k out of 180k. They are big files.

It was running, but did not append the matched part of File2 to File1. Could anyone give me some clue?
I found this join/merge problem is quite common in my work, and I do not have database like MySQL. It would be great for me to catch the spirit of the coding for this.
Thanks a lot!
Yifang

Last edited by yifangt; 02-05-2011 at 11:33 AM..
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

"Join" or "Merge" more than 2 files into single output based on common key (column)

Hi All, I have working (Perl) code to combine 2 input files into a single output file using the join function that works to a point, but has the following limitations: 1. I am restrained to 2 input files only. 2. Only the "matched" fields are written out to the "matched" output file and... (1 Reply)
Discussion started by: Katabatic
1 Replies

2. Shell Programming and Scripting

Join multiple files based on 1 common column

I have n files (for ex:64 files) with one similar column. Is it possible to combine them all based on that column ? file1 ax100 20 30 40 ax200 22 33 44 file2 ax100 10 20 40 ax200 12 13 44 file2 ax100 0 0 4 ax200 2 3 4 (9 Replies)
Discussion started by: quincyjones
9 Replies

3. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Hi, I have line in input file as below: 3G_CENTRAL;INDONESIA_(M)_TELKOMSEL;SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL My expected output for line in the file must be : "1-Radon1-cMOC_deg"|"LDIndex"|"3G_CENTRAL|INDONESIA_(M)_TELKOMSEL"|LAST|"SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL" Can someone... (7 Replies)
Discussion started by: shis100
7 Replies

4. UNIX for Dummies Questions & Answers

how to join two files using "Join" command with one common field in this problem?

file1: Toronto:12439755:1076359:July 1, 1867:6 Quebec City:7560592:1542056:July 1, 1867:5 Halifax:938134:55284:July 1, 1867:4 Fredericton:751400:72908:July 1, 1867:3 Winnipeg:1170300:647797:July 15, 1870:7 Victoria:4168123:944735:July 20, 1871:10 Charlottetown:137900:5660:July 1, 1873:2... (2 Replies)
Discussion started by: mindfreak
2 Replies

5. Shell Programming and Scripting

Substituting comma "," for dot "." in a specific column when comma"," is a delimiter

Hi, I'm dealing with an issue and losing a lot of hours figuring out how i would solve this. I have an input file which looks like this: ('BLABLA +200-GRS','Serviço ','TarifaçãoServiço','wap.bla.us.0000000121',2985,0,55,' de conversão em escada','Dia','Domingos') ('BLABLA +200-GRR','Serviço... (6 Replies)
Discussion started by: poliver
6 Replies

6. UNIX for Dummies Questions & Answers

How to use the the join command to join multiple files by a common column

Hi, I have 20 tab delimited text files that have a common column (column 1). The files are named GSM1.txt through GSM20.txt. Each file has 3 columns (2 other columns in addition to the first common column). I want to write a script to join the files by the first common column so that in the... (5 Replies)
Discussion started by: evelibertine
5 Replies

7. Shell Programming and Scripting

Problem of Perl's "join" function

$ perl -e '@f=("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","1","911"); print join("\t",@f)."\n";' aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ... (5 Replies)
Discussion started by: carloszhang
5 Replies

8. UNIX for Dummies Questions & Answers

How to join 2 .txt files based on a common column?

Hi all, I'm trying to join two .txt file tab delimitated based on a common column. File 1 transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct comp1000201_c0_seq1 comp1000201_c0 337 183.51 0.00 0.00 0.00 0.00 comp1000297_c0_seq1 ... (1 Reply)
Discussion started by: alisrpp
1 Replies

9. Shell Programming and Scripting

Delete all log files older than 10 day and whose first string of the first line is "MSH" or "<?xml"

Dear Ladies & Gents, I have a requirement to delete all the log files in /var/log/test directory that are older than 10 days and their first line begin with "MSH" or "<?xml" or "FHS". I've put together the following BASH script, but it's erroring out: for filename in $(find /var/log/test... (2 Replies)
Discussion started by: Hiroshi
2 Replies

10. Shell Programming and Scripting

Join, merge, fill NULL the void columns of multiples files like sql "LEFT JOIN" by using awk

Hello, This post is already here but want to do this with another way Merge multiples files with multiples duplicates keys by filling "NULL" the void columns for anothers joinning files file1.csv: 1|abc 1|def 2|ghi 2|jkl 3|mno 3|pqr file2.csv: 1|123|jojo 1|NULL|bibi... (2 Replies)
Discussion started by: yjacknewton
2 Replies
All times are GMT -4. The time now is 05:54 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy