Perl join two files by "common" column Post: 302494164

Sponsored Content

Top Forums Web Development Perl join two files by "common" column Post 302494164 by yifangt on Saturday 5th of February 2011 11:17:22 PM

02-06-2011

Registered User

Hi Dave:
Thanks for your comments! Actually I prefer your style too, e.g. use warnings etc. I scripted another codes as:

Code:

 #!/usr/bin/perl
use strict;
#use warnings;
my %Probe_n_Seq = ();
open(FILE1, "029931_D_SequenceList_20100827.txt") || die "Can't find file $!";
    while(<FILE1>){
          chomp $_;
          my @AAA=split(/\t/, $_);
           $Probe_n_Seq{$AAA[0]}=$AAA[1];
      }
close(FILE1);

open(FILE2, "CTG_n_SCTG_AGI_Entries.txtb")        || die "Can't find file $!";
while(<FILE2>) {
   chomp $_;
    my @BBB =split (/\t/, $_);
   foreach my $key (keys (%Probe_n_Seq)) {
    if ($key =~ m/$BBB[0]\|/) {
     print $key, "\t", $Probe_n_Seq{$key},"\t",$BBB[0]."\t".$BBB[1]."\t".$BBB[2],"\n";
         } 
    } 
 }
close(FILE2);

I used the first column $AAA[0] of file1 as key of the hash, and then compare with the first column $BBB[0] of file2. If $AAA[0] contains the string $BBB[0], it means a match, as "mira_" is not the only assembly marker.

Code:

if ($key =~ m/$BBB[0]\|/)

It seems running except a small bug for

Code:

my %Probe_n_Seq = ();

which caused the warning and stopped the program. So that I have to comment the use warnings.
The code takes ~6 hours for my 2.3Ghz dual CPU + 3GB RAM (compaq machine) to run. Not sure if this could be improved for file1 has 147478 rows (15.2MB) and file2 86837 rows(7.2MB).
Actually I have another idea in my mind to reduce the work load because the iteration is 147478x86837 times. If a match is found in file1, then the matched row in file1 can be deleted so that for the next $BBB[0] in file2 does not need to search this row again. ... so that the last search is 86838 instead of 147478 loops ( when the match is in the last row, worst scenario!). The reason is each row is unique in both file. Could not figure out this by myself. Any clue is highly appreciated!
Yifang

yifangt

View Public Profile for yifangt

Find all posts by yifangt

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

"Join" or "Merge" more than 2 files into single output based on common key (column)

Hi All, I have working (Perl) code to combine 2 input files into a single output file using the join function that works to a point, but has the following limitations: 1. I am restrained to 2 input files only. 2. Only the "matched" fields are written out to the "matched" output file and...

2. Shell Programming and Scripting

Join multiple files based on 1 common column

I have n files (for ex:64 files) with one similar column. Is it possible to combine them all based on that column ? file1 ax100 20 30 40 ax200 22 33 44 file2 ax100 10 20 40 ax200 12 13 44 file2 ax100 0 0 4 ax200 2 3 4

3. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Hi, I have line in input file as below: 3G_CENTRAL;INDONESIA_(M)_TELKOMSEL;SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL My expected output for line in the file must be : "1-Radon1-cMOC_deg"|"LDIndex"|"3G_CENTRAL|INDONESIA_(M)_TELKOMSEL"|LAST|"SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL" Can someone...

4. UNIX for Dummies Questions & Answers

how to join two files using "Join" command with one common field in this problem?

file1: Toronto:12439755:1076359:July 1, 1867:6 Quebec City:7560592:1542056:July 1, 1867:5 Halifax:938134:55284:July 1, 1867:4 Fredericton:751400:72908:July 1, 1867:3 Winnipeg:1170300:647797:July 15, 1870:7 Victoria:4168123:944735:July 20, 1871:10 Charlottetown:137900:5660:July 1, 1873:2...

5. Shell Programming and Scripting

Substituting comma "," for dot "." in a specific column when comma"," is a delimiter

Hi, I'm dealing with an issue and losing a lot of hours figuring out how i would solve this. I have an input file which looks like this: ('BLABLA +200-GRS','Servi�o ','Tarifa��oServi�o','wap.bla.us.0000000121',2985,0,55,' de convers�o em escada','Dia','Domingos') ('BLABLA +200-GRR','Servi�o...

6. UNIX for Dummies Questions & Answers

How to use the the join command to join multiple files by a common column

Hi, I have 20 tab delimited text files that have a common column (column 1). The files are named GSM1.txt through GSM20.txt. Each file has 3 columns (2 other columns in addition to the first common column). I want to write a script to join the files by the first common column so that in the...

7. Shell Programming and Scripting

Problem of Perl's "join" function

$ perl -e '@f=("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","1","911"); print join("\t",@f)."\n";' aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ...

8. UNIX for Dummies Questions & Answers

How to join 2 .txt files based on a common column?

Hi all, I'm trying to join two .txt file tab delimitated based on a common column. File 1 transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct comp1000201_c0_seq1 comp1000201_c0 337 183.51 0.00 0.00 0.00 0.00 comp1000297_c0_seq1 ...

9. Shell Programming and Scripting

Delete all log files older than 10 day and whose first string of the first line is "MSH" or "<?xml"

Dear Ladies & Gents, I have a requirement to delete all the log files in /var/log/test directory that are older than 10 days and their first line begin with "MSH" or "<?xml" or "FHS". I've put together the following BASH script, but it's erroring out: for filename in $(find /var/log/test...

10. Shell Programming and Scripting

Join, merge, fill NULL the void columns of multiples files like sql "LEFT JOIN" by using awk

Hello, This post is already here but want to do this with another way Merge multiples files with multiples duplicates keys by filling "NULL" the void columns for anothers joinning files file1.csv: 1|abc 1|def 2|ghi 2|jkl 3|mno 3|pqr file2.csv: 1|123|jojo 1|NULL|bibi...

LEARN ABOUT DEBIAN

comm

COMM(1) 							   User Commands							   COMM(1)

NAME

       comm - compare two sorted files line by line

SYNOPSIS

       comm [OPTION]... FILE1 FILE2

DESCRIPTION

       Compare sorted files FILE1 and FILE2 line by line.

       With  no  options,  produce three-column output.  Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and
       column three contains lines common to both files.

       -1     suppress column 1 (lines unique to FILE1)

       -2     suppress column 2 (lines unique to FILE2)

       -3     suppress column 3 (lines that appear in both files)

       --check-order
	      check that the input is correctly sorted, even if all input lines are pairable

       --nocheck-order
	      do not check that the input is correctly sorted

       --output-delimiter=STR
	      separate columns with STR

       --help display this help and exit

       --version
	      output version information and exit

       Note, comparisons honor the rules specified by `LC_COLLATE'.

EXAMPLES

       comm -12 file1 file2
	      Print only lines present in both file1 and file2.

       comm -3 file1 file2
	      Print lines in file1 not in file2, and vice versa.

AUTHOR

       Written by Richard M. Stallman and David MacKenzie.

REPORTING BUGS

       Report comm bugs to bug-coreutils@gnu.org
       GNU coreutils home page: <http://www.gnu.org/software/coreutils/>
       General help using GNU software: <http://www.gnu.org/gethelp/>
       Report comm translation bugs to <http://translationproject.org/team/>

COPYRIGHT

       Copyright (C) 2011 Free Software Foundation, Inc.  License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
       This is free software: you are free to change and redistribute it.  There is NO WARRANTY, to the extent permitted by law.

SEE ALSO

       join(1), uniq(1)

       The full documentation for comm is maintained as a Texinfo manual.  If the info and comm programs are properly installed at your site,  the
       command

	      info coreutils 'comm invocation'

       should give you access to the complete manual.

GNU coreutils 8.12.197-032bb					  September 2011							   COMM(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

"Join" or "Merge" more than 2 files into single output based on common key (column)

Discussion started by: Katabatic

2. Shell Programming and Scripting

Join multiple files based on 1 common column

Discussion started by: quincyjones

3. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Discussion started by: shis100

4. UNIX for Dummies Questions & Answers

how to join two files using "Join" command with one common field in this problem?

Discussion started by: mindfreak

5. Shell Programming and Scripting

Substituting comma "," for dot "." in a specific column when comma"," is a delimiter

Discussion started by: poliver

6. UNIX for Dummies Questions & Answers

How to use the the join command to join multiple files by a common column

Discussion started by: evelibertine

7. Shell Programming and Scripting

Problem of Perl's "join" function

Discussion started by: carloszhang

8. UNIX for Dummies Questions & Answers

How to join 2 .txt files based on a common column?

Discussion started by: alisrpp

9. Shell Programming and Scripting

Delete all log files older than 10 day and whose first string of the first line is "MSH" or "<?xml"

Discussion started by: Hiroshi

10. Shell Programming and Scripting

Join, merge, fill NULL the void columns of multiples files like sql "LEFT JOIN" by using awk

Discussion started by: yjacknewton

LEARN ABOUT DEBIAN

comm