Sponsored Content
Top Forums Web Development Perl join two files by "common" column Post 302494164 by yifangt on Saturday 5th of February 2011 11:17:22 PM
Old 02-06-2011
Hi Dave:
Thanks for your comments! Actually I prefer your style too, e.g. use warnings etc. I scripted another codes as:
Code:
 #!/usr/bin/perl
use strict;
#use warnings;
my %Probe_n_Seq = ();
open(FILE1, "029931_D_SequenceList_20100827.txt") || die "Can't find file $!";
    while(<FILE1>){
          chomp $_;
          my @AAA=split(/\t/, $_);
           $Probe_n_Seq{$AAA[0]}=$AAA[1];
      }
close(FILE1);

open(FILE2, "CTG_n_SCTG_AGI_Entries.txtb")        || die "Can't find file $!";
while(<FILE2>) {
   chomp $_;
    my @BBB =split (/\t/, $_);
   foreach my $key (keys (%Probe_n_Seq)) {
    if ($key =~ m/$BBB[0]\|/) {
     print $key, "\t", $Probe_n_Seq{$key},"\t",$BBB[0]."\t".$BBB[1]."\t".$BBB[2],"\n";
         } 
    } 
 }
close(FILE2);

I used the first column $AAA[0] of file1 as key of the hash, and then compare with the first column $BBB[0] of file2. If $AAA[0] contains the string $BBB[0], it means a match, as "mira_" is not the only assembly marker.
Code:
if ($key =~ m/$BBB[0]\|/)

It seems running except a small bug for
Code:
my %Probe_n_Seq = ();

which caused the warning and stopped the program. So that I have to comment the use warnings.
The code takes ~6 hours for my 2.3Ghz dual CPU + 3GB RAM (compaq machine) to run. Not sure if this could be improved for file1 has 147478 rows (15.2MB) and file2 86837 rows(7.2MB).
Actually I have another idea in my mind to reduce the work load because the iteration is 147478x86837 times. If a match is found in file1, then the matched row in file1 can be deleted so that for the next $BBB[0] in file2 does not need to search this row again. ... so that the last search is 86838 instead of 147478 loops ( when the match is in the last row, worst scenario!). The reason is each row is unique in both file. Could not figure out this by myself. Any clue is highly appreciated!
Yifang
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

"Join" or "Merge" more than 2 files into single output based on common key (column)

Hi All, I have working (Perl) code to combine 2 input files into a single output file using the join function that works to a point, but has the following limitations: 1. I am restrained to 2 input files only. 2. Only the "matched" fields are written out to the "matched" output file and... (1 Reply)
Discussion started by: Katabatic
1 Replies

2. Shell Programming and Scripting

Join multiple files based on 1 common column

I have n files (for ex:64 files) with one similar column. Is it possible to combine them all based on that column ? file1 ax100 20 30 40 ax200 22 33 44 file2 ax100 10 20 40 ax200 12 13 44 file2 ax100 0 0 4 ax200 2 3 4 (9 Replies)
Discussion started by: quincyjones
9 Replies

3. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Hi, I have line in input file as below: 3G_CENTRAL;INDONESIA_(M)_TELKOMSEL;SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL My expected output for line in the file must be : "1-Radon1-cMOC_deg"|"LDIndex"|"3G_CENTRAL|INDONESIA_(M)_TELKOMSEL"|LAST|"SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL" Can someone... (7 Replies)
Discussion started by: shis100
7 Replies

4. UNIX for Dummies Questions & Answers

how to join two files using "Join" command with one common field in this problem?

file1: Toronto:12439755:1076359:July 1, 1867:6 Quebec City:7560592:1542056:July 1, 1867:5 Halifax:938134:55284:July 1, 1867:4 Fredericton:751400:72908:July 1, 1867:3 Winnipeg:1170300:647797:July 15, 1870:7 Victoria:4168123:944735:July 20, 1871:10 Charlottetown:137900:5660:July 1, 1873:2... (2 Replies)
Discussion started by: mindfreak
2 Replies

5. Shell Programming and Scripting

Substituting comma "," for dot "." in a specific column when comma"," is a delimiter

Hi, I'm dealing with an issue and losing a lot of hours figuring out how i would solve this. I have an input file which looks like this: ('BLABLA +200-GRS','Serviço ','TarifaçãoServiço','wap.bla.us.0000000121',2985,0,55,' de conversão em escada','Dia','Domingos') ('BLABLA +200-GRR','Serviço... (6 Replies)
Discussion started by: poliver
6 Replies

6. UNIX for Dummies Questions & Answers

How to use the the join command to join multiple files by a common column

Hi, I have 20 tab delimited text files that have a common column (column 1). The files are named GSM1.txt through GSM20.txt. Each file has 3 columns (2 other columns in addition to the first common column). I want to write a script to join the files by the first common column so that in the... (5 Replies)
Discussion started by: evelibertine
5 Replies

7. Shell Programming and Scripting

Problem of Perl's "join" function

$ perl -e '@f=("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","1","911"); print join("\t",@f)."\n";' aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ... (5 Replies)
Discussion started by: carloszhang
5 Replies

8. UNIX for Dummies Questions & Answers

How to join 2 .txt files based on a common column?

Hi all, I'm trying to join two .txt file tab delimitated based on a common column. File 1 transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct comp1000201_c0_seq1 comp1000201_c0 337 183.51 0.00 0.00 0.00 0.00 comp1000297_c0_seq1 ... (1 Reply)
Discussion started by: alisrpp
1 Replies

9. Shell Programming and Scripting

Delete all log files older than 10 day and whose first string of the first line is "MSH" or "<?xml"

Dear Ladies & Gents, I have a requirement to delete all the log files in /var/log/test directory that are older than 10 days and their first line begin with "MSH" or "<?xml" or "FHS". I've put together the following BASH script, but it's erroring out: for filename in $(find /var/log/test... (2 Replies)
Discussion started by: Hiroshi
2 Replies

10. Shell Programming and Scripting

Join, merge, fill NULL the void columns of multiples files like sql "LEFT JOIN" by using awk

Hello, This post is already here but want to do this with another way Merge multiples files with multiples duplicates keys by filling "NULL" the void columns for anothers joinning files file1.csv: 1|abc 1|def 2|ghi 2|jkl 3|mno 3|pqr file2.csv: 1|123|jojo 1|NULL|bibi... (2 Replies)
Discussion started by: yjacknewton
2 Replies
AutoSplit(3pm)						 Perl Programmers Reference Guide					    AutoSplit(3pm)

NAME
AutoSplit - split a package for autoloading SYNOPSIS
autosplit($file, $dir, $keep, $check, $modtime); autosplit_lib_modules(@modules); DESCRIPTION
This function will split up your program into files that the AutoLoader module can handle. It is used by both the standard perl libraries and by the MakeMaker utility, to automatically configure libraries for autoloading. The "autosplit" interface splits the specified file into a hierarchy rooted at the directory $dir. It creates directories as needed to reflect class hierarchy, and creates the file autosplit.ix. This file acts as both forward declaration of all package routines, and as timestamp for the last update of the hierarchy. The remaining three arguments to "autosplit" govern other options to the autosplitter. $keep If the third argument, $keep, is false, then any pre-existing "*.al" files in the autoload directory are removed if they are no longer part of the module (obsoleted functions). $keep defaults to 0. $check The fourth argument, $check, instructs "autosplit" to check the module currently being split to ensure that it includes a "use" specification for the AutoLoader module, and skips the module if AutoLoader is not detected. $check defaults to 1. $modtime Lastly, the $modtime argument specifies that "autosplit" is to check the modification time of the module against that of the "autosplit.ix" file, and only split the module if it is newer. $modtime defaults to 1. Typical use of AutoSplit in the perl MakeMaker utility is via the command-line with: perl -e 'use AutoSplit; autosplit($ARGV[0], $ARGV[1], 0, 1, 1)' Defined as a Make macro, it is invoked with file and directory arguments; "autosplit" will split the specified file into the specified directory and delete obsolete ".al" files, after checking first that the module does use the AutoLoader, and ensuring that the module is not already currently split in its current form (the modtime test). The "autosplit_lib_modules" form is used in the building of perl. It takes as input a list of files (modules) that are assumed to reside in a directory lib relative to the current directory. Each file is sent to the autosplitter one at a time, to be split into the directory lib/auto. In both usages of the autosplitter, only subroutines defined following the perl __END__ token are split out into separate files. Some routines may be placed prior to this marker to force their immediate loading and parsing. Multiple packages As of version 1.01 of the AutoSplit module it is possible to have multiple packages within a single file. Both of the following cases are supported: package NAME; __END__ sub AAA { ... } package NAME::option1; sub BBB { ... } package NAME::option2; sub BBB { ... } package NAME; __END__ sub AAA { ... } sub NAME::option1::BBB { ... } sub NAME::option2::BBB { ... } DIAGNOSTICS
"AutoSplit" will inform the user if it is necessary to create the top-level directory specified in the invocation. It is preferred that the script or installation process that invokes "AutoSplit" have created the full directory path ahead of time. This warning may indicate that the module is being split into an incorrect path. "AutoSplit" will warn the user of all subroutines whose name causes potential file naming conflicts on machines with drastically limited (8 characters or less) file name length. Since the subroutine name is used as the file name, these warnings can aid in portability to such systems. Warnings are issued and the file skipped if "AutoSplit" cannot locate either the __END__ marker or a "package Name;"-style specification. "AutoSplit" will also emit general diagnostics for inability to create directories or files. AUTHOR
"AutoSplit" is maintained by the perl5-porters. Please direct any questions to the canonical mailing list. Anything that is applicable to the CPAN release can be sent to its maintainer, though. Author and Maintainer: The Perl5-Porters <perl5-porters@perl.org> Maintainer of the CPAN release: Steffen Mueller <smueller@cpan.org> COPYRIGHT AND LICENSE
This package has been part of the perl core since the first release of perl5. It has been released separately to CPAN so older installations can benefit from bug fixes. This package has the same copyright and license as the perl core: Copyright (C) 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008 by Larry Wall and others All rights reserved. This program is free software; you can redistribute it and/or modify it under the terms of either: a) the GNU General Public License as published by the Free Software Foundation; either version 1, or (at your option) any later version, or b) the "Artistic License" which comes with this Kit. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See either the GNU General Public License or the Artistic License for more details. You should have received a copy of the Artistic License with this Kit, in the file named "Artistic". If not, I'll be glad to provide one. You should also have received a copy of the GNU General Public License along with this program in the file named "Copying". If not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA or visit their web page on the internet at http://www.gnu.org/copyleft/gpl.html. For those of you that choose to use the GNU General Public License, my interpretation of the GNU General Public License is that no Perl script falls under the terms of the GPL unless you explicitly put said script under the terms of the GPL yourself. Furthermore, any object code linked with perl does not automatically fall under the terms of the GPL, provided such object code only adds definitions of subroutines and variables, and does not otherwise impair the resulting interpreter from executing any standard Perl script. I consider linking in C subroutines in this manner to be the moral equivalent of defining subroutines in the Perl language itself. You may sell such an object file as proprietary provided that you provide or offer to provide the Perl source, as specified by the GNU General Public License. (This is merely an alternate way of specifying input to the program.) You may also sell a binary produced by the dumping of a running Perl script that belongs to you, provided that you provide or offer to provide the Perl source as specified by the GPL. (The fact that a Perl interpreter and your code are in the same binary file is, in this case, a form of mere aggregation.) This is my interpretation of the GPL. If you still have concerns or difficulties understanding my intent, feel free to contact me. Of course, the Artistic License spells all this out for your protection, so you may prefer to use that. perl v5.12.1 2010-04-26 AutoSplit(3pm)
All times are GMT -4. The time now is 06:58 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy