Sponsored Content
Top Forums Web Development Perl join two files by "common" column Post 302494164 by yifangt on Saturday 5th of February 2011 11:17:22 PM
Old 02-06-2011
Hi Dave:
Thanks for your comments! Actually I prefer your style too, e.g. use warnings etc. I scripted another codes as:
Code:
 #!/usr/bin/perl
use strict;
#use warnings;
my %Probe_n_Seq = ();
open(FILE1, "029931_D_SequenceList_20100827.txt") || die "Can't find file $!";
    while(<FILE1>){
          chomp $_;
          my @AAA=split(/\t/, $_);
           $Probe_n_Seq{$AAA[0]}=$AAA[1];
      }
close(FILE1);

open(FILE2, "CTG_n_SCTG_AGI_Entries.txtb")        || die "Can't find file $!";
while(<FILE2>) {
   chomp $_;
    my @BBB =split (/\t/, $_);
   foreach my $key (keys (%Probe_n_Seq)) {
    if ($key =~ m/$BBB[0]\|/) {
     print $key, "\t", $Probe_n_Seq{$key},"\t",$BBB[0]."\t".$BBB[1]."\t".$BBB[2],"\n";
         } 
    } 
 }
close(FILE2);

I used the first column $AAA[0] of file1 as key of the hash, and then compare with the first column $BBB[0] of file2. If $AAA[0] contains the string $BBB[0], it means a match, as "mira_" is not the only assembly marker.
Code:
if ($key =~ m/$BBB[0]\|/)

It seems running except a small bug for
Code:
my %Probe_n_Seq = ();

which caused the warning and stopped the program. So that I have to comment the use warnings.
The code takes ~6 hours for my 2.3Ghz dual CPU + 3GB RAM (compaq machine) to run. Not sure if this could be improved for file1 has 147478 rows (15.2MB) and file2 86837 rows(7.2MB).
Actually I have another idea in my mind to reduce the work load because the iteration is 147478x86837 times. If a match is found in file1, then the matched row in file1 can be deleted so that for the next $BBB[0] in file2 does not need to search this row again. ... so that the last search is 86838 instead of 147478 loops ( when the match is in the last row, worst scenario!). The reason is each row is unique in both file. Could not figure out this by myself. Any clue is highly appreciated!
Yifang
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

"Join" or "Merge" more than 2 files into single output based on common key (column)

Hi All, I have working (Perl) code to combine 2 input files into a single output file using the join function that works to a point, but has the following limitations: 1. I am restrained to 2 input files only. 2. Only the "matched" fields are written out to the "matched" output file and... (1 Reply)
Discussion started by: Katabatic
1 Replies

2. Shell Programming and Scripting

Join multiple files based on 1 common column

I have n files (for ex:64 files) with one similar column. Is it possible to combine them all based on that column ? file1 ax100 20 30 40 ax200 22 33 44 file2 ax100 10 20 40 ax200 12 13 44 file2 ax100 0 0 4 ax200 2 3 4 (9 Replies)
Discussion started by: quincyjones
9 Replies

3. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Hi, I have line in input file as below: 3G_CENTRAL;INDONESIA_(M)_TELKOMSEL;SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL My expected output for line in the file must be : "1-Radon1-cMOC_deg"|"LDIndex"|"3G_CENTRAL|INDONESIA_(M)_TELKOMSEL"|LAST|"SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL" Can someone... (7 Replies)
Discussion started by: shis100
7 Replies

4. UNIX for Dummies Questions & Answers

how to join two files using "Join" command with one common field in this problem?

file1: Toronto:12439755:1076359:July 1, 1867:6 Quebec City:7560592:1542056:July 1, 1867:5 Halifax:938134:55284:July 1, 1867:4 Fredericton:751400:72908:July 1, 1867:3 Winnipeg:1170300:647797:July 15, 1870:7 Victoria:4168123:944735:July 20, 1871:10 Charlottetown:137900:5660:July 1, 1873:2... (2 Replies)
Discussion started by: mindfreak
2 Replies

5. Shell Programming and Scripting

Substituting comma "," for dot "." in a specific column when comma"," is a delimiter

Hi, I'm dealing with an issue and losing a lot of hours figuring out how i would solve this. I have an input file which looks like this: ('BLABLA +200-GRS','Serviço ','TarifaçãoServiço','wap.bla.us.0000000121',2985,0,55,' de conversão em escada','Dia','Domingos') ('BLABLA +200-GRR','Serviço... (6 Replies)
Discussion started by: poliver
6 Replies

6. UNIX for Dummies Questions & Answers

How to use the the join command to join multiple files by a common column

Hi, I have 20 tab delimited text files that have a common column (column 1). The files are named GSM1.txt through GSM20.txt. Each file has 3 columns (2 other columns in addition to the first common column). I want to write a script to join the files by the first common column so that in the... (5 Replies)
Discussion started by: evelibertine
5 Replies

7. Shell Programming and Scripting

Problem of Perl's "join" function

$ perl -e '@f=("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","1","911"); print join("\t",@f)."\n";' aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa ... (5 Replies)
Discussion started by: carloszhang
5 Replies

8. UNIX for Dummies Questions & Answers

How to join 2 .txt files based on a common column?

Hi all, I'm trying to join two .txt file tab delimitated based on a common column. File 1 transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct comp1000201_c0_seq1 comp1000201_c0 337 183.51 0.00 0.00 0.00 0.00 comp1000297_c0_seq1 ... (1 Reply)
Discussion started by: alisrpp
1 Replies

9. Shell Programming and Scripting

Delete all log files older than 10 day and whose first string of the first line is "MSH" or "<?xml"

Dear Ladies & Gents, I have a requirement to delete all the log files in /var/log/test directory that are older than 10 days and their first line begin with "MSH" or "<?xml" or "FHS". I've put together the following BASH script, but it's erroring out: for filename in $(find /var/log/test... (2 Replies)
Discussion started by: Hiroshi
2 Replies

10. Shell Programming and Scripting

Join, merge, fill NULL the void columns of multiples files like sql "LEFT JOIN" by using awk

Hello, This post is already here but want to do this with another way Merge multiples files with multiples duplicates keys by filling "NULL" the void columns for anothers joinning files file1.csv: 1|abc 1|def 2|ghi 2|jkl 3|mno 3|pqr file2.csv: 1|123|jojo 1|NULL|bibi... (2 Replies)
Discussion started by: yjacknewton
2 Replies
Net::Ifconfig::Wrapper(3pm)				User Contributed Perl Documentation			       Net::Ifconfig::Wrapper(3pm)

NAME
Net::Ifconfig::Wrapper - provides a unified way to configure network interfaces on FreeBSD, OpenBSD, Solaris, Linux, OS X, and WinNT (from Win2K). Version 0.11 SYNOPSIS
#!/usr/local/bin/perl -w # uni-ifconfig.pl # The unified ifconfig command. # Works the same way on FreeBSD, OpenBSD, Solaris, Linux, OS X, WinNT (from Win2K). # Note: due of Net::Ifconfig::Wrapper limitations 'inet' and 'down' commands # are not working on WinNT. +/-alias are working, of course. use strict; use Net::Ifconfig::Wrapper; my $Usage = << 'EndOfText'; uni-ifconfig.pl # Print this notice uni-ifconfig.pl -a # Print info about all interfaces uni-ifconfig.pl <iface> # Print info obout specified interface uni-ifconfig.pl <iface> down # Bring specified interface down uni-ifconfig.pl <iface> inet <AAA.AAA.AAA.AAA> mask <MMM.MMM.MMM.MMM> # Set the specified address on the specified interface # and bring this interface up uni-ifconfig.pl <iface> inet <AAA.AAA.AAA.AAA> mask <MMM.MMM.MMM.MMM> [+]alias # Set the specified alias address # on the specified interface uni-ifconfig.pl <iface> inet <AAA.AAA.AAA.AAA> [mask <MMM.MMM.MMM.MMM>] -alias # Remove specified alias address # from the specified interface EndOfText my $Info = Net::Ifconfig::Wrapper::Ifconfig('list', '', '', '') or die $@; scalar(keys(%{$Info})) or die "No one interface found. Something wrong? "; if (!scalar(@ARGV)) { print $Usage; exit 0; } if ($ARGV[0] eq '-a') { defined($ARGV[1]) and die $Usage; foreach (sort(keys(%{$Info}))) { print IfaceInfo($Info, $_); }; exit 0; }; $Info->{$ARGV[0]} or die "Interface '$ARGV[0]' is unknown "; if (!defined($ARGV[1])) { print IfaceInfo($Info, $ARGV[0]); exit 0; } my $CmdLine = join(' ', @ARGV); my $Result = undef; if ($CmdLine =~ m/As*([w{}-]+)s+downs*/i) { $Result = Net::Ifconfig::Wrapper::Ifconfig('down', $1, '', ''); } elsif ($CmdLine =~ m/As*([w{}-]+)s+inets+(d{1,3}(?:.d{1,3}){3})s+masks+(d{1,3}(?:.d{1,3}){3})s*/i) { $Result = Net::Ifconfig::Wrapper::Ifconfig('inet', $1, $2, $3); } elsif ($CmdLine =~ m/As*([w{}-]+)s+inets+(d{1,3}(?:.d{1,3}){3})s+masks+(d{1,3}(?:.d{1,3}){3})s++?aliass*/i) { $Result = Net::Ifconfig::Wrapper::Ifconfig('+alias', $1, $2, $3); } elsif ($CmdLine =~ m/As*([w{}-]+)s+inets+(d{1,3}(?:.d{1,3}){3})s+(:?masks+(d{1,3}(?:.d{1,3}){3})s+)?-aliass*/i) { $Result = Net::Ifconfig::Wrapper::Ifconfig('-alias', $1, $2, ''); } else { die $Usage; }; $Result or die $@; exit 0; sub IfaceInfo { my ($Info, $Iface) = @_; my $Res = "$Iface: ".($Info->{$Iface}{'status'} ? 'UP' : 'DOWN')." "; while (my ($Addr, $Mask) = each(%{$Info->{$Iface}{'inet'}})) { $Res .= sprintf(" inet %-15s mask $Mask ", $Addr); }; $Info->{$Iface}{'ether'} and $Res .= " ether ".$Info->{$Iface}{'ether'}." "; $Info->{$Iface}{'descr'} and $Res .= " descr '".$Info->{$Iface}{'descr'}."' "; return $Res; }; DESCRIPTION
This module provides a unified way to configure the network interfaces on FreeBSD, OpenBSD, Solaris, Linux, OS X, and WinNT (from Win2K) systems. Only "inet" (IPv4) and "ether" (MAC) addresses are supported at the moment On Unixes this module calls the system "ifconfig" command to perform the actions. On Windows the functions from IpHlpAPI.DLL are called. For all supported Unixes "Net::Ifconfig::Wrapper" expect "ifconfig" command to be "/sbin/ifconfig". Module was tested on FreeBSD 4.7,4.8,5.3 (Intel), RedHat 6.2,7.3,8.0 (Intel), Win2000 Pro (Intel), OpenBSD 3.1 (SPARC), Solaris 7 (SPARC), OS X 10.3 (aka Panther), OS X 10.4 (aka Tiger). In MSWin32 family only WinNT is supported. In WinNT family only Win2K or later is supported. The Net::Ifconfig::Wrapper methods "Ifconfig(Command, Interface, Address, Netmask);" The first and the last method of the "Net::Ifconfig::Wrapper" module. Do all the job. The particular action is described by the $Command parameter. $Command could be: 'list' "Ifconfig('list', '', '', '')" will return the reference to the hash contains the information about interfaces. The structure of this hash is the following: {IfaceName => {'status' => 0|1 # The status of the interface. 0 means down, 1 means up 'ether' => MACaddr, # The ethernet address of the interface if available 'descr' => Description, # The description of the interface if available 'inet' => {IPaddr1 => NetMask, # The IP address and his netmask, both are in AAA.BBB.CCC.DDD notation IPaddr2 => NetMask, ... }, ... }; Interface, Address, Netmask parameters are ignored. The following programs are called: FreeBSD "/sbin/ifconfig -a" Solaris "/sbin/ifconfig -a" OpenBSD "/sbin/ifconfig -A" Linux "/sbin/ifconfig -a" OS X "/sbin/ifconfig -a" MSWin32 "GetAdaptersInfo" function from "IpHlpAPI.DLL" Limitations: OpenBSD: "/sbin/ifconfig -A" command is not returning information about MAC addresses so we are trying to get it from '/usr/sbin/arp -a' command (first 'static' entry). If no one present the 'ff:ff:ff:ff:ff' address is returned. MSWin32: "GetAdaptersInfo" function is not returning information about interface which have address 127.0.0.1 binded so "Net::Ifconfig::Wrapper" have no ability to display it. Not limitation but little problem: MSWin32 interface names are not human-readable, they looks like "{843C2077-30EC-4C56-A401-658BB1E42BC7}" (on Win2K at least). 'inet' This function is used to set IPv4 address on interface. It have to be called as Ifconfig('inet', $IfaceName, $Addr, $Mask); $IfaceName is an interface name as displayed by 'list' command $Addr is an IPv4 address in the "AAA.AAA.AAA.AAA" notation $Mask is an IPv4 subnet mask in the "MMM.MMM.MMM.MMM" notation The following actual "ifconfig" programs are called FreeBSD "/sbin/ifconfig %Iface% inet %Addr% netmask %Mask% up" Solaris "/sbin/ifconfig %Iface% inet %Addr% netmask %Mask% up" OpenBSD "/sbin/ifconfig %Iface% inet %Addr% netmask %Mask% up" Linux "/sbin/ifconfig %Iface% inet %Addr% netmask %Mask% up" OS X "/sbin/ifconfig %Iface% inet %Addr% netmask %Mask% up" MSWin32: nothing :( Limitations: MSWin32: I did not find the relaible way to recognize the "main" address on the Win32 network interface, so I have disabled this functionality. If you know the way please let me know. 'up' Just a synonym for 'inet' 'down' This function is used to bring specified interface down. It have to be called as Ifconfig('inet', $IfaceName, '', ''); $IfaceName is an interface name as displayed by 'list' command Address and Netmask are ignored. The following actual "ifconfig" programs are called FreeBSD "/sbin/ifconfig %Iface% down" Solaris "/sbin/ifconfig %Iface% down" OpenBSD "/sbin/ifconfig %Iface% down" Linux "/sbin/ifconfig %Iface% down" OS X "/sbin/ifconfig %Iface% down" MSWin32 nothing :( Limitations: MSWin32: I did not find the way to implement the 'up' command so I did not implement 'down'. '+alias' This function is used to set IPv4 alias address on interface. It have to be called as Ifconfig('+alias', $IfaceName, $Addr, $Mask); $IfaceName is an interface name as displayed by 'list' command $Addr is an IPv4 address in the "AAA.AAA.AAA.AAA" notation $Mask is an IPv4 subnet mask in the "MMM.MMM.MMM.MMM" notation The following actual "ifconfig" programs are called FreeBSD "/sbin/ifconfig %Iface% inet %Addr% netmask %Mask% alias" Solaris "/sbin/ifconfig %Iface%:%Logic% inet %Addr% netmask %Mask% up" OpenBSD "/sbin/ifconfig %Iface% inet %Addr% netmask %Mask% alias" Linux "/sbin/ifconfig %Iface%:%Logic% inet %Addr% netmask %Mask% up" OS X "/sbin/ifconfig %Iface% inet %Addr% netmask %Mask% alias" MSWin32 "AddIPAddress" function from "IpHlpAPI.DLL" First available logic interface is taken automatically for Solaris and Linux 'alias' Just a synonim for '+alias' '-alias' This function is used to remove IPv4 alias address from interface. It have to be called as Ifconfig('-alias', $IfaceName, $Addr, ''); $IfaceName is an interface name as displayed by 'list' command $Addr is an IPv4 address in the "AAA.AAA.AAA.AAA" notation Netmask> parameter is ignored The following actual "ifconfig" programs are called FreeBSD "/sbin/ifconfig %Iface% inet %Addr% -alias" Solaris "/sbin/ifconfig %Iface%:%Logic% down" OpenBSD "/sbin/ifconfig %Iface% inet %Addr% -alias" Linux "/sbin/ifconfig %Iface%:%Logic% down" OS X "/sbin/ifconfig %Iface% inet %Addr% -alias" MSWin32 "DeleteIPAddress" function from "IpHlpAPI.DLL" Appropriate logic interface is obtained automatically for Solaris and Linux On success "Ifconfig(...)" returns the defined value. Actually, it is a reference to the array contains the output of the actual "ifconfig" program called. In case of troubles "Ifconfig(...)" returns 'undef' value, $@ variable contains the error message. EXPORT None by default. AUTHOR
Daniel Podolsky, <tpaba@cpan.org> SEE ALSO
ifconfig(8), Internet Protocol Helper in Platform SDK. perl v5.14.2 2012-01-19 Net::Ifconfig::Wrapper(3pm)
All times are GMT -4. The time now is 04:18 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy