Perl: Need help comparing huge files

07-12-2012

Registered User

117, 0

Join Date: May 2011

Last Activity: 8 December 2019, 11:33 PM EST

Location: USA

Posts: 117

Thanks Given: 23

Thanked 0 Times in 0 Posts

Perl: Need help comparing huge files

What do i need to do have the below perl program load 205 million record files into the hash. It currently works on smaller files, but not working on huge files. Any idea what i need to do to modify to make it work with huge files:

Code:

#!/usr/bin/perl
$ot1=$ARGV[2];
$ot2=$ARGV[3];
open(mfileot1, ">$ot1");
open(mfileot2, ">$ot2");
use strict;
#----------------
# Hash Definition
#----------------
my %HashArray;
my @file1Line;
my @file2Line;
#--------------------
# Subroutine
#--------------------
sub comp_file{
  my ($FILE1, $FILE2) = @_;
  open (R, $FILE1) or die ("Can't open file $FILE1");
  foreach my $FP1(<R>){
    chomp($FP1);
    my ($k, $l) = split(/\s+/,$FP1);
    push @{$HashArray{'$FP1'}{$k}},$l;
  }
  close (R);
  open (P, $FILE2) or die ("Can't open file $FILE2");
  foreach my $FP2(<P>){
    chomp($FP2);
    my ($k, $l) = split(/\s+/,$FP2);
    push @{$HashArray{'$FP2'}{$k}},$l;
  }
  close (P);
  foreach my $key(keys %{$HashArray{'$FP1'}}){
    if (!exists $HashArray{'$FP2'}{$key}){
      foreach my $last(@{$HashArray{'$FP1'}{$key}}){
        push (@file1Line,"$key$last");
      }
    }
  }
  print mfileot1 "$_\n" for (sort @file1Line);
  close(mfileot1);
  foreach my $key(keys %{$HashArray{'$FP2'}}){
    if (!exists $HashArray{'$FP1'}{$key}){
      foreach my $last(@{$HashArray{'$FP2'}{$key}}){
        push (@file2Line,"$key$last");
      }
    }
  }
  print mfileot2 "$_\n" for (sort @file2Line);
  close(mfileot2);
}
############MAIN MENU####################################
# Pre-check Condition
# if the input doesn't contain two(2) files, return help
# USAGE: hash2files.pl FILE1 FILE2 FILE3 FILE4
#########################################################

if ($#ARGV != 3){
  print "USAGE: $0 <FILE1> <FILE2> <FILE3> <FILE4>\n";
  exit;
}
else {
  my ($FILE1, $FILE2, $OT1, $OT2)= @ARGV;
  &comp_file($FILE1, $FILE2);
}

Last edited by Scott; 07-12-2012 at 02:13 PM.. Reason: Please use code tags and indent code. Thanks.

mrn6430

View Public Profile for mrn6430

Find all posts by mrn6430

07-12-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

What exactly does your program do? Show a sample of input and output.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

07-12-2012

Registered User

117, 0

Join Date: May 2011

Last Activity: 8 December 2019, 11:33 PM EST

Location: USA

Posts: 117

Thanks Given: 23

Thanked 0 Times in 0 Posts

Basically to run it: hash2files.pl inputfile1 inputfile2 outputfile1 outputfile2

Inputfile1 contains nuneric id's:

Code:

To be compared against Inpufile2 which also has id's:

Code:

The outputfile1 will contain all the id's in inputfile1 that are not found in inputfile2
In this case the result would be;

Code:

1233
4444
7777

Outputfile2 will have all the id's in inputfile2 not found in inputfile1. In this case:

Code:

9898
9999

It works really well with average size file. But it it can not handle loading 2 huge files (inputfile1 and 2) into the hash memory and it stops after a while w/o any error msgs oither than it does it produce the results. It terminates basically.

How can I make this work for huge files. The inputfile1 is about 204 million records and almost the same amount of records in inputfile2? I kniow it needs to be modified to somehow load one of them such as inputfile2 into the hash memory and not both, and do a compare on the id by reading one line from inputfile1 and if found in the has just delete it from the hash one at a time since we do not care about the matched one's at this point. What should remain in the hash is all not found id's and write them to a file. But i do not knoq how to do that !!

I hope helps explaining my issue.

Last edited by Scott; 07-13-2012 at 02:00 PM.. Reason: Blah blah blah blah and blah blah. Thanks.

mrn6430

View Public Profile for mrn6430

Find all posts by mrn6430

07-12-2012

Registered User

628, 174

Join Date: Oct 2010

Last Activity: 2 December 2017, 5:58 AM EST

Location: Madrid, Spain

Posts: 628

Thanks Given: 8

Thanked 174 Times in 171 Posts

Hi mrn6430,

Value 1233 isn't found in inputfile2, and similar issue for 1244. Did you forget it or did I miss anything?

birei

View Public Profile for birei

Find all posts by birei

07-13-2012

Registered User

117, 0

Join Date: May 2011

Last Activity: 8 December 2019, 11:33 PM EST

Location: USA

Posts: 117

Thanks Given: 23

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by birei

Hi mrn6430,

Value 1233 isn't found in inputfile2, and similar issue for 1244. Did you forget it or did I miss anything?

Yes. I updated my reply to include it. Besides the point, need a way to deal with such huge files. That is the mean issue. Thanks

mrn6430

View Public Profile for mrn6430

Find all posts by mrn6430

07-13-2012

Registered User

628, 174

Join Date: Oct 2010

Last Activity: 2 December 2017, 5:58 AM EST

Location: Madrid, Spain

Posts: 628

Thanks Given: 8

Thanked 174 Times in 171 Posts

Try:

Code:

$ cat inputfile1
1233
2345
3456
4444
7777
$ cat inputfile2
1244
2345
3456
9898
9999
$ cat script.pl
use warnings;
use strict;

my (%hash);

die qq|Usage: $0 <inputfile-1> <inputfile-2> <outputfile-1> <outputfile-2>\n| 
        unless @ARGV == 4;

open my $ifh1, q|<|, shift or die;
open my $ifh2, q|<|, shift or die;
open my $ofh1, q|>|, shift or die;
open my $ofh2, q|>|, shift or die;

while ( <$ifh1> ) {
        chomp;
        $hash{ $_ } = 1;
}

while ( <$ifh2> ) {
        chomp;
        if ( exists $hash{ $_ } ) {
                delete $hash{ $_ };
                next;
        }

        printf $ofh2 qq|%d\n|, $_;
}

for ( sort { $a <=> $b } keys %hash ) {
        printf $ofh1 qq|%d\n|, $_;
}
$ perl script.pl inputfile1 inputfile2 outputfile1 outputfile2
$ cat outputfile1
1233
4444
7777
$ cat outputfile2
1244
9898
9999

birei

View Public Profile for birei

Find all posts by birei

07-13-2012

Registered User

117, 0

Join Date: May 2011

Last Activity: 8 December 2019, 11:33 PM EST

Location: USA

Posts: 117

Thanks Given: 23

Thanked 0 Times in 0 Posts

Thank you so much. I will test it. Do you know if there is any limitation of how many records max to load into hash using perl? I have a 205million records to load.

Thanks

mrn6430

View Public Profile for mrn6430

Find all posts by mrn6430

Shell Programming and Scripting

Perl: Need help comparing huge files

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Need help in comparing two files using shell or Perl

Discussion started by: sags007_99

2. Shell Programming and Scripting

Removing Dupes from huge file- awk/perl/uniq

Discussion started by: makn

3. Shell Programming and Scripting

Perl: Comparing to two files and displaying the differences

Discussion started by: dont_be_hasty

4. Shell Programming and Scripting

Comparing 2 huge text files

Discussion started by: linuxgeek

5. Shell Programming and Scripting

Comparing two huge files on field basis.

Discussion started by: Suman Singh

6. Shell Programming and Scripting

Problem running Perl Script with huge data files

Discussion started by: ad23

7. Shell Programming and Scripting

Compare 2 folders to find several missing files among huge amounts of files.

Discussion started by: jiapei100

8. Shell Programming and Scripting

Perl script error to split huge data one by one.

Discussion started by: patrick87

9. Shell Programming and Scripting

Comparing two huge files

Discussion started by: kmkbuddy_1983

10. UNIX for Dummies Questions & Answers

comparing Huge Files - Performance is very bad

Discussion started by: madhukalyan