best method to compare 2 big files in unix


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users best method to compare 2 big files in unix
# 1  
Old 01-15-2011
best method to compare 2 big files in unix

Hi ,

I have a requirement to compare 2 files which can contain 40 million or more records and more than 20 fields to compare .
Currently I am using awk scripting , and since awk has a memory issue, I am not able to process file more than 10 million records.

Any suggestions or pointers to change my logic which can help me. I thought of splitting the files into 1 million each , but then , I would miss few records in doing soo.

Thanks in advance.
Rashmi
# 2  
Old 01-15-2011
Is that fgrep help you ?

fgrep -v -f file2 file1 >file3

This will output file3 containing all lines from file1 that are not in file2.
# 3  
Old 01-15-2011
What Operating System and version do you have?
What Shell do you use?

Does "wc -l filename" give an accurate answer for the number of lines? This tells me whether there is at least a chance of processing the file in unix Shell.
Are the files sorted to a simple order, or are they in random order?

Exactly how big are these files according to "ls -la" ? The size is probably more important than the number of records, but the maximum size of any record is also important.
If any file is larger than 2Gb it may be impossible to process with basic unix Shell commands (depends on the version of Shell).

Are the files definitely unix text files suitable for processing in unix Shell?

When comparing these files, are you just interested in whether they are different?

If you are trying to do more sophisticated processing on the differences, do you have a high-level programming language such as Oracle and the ability to write applications to process the data?


Don't forget to post sample input, expected processing, and sample expected output.




Ps. "fgrep" is not even vaguely suited to this task.
Pps. If "awk" fails, please post the "awk" script along with the environmental and numerical facts.

Last edited by methyl; 01-15-2011 at 07:27 PM.. Reason: paste errors
# 4  
Old 01-16-2011
Hi,
I am working on Sun Microsystems Inc.SunOS 5.10 version.
My file is a output of a sql query which fetchs records of more than 40 million records from the oracle database
from 2 different systems. I need to compare these 2 files, by finding which fields are not matching and the missing
records from file 1 and file2.The 1st colmn whld be displayed as it is and if the fields match , then it shld put Y , else N in the output file.

for ex
file 1
--------
Code:
field1|field2|field3|
abc|123|234
def|345|456
hij|567|678

file2
---------
Code:
field1|field2|field3|
abc|890|234
hij|567|658

output file
Code:
field1|field2|field3|
abc|N|Y
def|345|456
hij|Y|N

the code I am using righ now. I would be sorting the file before I start processing, here the control_file will tell
me which fields I need to compare.if the fields match , then it shld put Y , else N in the output file.

Thanks in advance

Moderator's Comments:
Mod Comment
Please use code tags when posting data and code samples!

Last edited by vgersh99; 01-16-2011 at 11:33 AM.. Reason: code tags, please!
# 5  
Old 01-16-2011
Sorting is probably going to be the biggest overhead by far, and I don't see a way to avoid it...
# 6  
Old 01-16-2011
Assuming the two files are sorted by the first field (an sql query can do that), my proposal in PERL:
Code:
use strict;
use warnings;

$\ = "\n";
$, = '|';

if (@ARGV < 2) {
    print "USAGE: $0 <file1> <file2>";
    exit 1;
}

my $inputfile1 = shift @ARGV;
open F1, '<', $inputfile1 or die $inputfile1;

my $inputfile2 = shift @ARGV;
open F2, '<', $inputfile2 or die $inputfile2;

my $h1 = <F1>; chomp $h1;
my $h2 = <F2>; chomp $h2;

if ($h1 eq $h2) {
    print $h1;
}
else {
    print STDERR "$0: different headers\n";
    exit 1;
}

my $k1 = undef; my @F1 = ();
my $k2 = undef; my @F2 = ();

while (1) {
    unless (defined $k1) { $_ = <F1>; last unless defined $_; chomp; ($k1, @F1) = split /\|/; }
    unless (defined $k2) { $_ = <F2>; last unless defined $_; chomp; ($k2, @F2) = split /\|/; }

    if ($k1 lt $k2) { print $k1, @F1; $k1 = undef; next; }
    if ($k2 lt $k1) { print $k2, @F2; $k2 = undef; next; }

    print $k1, map { $_ eq shift @F2 ? 'Y' : 'N' } @F1;

    $k1 = undef;
    $k2 = undef;
}

if (defined $k1) { print $k1, @F1; } while (<F1>) { chomp; print; } 
if (defined $k2) { print $k2, @F2; } while (<F2>) { chomp; print; }

# 7  
Old 01-17-2011
Thanks, can you please explain me the code, as I don;t know perl scripting.Is there a way to do it unix, would reading the file using fopen consume less memory and do the comparison and then write to the output file.

Please advice
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Compare and merge two big CSV files

Hi all, i need help. I have two csv files with a huge amount of data. I need the first column of the first file, to be compared with the data of the second, to have at the end a file with the data not present in the second file. Example File1: (only one column) profile_id 57036226... (11 Replies)
Discussion started by: SirMannu
11 Replies

2. UNIX for Beginners Questions & Answers

Compare two big files for differences using Linux

Hello everybody Looking for help in comparing two files in Linux(files are big 800MB each). Example:- File1 has below data $ cat file1 5,6,3 2.1.4 1,1,1 8,9,1 File2 has below data $ cat file2 5,6,3 8,9,8 1,2,1 2,1,4 (8 Replies)
Discussion started by: shanul karim
8 Replies

3. Shell Programming and Scripting

Compare two big files for differences using Linux

Hello everybody Looking for help in comparing two files in Linux(files are big 800MB each). Example:- File1 has below data $ cat file1 5,6,3 2.1.4 1,1,1 8,9,1 File2 has below data $ cat file2 5,6,3 8,9,8 1,2,1 2,1,4 (1 Reply)
Discussion started by: shanul karim
1 Replies

4. Shell Programming and Scripting

Compare two files in UNIX

Hi, I have two files File1 Contents: abc dcf sdc File2 Contents: dcf sdc erg Now my program should return the contents existing in File1 but not in File2. In this case output shoud be "abc" as abc is not available in File 2. It should not return "erg" by saying it is... (4 Replies)
Discussion started by: forums123456
4 Replies

5. Shell Programming and Scripting

Compare two files in unix

Hi Gurus I need your kind help sorting the below query I have two text files File1.txt ID Name Address 101 Srinath BBB 102 Sidharth CCC File2.txt ID Name Address 102 Siddharth DDD 103 Suman EEE Now the requirement is if the second file has... (0 Replies)
Discussion started by: Pratik4891
0 Replies

6. UNIX for Advanced & Expert Users

How to compare two files using UNIX?

I have two files which have primary key(s) for each row. I need to compare both the files and produce the output in the following format. Primary key(s),file1 value,file2 value. Both the input files will be comma separated files. I have accomplished this using perl, but it is... (6 Replies)
Discussion started by: gpsridhar
6 Replies

7. UNIX for Dummies Questions & Answers

how can i unix compare two files??

how can i unix compare two files?? var1 = 6499 7328 6351 7583 7573 var2 = 6499 7328 6351 7583 7777 i did: diff $var1 $var2 and i got the output: 1c1 < 6499 7328 6351 7583 7573 --- > 6499 7328 6351 7583 7777 what can i do with it? and what does it tell me?? how can i knoe that... (2 Replies)
Discussion started by: nirnir26
2 Replies

8. UNIX for Dummies Questions & Answers

Unix Compare Files

Hi, I need to compare 2 files based on the first field in each file and output the differences to a new file. example File 1 and File 2 both have first field as Number ie: File 1 1252652355 1859553322 1778899562 File 2 1252652355 1859553322 So I would expect File 3 to... (2 Replies)
Discussion started by: Lagre1
2 Replies

9. UNIX for Advanced & Expert Users

UNIX; Compare two files

Hi Guys, Requirement: Want to compare two files, if the the content of both files is same then show "Good result" else Show "Bad Result" I am using the following logic if( cmp -s a b = 0 ) then echo "Good result" else echo "Bad result" exit 0 fi But this is... (1 Reply)
Discussion started by: abhishek3598
1 Replies

10. Shell Programming and Scripting

how to compare big real numbers

Hi everyone, I need to compare 2 big Floating/Real numbers in a script. After the comparission it is showing worng values in my script. echo "Enter value1" read value1 echo "Enter value2" read value2 Result=`echo "if($value1 > $value2) 1" | bc` if ; then echo "$value1 is... (4 Replies)
Discussion started by: padarthy
4 Replies
Login or Register to Ask a Question