Visit Our UNIX and Linux User Community


How to compare two files using UNIX?


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users How to compare two files using UNIX?
# 1  
Old 10-19-2010
How to compare two files using UNIX?

I have two files which have primary key(s) for each row.

I need to compare both the files and produce the output in the following format.

Code:
 
Primary key(s),file1 value,file2 value.

Both the input files will be comma separated files.

I have accomplished this using perl, but it is taking around two hours to generate the ouput for files with 50000 lines.

Can it be done using shell or awk in a much faster way?
# 2  
Old 10-19-2010
The answer is that you probably have unnecessaily complex perl code; yes, awk, perl or some other languages all work well. We have searches like that on 10M rows that take less than 2 minutes.

Please give us a few sample rows, sample output, and point out the field that is the primary key.
# 3  
Old 10-19-2010
My Sample input files :

File1
Code:
 
10,807,3616114150,1,RVRSL,X,NA,2615574
11,199,607440419,0,ORGNL,1,NA,157844
12,807,25004011647,1,RVRSL,X,NA,1411925

File2
Code:
 
10,807,3616114150,1,RVRSL,X,NA2,2615574
11,199,607440419,0,ORGNL,1,NA,157344
12,807,25004011647,1,RVR2SL,X,NA,1411925

Here, the first two columns are primary keys.

Sample output :

for the first row alone(Primary key combination is 10,807) :

Code:
 
10,807,3616114150,3616114150,Y
10,807,1,1,Y
10,807,RVRSL,RVRSL,Y
10,807,X,X,Y
10,807,NA,NA2,N
10,807,2615574,2615574,Y

The last Y or N indicates whether there is a mismatch for that column.

---------- Post updated at 07:59 PM ---------- Previous update was at 06:57 PM ----------

This is my perl code :

It would be great if anyone can even tune this and make it perform better or provide a faster solution using awk/shell.

Code:
 
#!/usr/bin/perl
use warnings;
my $input_file1 = shift;
my $input_file2 = shift;
my $primary_col;
my $starttime = shift;
#my $delim = shift;
open FIRFILE, $input_file1 or die "Unable to open file: [$!]";
open SECFILE, $input_file2 or die "Unable to open file: [$!]";
open(OUTFILE,"+>compareoutput_$starttime.txt") or die "Can't Create File!!";
$primary_col = `head -1 pk_count.txt`; # To get the number of primary keys
# Start with empty hashes
my %firHash = ();
my %secHash = ();
#print "First File:\n";
# Fill the first hash
while (<FIRFILE>)
{
                @fileColumns1 = split(/,/);
  #@fileColumns1 = split(/$delim/);
                my $size1 = @fileColumns1;
                #my $ident = $fileColumns1[0];
  my $ident2 = join(",", @fileColumns1[0..$primary_col-1]);
  #my $ident2 = join("$delim", @fileColumns1[0..$primary_col-1]);
                $firHash{$ident2} = $_;
}
# Fill the second hash
while (<SECFILE>)
{
  @fileColumns2 = split(/,/);
  #@fileColumns2 = split(/$delim/);
  my $size2 = @fileColumns2;
  #my $ident = $fileColumns2[0];
  my $ident2 = join(",", @fileColumns2[0..$primary_col-1]);
  #my $ident2 = join("$delim", @fileColumns2[0..$primary_col-1]);
                $secHash{$ident2} = $_;
}
foreach $key1 (sort keys %firHash)
{
foreach $key2 (sort keys %secHash)
{
if ($key1 eq $key2)
{
 my @file2 = split(/,/, $secHash{$key2});
 #my @file2 = split(/$delim/, $secHash{$key2});
 delete $file2[0];
 my @file1 = split(/,/, $firHash{$key1});
 #my @file1 = split(/$delim/, $firHash{$key1});
 delete $file1[0];
 my $len = @file1;
 for ($count=$primary_col; $count<$len; $count++)
 {
 chomp $file1[$count];
 chomp $file2[$count];
 my $column_num = $count+$primary_col-1;
  if ( $file1[$count] ne $file2[$count])
  {
  print OUTFILE $key1,",",$column_num,",",$file1[$count],",",$file2[$count],",N","\n";
  #print OUTFILE $key1,$delim,$column_num,$delim,$file1[$count],$delim,$file2[$count],$delim,"N","\n";
  }
  else
  {
  print OUTFILE $key1,",",$column_num,",",$file1[$count],",",$file2[$count],",Y","\n";
  #print OUTFILE $key1,$delim,$column_num,$delim,$file1[$count],$delim,$file2[$count],$delim,"Y","\n";
  }
 }
}
}
}

# 4  
Old 10-19-2010
nawk -f gp.awk file1 file2

gp.awk:
Code:
BEGIN {
  FS=OFS=","
}
{ idx = $1 OFS $2 }
FNR==NR { f1[idx]=$0;next}
idx in f1 {
  n=split(f1[idx],a,FS)
  for(i=3;i<=NF;i++)
    print idx,a[i], $i, (a[i]==$i)?"Y":"N"
}

This User Gave Thanks to vgersh99 For This Post:
# 5  
Old 10-21-2010
@vgersh99 :
You code worked perfect and it ran in just a few seconds for 100k file. That was great. Thanks a lot.

I do have to make two changes to this code, which i tried but could not complete it.

First : The number of primary keys is not a constant one. It will be stored in a file.

Reference to my perl code:
Code:
 
$primary_col = `head -1 pk_count.txt`; # To get the number of primary keys

Im not sure how to change this in your code.

Second :
I also need to have a value denoting the number of columns in the output.
Taking my previous example, the output will look like :

Code:
10,807,1,3616114150,3616114150,Y
10,807,2,1,1,Y
10,807,3,RVRSL,RVRSL,Y
10,807,4,X,X,Y
10,807,5,NA,NA2,N
10,807,6,2615574,2615574,Y

where 10,807 is the primary key and the number next to it is just a sequence. It has to increment from 1 to n for each primary key
# 6  
Old 10-21-2010
something along these lines.
# default 2 keys
nawk -f gp.awk file1 file2
#
# with 3 keys
nawk -v keys=3 -f gp.awk file1 file2

gp.awk:
Code:
BEGIN {
  FS=OFS=","
  if (!keys) keys=2
}
{ for(i=1;i<=keys;i++) idx=(i==1)?$i:idx OFS $i }
FNR==NR { f1[idx]=$0;next}
idx in f1 {
  n=split(f1[idx],a,FS)
  seq=0
  for(i=keys+1;i<=NF;i++)
    print idx,++seq,a[i], $i, (a[i]==$i)?"Y":"N"
}

This User Gave Thanks to vgersh99 For This Post:
# 7  
Old 10-22-2010
Thanks a lot vgersh99... Your code worked perfectly fine.. It produced the output under one minute for 100k file..

Previous Thread | Next Thread
Test Your Knowledge in Computers #769
Difficulty: Medium
Intel Core i7 (2008) has an 8 MB on-die unified L3 cache that is inclusive, shared by all cores.
True or False?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to compare multiple files in UNIX?

Hi, I have below query related to multiple file comparing I have four files i want to compare it and contents of one file will not be presence in 3other files and if any content found then it will print the execution. Can you please help me how to achieve it. (2 Replies)
Discussion started by: soumyamihp
2 Replies

2. Shell Programming and Scripting

Compare two files in UNIX

I have requirement to compare two files in unix. Below are the sample files. File1: cn=test123,cn=bobgroup,dc=ind,dc=com cn=bob123,cn=bobgroup,dc=ind,dc=com cn=test13,cn=bobgroup,dc=ind,dc=com cn=est12,cn=bobgroup,dc=ind,dc=com cn=st123,cn=bobgroup,dc=ind,dc=com File2... (1 Reply)
Discussion started by: babu92
1 Replies

3. Shell Programming and Scripting

Compare files using Unix scripting

I have a file containing the below data obtained after running a diff command > abc 10 < abc 15 > xyz 02 <xyz 05 ..... Does anyone know how i can obtain output like : previous value of abc is 10 and present value is 15 similarly for all the comparisons in the text file (10 Replies)
Discussion started by: amithpatrick1
10 Replies

4. UNIX for Dummies Questions & Answers

Unix Script to compare two files

Hello, I have a dat file nctilllist.dat which will be present in the directory path "/usr/lpp/web-data/mfg/nct/file-data/nctilllist.dat" nctillist.dat will have reference to files like DP100001.jpg,DP10002.PDF,DP100003.doc on the path /usr/lpp/web-data/mfg/nct/file-data will have... (12 Replies)
Discussion started by: gayathrivm
12 Replies

5. Shell Programming and Scripting

Compare two files in UNIX

Hi, I have two files File1 Contents: abc dcf sdc File2 Contents: dcf sdc erg Now my program should return the contents existing in File1 but not in File2. In this case output shoud be "abc" as abc is not available in File 2. It should not return "erg" by saying it is... (4 Replies)
Discussion started by: forums123456
4 Replies

6. Shell Programming and Scripting

Compare two files in unix

Hi Gurus I need your kind help sorting the below query I have two text files File1.txt ID Name Address 101 Srinath BBB 102 Sidharth CCC File2.txt ID Name Address 102 Siddharth DDD 103 Suman EEE Now the requirement is if the second file has... (0 Replies)
Discussion started by: Pratik4891
0 Replies

7. UNIX for Dummies Questions & Answers

how can i unix compare two files??

how can i unix compare two files?? var1 = 6499 7328 6351 7583 7573 var2 = 6499 7328 6351 7583 7777 i did: diff $var1 $var2 and i got the output: 1c1 < 6499 7328 6351 7583 7573 --- > 6499 7328 6351 7583 7777 what can i do with it? and what does it tell me?? how can i knoe that... (2 Replies)
Discussion started by: nirnir26
2 Replies

8. Shell Programming and Scripting

how to compare a two files in unix server.

Hi Friends, I have a requirement like i have two files in diffrent locations. i want to compare these two files, if both the files are same i want to return "0" else return 1. Please help me on this. Thanks sreenu. (3 Replies)
Discussion started by: sreenu80
3 Replies

9. UNIX for Dummies Questions & Answers

Unix Compare Files

Hi, I need to compare 2 files based on the first field in each file and output the differences to a new file. example File 1 and File 2 both have first field as Number ie: File 1 1252652355 1859553322 1778899562 File 2 1252652355 1859553322 So I would expect File 3 to... (2 Replies)
Discussion started by: Lagre1
2 Replies

10. UNIX for Advanced & Expert Users

UNIX; Compare two files

Hi Guys, Requirement: Want to compare two files, if the the content of both files is same then show "Good result" else Show "Bad Result" I am using the following logic if( cmp -s a b = 0 ) then echo "Good result" else echo "Bad result" exit 0 fi But this is... (1 Reply)
Discussion started by: abhishek3598
1 Replies

Featured Tech Videos