Difference of two data files & writing to an outfile.

06-13-2011

Registered User

2,759, 420

Join Date: Jun 2006

Last Activity: 13 September 2015, 8:58 PM EDT

Posts: 2,759

Thanks Given: 44

Thanked 420 Times in 408 Posts

Guess one file has only one record.

Code:

awk -F "|" 'NR==FNR&&/\|/ {A=$49 FS $66 FS $119 FS $188;B=$0;C=FILENAME}
            NR>FNR &&/\|/ {if (A==$49 FS $66 FS $119 FS $188) 
                             { print "4 primary keys are same in "  C " and " FILENAME }
                          else {print "4 primary keys are NOT same in "  C " and " FILENAME}
                          print ""
                          print "Record in file " C " : \n" B
                          print ""
                          print "Record in file " FILENAME " : \n" $0
                          } ' file1 file2

4 primary keys are NOT same in file1 and file2

Record in file file1 :
  231 CP5101987 Corp|-1|198|C|7.300000|20110607|EAB|EUROPEAN AMERICAN BANK|LNB-CALL12/97|BANK|Corp|2|STEP CPN|CALLABLE|MULTI-STEP BOND|1|US DOMESTIC|US|USD|DE      POSIT NOTES|760000.00|.00|10000.0000|1000.0000|1000.00|LAS-sole|NOT LISTED|100.00000|19960607|19960607|19961207|19960607|19960607|100.000000| | | | | |       |US29874AZZ55| | | | | | | | | | | | | | |225433|500231|29874AZZ5| | | | | | | | | |N.A.| | | | | | | | | | | | |Y|N|N|197879|Citibank NA|8156Z|US| |Fin      ancial|Banks|Money Center Banks|N.A.|US|C 7.3 06/07/11|N| |US DOMESTIC| |N.A.| | |Y| |N|COCP5101987|European American Bank|USD|USD|Y|Y|Y|1|N|N|USD|N|N|Y      |19971207|EUROPEAN AMERICAN BANK|Semi-Annual| |19970607| | |N|N|US|US|Does Not Apply|20110607|N|421|MATURED|N|N|100.000000|N| |.000000000| |N|DTC| | | |      N.A.|N.A.|N.A.|N.A.|N.A.| | | | | | |N|N|N|N| |Grandfathered|29874AZZ5|145| | |N.A.|N|FULL (ONLY)| |N|19970607| | | |N| | |20110607| | |N|N|N| | |N|3| |       | |N.A.|2| | | | |N|N|BBG00048KQJ7|

Record in file file2 :
 231 9999X01M9 Govt|-1|198|XIB|0|20110707| |WI TSY BILL|WI TSY BILL|USGN|Govt| |NONE|NORMAL|DISCOUNT|2|US GOVT|US|USD| |31782000000|31782000000|100|100| | | |     100.00000| | |20110106| |20110609|0.005000| | | | | | | | | | | | | | | | | | | | | |349057|13714872|9127952X8| | | | | | | | | | | | | | | | | | | | | |      |N|N|N|218252|United States of America|3352Z|US| |Government|Sovereign|Sovereign| |US|XIB 07/07/11| | |US GOVERNMENT| | | | | | |N.A.|GV9999X01M9|United      States Treasury Bill - WI Post Auction|USD|USD| | |Y|2|N| |USD|N.A.|N.A.|N| |WI TSY BILL| | | | | |N|Y|US|US| |20110707|N.A.|5| | | | | | | | |Y| | | |      |N.A.|N.A.|N.A.|N.A.|N.A.| | | | | | | | |N.A.|N| |Non-Grandfathered|9999X01M9|459|Bill| | | | | |N| | | | |N| | |20110707| | | | | | | |N|1| | | | | | |      | | | |N|BBG001CSH9Y7|

rdcwayx

View Public Profile for rdcwayx

Find all posts by rdcwayx

06-14-2011

Registered User

61, 0

Join Date: Sep 2010

Last Activity: 4 November 2013, 1:17 PM EST

Posts: 61

Thanks Given: 40

Thanked 0 Times in 0 Posts

Hi rdcwayx,

Really appreciate your time in writing the awk script.

But each of these files have 500K records.

But I have managed to write a perl script for the diff file. i.e.

Code:

#!/usr/local/bin/perl
$self = $0;
$self =~ s!^.*/!!;
#
$[ = 1; # = number of first index into arrays and strings
#
$FIELD_SEPARATOR = '\t';
$FIELD_NUMBER_LIST =('38','82');
$field_separator = $FIELD_SEPARATOR;
$field_number_list = $FIELD_NUMBER_LIST;
#
while (@ARGV)
{
    $_ = shift;
    if    (/^-F$/)    { $field_separator = shift; }
    elsif (/^-L$/)    { $field_number_list = shift; }
    elsif (/^-F.+$/)  { $field_separator = substr($_,$[+2); }
    elsif (/^-L.+$/)  { $field_number_list = substr($_,$[+2); }
    #else              { push(@filename, $_); }
}
#
$file_a = 'file1';
$file_b = 'file2';
#
unless (($file_a ne "") && (-f $file_a))
{
    die "Error: Can't find file '$file_a'!\n";
}
unless (($file_b ne "") && (-f $file_b))
{
    die "Error: Can't find file '$file_b'!\n";
}
#
@index_list = split(/,/, $field_number_list);
#
# Scan first file, Pass 1:
open(FILE_A, "<$file_a") || die "Can't open '$file_a': $!\n";
#
while (<FILE_A>)
{
    chop if /\n$/;
    undef $key;
    undef @field;
    @field = split(/$field_separator/o);
    foreach $index (@index_list)
    {
        if (defined $key)
        {
            $key .= "\n" . $field[$index];
        }
        else
        {
            $key = $field[$index];
        }
    }
     $intersection{$key} = 1;
}
#
close(FILE_A);
# Scan second file, Pass 1:
#
$empty_intersection = 1;
#
open(FILE_B, "<$file_b") || die "Can't open '$file_b': $!\n";
#
while (<FILE_B>)
{
    chop if /\n$/;
    undef $key;
    undef @field;
    @field = split(/$field_separator/o);
    foreach $index (@index_list)
    {
        if (defined $key)
        {
            $key .= "\n" . $field[$index];
        }
        else
        {
            $key = $field[$index];
        }
    }
 $code = $intersection{$key};
if ($code == 1)
    {
        $intersection{$key} = 3;
        $empty_intersection = 0;
    }
    else
    {
        if ($code != 3) { $intersection{$key} = 2; }
    }
}
#
close(FILE_B);
#
# Prepare output file names:
$file_a_1 = $file_a . '.1';
#
# Scan first file, Pass 2:
#
open(FILE_A, "<$file_a")     || die "Can't open '$file_a': $!\n";
open(FILE_A_1, ">$file_a_1") || die "Can't write '$file_a_1': $!\n";
#
while (<FILE_A>)
{
    chop if /\n$/;
    undef $key;
    undef @field;
    @field = split(/$field_separator/o);
    foreach $index (@index_list)
    {
        if (defined $key)
        {
            $key .= "\n" . $field[$index];
        }
else
        {
            $key = $field[$index];
        }
    }
    if ($intersection{$key} == 3)
    {
       # 
    }
    else
    {
        print FILE_A_1 $_, "\n";
    }
}
#
close(FILE_A);
close(FILE_A_1);
#
# Scan second file, Pass 2:
#
open(FILE_B, "<$file_b")     || die "Can't open '$file_b': $!\n";
open(FILE_A_1, ">>$file_a_1") || die "Can't write '$file_a_1': $!\n";
#
while (<FILE_B>)
{
    chop if /\n$/;
    undef $key;
    undef @field;
    @field = split(/$field_separator/o);
    foreach $index (@index_list)
    {
        if (defined $key)
        {
            $key .= "\n" . $field[$index];
        }
        else
        {
            $key = $field[$index];
        }
    }
    if ($intersection{$key} == 3)
    {
      #      }
    else
    {
        print FILE_A_1 $_, "\n";
    }
}
#
close(FILE_B);
close(FILE_A_1);
#
# Display results:
#
printf("The Diff file created '%s'\n\n", $file_a_1);
#

The above code works perfectly for generating the diff file i.e. depending upon the primary keys (here taken 2) the outfile contains the records that exists in file1 but not in file2 and the records that exists in file2 but not in file1.

Now,

I need to compare the whole record(line) if the primary keys in file1 matches with the primary keys in file2. If both the lines are equal then discard else write to the outfile.

Could someone please help me out in order to the above step.

Really appreciate your thoughts on this.

Last edited by filter; 06-14-2011 at 07:04 PM..

filter

View Public Profile for filter

Find all posts by filter

06-14-2011

Registered User

2,759, 420

Join Date: Jun 2006

Last Activity: 13 September 2015, 8:58 PM EDT

Posts: 2,759

Thanks Given: 44

Thanked 420 Times in 408 Posts

With your new request, awk code can be more shorter.

Code:

awk -F \| 'NR==FNR && /\|/ {a[$49 FS $66 FS $119 FS $188]=$0} 
    NR>FNR && /\|/ {if (a[$49 FS $66 FS $119 FS $188]=="") {print > FILENAME ".diff"} else {print > "same.txt"}}' file1 file2

awk -F \| 'NR==FNR && /\|/ {a[$49 FS $66 FS $119 FS $188]=$0} 
    NR>FNR && /\|/ {if (a[$49 FS $66 FS $119 FS $188]=="") {print > FILENAME ".diff"} else {print > "same.txt"}}' file2 file1

After run the awk commands, you will get three files:

Code:

file1.diff                          # exist in file1, but not in file2
file2.diff                          # exist in file2, but not in file2
same.txt                         # exist in both files.

This User Gave Thanks to rdcwayx For This Post:

rdcwayx

View Public Profile for rdcwayx

Find all posts by rdcwayx

06-15-2011

Registered User

61, 0

Join Date: Sep 2010

Last Activity: 4 November 2013, 1:17 PM EST

Posts: 61

Thanks Given: 40

Thanked 0 Times in 0 Posts

Hi rdcwayx,

Thanks a lot for providing the awk script.
As you said the awk looks simpler and shorter.

I have ran the two awk scripts that you have provided to me for the two different files and have generated the two same.txt and same1.txt files.

Code:

awk '-F\t' 'NR==FNR && /\t/ {a[$38  FS $82]=$0} NR>FNR && /\t/ {if (a[$38 FS $82]=="") {print  > FILENAME ".diff"} else {print > "same.txt"}}' File1 File2

File1.diff --> Records that exists in File1 but not in File2
same.txt --> Records that exists in both the files

Code:

awk '-F\t' 'NR==FNR && /\t/ {a[$38  FS $82]=$0} NR>FNR && /\t/ {if (a[$38 FS $82]=="") {print  > FILENAME ".diff"} else {print > "same1.txt"}}' File1 File2

File2.diff --> Records that exists in File2 but not in File1
same1.txt --> Records that exists in both the files'

Here,
The file size of the same.txt and same1.txt is different.

--> 506108009 Jun 15 01:50 same.txt

--> 505878904 Jun 15 01:52 same1.txt

So, is there anyway that we can capture the records if the Primary Keys ($38 and $82) of both the files matches but the data in the other columns is not the same.

To solve the above issue, I thought once we have the files same.txt and same1.txt we should compare the number of characters in the a line in File1 with number of characters in a line in file2 , if there is a change then write to a file else discard (do nothing or leave it)

Could you please share any ideas to solve the above issue. It would be really grateful.

filter

View Public Profile for filter

Find all posts by filter

Shell Programming and Scripting

Difference of two data files & writing to an outfile.

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Difference between & and nohup &

Discussion started by: Anupam_Halder

2. Shell Programming and Scripting

Help on writing data from 2 different files to one based on a common factor

Discussion started by: vat1kor

3. Shell Programming and Scripting

awk help: Match data fields from 2 files & output results from both into 1 file

Discussion started by: ambroze

4. Shell Programming and Scripting

search & merg data from 3 files

Discussion started by: oreka18

5. Shell Programming and Scripting

Copying the Header & footer Information to the Outfile.

Discussion started by: filter

6. Shell Programming and Scripting

Sort a the file & refine data column & row format

Discussion started by: ckaramsetty

7. Shell Programming and Scripting

How to combine 2 files and output the unique & difference?

Discussion started by: pinpe

8. UNIX for Dummies Questions & Answers

Reading and writing data to and from multiple files

Discussion started by: Fahmida

9. Shell Programming and Scripting

Need help in writing a script to create a new text file with specific data from existing two files

Discussion started by: shashi143ibm

10. UNIX Desktop Questions & Answers

what is the difference between Unix & linux, what are the advantages & disadvantages

Discussion started by: cybertechmkteo