Difference of two data files & writing to an outfile.


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Difference of two data files & writing to an outfile.
# 8  
Old 06-13-2011
Guess one file has only one record.
Code:
awk -F "|" 'NR==FNR&&/\|/ {A=$49 FS $66 FS $119 FS $188;B=$0;C=FILENAME}
            NR>FNR &&/\|/ {if (A==$49 FS $66 FS $119 FS $188) 
                             { print "4 primary keys are same in "  C " and " FILENAME }
                          else {print "4 primary keys are NOT same in "  C " and " FILENAME}
                          print ""
                          print "Record in file " C " : \n" B
                          print ""
                          print "Record in file " FILENAME " : \n" $0
                          } ' file1 file2

4 primary keys are NOT same in file1 and file2

Record in file file1 :
  231 CP5101987 Corp|-1|198|C|7.300000|20110607|EAB|EUROPEAN AMERICAN BANK|LNB-CALL12/97|BANK|Corp|2|STEP CPN|CALLABLE|MULTI-STEP BOND|1|US DOMESTIC|US|USD|DE      POSIT NOTES|760000.00|.00|10000.0000|1000.0000|1000.00|LAS-sole|NOT LISTED|100.00000|19960607|19960607|19961207|19960607|19960607|100.000000| | | | | |       |US29874AZZ55| | | | | | | | | | | | | | |225433|500231|29874AZZ5| | | | | | | | | |N.A.| | | | | | | | | | | | |Y|N|N|197879|Citibank NA|8156Z|US| |Fin      ancial|Banks|Money Center Banks|N.A.|US|C 7.3 06/07/11|N| |US DOMESTIC| |N.A.| | |Y| |N|COCP5101987|European American Bank|USD|USD|Y|Y|Y|1|N|N|USD|N|N|Y      |19971207|EUROPEAN AMERICAN BANK|Semi-Annual| |19970607| | |N|N|US|US|Does Not Apply|20110607|N|421|MATURED|N|N|100.000000|N| |.000000000| |N|DTC| | | |      N.A.|N.A.|N.A.|N.A.|N.A.| | | | | | |N|N|N|N| |Grandfathered|29874AZZ5|145| | |N.A.|N|FULL (ONLY)| |N|19970607| | | |N| | |20110607| | |N|N|N| | |N|3| |       | |N.A.|2| | | | |N|N|BBG00048KQJ7|

Record in file file2 :
 231 9999X01M9 Govt|-1|198|XIB|0|20110707| |WI TSY BILL|WI TSY BILL|USGN|Govt| |NONE|NORMAL|DISCOUNT|2|US GOVT|US|USD| |31782000000|31782000000|100|100| | | |     100.00000| | |20110106| |20110609|0.005000| | | | | | | | | | | | | | | | | | | | | |349057|13714872|9127952X8| | | | | | | | | | | | | | | | | | | | | |      |N|N|N|218252|United States of America|3352Z|US| |Government|Sovereign|Sovereign| |US|XIB 07/07/11| | |US GOVERNMENT| | | | | | |N.A.|GV9999X01M9|United      States Treasury Bill - WI Post Auction|USD|USD| | |Y|2|N| |USD|N.A.|N.A.|N| |WI TSY BILL| | | | | |N|Y|US|US| |20110707|N.A.|5| | | | | | | | |Y| | | |      |N.A.|N.A.|N.A.|N.A.|N.A.| | | | | | | | |N.A.|N| |Non-Grandfathered|9999X01M9|459|Bill| | | | | |N| | | | |N| | |20110707| | | | | | | |N|1| | | | | | |      | | | |N|BBG001CSH9Y7|

# 9  
Old 06-14-2011
Hi rdcwayx,

Really appreciate your time in writing the awk script.

But each of these files have 500K records.

But I have managed to write a perl script for the diff file. i.e.

Code:
#!/usr/local/bin/perl
$self = $0;
$self =~ s!^.*/!!;
#
$[ = 1; # = number of first index into arrays and strings
#
$FIELD_SEPARATOR = '\t';
$FIELD_NUMBER_LIST =('38','82');
$field_separator = $FIELD_SEPARATOR;
$field_number_list = $FIELD_NUMBER_LIST;
#
while (@ARGV)
{
    $_ = shift;
    if    (/^-F$/)    { $field_separator = shift; }
    elsif (/^-L$/)    { $field_number_list = shift; }
    elsif (/^-F.+$/)  { $field_separator = substr($_,$[+2); }
    elsif (/^-L.+$/)  { $field_number_list = substr($_,$[+2); }
    #else              { push(@filename, $_); }
}
#
$file_a = 'file1';
$file_b = 'file2';
#
unless (($file_a ne "") && (-f $file_a))
{
    die "Error: Can't find file '$file_a'!\n";
}
unless (($file_b ne "") && (-f $file_b))
{
    die "Error: Can't find file '$file_b'!\n";
}
#
@index_list = split(/,/, $field_number_list);
#
# Scan first file, Pass 1:
open(FILE_A, "<$file_a") || die "Can't open '$file_a': $!\n";
#
while (<FILE_A>)
{
    chop if /\n$/;
    undef $key;
    undef @field;
    @field = split(/$field_separator/o);
    foreach $index (@index_list)
    {
        if (defined $key)
        {
            $key .= "\n" . $field[$index];
        }
        else
        {
            $key = $field[$index];
        }
    }
     $intersection{$key} = 1;
}
#
close(FILE_A);
# Scan second file, Pass 1:
#
$empty_intersection = 1;
#
open(FILE_B, "<$file_b") || die "Can't open '$file_b': $!\n";
#
while (<FILE_B>)
{
    chop if /\n$/;
    undef $key;
    undef @field;
    @field = split(/$field_separator/o);
    foreach $index (@index_list)
    {
        if (defined $key)
        {
            $key .= "\n" . $field[$index];
        }
        else
        {
            $key = $field[$index];
        }
    }
 $code = $intersection{$key};
if ($code == 1)
    {
        $intersection{$key} = 3;
        $empty_intersection = 0;
    }
    else
    {
        if ($code != 3) { $intersection{$key} = 2; }
    }
}
#
close(FILE_B);
#
# Prepare output file names:
$file_a_1 = $file_a . '.1';
#
# Scan first file, Pass 2:
#
open(FILE_A, "<$file_a")     || die "Can't open '$file_a': $!\n";
open(FILE_A_1, ">$file_a_1") || die "Can't write '$file_a_1': $!\n";
#
while (<FILE_A>)
{
    chop if /\n$/;
    undef $key;
    undef @field;
    @field = split(/$field_separator/o);
    foreach $index (@index_list)
    {
        if (defined $key)
        {
            $key .= "\n" . $field[$index];
        }
else
        {
            $key = $field[$index];
        }
    }
    if ($intersection{$key} == 3)
    {
       # 
    }
    else
    {
        print FILE_A_1 $_, "\n";
    }
}
#
close(FILE_A);
close(FILE_A_1);
#
# Scan second file, Pass 2:
#
open(FILE_B, "<$file_b")     || die "Can't open '$file_b': $!\n";
open(FILE_A_1, ">>$file_a_1") || die "Can't write '$file_a_1': $!\n";
#
while (<FILE_B>)
{
    chop if /\n$/;
    undef $key;
    undef @field;
    @field = split(/$field_separator/o);
    foreach $index (@index_list)
    {
        if (defined $key)
        {
            $key .= "\n" . $field[$index];
        }
        else
        {
            $key = $field[$index];
        }
    }
    if ($intersection{$key} == 3)
    {
      #      }
    else
    {
        print FILE_A_1 $_, "\n";
    }
}
#
close(FILE_B);
close(FILE_A_1);
#
# Display results:
#
printf("The Diff file created '%s'\n\n", $file_a_1);
#


The above code works perfectly for generating the diff file i.e. depending upon the primary keys (here taken 2) the outfile contains the records that exists in file1 but not in file2 and the records that exists in file2 but not in file1.

Now,

I need to compare the whole record(line) if the primary keys in file1 matches with the primary keys in file2. If both the lines are equal then discard else write to the outfile.

Could someone please help me out in order to the above step.

Really appreciate your thoughts on this.

Last edited by filter; 06-14-2011 at 07:04 PM..
# 10  
Old 06-14-2011
With your new request, awk code can be more shorter.

Code:
awk -F \| 'NR==FNR && /\|/ {a[$49 FS $66 FS $119 FS $188]=$0} 
    NR>FNR && /\|/ {if (a[$49 FS $66 FS $119 FS $188]=="") {print > FILENAME ".diff"} else {print > "same.txt"}}' file1 file2

awk -F \| 'NR==FNR && /\|/ {a[$49 FS $66 FS $119 FS $188]=$0} 
    NR>FNR && /\|/ {if (a[$49 FS $66 FS $119 FS $188]=="") {print > FILENAME ".diff"} else {print > "same.txt"}}' file2 file1

After run the awk commands, you will get three files:

Code:
file1.diff                          # exist in file1, but not in file2
file2.diff                          # exist in file2, but not in file2
same.txt                         # exist in both files.

This User Gave Thanks to rdcwayx For This Post:
# 11  
Old 06-15-2011
Hi rdcwayx,

Thanks a lot for providing the awk script.
As you said the awk looks simpler and shorter.

I have ran the two awk scripts that you have provided to me for the two different files and have generated the two same.txt and same1.txt files.

Code:
awk '-F\t' 'NR==FNR && /\t/ {a[$38  FS $82]=$0} NR>FNR && /\t/ {if (a[$38 FS $82]=="") {print  > FILENAME ".diff"} else {print > "same.txt"}}' File1 File2

File1.diff --> Records that exists in File1 but not in File2
same.txt --> Records that exists in both the files


Code:
awk '-F\t' 'NR==FNR && /\t/ {a[$38  FS $82]=$0} NR>FNR && /\t/ {if (a[$38 FS $82]=="") {print  > FILENAME ".diff"} else {print > "same1.txt"}}' File1 File2


File2.diff --> Records that exists in File2 but not in File1
same1.txt --> Records that exists in both the files'

Here,
The file size of the same.txt and same1.txt is different.

--> 506108009 Jun 15 01:50 same.txt


--> 505878904 Jun 15 01:52 same1.txt

So, is there anyway that we can capture the records if the Primary Keys ($38 and $82) of both the files matches but the data in the other columns is not the same.

To solve the above issue, I thought once we have the files same.txt and same1.txt we should compare the number of characters in the a line in File1 with number of characters in a line in file2 , if there is a change then write to a file else discard (do nothing or leave it)

Could you please share any ideas to solve the above issue. It would be really grateful.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Difference between & and nohup &

Hi All, Can anyone please help me understanding what the difference between the below two? 1. script.sh & 2. nohup script.sh & (2 Replies)
Discussion started by: Anupam_Halder
2 Replies

2. Shell Programming and Scripting

Help on writing data from 2 different files to one based on a common factor

Hello all, I have 2 text files. For example: File1.txt contains data A B C D ****NEXT**** X Y Z ****NEXT**** L M N and File2.txt contains data (13 Replies)
Discussion started by: vat1kor
13 Replies

3. Shell Programming and Scripting

awk help: Match data fields from 2 files & output results from both into 1 file

I need to take 2 input files and create 1 output based on matches from each file. I am looking to match field #1 in both files (Userid) and create an output file that will be a combination of fields from both file1 and file2 if there are any differences in the fields 2,3,4,5,or 6. Below is an... (5 Replies)
Discussion started by: ambroze
5 Replies

4. Shell Programming and Scripting

search & merg data from 3 files

i have 3 files which contains as below (example): yy-mm-dd hh:mm:sec lat lon depth mag 2006-01-01 23:17:26.80 39.8405 41.8795 2.0 3.3 2006-01-06 00:10:26.80 39.9570 41.2130 5.0 3.3 2006-01-06 06:59:02.10 39.4099 44.6065 10.0 3.7 2006-01-06 13:49:52.70... (4 Replies)
Discussion started by: oreka18
4 Replies

5. Shell Programming and Scripting

Copying the Header & footer Information to the Outfile.

Hi I am writing a perl script which checks for the specific column values from a file and writes to the OUT file. So the feed file has a header information and footer information. I header information isaround107 lines i.e. Starts with START-OF-FILE ....... so on .... ... (11 Replies)
Discussion started by: filter
11 Replies

6. Shell Programming and Scripting

Sort a the file & refine data column & row format

cat file1.txt field1 "user1": field2:"data-cde" field3:"data-pqr" field4:"data-mno" field1 "user1": field2:"data-dcb" field3:"data-mxz" field4:"data-zul" field1 "user2": field2:"data-cqz" field3:"data-xoq" field4:"data-pos" Now i need to have the date like below. i have just... (7 Replies)
Discussion started by: ckaramsetty
7 Replies

7. Shell Programming and Scripting

How to combine 2 files and output the unique & difference?

Hi Guys, I have two input files and I want to combine them and get the unique values and differences and put them into one file. See below desired output file. Inputfile1: 1111111 2222222 3333333 7860068 7860069 7860071 7860072 Inputfile2: 4444444 (4 Replies)
Discussion started by: pinpe
4 Replies

8. UNIX for Dummies Questions & Answers

Reading and writing data to and from multiple files

Hi, I have several text files. One main file contains the detail data, other have some information to extract data from the main file, and some are empty files. Examples are shown below: The main file look like: MainFile.txt >Header1 data1...data1... >Header2 data2...data2... ... ...... (2 Replies)
Discussion started by: Fahmida
2 Replies

9. Shell Programming and Scripting

Need help in writing a script to create a new text file with specific data from existing two files

Hi, I have two text files. Need to create a third text file extracting specific data from first two existing files.. Text File 1: Format contains: SQL*Loader: Release 10.2.0.1.0 - Production on Wed Aug 4 21:06:34 2010 some text ............so on...and somwhere text like: Record 1:... (1 Reply)
Discussion started by: shashi143ibm
1 Replies

10. UNIX Desktop Questions & Answers

what is the difference between Unix & linux, what are the advantages & disadvantages

ehe may i know what are the difference between Unix & Linux, and what are the advantages of having Unix as well as disadvantages of having Unix or if u dun mind i am dumb do pls tell me what are the advantages as well as the disadvantages of having linux as well. thanks (1 Reply)
Discussion started by: cybertechmkteo
1 Replies
Login or Register to Ask a Question