Compare large file and identify difference in separate file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Compare large file and identify difference in separate file
# 1  
Old 03-15-2012
Compare large file and identify difference in separate file

I have a very large system generated file containing around 500K rows size 100MB like following
Code:
  
  HOME|ALICE STREET|3||NEW LISTING
  HOME|NEWPORT STREET|1||NEW LISTING
  HOME|KING STREET|5||NEW LISTING
  HOME|WINSOME AVENUE|4||MODIFICATION
  CAR|TOYOTA|4||NEW LISTING
  CAR|FORD|4||NEW LISTING
  COMPUTER|HP|1||NEW LISTING
  COMPUTER|APPLE|1||NEW LISTING

The file is generated once a day. Everyday some rows are deleted some modified and some added as following (Line 2 in first file is deleted, Line 5 in first file is modified and Line 6 in second file is added)
Code:
   
  HOME|ALICE STREET|3||NEW LISTING
  HOME|KING STREET|5||NEW LISTING
  HOME|WINSOME AVENUE|4||MODIFICATION
  CAR|TOYOTA|5||NEW LISTING
  CAR|FORD|4||NEW LISTING
  CAR|HONDA|4||NEW LISTING
  COMPUTER|HP|1||NEW LISTING
  COMPUTER|APPLE|1||NEW LISTING

I want to identify those rows deleted into a file and those rows modified and added into a file.

Diif File 1 should be
Code:
 
  HOME|NEWPORT STREET|1||NEW LISTING

And Diff File 2 should be
Code:
   
  CAR|TOYOTA|5||NEW LISTING
  CAR|HONDA|4||NEW LISTING

I am very new to shell scripting. Any help is very much appreciated.

Moderator's Comments:
Mod Comment Please use next time code tags for your code and data
# 2  
Old 03-15-2012
Take a look at the output of
Code:
diff file1 file2

# 3  
Old 03-15-2012
I don't think that the output format from "diff" is suitable and it tends to get lost on large unsorted files.


The file must be a proper unix text file. In your sample, the data has no particular sorted order.

You'll need to "sort" both files to produce two new sorted output files in a work area. Then compare the two sorted files using two different unix "comm" commands. The differences will not be in the same order as the original file.
# 4  
Old 03-15-2012
Hi.

I think I would still try GNU diff first. If it works, then you are done, if not, then something else can be tried, such as the suggestion from methyl.

If your files are described as:
Code:
   When the files you are comparing are large and have small groups of
changes scattered throughout them, you can use the
`--speed-large-files' option to make a different modification to the
algorithm that `diff' uses.  If the input files have a constant small
density of changes, this option speeds up the comparisons without
changing the output.  If not, `diff' might produce a larger set of
differences; however, the output will still be correct.

-- excerpt from info diff, q.v.

then you may want to experiment with that.

Best wishes ... cheers, drl
# 5  
Old 03-20-2012
I am basically trying to identify the deleted rows from first file.
Code:
$ cat file1
HOME|ALICE STREET|3||NEW LISTING
HOME|NEWPORT STREET|1||NEW LISTING
HOME|KING STREET|5||NEW LISTING
HOME|WINSOME AVENUE|4||MODIFICATION
CAR|TOYOTA|4||NEW LISTING
CAR|FORD|4||NEW LISTING
COMPUTER|HP|1||NEW LISTING
COMPUTER|APPLE|1||NEW LISTING
$ cat file2
HOME|ALICE STREET|3||NEW LISTING
HOME|KING STREET|5||NEW LISTING
HOME|WINSOME AVENUE|4||MODIFICATION
CAR|TOYOTA|5||NEW LISTING
CAR|FORD|4||NEW LISTING
CAR|HONDA|4||NEW LISTING
COMPUTER|HP|1||NEW LISTING
COMPUTER|APPLE|1||NEW LISTING
$ diff file1 file2
2d1
< HOME|NEWPORT STREET|1||NEW LISTING
5c4
< CAR|TOYOTA|4||NEW LISTING
---
> CAR|TOYOTA|5||NEW LISTING
6a6
> CAR|HONDA|4||NEW LISTING
$


I am trying to get a output file containing only deleted line from first file, in this instance following
Code:
HOME|NEWPORT STREET|1||NEW LISTING


Last edited by Franklin52; 03-20-2012 at 04:54 AM.. Reason: Please use code tags for code and data samples, thank you
# 6  
Old 03-20-2012
try this,
Code:
 
awk 'NR==FNR{a[$0]++;next} !a[$0]' file2 file1

# 7  
Old 03-20-2012
awk

Hi,

Try this one, It will consider the first 2 columns as a key. Just little modification from pravin's post.

Code:
awk 'BEGIN{FS="|";}NR==FNR{a[$1$2]++;next}!a[$1$2]' file2 file1

Output:
Code:
HOME|NEWPORT STREET|1||NEW LISTING

Cheers,
RangaSmilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Trying to use diff output to compare to a separate file

I have two files: smw:/working/iso_testing # cat a QConvergeConsoleCLI-1.1.03-49.x86_64.rpm aaa_base-13.2+git20140911.61c1681-1.3.i586.rpm acpica-20140724-2.1.2.i586.rpm test.rpm smw:/working/iso_testing # cat b QConvergeConsoleCLI-1.1.03-49.x86_64.rpm... (12 Replies)
Discussion started by: jedlund21
12 Replies

2. Shell Programming and Scripting

Script to compare files in 2 folders and delete the large file

Hello, my first thread here. I've been searching and fiddling around for about a week and I cannot find a solution.:confused: I have been converting all of my home videos to HEVC and sometimes the files end up smaller and sometimes they don't. I am currently comparing all the video files... (5 Replies)
Discussion started by: Josh52180
5 Replies

3. Shell Programming and Scripting

Compare multiple files, identify common records and combine unique values into one file

Good morning all, I have a problem that is one step beyond a standard awk compare. I would like to compare three files which have several thousand records against a fourth file. All of them have a value in each row that is identical, and one value in each of those rows which may be duplicated... (1 Reply)
Discussion started by: nashton
1 Replies

4. Shell Programming and Scripting

Compare two string in two separate file and delete some line of file

Hi all i want to write program with shell script that able compare two file content and if one of lines of file have # at the first of string or nothing find same string in one of two file . remove the line in second file that have not the string in first file. for example: file... (2 Replies)
Discussion started by: saleh67
2 Replies

5. Shell Programming and Scripting

Using AWK to separate data from a large XML file into multiple files

I have a 500 MB XML file from a FileMaker database export, it's formatted horribly (no line breaks at all). The node structure is basically <FMPXMLRESULT> <METADATA> <FIELD att="............." id="..."/> </METADATA> <RESULTSET FOUND="1763457"> <ROW att="....." etc="...."> ... (16 Replies)
Discussion started by: JRy
16 Replies

6. Shell Programming and Scripting

Compare selected columns from a file and print difference

I have learned file comparison from my previous post here. Then, it is comparing the whole line. Now, i have a new problem. I have two files with 3 columns separated with a "|". What i want to do is to compare the second and third column of file 1, and the second and third column of file 2. And... (4 Replies)
Discussion started by: kingpeejay
4 Replies

7. Shell Programming and Scripting

compare 2 file and print difference in the third file URG PLS

Hi I have two files in unix. I need to compare two files and print the differed lines in other file Eg file1 1111 2222 3333 file2 1111 2222 3333 4444 5555 newfile 4444 5555 Thanks In advance (3 Replies)
Discussion started by: evvander
3 Replies

8. UNIX for Dummies Questions & Answers

How to compare the difference between a file and a folder??

Hi, I have a .txt file which has to be compared with a folder and print the difference to some other .txt file. I did try with the diff command..i mean diff /tmp/aaa/bbb.txt /space/aaa/bbb/ /***bbb.txt contains all the files names which may or may not exist in the folder bbb..so i need... (2 Replies)
Discussion started by: kumarsaravana_s
2 Replies

9. Filesystems, Disks and Memory

Strange difference in file size when copying LARGE file..

Hi, Im trying to take a database backup. one of the files is 26 GB. I am using cp -pr to create a backup copy of the database. after the copying is complete, if i do du -hrs on the folders i saw a difference of 2GB. The weird fact is that the BACKUP folder was 2 GB more than the original one! ... (1 Reply)
Discussion started by: 0ktalmagik
1 Replies

10. Shell Programming and Scripting

compare two .dat files and if there is any difference pulled into a separate file

Hi, compare two .dat files and difference will be moved into separate file.if anybody having code for this please send asap. using diff command, i don't know how to write shell programming. and my first file is like this including Header and trailer 10Ç20060323Ç01(Header) 01ÇIÇbabuÇ3000 01ÇIÇbaluÇ4000... (1 Reply)
Discussion started by: kirankumar
1 Replies
Login or Register to Ask a Question