Compare large file and identify difference in separate file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Compare large file and identify difference in separate file
# 8  
Old 03-20-2012
Hi rangarasan/pravin27

Thanks for your post. I am getting following error on the code.
Code:
$ awk 'BEGIN{FS="|";}NR==FNR{a[$1$2]++;next}!a[$1$2]' file2 file1
awk: syntax error near line 1
awk: bailing out near line 1
$


Last edited by Franklin52; 03-21-2012 at 04:21 AM.. Reason: Please use code tags for code and data samples, thank you
# 9  
Old 03-21-2012
jubaier,
Using the results of the diff command, you can select those records that were on file 1 but not on file 2 and vice versa and output results to individual files. As others have mentioned, you should sort before doing diff. For example:


Code:
diff file1.txt file2.txt 

2d1 
< HOME|NEWPORT STREET|1||NEW LISTING 
5c4 
< CAR|TOYOTA|4||NEW LISTING 
--- 
> CAR|TOYOTA|5||NEW LISTING 
6a6 
> CAR|HONDA|4||NEW LISTING

Output those records on file2.txt not on file1.txt
Code:
diff file1.txt file2.txt | grep ">" | cut -b 3- > add.txt



Output those records on file1.txt not on file2.txt

Code:
diff file1.txt file2.txt | grep "<" | cut -b 3- > drop.txt



Code:
cat add.txt 
CAR|TOYOTA|5||NEW LISTING 
CAR|HONDA|4||NEW LISTING 

cat drop.txt 
HOME|NEWPORT STREET|1||NEW LISTING 
CAR|TOYOTA|4||NEW LISTING

Note that this compare is matching the entire record which may not match your requirement since you don't want the original record "CAR|TOYOTA|4||NEW LISTING" that changed written to the drop.txt file. You then can look at the join command which will allow you to match based on certain fields/keys in your file to determine if it's truly a dropped record vs a changed record.

mjf
# 10  
Old 03-21-2012
Hi.

Here is a complex script that attempts to satisfy the original requirements:
Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate comparison and extraction of small differences.

# Section 1, setup, pre-solution.
# Infrastructure details, environment, debug commands for forum posts. 
# Uncomment export command to run script as external user.
# export PATH="/usr/local/bin:/usr/bin:/bin"
set +o nounset
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
LC_ALL=C ; LANG=C ; export LC_ALL LANG
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
edges() { local _f _n _l;: ${1?"edges: need file"}; _f=$1;_l=$(wc -l $_f);
  head -${_n:=3} $_f ; pe "--- ( $_l: lines total )" ; tail -$_n $_f ; }
C=$HOME/bin/context && [ -f $C ] && $C specimen diff awk
set -o nounset

FILE1=${1-data1}
shift
FILE2=${1-data2}

# Display sample data files.
pe
specimen $FILE1 $FILE2 
# edges 3 $FILE1
# edges 3 $FILE2

# Section 2, solution.
pl " Preparation and pipeline:"
db " Section 2: solution."
diff -u $FILE1 $FILE2 |
tee f1 |
awk '
/^-[^-]/	{ 
			# print "debug for + working on",NR,$0
			previous = NR ; action = "deleted"; line = $0  ;  next }
/^+[^+]/	{
			# print "debug for - working on",NR,$0
				action = "inserted"
			if ( previous != NR-1 ) { 
				if ( previous != 0 ) {
				print action, $0 
				previous = 0
				next
				} else {
				print action, $0 ;
				}
		} else {
			action = "changed"
		  print action, $0
		  previous = 0
		}
		next
		}
previous != 0	{
				# print "debug for not +-",NR,$0
				print action, line ; previous = 0 }
' |
tee f2 |
awk '
/^deleted/	{ sub(/^deleted [-]/, "") ; print > "f.deleted" ; next }
/^(changed|inserted)/	{ sub(/^(changed|inserted) [+]/,"") ; print > "f.changed" ; next }
'

pl " Results, deletions file:"
cat f.deleted
pl " Results, insertions and changes file:"
cat f.changed

exit 0

producing:
Code:
% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
specimen (local) 1.17
diff (GNU diffutils) 2.8.1
awk GNU Awk 3.1.5

Whole: 5:0:5 of 8 lines in file "data1"
HOME|ALICE STREET|3||NEW LISTING
HOME|NEWPORT STREET|1||NEW LISTING
HOME|KING STREET|5||NEW LISTING
HOME|WINSOME AVENUE|4||MODIFICATION
CAR|TOYOTA|4||NEW LISTING
CAR|FORD|4||NEW LISTING
COMPUTER|HP|1||NEW LISTING
COMPUTER|APPLE|1||NEW LISTING

Whole: 5:0:5 of 8 lines in file "data2"
HOME|ALICE STREET|3||NEW LISTING
HOME|KING STREET|5||NEW LISTING
HOME|WINSOME AVENUE|4||MODIFICATION
CAR|TOYOTA|5||NEW LISTING
CAR|FORD|4||NEW LISTING
CAR|HONDA|4||NEW LISTING
COMPUTER|HP|1||NEW LISTING
COMPUTER|APPLE|1||NEW LISTING

-----
 Preparation and pipeline:

-----
 Results, deletions file:
HOME|NEWPORT STREET|1||NEW LISTING

-----
 Results, insertions and changes file:
CAR|TOYOTA|5||NEW LISTING
CAR|HONDA|4||NEW LISTING

This uses the unified format for the diff. It obviously works for the sample files, but I don't know if it will work on far larger instances. You can look at files f1 and f2 to see the intermediate data.

If it does not work, then perhaps a sort and diff would be the best approach -- I just dislike making passes over files when I don't have to, especially if they are large. However, these days, 100 MB is not over-whelming.

Best wishes ... cheers, drl
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Trying to use diff output to compare to a separate file

I have two files: smw:/working/iso_testing # cat a QConvergeConsoleCLI-1.1.03-49.x86_64.rpm aaa_base-13.2+git20140911.61c1681-1.3.i586.rpm acpica-20140724-2.1.2.i586.rpm test.rpm smw:/working/iso_testing # cat b QConvergeConsoleCLI-1.1.03-49.x86_64.rpm... (12 Replies)
Discussion started by: jedlund21
12 Replies

2. Shell Programming and Scripting

Script to compare files in 2 folders and delete the large file

Hello, my first thread here. I've been searching and fiddling around for about a week and I cannot find a solution.:confused: I have been converting all of my home videos to HEVC and sometimes the files end up smaller and sometimes they don't. I am currently comparing all the video files... (5 Replies)
Discussion started by: Josh52180
5 Replies

3. Shell Programming and Scripting

Compare multiple files, identify common records and combine unique values into one file

Good morning all, I have a problem that is one step beyond a standard awk compare. I would like to compare three files which have several thousand records against a fourth file. All of them have a value in each row that is identical, and one value in each of those rows which may be duplicated... (1 Reply)
Discussion started by: nashton
1 Replies

4. Shell Programming and Scripting

Compare two string in two separate file and delete some line of file

Hi all i want to write program with shell script that able compare two file content and if one of lines of file have # at the first of string or nothing find same string in one of two file . remove the line in second file that have not the string in first file. for example: file... (2 Replies)
Discussion started by: saleh67
2 Replies

5. Shell Programming and Scripting

Using AWK to separate data from a large XML file into multiple files

I have a 500 MB XML file from a FileMaker database export, it's formatted horribly (no line breaks at all). The node structure is basically <FMPXMLRESULT> <METADATA> <FIELD att="............." id="..."/> </METADATA> <RESULTSET FOUND="1763457"> <ROW att="....." etc="...."> ... (16 Replies)
Discussion started by: JRy
16 Replies

6. Shell Programming and Scripting

Compare selected columns from a file and print difference

I have learned file comparison from my previous post here. Then, it is comparing the whole line. Now, i have a new problem. I have two files with 3 columns separated with a "|". What i want to do is to compare the second and third column of file 1, and the second and third column of file 2. And... (4 Replies)
Discussion started by: kingpeejay
4 Replies

7. Shell Programming and Scripting

compare 2 file and print difference in the third file URG PLS

Hi I have two files in unix. I need to compare two files and print the differed lines in other file Eg file1 1111 2222 3333 file2 1111 2222 3333 4444 5555 newfile 4444 5555 Thanks In advance (3 Replies)
Discussion started by: evvander
3 Replies

8. UNIX for Dummies Questions & Answers

How to compare the difference between a file and a folder??

Hi, I have a .txt file which has to be compared with a folder and print the difference to some other .txt file. I did try with the diff command..i mean diff /tmp/aaa/bbb.txt /space/aaa/bbb/ /***bbb.txt contains all the files names which may or may not exist in the folder bbb..so i need... (2 Replies)
Discussion started by: kumarsaravana_s
2 Replies

9. Filesystems, Disks and Memory

Strange difference in file size when copying LARGE file..

Hi, Im trying to take a database backup. one of the files is 26 GB. I am using cp -pr to create a backup copy of the database. after the copying is complete, if i do du -hrs on the folders i saw a difference of 2GB. The weird fact is that the BACKUP folder was 2 GB more than the original one! ... (1 Reply)
Discussion started by: 0ktalmagik
1 Replies

10. Shell Programming and Scripting

compare two .dat files and if there is any difference pulled into a separate file

Hi, compare two .dat files and difference will be moved into separate file.if anybody having code for this please send asap. using diff command, i don't know how to write shell programming. and my first file is like this including Header and trailer 10Ç20060323Ç01(Header) 01ÇIÇbabuÇ3000 01ÇIÇbaluÇ4000... (1 Reply)
Discussion started by: kirankumar
1 Replies
Login or Register to Ask a Question