Difference between two huge files

09-10-2008

Registered User

34, 0

Join Date: Sep 2007

Last Activity: 30 October 2012, 6:06 AM EDT

Posts: 34

Thanks Given: 0

Thanked 0 Times in 0 Posts

Difference between two huge files

Hi,

As per my requirement, I need to take difference between two big files(around 6.5 GB) and get the difference to a output file without any line numbers or '<' or '>' in front of each new line.

As DIFF command wont work for big files, i tried to use BDIFF instead.

I am getting incorrect number of records.

I have done the following test:

I have got a dat file with a few million records in it and to generate a another file i have used sed '1,100d' oldfile > newfile

so i am using Bdiff oldfile newfile | sed -n '/^</p' > DIFF.DAT

The output(DIFF) should be having 100 records in it. But i am getting an output with several records in it.

Could anyone help me out from this situation?

Thanks

Sue

pyaranoid

View Public Profile for pyaranoid

Find all posts by pyaranoid

09-12-2008

Registered User

82, 0

Join Date: Sep 2008

Last Activity: 26 November 2016, 1:42 PM EST

Location: pune

Posts: 82

Thanks Given: 0

Thanked 0 Times in 0 Posts

hi,

u can try this code :

printf "%s\n" $(comm -3 file1 file2)>newfile

hope it will work

subhendu81

View Public Profile for subhendu81

Find all posts by subhendu81

09-12-2008

Registered User

34, 0

Join Date: Sep 2007

Last Activity: 30 October 2012, 6:06 AM EDT

Posts: 34

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi,

Thanks for the reply. The command is working for small files.

When i try to use with my file(6.5 GB), I am getting the following error:

comm: poel.dat: Value too large for defined data type

Please advice.

Thanks

Sue

pyaranoid

View Public Profile for pyaranoid

Find all posts by pyaranoid

09-13-2008

Registered User

34, 0

Join Date: Sep 2007

Last Activity: 30 October 2012, 6:06 AM EDT

Posts: 34

Thanks Given: 0

Thanked 0 Times in 0 Posts

Any Auggestions Please?

Thanks

Sue

pyaranoid

View Public Profile for pyaranoid

Find all posts by pyaranoid

09-13-2008

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Do you know about bdiff - meant for large files? Does what diff does, except it works on large files.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

09-14-2008

Registered User

34, 0

Join Date: Sep 2007

Last Activity: 30 October 2012, 6:06 AM EDT

Posts: 34

Thanks Given: 0

Thanked 0 Times in 0 Posts

Yes, I know that Bdiff is used for large files.

Do we need to sort the files when we use bdiff?

How to get rid of linenumbers and < and > symbols in front of each record of the outputfile?

Thanks

Sue

pyaranoid

View Public Profile for pyaranoid

Find all posts by pyaranoid

09-14-2008

Administrator Emeritus

9,926, 461

Join Date: Aug 2001

Last Activity: 26 February 2016, 12:31 PM EST

Location: Ashburn, Virginia

Posts: 9,926

Thanks Given: 63

Thanked 461 Times in 270 Posts

bdiff does not work very well as you have discovered... although it should have behaved better than you reported. If you get bdiff working "right", it should report way too many differences. This is probably not what you want. If you are sure that one file is a superset of the other, a custom script to scan the file would probably be the best approach... especially with such large files. What os are you using?

Perderabo

View Public Profile for Perderabo

Find all posts by Perderabo

UNIX for Dummies Questions & Answers

Difference between two huge files

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Aggregation of Huge files

Discussion started by: Ravichander

2. Shell Programming and Scripting

Difference between two huge .csv files

Discussion started by: Dimple

3. Shell Programming and Scripting

Three Difference File Huge Data Comparison Problem.

Discussion started by: patrick87

4. Shell Programming and Scripting

Compare 2 folders to find several missing files among huge amounts of files.

Discussion started by: jiapei100

5. Programming

Huge difference between _POSIX_OPEN_MAX and sysconf(_SC_OPEN_MAX).

Discussion started by: gencon

6. Shell Programming and Scripting

Replacing second line from huge files

Discussion started by: satish.pyboyina

7. High Performance Computing

Huge Files to be Joined on Ux instead of ORACLE

Discussion started by: magedfawzy

8. UNIX for Advanced & Expert Users

Huge files manipulation

Discussion started by: Klashxx

9. AIX

Huge difference in reported Disk usage between ls,df and du

Discussion started by: cooperuf

10. Shell Programming and Scripting

Comparing two huge files

Discussion started by: kmkbuddy_1983