File Comparison

01-05-2008

Registered User

8, 0

Join Date: Dec 2007

Last Activity: 11 February 2008, 11:09 PM EST

Posts: 8

Thanks Given: 0

Thanked 0 Times in 0 Posts

I am trying stateful method, but I am not getting any output.
I made your code as a script file and executed it where the files reside, do not see anything,it comes back without any output or error. I am trying on small files to verify.

net_shree

View Public Profile for net_shree

Find all posts by net_shree

01-05-2008

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

Quote:

Originally Posted by net_shree

I have to compare two text files, very few of the lines in these files will have some difference in some column.
The files size is in GB.

By chance I am working with a text file of this size ( 1 GB ). It contains just over 1 GB, and has 15 M (15,000,000) lines. The real time to count the lines with wc is 15-20 seconds ( AMD-64/3000, SATA disk).

If this is correct, and you have 2 such files, then I think any method that reads a line from file1 and uses it with a program to look through file 2 at each step will not end quickly, because there will be 15 M loads of that program involved, not to mention actually reading the file. For example, doing a grep reading /dev/null for 15,000 times takes about 10 seconds (10.2 actually) real time. For 1,000 times that, I'd be looking at 2.75 hours just to load grep from the disk and read an immediate EOF. A grep of a non-existent string takes about 18 seconds for a single search.

I suggest that the files be sorted and diff be run once on the two files (post #8, rikxik). That will be 2 passes across each file, a decrease of close to 100% from 15M passes over 1 file.

If my facts are wrong, then tell me where I missed something of importance or made a mistake. Otherwise, perhaps we should take a step back and you tell us what the higher purpose of the problem is -- what problem you are really trying to solve -- perhaps we can suggest some other approach ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

01-06-2008

Registered User

259, 2

Join Date: Dec 2007

Last Activity: 24 October 2011, 2:20 AM EDT

Posts: 259

Thanks Given: 0

Thanked 2 Times in 2 Posts

Quote:

I suggest that the files be sorted and diff be run once on the two files (post #8, rikxik). That will be 2 passes across each file, a decrease of close to 100% from 15M passes over 1 file.

Hi drl - I was wondering whether there is any reason/performance gain (for diff) if we sort the files? Is it essential/necessary? Just thinking aloud.

rikxik

View Public Profile for rikxik

Find all posts by rikxik

01-06-2008

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi, rikxik.

I was thinking that the diff window to look for sequences would not be so large. However, if the files were very similar, then the sort could perhaps be skipped -- I hope for the best, but expect the worst

It would be interesting to try it both ways, of course ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

01-10-2008

Registered User

8, 0

Join Date: Dec 2007

Last Activity: 11 February 2008, 11:09 PM EST

Posts: 8

Thanks Given: 0

Thanked 0 Times in 0 Posts

I did sort both the files and then tried diff as well as grep -v -f file1 file2, same problem.
It is running for too long.

net_shree

View Public Profile for net_shree

Find all posts by net_shree

01-10-2008

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

Perhaps I had more luck -- I didn't have to wait so long for a definitive answer. On 2 different machines, I had 2 large, similar, but different files of size about 1 GB. One machine had 2.5 GB memory, the other 1 GB. When I used diff, I got the message:

Code:

diff: memory exhausted
 Exit status: 2

So I sorted the files and ran:

Code:

comm -3 file1 file2

On one machine the elapsed time for comm was 3 minutes (2.8 GHz Xeon, RHEL 4), and on the other, 2.5 minutes (AMD-64, 3000+, Debian sarge).

You may need to glance at man comm to see what it is doing -- it does require sorted input files, and then presents unique entries in both files.

Best wishes ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

Shell Programming and Scripting

File Comparison

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

File Comparison

Discussion started by: prawinmca

2. Shell Programming and Scripting

File Comparison: Print Lines not present in another file

Discussion started by: genehunter

3. Shell Programming and Scripting

file comparison

Discussion started by: paolo.kunder

4. Shell Programming and Scripting

Help with file comparison

Discussion started by: cartrider

5. Shell Programming and Scripting

CSV file comparison

Discussion started by: baskivs

6. Shell Programming and Scripting

two file comparison

Discussion started by: dealerso

7. Shell Programming and Scripting

File Comparison

Discussion started by: naveenn08

8. Shell Programming and Scripting

File Comparison- Need help

Discussion started by: uihnybgte

9. Shell Programming and Scripting

file comparison

Discussion started by: satish.res

10. UNIX for Dummies Questions & Answers

file comparison...help needed.

Discussion started by: er_ashu