The UNIX and Linux Forums  

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
Google UNIX.COM


Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
file comparison...help needed. er_ashu UNIX for Dummies Questions & Answers 4 05-15-2008 06:37 PM
Comparison Unix and Windows file sysytem localp UNIX for Dummies Questions & Answers 1 04-11-2008 01:02 AM
Output format - comparison with I/p file velappangs Shell Programming and Scripting 1 04-03-2008 03:31 AM
file comparison script tiger99 Shell Programming and Scripting 1 01-30-2008 07:47 AM
File Time Comparison Question pc9456 UNIX for Advanced & Expert Users 2 07-23-2003 12:05 PM

Reply
 
Submit Tools LinkBack Thread Tools Search this Thread Display Modes
  #15  
Old 01-05-2008
Registered User
 

Join Date: Dec 2007
Posts: 8
I am trying stateful method, but I am not getting any output.
I made your code as a script file and executed it where the files reside, do not see anything,it comes back without any output or error. I am trying on small files to verify.
Reply With Quote
Forum Sponsor
  #16  
Old 01-05-2008
drl's Avatar
drl drl is offline
Registered User
 

Join Date: Apr 2007
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 556
Hi.
Quote:
Originally Posted by net_shree View Post
I have to compare two text files, very few of the lines in these files will have some difference in some column.
The files size is in GB.
By chance I am working with a text file of this size ( 1 GB ). It contains just over 1 GB, and has 15 M (15,000,000) lines. The real time to count the lines with wc is 15-20 seconds ( AMD-64/3000, SATA disk).

If this is correct, and you have 2 such files, then I think any method that reads a line from file1 and uses it with a program to look through file 2 at each step will not end quickly, because there will be 15 M loads of that program involved, not to mention actually reading the file. For example, doing a grep reading /dev/null for 15,000 times takes about 10 seconds (10.2 actually) real time. For 1,000 times that, I'd be looking at 2.75 hours just to load grep from the disk and read an immediate EOF. A grep of a non-existent string takes about 18 seconds for a single search.

I suggest that the files be sorted and diff be run once on the two files (post #8, rikxik). That will be 2 passes across each file, a decrease of close to 100% from 15M passes over 1 file.

If my facts are wrong, then tell me where I missed something of importance or made a mistake. Otherwise, perhaps we should take a step back and you tell us what the higher purpose of the problem is -- what problem you are really trying to solve -- perhaps we can suggest some other approach ... cheers, drl
Reply With Quote
  #17  
Old 01-06-2008
rikxik's Avatar
Registered User
 

Join Date: Dec 2007
Posts: 105
Quote:
I suggest that the files be sorted and diff be run once on the two files (post #8, rikxik). That will be 2 passes across each file, a decrease of close to 100% from 15M passes over 1 file.
Hi drl - I was wondering whether there is any reason/performance gain (for diff) if we sort the files? Is it essential/necessary? Just thinking aloud.
Reply With Quote
  #18  
Old 01-06-2008
drl's Avatar
drl drl is offline
Registered User
 

Join Date: Apr 2007
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 556
Hi, rikxik.

I was thinking that the diff window to look for sequences would not be so large. However, if the files were very similar, then the sort could perhaps be skipped -- I hope for the best, but expect the worst

It would be interesting to try it both ways, of course ... cheers, drl
Reply With Quote
  #19  
Old 01-10-2008
Registered User
 

Join Date: Dec 2007
Posts: 8
I did sort both the files and then tried diff as well as grep -v -f file1 file2, same problem.
It is running for too long.
Reply With Quote
  #20  
Old 01-10-2008
drl's Avatar
drl drl is offline
Registered User
 

Join Date: Apr 2007
Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris
Posts: 556
Hi.

Perhaps I had more luck -- I didn't have to wait so long for a definitive answer. On 2 different machines, I had 2 large, similar, but different files of size about 1 GB. One machine had 2.5 GB memory, the other 1 GB. When I used diff, I got the message:
Code:
diff: memory exhausted
 Exit status: 2
So I sorted the files and ran:
Code:
comm -3 file1 file2
On one machine the elapsed time for comm was 3 minutes (2.8 GHz Xeon, RHEL 4), and on the other, 2.5 minutes (AMD-64, 3000+, Debian sarge).

You may need to glance at man comm to see what it is doing -- it does require sorted input files, and then presents unique entries in both files.

Best wishes ... cheers, drl
Reply With Quote
Google The UNIX and Linux Forums
Reply

Tags
linux

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes




All times are GMT -7. The time now is 07:08 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited.
The UNIX and Linux Forums Content Copyright ©1993-2008. All Rights Reserved.Ad Management by RedTyger Visit The Complex Event Processing Blog

Content Relevant URLs by vBSEO 3.2.0