Hi.
Quote:
Originally Posted by
net_shree
I have to compare two text files, very few of the lines in these files will have some difference in some column.
The files size is in GB.
By chance I am working with a text file of this size ( 1 GB ). It contains just over 1 GB, and has 15 M (15,000,000) lines. The real time to count the lines with wc is 15-20 seconds ( AMD-64/3000, SATA disk).
If this is correct, and you have 2 such files, then I think any method that reads a line from file1 and uses it with a program to look through file 2 at each step will not end quickly, because there will be 15 M loads of that program involved, not to mention actually reading the file. For example, doing a grep reading /dev/null for 15,000 times takes about 10 seconds (10.2 actually) real time. For 1,000 times that, I'd be looking at 2.75 hours just to load grep from the disk and read an immediate EOF. A
grep of a non-existent string takes about 18 seconds for a single search.
I suggest that the files be sorted and
diff be run once on the two files (post #8, rikxik). That will be 2 passes across each file, a decrease of close to 100% from 15M passes over 1 file.
If my facts are wrong, then tell me where I missed something of importance or made a mistake. Otherwise, perhaps we should take a step back and you tell us what the higher purpose of the problem is -- what problem you are really trying to solve -- perhaps we can suggest some other approach ... cheers, drl