File Comparison

Thread Tools Search this Thread
Top Forums Shell Programming and Scripting File Comparison
# 15  
Old 01-05-2008
I am trying stateful method, but I am not getting any output.
I made your code as a script file and executed it where the files reside, do not see anything,it comes back without any output or error. I am trying on small files to verify.
# 16  
Old 01-05-2008
Originally Posted by net_shree
I have to compare two text files, very few of the lines in these files will have some difference in some column.
The files size is in GB.
By chance I am working with a text file of this size ( 1 GB ). It contains just over 1 GB, and has 15 M (15,000,000) lines. The real time to count the lines with wc is 15-20 seconds ( AMD-64/3000, SATA disk).

If this is correct, and you have 2 such files, then I think any method that reads a line from file1 and uses it with a program to look through file 2 at each step will not end quickly, because there will be 15 M loads of that program involved, not to mention actually reading the file. For example, doing a grep reading /dev/null for 15,000 times takes about 10 seconds (10.2 actually) real time. For 1,000 times that, I'd be looking at 2.75 hours just to load grep from the disk and read an immediate EOF. A grep of a non-existent string takes about 18 seconds for a single search.

I suggest that the files be sorted and diff be run once on the two files (post #8, rikxik). That will be 2 passes across each file, a decrease of close to 100% from 15M passes over 1 file.

If my facts are wrong, then tell me where I missed something of importance or made a mistake. Otherwise, perhaps we should take a step back and you tell us what the higher purpose of the problem is -- what problem you are really trying to solve -- perhaps we can suggest some other approach ... cheers, drl
# 17  
Old 01-06-2008
I suggest that the files be sorted and diff be run once on the two files (post #8, rikxik). That will be 2 passes across each file, a decrease of close to 100% from 15M passes over 1 file.
Hi drl - I was wondering whether there is any reason/performance gain (for diff) if we sort the files? Is it essential/necessary? Just thinking aloud.
# 18  
Old 01-06-2008
Hi, rikxik.

I was thinking that the diff window to look for sequences would not be so large. However, if the files were very similar, then the sort could perhaps be skipped -- I hope for the best, but expect the worst Smilie

It would be interesting to try it both ways, of course ... cheers, drl
# 19  
Old 01-10-2008
I did sort both the files and then tried diff as well as grep -v -f file1 file2, same problem.
It is running for too long.
# 20  
Old 01-10-2008

Perhaps I had more luck -- I didn't have to wait so long for a definitive answer. On 2 different machines, I had 2 large, similar, but different files of size about 1 GB. One machine had 2.5 GB memory, the other 1 GB. When I used diff, I got the message:
diff: memory exhausted
 Exit status: 2

So I sorted the files and ran:
comm -3 file1 file2

On one machine the elapsed time for comm was 3 minutes (2.8 GHz Xeon, RHEL 4), and on the other, 2.5 minutes (AMD-64, 3000+, Debian sarge).

You may need to glance at man comm to see what it is doing -- it does require sorted input files, and then presents unique entries in both files.

Best wishes ... cheers, drl
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

File Comparison

HI, I have two files and contains many Fields with | (pipe) delimitor, wanted to compare both the files and get only unmatched perticular fields. this i wanted to use in shell scriting. ex: first.txt 111 |abc| 230| hbc231 |bbb |210 |bbd405 |ghc |555 |cgv second.txt 111 |abc |230 |hbc231... (1 Reply)
Discussion started by: prawinmca
1 Replies

2. Shell Programming and Scripting

File Comparison: Print Lines not present in another file

Hi, I have fileA.txt like this. B01B02 D0011718 B01B03 D0012540 B01B04 D0006145 B01B05 D0004815 B01B06 D0012069 B01B07 D0004064 B01B08 D0011988 B01B09 D0012071 B01B10 D0005596 B01B11 D0011351 B01B12 D0004814 B01C01 D0011804 I want to compare this against another file (fileB.txt)... (3 Replies)
Discussion started by: genehunter
3 Replies

3. Shell Programming and Scripting

file comparison

Dear All, I would really appreciate if you can help me to resolve this file comparison I have two files: file1: chr start end ID gene_name chr1 2020 3030 1 test1 chr1 900 5000 2 test1 chr2 5000 8000 3 test2 chr3 6000 12000 4 test3 chr3 6000 15000 5 test3 file2:... (2 Replies)
Discussion started by: paolo.kunder
2 Replies

4. Shell Programming and Scripting

Help with file comparison

Hello, I am trying to compare 2 files and get only the new lines as output. Note that new lines can be anywhere in the file and not necessarily at the bottom of the file. I have made the following progress so far. /home/aa>cat old.txt 0001 732 A 0002 732 C 0005 732 D... (7 Replies)
Discussion started by: cartrider
7 Replies

5. Shell Programming and Scripting

CSV file comparison

Hi all, i have two .csv files. i need to compare those two files and if there is any difference that should be moved into third .csv file. example, org.csv and dup.csv when we compare those two files org.csv and dup.csv. if there is any change in dup.csv. it should be capture in third... (7 Replies)
Discussion started by: baskivs
7 Replies

6. Shell Programming and Scripting

two file comparison

now i have a different file zoo.txt with content 123|zoo 234|natan 456|don and file rick.txt with contents 123|dog|pie|pep 123|tail|see|newt 456|som|sin|sim 234|pay|rat|cat i want to look for lines in file zoo.txt column1 that has same corresponding lines in column 1 of... (6 Replies)
Discussion started by: dealerso
6 Replies

7. Shell Programming and Scripting

File Comparison

Hi i have 2 csv files a.csv and b.csv with the same number of columns and a list of values in both of it. Each and every individual value in both the files need to compared and if it matches then print correct in a new csv file otherwise print Incorrect eg a.csv 1,12/27/2007,Reward,$10.00... (5 Replies)
Discussion started by: naveenn08
5 Replies

8. Shell Programming and Scripting

File Comparison- Need help

I have two text files which have records of thousand rows. Each row is having around 40 columns. Each column is tab delimited. Each row is delimited by newline character. My requirement is to find for each row i need to find whether any column is different between the two files. For each row i... (8 Replies)
Discussion started by: uihnybgte
8 Replies

9. Shell Programming and Scripting

file comparison

hi I have 2 files to comapre ,in file a sible column it is numbers,in file b2 numbers and other values with coma separated. i want compare numbers in file a with file b,and the out put put should be in C with numbers in both file a and b along with other columns of file b. i used folowing... (7 Replies)
Discussion started by: satish.res
7 Replies

10. UNIX for Dummies Questions & Answers

file needed.

Hello all, Can anyone help me with this. There are two files and I have to match the second file records with that of first and if matched, print the output in two fies, one containing the matched records and other containing the rest. Here is the example. File1 "111",erter,"00000", ... (4 Replies)
Discussion started by: er_ashu
4 Replies
Login or Register to Ask a Question