omitting lines from file A that are in file B


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting omitting lines from file A that are in file B
# 8  
Old 02-19-2008
Quote:
Originally Posted by vino
Code:
grep -v -f fileA fileB > output.txt

In theory, this will work (especially with -F), but for very large files, I haven't tried, and I'm unsure of grep's algorithm here. If for every line in fileB it loops through every line in lineA, this will take a while. Or, if it tries to build a regexp tree for 1M entries, this will take some huge resources!
# 9  
Old 02-19-2008
Quote:
Originally Posted by gneen
Heh - the grep inside the read loop would "work" ... but I'd have to come back in a year to see the results!

For tiny files this would clearly be the way to go - but for files the size I'm dealing with this would mean one million greps into a file that was ten million lines long ... can you spell "Rip Van Winkle"?

Smilie
=Gneen
So, I made a 1M line file and 10M line file to test on... running the grep loop on a pretty beefy server (dual 2.8ghz intel xeons) and it took 170m ... so, not forever but definitely not as fast as you'd want.
# 10  
Old 02-19-2008
Hi.
Quote:
Originally Posted by gneen
I've got file A with (say) 1M lines in it ... ascii text, space delimited ...

I've got file B with (say) 10M lines in it ... same structure.

I want to remove any lines from A that appear (identically) in B and print the remaining (say) 900K lines ...
If the original order is not important you could sort both files, and use comm, where you can choose a list that contains lines unique to one file, unique to the other, common to both, or any combination of those ... cheers, drl
# 11  
Old 02-19-2008
Bug joining

Just do sort both the file on the same field and on which you want the
new files should not contain the same record, if you have sorted on the
field no 1 on both the files use the following:-
now you want to remove the duplicates from the second :-

join -1 1 -2 1 (-t and delimiter) -v2 -o 1.1 1.2 1.3 1.4 file1 file2 > output

and your job is done in not time this will take the least time ??:Smilie
# 12  
Old 02-19-2008
You can give this a try:

Code:
awk 'NR==FNR{arr[$0];next} !($0 in arr){print}' A B

Regards
# 13  
Old 02-19-2008
The solution I went with ... 23 seconds for 100K into 1M

I can't thank you all enough!

I ended up going with the awk script suggested by Otheus above ... I was so amazed at the speed I felt obligated to check the result with an alternative method - and the result was indeed verified.

My box has 6GB of memory BTW - but it would appear that my gawk has a 1.5GB limit (either compiled in or part of the OS - but in either event I don't think I can change it). The limit is approached when the size of FILE1 approaches 1.5GB ... for larger files I split the input file and ran it against each of the parts. The size of FILE2 does not play into the amount of memory required by the awk program. Your awk may have a different limit which you'll discover if it is an issue.

FILE1: START.100K
66,831,529 bytes with 100K lines (yes - my data is actually 600+ bytes/record)

FILE2: REF.1M
648,903,713 bytes with 1M lines - obviously similar data

Quote:
time (awk ' NR==FNR { A[$0]=1; next; }
{ if ($0 in A) { A[$0]=0; } }
END { for (k in A) { if (A[k]==1) { print k; } } } ' $FILE1 $FILE2 > $FILE3 )
real 0m23.323s !!!!
user 0m11.484s
sys 0m8.233s


AND THE OUTPUT IS:
48295948 bytes with 71,836 lines - i.e. there were 71,836 lines of the 100K lines that did NOT appear in the 1M line file.

Last edited by gneen; 02-19-2008 at 09:13 PM..
# 14  
Old 02-19-2008
Thanks for the benchmark, very intesting tread here. I have to work with big files too sometimes.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find all lines in file such that each word on that line appears in at least n lines of the file

I have a file where every line includes four expressions with a caret in the middle (plus some other "words" or fields, always separated by spaces). I would like to extract from this file, all those lines such that each of the four expressions containing a caret appears in at least four different... (9 Replies)
Discussion started by: uncleMonty
9 Replies

2. Shell Programming and Scripting

How to compare 2 files and create a result file with unmatched lines from first file.?

HI, I have 2 text files. file1 and file2. file1.txt (There are no duplicates in this file) 1234 3232 4343 3435 6564 6767 1213 file2.txt 1234,wq,wewe,qwqw 1234,as,dfdf,dfdf 4343,asas,sdds,dsds 6767,asas,fdfd,fdffd I need to search each number in file1.txt in file2.txt's 1st... (6 Replies)
Discussion started by: Little
6 Replies

3. Shell Programming and Scripting

Trying to take file numbers from a file, pass them to sed to change strings in corresponding lines

I have a bunch of file numbers in the file 'test': I'm trying the above command to change all the instances of "H" to "Na+" in the file testsds.pdb at the line numbers indicated in the file 'test'. I've tried the following and various similar alternatives but nothing is working: cat test |... (3 Replies)
Discussion started by: crunchgargoyle
3 Replies

4. UNIX for Dummies Questions & Answers

Add strings from one file at the end of specific lines in text file

Hello All, this is my first post so I don't know if I am doing this right. I would like to append entries from a series of strings (contained in a text file) consecutively at the end of specifically labeled lines in another file. As an example: - the file that contains the values to be... (3 Replies)
Discussion started by: gus74
3 Replies

5. Shell Programming and Scripting

Put the lines from file A to end of lines in file B

I really can't figure this one out. I have 2 files, one file is a list of hostnames and the other is a list of their corresponding IPs: fileA: example.com another.org thirdie.net fileB: 1.1.1.1 2.2.2.2 3.3.3.3 I want to create a fileC that looks like: example.com 1.1.1.1... (2 Replies)
Discussion started by: zstar
2 Replies

6. Shell Programming and Scripting

Bash script to send lines of file to new file based on Regex

I have a file that looks like this: cat includes CORP-CRASHTEST-BU e:\crashplan\ CORP-TEST /usr/openv/java /usr/openv/logs /usr/openv/man CORP-LABS_TEST /usr/openv/java /usr/openv/logs /usr/openv/man What I want to do is make three new files with just those selections. So the three... (4 Replies)
Discussion started by: newbie2010
4 Replies

7. Shell Programming and Scripting

Omitting sections of file that contain word

I have a configuration file that contains hundreds of these chunks. Each "chunk" is the section that begins with "define service {" and ends with "}". define service { check_command check_proc!java hostgroup_name service_description ... (5 Replies)
Discussion started by: SkySmart
5 Replies

8. Shell Programming and Scripting

Extract some lines from one file and add those lines to current file

hi, i have two files. file1.sh echo "unix" echo "linux" file2.sh echo "unix linux forums" now the output i need is $./file2.sh unix linux forums (3 Replies)
Discussion started by: snreddy_gopu
3 Replies

9. Shell Programming and Scripting

Strings from one file which exactly match to the 1st column of other file and then print lines.

Hi, I have two files. 1st file has 1 column (huge file containing ~19200000 lines) and 2nd file has 2 columns (small file containing ~6000 lines). ################################# huge_file.txt a a ab b ################################## small_file.txt a 1.5 b 2.5 ab ... (4 Replies)
Discussion started by: AshwaniSharma09
4 Replies

10. Shell Programming and Scripting

Extra/parse lines from a file between unque lines through the file

I need help to parse a file where there are many records, all of which are consistently separated by lines containing “^=============” and "^ End of Report". Example: ============= 1 2 3 4 End of record ============= 1 3 4 End of record Etc.... I only need specific lines... (5 Replies)
Discussion started by: jouuu
5 Replies
Login or Register to Ask a Question