omitting lines from file A that are in file B

02-18-2008

Registered User

5, 0

Join Date: Feb 2008

Last Activity: 25 January 2009, 9:54 AM EST

Posts: 5

Thanks Given: 0

Thanked 0 Times in 0 Posts

omitting lines from file A that are in file B

I've got file A with (say) 1M lines in it ... ascii text, space delimited ...

I've got file B with (say) 10M lines in it ... same structure.

I want to remove any lines from A that appear (identically) in B and print the remaining (say) 900K lines. (And I want to do it in zero time of course!)

Best I've come up with so far is somehow marking the lines in A, then doing a sort and applying an awk script to the result so that the marked lines are only printed if the following (or previous) line isn't "identical" except for the mark.

But after 1000 years of shell programming I've GOT to believe I'm missing an easier/faster solution ... I'm using bash and cygwin tools - and compiling is not an option.

ADVthanksANCE for your help!
=Gneen

gneen

View Public Profile for gneen

Find all posts by gneen

02-18-2008

Registered User

113, 0

Join Date: May 2007

Last Activity: 15 November 2012, 10:53 PM EST

Posts: 113

Thanks Given: 0

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by gneen

Code:

cat fileA | while read line
do
grep -q "$line" fileB
if [ $? -eq 1 ]; then
echo "$line" > fileC
fi
done

Not sure how fast that would be, but fileC will end up with all the lines that were in fileA that were in not in fileB.

earnstaf

View Public Profile for earnstaf

Find all posts by earnstaf

02-18-2008

Registered User

5, 0

Join Date: Feb 2008

Last Activity: 25 January 2009, 9:54 AM EST

Posts: 5

Thanks Given: 0

Thanked 0 Times in 0 Posts

but ...

Heh - the grep inside the read loop would "work" ... but I'd have to come back in a year to see the results!

For tiny files this would clearly be the way to go - but for files the size I'm dealing with this would mean one million greps into a file that was ten million lines long ... can you spell "Rip Van Winkle"?

=Gneen

gneen

View Public Profile for gneen

Find all posts by gneen

02-18-2008

Registered User

2,524, 241

Join Date: Dec 2007

Last Activity: 17 March 2020, 2:04 PM EDT

Posts: 2,524

Thanks Given: 173

Thanked 241 Times in 206 Posts

not knowing what the real data looks like, but...

How about?
This would effectively break up everything into 26 smaller files based on the first character of the file, and assuming it is lowercase. (Or, depending on the format of your data, could be ten numeric groups, etc...)

for outch in a b c d e f g h i j k l m n o p q r s t u v w x y z
do
cat fileb | grep ^"$outch" > fileb_"$outc"
done

while read zf
do
leadch=$(echo $zf | cut -c1-1)

now do lookup to appropriate file
use the just determined $leadch variable
and write if found/not found, as you like

done <filea

joeyg

View Public Profile for joeyg

Find all posts by joeyg

02-18-2008

Registered User

2,157, 51

Join Date: Feb 2007

Last Activity: 6 September 2017, 5:43 AM EDT

Location: Innsbruck, Austria

Posts: 2,157

Thanks Given: 12

Thanked 51 Times in 48 Posts

Use awk/perl hashes/assoc arrays

Assuming awk is fairly memory efficient and you have at least 1M x length-of-line bytes in virtual mem, this should work:

Code:

awk 'NR==FNR { A[$0]=1; next; } { if ($0 in A) { A[$0]=0; } END { for (k in A) { if (A[k]==1) { print A[k]; } } }'  A   B

otheus

View Public Profile for otheus

Find all posts by otheus

02-18-2008

Registered User

5, 0

Join Date: Feb 2008

Last Activity: 25 January 2009, 9:54 AM EST

Posts: 5

Thanks Given: 0

Thanked 0 Times in 0 Posts

Very promising awk script ...

Thanks otheus!
Nothing quite like a one-line cryptic awk script from a guru ... with a few minor typo corrections it shows excellent promise ... trying it with the giant files and the real data is going to need to wait for tomorrow. SWEET! (I'll post back here with some timing results.)

And thanks to to the other folks who replied - this is indeed an incredible resource!

Quote:

# FNR is the number of records in the current input file - it is reset
# when the next FILE is started but NR is the number of records processed
# so far and it is not reset ... so the first line effectively creates
# an associative array out of the lines in the first input file and marks
# them with a value of "1". Then the second line effectively examines
# the lines in the second file and sets the value to zero if it is there.
# Thus - by the time it finishes, only those lines in file A but NOT in
# file B will have a value of "1". And then we print those values.

awk ' NR==FNR { A[$0]=1; next; }
{ if ($0 in A) { A[$0]=0; } }
END { for (k in A) { if (A[k]==1) { print k; } } } ' $FILE1 $FILE2

-----------------------------------------------------------

The output from a test run follows:

FILE1:
1
2
3
4
5

FILE2:
5
3
1

AND THE OUTPUT IS:
4
2

gneen

View Public Profile for gneen

Find all posts by gneen

02-19-2008

Registered User

2,848, 14

Join Date: Feb 2005

Last Activity: 10 August 2018, 5:24 AM EDT

Location: Sydney, Down Under

Posts: 2,848

Thanks Given: 0

Thanked 14 Times in 14 Posts

Code:

grep -v -f fileA fileB > output.txt

vino

View Public Profile for vino

Find all posts by vino

Shell Programming and Scripting

omitting lines from file A that are in file B

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find all lines in file such that each word on that line appears in at least n lines of the file

Discussion started by: uncleMonty

2. Shell Programming and Scripting

How to compare 2 files and create a result file with unmatched lines from first file.?

Discussion started by: Little

3. Shell Programming and Scripting

Trying to take file numbers from a file, pass them to sed to change strings in corresponding lines

Discussion started by: crunchgargoyle

4. UNIX for Dummies Questions & Answers

Add strings from one file at the end of specific lines in text file

Discussion started by: gus74

5. Shell Programming and Scripting

Put the lines from file A to end of lines in file B

Discussion started by: zstar

6. Shell Programming and Scripting

Bash script to send lines of file to new file based on Regex

Discussion started by: newbie2010

7. Shell Programming and Scripting

Omitting sections of file that contain word

Discussion started by: SkySmart

8. Shell Programming and Scripting

Extract some lines from one file and add those lines to current file

Discussion started by: snreddy_gopu

9. Shell Programming and Scripting

Strings from one file which exactly match to the 1st column of other file and then print lines.

Discussion started by: AshwaniSharma09

10. Shell Programming and Scripting

Extra/parse lines from a file between unque lines through the file

Discussion started by: jouuu