compare 2 very large lists of different length

01-10-2010

Registered User

63, 0

Join Date: Mar 2008

Last Activity: 24 October 2012, 3:07 PM EDT

Posts: 63

Thanks Given: 4

Thanked 0 Times in 0 Posts

compare 2 very large lists of different length

I have two very large datasets (>100MB) in a simple vertical list format. They are of different size and with different order and formatting (e.g. whitespace and some other minor cruft that would thwart easy regex).

Let's call them set1 and set2.

I want to check set2 to see if it contains any of the data entries in set1. I think of this as individual greps of set2 using each line of set1.

(NB- I could, with some work, manipulate the sets to make the order and formatting the same.)

In your opinion, what is the best tool to use for this search of set2 using the data in set1?

- comm?
- a looping shell script, or xargs, that calls grep?
- grep -f?
- diff?
- combine the sets (after making format the same) then sort and print only duplicate lines? uniq -d, sed or awk

uiop44

View Public Profile for uiop44

Find all posts by uiop44

01-10-2010

Registered User

298, 4

Join Date: Nov 2009

Last Activity: 3 February 2019, 6:32 PM EST

Location: india

Posts: 298

Thanks Given: 3

Thanked 4 Times in 4 Posts

Quote:

Originally Posted by uiop44

I would implement them like this ->
1. sort both the datasets (presorting always helps for better performance especially if the datasets needs to be updated frequently and the operation is carried out many times.

2.Lets say set1 has index1 and set2 has index2

3.add a very large element say 10000000(which is larger than the largest of the elements in set1 and set2) to set1. See below why.

4. while not(end of set2) // set2 is the bottleneck
{ print set2[index2] if set1[index1]==set2[index2];
index2++ if(set2[index2]<set1[index1])
index1++ if(set2[index1]>set1[index1])
}

P.S . we added it to exhaust set2 which is the condition of the while loop. However you could have done while not(end of set1) And not(end of set2) {} but since you only wanna check set2 so we had to stick to above to make it more efficient.

enjoy,
Gaurav

Last edited by gaurav1086; 01-10-2010 at 07:17 AM..

gaurav1086

View Public Profile for gaurav1086

Find all posts by gaurav1086

01-10-2010

Registered User

63, 0

Join Date: Mar 2008

Last Activity: 24 October 2012, 3:07 PM EDT

Posts: 63

Thanks Given: 4

Thanked 0 Times in 0 Posts

Gaurav

Thank you.

uiop44

View Public Profile for uiop44

Find all posts by uiop44

UNIX for Dummies Questions & Answers

compare 2 very large lists of different length

10 More Discussions You Might Find Interesting

1. Homework & Coursework Questions

[Python] Compare 2 lists

Discussion started by: dagendy

2. Shell Programming and Scripting

compare two lists on two files

Discussion started by: esolvepolito

3. Shell Programming and Scripting

Compare two lists with perl

Discussion started by: gabrysfe

4. Shell Programming and Scripting

Bash script to compare two lists

Discussion started by: GermanJulian

5. Shell Programming and Scripting

Comparison between 2 large lists with Getting VALUES from one into the other

Discussion started by: amurib

6. Programming

Python: Compare 2 word lists

Discussion started by: Bloomy

7. Shell Programming and Scripting

How to make bash wrapper for java/groovy program with variable length arguments lists?

Discussion started by: siegfried

8. UNIX for Dummies Questions & Answers

Compare 2 lists using a full and/or partial match at beginning of line?

Discussion started by: Garrred

9. UNIX for Dummies Questions & Answers

Sed working on lines of small length and not large length

Discussion started by: thanuman

10. Shell Programming and Scripting

Compare lists of files

Discussion started by: keelba