Sponsored Content
Top Forums UNIX for Dummies Questions & Answers compare 2 very large lists of different length Post 302385817 by gaurav1086 on Sunday 10th of January 2010 06:08:59 AM
Old 01-10-2010
Quote:
Originally Posted by uiop44
I have two very large datasets (>100MB) in a simple vertical list format. They are of different size and with different order and formatting (e.g. whitespace and some other minor cruft that would thwart easy regex).

Let's call them set1 and set2.

I want to check set2 to see if it contains any of the data entries in set1. I think of this as individual greps of set2 using each line of set1.

(NB- I could, with some work, manipulate the sets to make the order and formatting the same.)

In your opinion, what is the best tool to use for this search of set2 using the data in set1?

- comm?
- a looping shell script, or xargs, that calls grep?
- grep -f?
- diff?
- combine the sets (after making format the same) then sort and print only duplicate lines? uniq -d, sed or awk
I would implement them like this ->
1. sort both the datasets (presorting always helps for better performance especially if the datasets needs to be updated frequently and the operation is carried out many times.

2.Lets say set1 has index1 and set2 has index2

3.add a very large element say 10000000(which is larger than the largest of the elements in set1 and set2) to set1. See below why.

4. while not(end of set2) // set2 is the bottleneck
{ print set2[index2] if set1[index1]==set2[index2];
index2++ if(set2[index2]<set1[index1])
index1++ if(set2[index1]>set1[index1])
}

P.S . we added it to exhaust set2 which is the condition of the while loop. However you could have done while not(end of set1) And not(end of set2) {} but since you only wanna check set2 so we had to stick to above to make it more efficient.


enjoy,
Gaurav

Last edited by gaurav1086; 01-10-2010 at 07:17 AM..
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Compare lists of files

If I had a list of numbers in two different files, what would be the fastest and easiest way to find out which numbers in list B are not in list A without reading each number in list B one at a time and using grep thousands of times against list A? I have two very long lists of numbers and the... (4 Replies)
Discussion started by: keelba
4 Replies

2. UNIX for Dummies Questions & Answers

Sed working on lines of small length and not large length

Hi , I have a peculiar case, where my sed command is working on a file which contains lines of small length. sed "s/XYZ:1/XYZ:3/g" abc.txt > xyz.txt when abc.txt contains lines of small length(currently around 80 chars) , this sed command is working fine. when abc.txt contains lines of... (3 Replies)
Discussion started by: thanuman
3 Replies

3. UNIX for Dummies Questions & Answers

Compare 2 lists using a full and/or partial match at beginning of line?

hello all, I wonder if anybody might be able to help with this. I have file 1 and file2. Both files may contain thousands of lines that have variable contents. file1 234GH 5234BTW 89er 678tfg 234 234YT tfg456 wert 78gt gh23444 (7 Replies)
Discussion started by: Garrred
7 Replies

4. Shell Programming and Scripting

How to make bash wrapper for java/groovy program with variable length arguments lists?

The following bash script does not work because the java/groovy code always thinks there are four arguments even if there are only 1 or 2. As you can see from my hideous backslashes, I am using cygwin bash on windows. export... (1 Reply)
Discussion started by: siegfried
1 Replies

5. Programming

Python: Compare 2 word lists

Hi. I am trying to write a Python programme that compares two different text files which both contain a list of words. Each word has its own line worda wordb wordc I want to compare textfile 2 with textfile 1, and if there's a word in textfile 2 that is NOT in textfile 1, I want to... (6 Replies)
Discussion started by: Bloomy
6 Replies

6. Shell Programming and Scripting

Comparison between 2 large lists with Getting VALUES from one into the other

hi, I have 2 large lists: LIST A: containes 6 fields of many entries (VARIABLE number), like: 2011-07-10 | 18:19:47 | 38037300 | 9647808003122 | 2 | success LIST B: containes 3 fields & 183 entries (FIXED number), like: 9647805651885 9647805651885 SCP_10 What I want is a... (8 Replies)
Discussion started by: amurib
8 Replies

7. Shell Programming and Scripting

Bash script to compare two lists

Hi, I do little bash scripting so sorry for my ignorance. How do I compare if the two variable not match and if they do not match run a command. I was thinking a for loop but then I need another for loop for the 2nd list and I do not think that would work as in the real world there could... (2 Replies)
Discussion started by: GermanJulian
2 Replies

8. Shell Programming and Scripting

Compare two lists with perl

Hi everybody! I'm trying to delete some elements from a list with two elements on each row agreeing with the elements in another list. Pratically I want a perl script able to take each element of the second list (that is a single column list), compare it with both elements of each row from the... (3 Replies)
Discussion started by: gabrysfe
3 Replies

9. Shell Programming and Scripting

compare two lists on two files

I have two files A and B listing ip addresses and all the ip addresses in B are in A, and A includes other ip addresses now I want to get the list of the ip addresses that are in A but not in B how to achieve this? thanks (1 Reply)
Discussion started by: esolvepolito
1 Replies

10. Homework & Coursework Questions

[Python] Compare 2 lists

Hello, I'm new to the python programming, and I have a question. I have to write a program that prints a receipt for a restaurant. The input is a list which looks like: product1 product3 product8 .... In the other input file there is a list which looks like: product1 coffee 5,00... (1 Reply)
Discussion started by: dagendy
1 Replies
struct::set(n)							Tcl Data Structures						    struct::set(n)

__________________________________________________________________________________________________________________________________________________

NAME
struct::set - Procedures for manipulating sets SYNOPSIS
package require Tcl 8.0 package require struct::set ?2.2.3? ::struct::set empty set ::struct::set size set ::struct::set contains set item ::struct::set union ?set1...? ::struct::set intersect ?set1...? ::struct::set difference set1 set2 ::struct::set symdiff set1 set2 ::struct::set intersect3 set1 set2 ::struct::set equal set1 set2 ::struct::set include svar item ::struct::set exclude svar item ::struct::set add svar set ::struct::set subtract svar set ::struct::set subsetof A B _________________________________________________________________ DESCRIPTION
The ::struct::set namespace contains several useful commands for processing finite sets. It exports only a single command, struct::set. All functionality provided here can be reached through a subcommand of this command. Note: As of version 2.2 of this package a critcl based C implementation is available. This implementation however requires Tcl 8.4 to run. COMMANDS
::struct::set empty set Returns a boolean value indicating if the set is empty (true), or not (false). ::struct::set size set Returns an integer number greater than or equal to zero. This is the number of elements in the set. In other words, its cardinality. ::struct::set contains set item Returns a boolean value indicating if the set contains the element item (true), or not (false). ::struct::set union ?set1...? Computes the set containing the union of set1, set2, etc., i.e. "set1 + set2 + ...", and returns this set as the result of the com- mand. ::struct::set intersect ?set1...? Computes the set containing the intersection of set1, set2, etc., i.e. "set1 * set2 * ...", and returns this set as the result of the command. ::struct::set difference set1 set2 Computes the set containing the difference of set1 and set2, i.e. ("set1 - set2") and returns this set as the result of the command. ::struct::set symdiff set1 set2 Computes the set containing the symmetric difference of set1 and set2, i.e. ("(set1 - set2) + (set2 - set1)") and returns this set as the result of the command. ::struct::set intersect3 set1 set2 This command is a combination of the methods intersect and difference. It returns a three-element list containing "set1*set2", "set1-set2", and "set2-set1", in this order. In other words, the intersection of the two parameter sets, and their differences. ::struct::set equal set1 set2 Returns a boolean value indicating if the two sets are equal (true) or not (false). ::struct::set include svar item The element item is added to the set specified by the variable name in svar. The return value of the command is empty. This is the equivalent of lappend for sets. If the variable named by svar does not exist it will be created. ::struct::set exclude svar item The element item is removed from the set specified by the variable name in svar. The return value of the command is empty. This is a near-equivalent of lreplace for sets. ::struct::set add svar set All the element of set are added to the set specified by the variable name in svar. The return value of the command is empty. This is like the method include, but for the addition of a whole set. If the variable named by svar does not exist it will be created. ::struct::set subtract svar set All the element of set are removed from the set specified by the variable name in svar. The return value of the command is empty. This is like the method exclude, but for the removal of a whole set. ::struct::set subsetof A B Returns a boolean value indicating if the set A is a true subset of or equal to the set B (true), or not (false). REFERENCES
BUGS, IDEAS, FEEDBACK This document, and the package it describes, will undoubtedly contain bugs and other problems. Please report such in the category struct :: set of the Tcllib SF Trackers [http://sourceforge.net/tracker/?group_id=12883]. Please also report any ideas for enhancements you may have for either package and/or documentation. KEYWORDS
cardinality, difference, emptiness, exclusion, inclusion, intersection, membership, set, symmetric difference, union COPYRIGHT
Copyright (c) 2004-2008 Andreas Kupries <andreas_kupries@users.sourceforge.net> struct 2.2.3 struct::set(n)
All times are GMT -4. The time now is 06:27 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy