01-10-2010
Quote:
Originally Posted by
uiop44
I have two very large datasets (>100MB) in a simple vertical list format. They are of different size and with different order and formatting (e.g. whitespace and some other minor cruft that would thwart easy regex).
Let's call them set1 and set2.
I want to check set2 to see if it contains any of the data entries in set1. I think of this as individual greps of set2 using each line of set1.
(NB- I could, with some work, manipulate the sets to make the order and formatting the same.)
In your opinion, what is the best tool to use for this search of set2 using the data in set1?
- comm?
- a looping shell script, or xargs, that calls grep?
- grep -f?
- diff?
- combine the sets (after making format the same) then sort and print only duplicate lines? uniq -d, sed or awk
I would implement them like this ->
1. sort both the datasets (presorting always helps for better performance especially if the datasets needs to be updated frequently and the operation is carried out many times.
2.Lets say set1 has index1 and set2 has index2
3.add a very large element say 10000000(which is larger than the largest of the elements in set1 and set2) to set1. See below why.
4. while not(end of set2) // set2 is the bottleneck
{ print set2[index2] if set1[index1]==set2[index2];
index2++ if(set2[index2]<set1[index1])
index1++ if(set2[index1]>set1[index1])
}
P.S . we added it to exhaust set2 which is the condition of the while loop. However you could have done while not(end of set1) And not(end of set2) {} but since you only wanna check set2 so we had to stick to above to make it more efficient.
enjoy,
Gaurav
Last edited by gaurav1086; 01-10-2010 at 07:17 AM..
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
If I had a list of numbers in two different files, what would be the fastest and easiest way to find out which numbers in list B are not in list A without reading each number in list B one at a time and using grep thousands of times against list A?
I have two very long lists of numbers and the... (4 Replies)
Discussion started by: keelba
4 Replies
2. UNIX for Dummies Questions & Answers
Hi ,
I have a peculiar case, where my sed command is working on a file which contains lines of small length.
sed "s/XYZ:1/XYZ:3/g" abc.txt > xyz.txt
when abc.txt contains lines of small length(currently around 80 chars) , this sed command is working fine.
when abc.txt contains lines of... (3 Replies)
Discussion started by: thanuman
3 Replies
3. UNIX for Dummies Questions & Answers
hello all,
I wonder if anybody might be able to help with this.
I have file 1 and file2.
Both files may contain thousands of lines that have variable contents.
file1
234GH
5234BTW
89er
678tfg
234
234YT
tfg456
wert
78gt
gh23444 (7 Replies)
Discussion started by: Garrred
7 Replies
4. Shell Programming and Scripting
The following bash script does not work because the java/groovy code always thinks there are four arguments even if there are only 1 or 2. As you can see from my hideous backslashes, I am using cygwin bash on windows.
export... (1 Reply)
Discussion started by: siegfried
1 Replies
5. Programming
Hi.
I am trying to write a Python programme that compares two different text files which both contain a list of words. Each word has its own line
worda
wordb
wordc
I want to compare textfile 2 with textfile 1, and if there's a word in textfile 2 that is NOT in textfile 1, I want to... (6 Replies)
Discussion started by: Bloomy
6 Replies
6. Shell Programming and Scripting
hi,
I have 2 large lists:
LIST A: containes 6 fields of many entries (VARIABLE number), like:
2011-07-10 | 18:19:47 | 38037300 | 9647808003122 | 2 | success
LIST B: containes 3 fields & 183 entries (FIXED number), like:
9647805651885 9647805651885 SCP_10
What I want is a... (8 Replies)
Discussion started by: amurib
8 Replies
7. Shell Programming and Scripting
Hi,
I do little bash scripting so sorry for my ignorance.
How do I compare if the two variable not match and if they do not match run a command.
I was thinking a for loop but then I need another for loop for the 2nd list and I do not think that would work as in the real world there could... (2 Replies)
Discussion started by: GermanJulian
2 Replies
8. Shell Programming and Scripting
Hi everybody!
I'm trying to delete some elements from a list with two elements on each row agreeing with the elements in another list. Pratically I want a perl script able to take each element of the second list (that is a single column list), compare it with both elements of each row from the... (3 Replies)
Discussion started by: gabrysfe
3 Replies
9. Shell Programming and Scripting
I have two files A and B listing ip addresses
and all the ip addresses in B are in A, and A includes other ip addresses
now I want to get the list of the ip addresses that are in A but not in B
how to achieve this? thanks (1 Reply)
Discussion started by: esolvepolito
1 Replies
10. Homework & Coursework Questions
Hello,
I'm new to the python programming, and I have a question.
I have to write a program that prints a receipt for a restaurant. The input is a list which looks like:
product1
product3
product8
....
In the other input file there is a list which looks like:
product1 coffee 5,00... (1 Reply)
Discussion started by: dagendy
1 Replies
LEARN ABOUT DEBIAN
tfbs::sitepairset
TFBS::SitePairSet(3pm) User Contributed Perl Documentation TFBS::SitePairSet(3pm)
NAME
TFBS::SitePairSet - a set of TFBS::SitePair objects
SYNOPSIS
my $site_pair_set = TFBS::SitePairSet->new(@list_of_site_pair_objects);
# add a TFBS::SitePair object to set:
$site_pair_set->add_site_pair($site_pair_obj);
# append another TFBS::SitePairSet contents:
$site_pair_set->add_site_pair_set($site_pair_obj);
# create an iterator:
my $it = $site_pair_set->Iterator(-sort_by => 'start');
DESCRIPTION
TFBS::SitePairSet is an aggregate class that contains a collection of TFBS::SitePair objects. It can be created anew and filled with
TFBS::Site::Pair object. It is also returned by search_aln() method call of TFBS::PatternI subclasses (e.g. TFBS::Matrix::PWM).
FEEDBACK
Please send bug reports and other comments to the author.
AUTHOR - Boris Lenhard
Boris Lenhard <Boris.Lenhard@cgb.ki.se>
APPENDIX
The rest of the documentation details each of the object methods. Internal methods are preceded with an underscore.
size
Title : size
Usage : my $size = $sitepairset->size()
Function: returns a number of TFBS::SitePair objects contained in the set
Returns : a scalar (integer)
Args : none
add_site_pair
Title : add_site_pair
Usage : $sitepairset->add_site_pair($site_pair_object)
$sitepairset->add_site_pair(@list_of_site_pair_objects)
Function: adds TFBS::SitePair objects to an existing TFBS::SitePairSet object
Returns : $sitepairset object (usually ignored)
Args : A list of TFBS::SitePair objects to add
add_site_pair_set
Title : add_site_pair_set
Usage : $sitepairset->add_site_pair_set($site_pair_set_object)
$sitepairset->add_site_pair(@list_of_site_pair_set_objects)
Function: adds the contents of other TFBS::SitePairSet objects
to an existing TFBS::SitePairSet object
Returns : $sitepairset object (usually ignored)
Args : A list of TFBS::SitePairSet objects whose contents should be
added to $sitepairset
Iterator
Title : Iterator
Usage : my $it = $sitepairset->Iterator(-sort_by=>'start');
while (my $site_pair = $it->next()) { #...
Function: Returns an iterator object, used to iterate thorugh elements
(TFBS::SitePair objects)
Returns : a TFBS::_Iterator object
Args : -sort_by # optional - currently it accepts
# (default sort order in parenthetse)
# 'name' (pattern name, alphabetically)
# 'ID' (pattern/matrix ID, alphabetically)
# 'start' (site start in sequence,
# numerically,increasing order)
# 'end' (site end in sequence,
# numerically, increasing order)
# 'score' (1st site in pair,
# numerically, decreasing order)
-reverse # optional - reverses the default sorting order if true
set1
set2
Title : set1
set2
Usage : my $siteset1 = $sitepairset->set1();
: my $siteset2 = $sitepairset->set2()
Function: Returns individual TFBS::SiteSet objects, from the site set pair
Returns : A TFBS::SiteSet object
Args : none
GFF
Title : GFF
Usage : print $site->GFF();
: print $site->GFF($gff_formatter)
Function: returns a "standard" multiline GFF string
Returns : a string (multiline, newline terminated)
Args : a $gff_formatter function reference (optional)
perl v5.14.2 2008-01-24 TFBS::SitePairSet(3pm)