Determining number of overlaps between two files using Hashes?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Determining number of overlaps between two files using Hashes?
# 8  
Old 09-15-2008
For 10 million records your output will be 31000 * 10 million lines of YES NO or 310 million lines. Is this what you really want? And does your filesystem support filesizes over 4GB (so-called large files)?
# 9  
Old 09-15-2008
That's a good point Jim... it would clobber a small server, and take a long time to run.
My requirements were for about 1 million yes/no - a much smaller footprint.

It is one way to do the task, I'd be happy to learn a more elegant (and less memory intensive) method, too! Smilie
# 10  
Old 09-15-2008
Quote:
Originally Posted by jim mcnamara
For 10 million records your output will be 31000 * 10 million lines of YES NO or 310 million lines. Is this what you really want? And does your filesystem support filesizes over 4GB (so-called large files)?
Hi Jim,

Yes unfortunately this is what I need. I'm analyzing genetic data and working with large number of data is quite common. I'm not sure what you mean by filesystem, but the hardware I have is pretty sufficient (RAM is 8GB). If worst case, I can break them up into 24 sections (for each human chromosome - thus making them each anywhere from 20,000 to 1,000,000).
# 11  
Old 09-15-2008
Would you be breaking BOTH files into 24 files each?
If so, this would reduce the overhead by a large amount, as each section only needs to be compared against 1/24th of the entire set, rather than each section being compared against the entire set.
# 12  
Old 09-15-2008
Quote:
Originally Posted by avronius
Would you be breaking BOTH files into 24 files each?
If so, this would reduce the overhead by a large amount, as each section only needs to be compared against 1/24th of the entire set, rather than each section being compared against the entire set.
yes I can. currently each file contains all the chrom, but I can extract out each chrom from each file into a separate file and use it when running the script.
# 13  
Old 09-15-2008
When you run your script, preface it with time so you can see how long it took to complete
Code:
time scriptname.pl variable1 variable2

This *should* show you how long it's taken to run the comparison.

I'd test it first with the smallest pair, then again with the largest pair. Prepare to go home at the end of the day before starting the largest pair to run...


If the smallest pair takes FAR too long to run, then I wouldn't recommend running the largest pair! Smilie
# 14  
Old 09-15-2008
Quote:
Originally Posted by avronius
When you run your script, preface it with time so you can see how long it took to complete
Code:
time scriptname.pl variable1 variable2

This *should* show you how long it's taken to run the comparison.

I'd test it first with the smallest pair, then again with the largest pair. Prepare to go home at the end of the day before starting the largest pair to run...


If the smallest pair takes FAR too long to run, then I wouldn't recommend running the largest pair! Smilie
thanks for the tip. i'm still not sure how to organize the script to test it Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Base64 conversion in awk overlaps

hi, problem: output is not consistent as expected using external command in AWK description: I'm trying to convert $2 into a base64 string for later decoding, and for this when I use awk , I'm getting overlapped results , or say it results are not 100% correct. my code is: gawk... (9 Replies)
Discussion started by: busyboy
9 Replies

2. Solaris

Determining number of hard disks in the system

Hello to all, what is the command in Solaris/Unix which I can use to determine how many hard disks exist in the system? I have tried with different command such as df -lk and similar but cannot know for sure how many actual disks are installed. Commands like # fdisk -l | grep Disk and #... (14 Replies)
Discussion started by: Mick
14 Replies

3. Shell Programming and Scripting

How to count number of files in directory and write to new file with number of files and their name?

Hi! I just want to count number of files in a directory, and write to new text file, with number of files and their name output should look like this,, assume that below one is a new file created by script Number of files in directory = 25 1. a.txt 2. abc.txt 3. asd.dat... (20 Replies)
Discussion started by: Akshay Hegde
20 Replies

4. Red Hat

Crontab: overlaps

I'm using CentOS 6.3 and I use a crontab entries like this: 0 23 2-31 * 1-6 root weekdayscript 0 23 1 * 7 root weekendscript this 2 entries always overlaps... but I don't know how... :wall: thanks (10 Replies)
Discussion started by: ionral
10 Replies

5. Shell Programming and Scripting

Compare values of hashes of hash for n number of hash in perl without sorting.

Hi, I have an hashes of hash, where hash is dynamic, it can be n number of hash. i need to compare data_count values of all . my %result ( $abc => { 'data_count' => '10', 'ID' => 'ABC122', } $def => { 'data_count' => '20', 'ID' => 'defASe', ... (1 Reply)
Discussion started by: asak
1 Replies

6. Shell Programming and Scripting

awk? create similarity matrix by calculating overlaps between sets comprising of individual parts

Hi everyone I am very new at awk and to me the task I need to get done is very very challenging... Nevertheless, after admiring how fast and elegant issues are being solved here I am sure this is my best chance. I have a 2D data file (input file is a plain tab-delimited text file). The first... (1 Reply)
Discussion started by: stonemonkey
1 Replies

7. UNIX for Dummies Questions & Answers

Determining file size for a list of files with paths

Hello, I have a flat file with a list of files with the path to the file and I am attempting to calculate the filesize for each one; however xargs isn't playing nicely and I am sure there is probably a better way of doing this. What I envisioned is this: cat filename|xargs -i ls -l {} |awk... (4 Replies)
Discussion started by: joe8mofo
4 Replies

8. Shell Programming and Scripting

Creating Hashes of Hashes of Array

Hi folks, I have a structure as mentioned below in a configuration file. <Component> Comp1: { item1:data,someUniqueAttribute; item2:data,someUniqueAttribute, } Comp2: { item3:data,someUniqueAttribute; ... (1 Reply)
Discussion started by: ckv84
1 Replies

9. Shell Programming and Scripting

Perl Hashes, reading and hashing 2 files

So I have two files that I want to put together via hashes and am having a terrible time with syntax. For example: File1 A apple B banana C citrusFile2 A red B yellow C orangeWhat I want to enter on the command line is: program.pl File1 File2And have the result... (11 Replies)
Discussion started by: silkiechicken
11 Replies

10. Programming

determining the object files...

hello, is there a utility to determine which object files are used to create a binary executable file?let me explain, please: for ex. there are three files: a.o b.o c.o and these files are used to create a binary called: prg namely, a.o b.o c.o -> prg so, how can i determine these three... (1 Reply)
Discussion started by: xyzt
1 Replies
Login or Register to Ask a Question