Determining number of overlaps between two files using Hashes?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Determining number of overlaps between two files using Hashes?
# 1  
Old 09-14-2008
Determining number of overlaps between two files using Hashes?

Hi there,

I have a doubt about how to set this up. This is the situation.

I have two files, one that is ~31,000 in length and has the following information (7 fields):
file1
Code:
1    +       100208127       100261594       6       100208127,100231680,100237404,100245177,100249508,100260529,    100208306,100231885,100237559,100245300,100249677,100261594,
1    +       100217082       100217185       1       100217082,      100217185,
1    +       100276376       100321515       12      100276376,100288052,100296809,100298021,100299978,100306120,100306616,100307757,100315308,100316594,100318639,100320146,        100276460,100288148,100296872,100298149,100300093,100306339,100306730,100307829,100315421,100316692,100318803,100321515,

the 5th field is important and it explains the number of segments represented in fields 6 and 7. So for example, the first line shows 6, so if you took the first number of field 6 this would represent the start of the first segment and the first number of field 7 would represent the end of the first segment, and so on till you have the total 6 segments. The second line for example shows only 1 in field 5 and hence there's only one segment starting at 100217082 and ending at 100217185.

the second file I have is variable in length and can be from 3,000,000 to 10,000,000 lines. The format contains 4 fields:
file2
Code:
1    100208130       100208166       +
1    100208310       100208346       +
1    100217090       100217126       +
1    100231689       100231725       +

As you can see, field 2 and 3 is just a difference of 36 numbers and I want to know how many times each line in file2 is contained within file1 specifically when looking at the segments (remember each line in file1 has different numbers of segments above, e.g. 6, 1, and 12 as represented in field 5).

So if I use these two files to generate my output, my output would tell me:

There are 3 lines from file2 that matches or overlaps segments in file1 and 1 line from file2 that DOESNOT match or overlap segments in file1.

Code:
YES 1    100208130       100208166       +
NO 1    100208310       100208346       +
YES 1    100217090       100217126       +
YES 1    100231689       100231725       +

To get this kind of computation, do you think it's important to use hashes for the first file or second file and if so, how would I set this up? Can someone assist here? Thanks!
# 2  
Old 09-15-2008
does anyone know if this is doable? I'm really at a lost here. this is what i have so far:

Code:
#!/usr/bin/perl -w
use strict;


my %hash;


my $file = 'file1';
open my $fh, "<", $file or die "Can't open $file: $!";
while (<$fh>) {
	chomp;
	my @field = split /\t/;
	my @start = split /,/, $field[5];
	my @end = split /,/, $field[6];
	
}

once I have these starts/ends stored, how would you compare file2 with it?

Last edited by labrazil; 09-15-2008 at 10:52 AM..
# 3  
Old 09-15-2008
Are there any other comparisons or output expectations you have not mentioned - because you maybe thought they did not matter?

Hashing is meant for a lookup for a match, not necessarily finding something numerically in a range or something that falls in between two values.


The first thing you must do is to undo the complexity of the line structures and then sort them if you want the process to complete in reasonable time.

I don't want to try anything until I'm sure we won't get into a feedback loop: 'Now I need this...'
# 4  
Old 09-15-2008
I once wrote a script that would do a comparison of two files for reporting on changes to a file (like diff, but with output for phb's)
It would open each file into an array.
As it stepped through the first array, it would do any specific line analysis required, then it would go through the second array line by line - looking for matches (or !matches).
It would write that data to an output array.

Next, it would step through the second array, it would do any specific line analysis required, then step through the first array line by line - again looking for matches (or !matches). It would write that data to a different output array.

These two arrays had the smaller subset of results that I required, and could be cross matched via a third function.

It's not the nicest way to do this sort of thing, but it did allow me to resolve my requirement in the shortest amount of time, without killing the system.

In this case, you have an array that is composed of 5 or 12 or 1 number pairs.

Code:
while (@file1Array)  #open the file
{
   chomp;
   @lineFile1Array = split (/\t/,@_);   #split the line into temporary array elements
   @tempStart = split (/,/,$lineArray[5]);
   @tempEnd = split (/,/,$lineArray[6]);
   while $count >= $lineArray[4]
   #create a new array composed of $lineArray[5]:$lineArray[6]
   # I didn't put this code in, as the syntax escapes me this early in the morning...

   while (@newLineArray)
   {
      $start,$end = split (/:/, $_)
      while $line(@file2Array)
      {
         @lineFile2Array = split (/\t/,$line);
          if (($lineFile2Array[1] >= $start) && $lineFile2Array[2] <= $end)
         {
            #Match found = write to yourfile
         }
      # If no match found (or when done evaluating that element), move on to the next element of the line
      }
   # If no match found, (or when done evaluating that line) move on to the next line in file2
   } 
# If no match found, move on to the next line in file2
}


This is a rough idea - not a complete code snippet (obviously!)

I didn't include code for opening either file - open both files before entering the loop

Last edited by avronius; 09-15-2008 at 12:13 PM.. Reason: added some small amout of clarity
# 5  
Old 09-15-2008
Quote:
Originally Posted by jim mcnamara
Are there any other comparisons or output expectations you have not mentioned - because you maybe thought they did not matter?

Hashing is meant for a lookup for a match, not necessarily finding something numerically in a range or something that falls in between two values.


The first thing you must do is to undo the complexity of the line structures and then sort them if you want the process to complete in reasonable time.

I don't want to try anything until I'm sure we won't get into a feedback loop: 'Now I need this...'
Thank you for the post. I can't think of anything else but I know that knowing if they do fall in the range or not would be informative for my analysis as I indicated in the expected output.
# 6  
Old 09-15-2008
Quote:
Originally Posted by avronius
I once wrote a script that would do a comparison of two files for reporting on changes to a file (like diff, but with output for phb's)
It would open each file into an array.
As it stepped through the first array, it would do any specific line analysis required, then it would go through the second array line by line - looking for matches (or !matches).
It would write that data to an output array.

Next, it would step through the second array, it would do any specific line analysis required, then step through the first array line by line - again looking for matches (or !matches). It would write that data to a different output array.

These two arrays had the smaller subset of results that I required, and could be cross matched via a third function.

It's not the nicest way to do this sort of thing, but it did allow me to resolve my requirement in the shortest amount of time, without killing the system.

In this case, you have an array that is composed of 5 or 12 or 1 number pairs.

Code:
while (@file1Array)  #open the file
{
   chomp;
   @lineFile1Array = split (/\t/,@_);   #split the line into temporary array elements
   @tempStart = split (/,/,$lineArray[5]);
   @tempEnd = split (/,/,$lineArray[6]);
   while $count >= $lineArray[4]
   #create a new array composed of $lineArray[5]:$lineArray[6]
   # I didn't put this code in, as the syntax escapes me this early in the morning...

   while (@newLineArray)
   {
      $start,$end = split (/:/, $_)
      while $line(@file2Array)
      {
         @lineFile2Array = split (/\t/,$line);
          if (($lineFile2Array[1] >= $start) && $lineFile2Array[2] <= $end)
         {
            #Match found = write to yourfile
         }
      # If no match found (or when done evaluating that element), move on to the next element of the line
      }
   # If no match found, (or when done evaluating that line) move on to the next line in file2
   } 
# If no match found, move on to the next line in file2
}

This is a rough idea - not a complete code snippet (obviously!)

I didn't include code for opening either file - open both files before entering the loop
thank you for your post. I think I understand the logic here but I'm not sure where you get
Code:
while $count >= $lineArray[4]

# 7  
Old 09-15-2008
You'll need to create a counter $count
While $lineArray[4] (your number of elements), is less than $count, you'll loop through creating the new array of (element0start:element0end element1start:element1end etc.)

Then increment the counter so that you don't continue trying to create new elements in the new array after you've mapped the sets
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Base64 conversion in awk overlaps

hi, problem: output is not consistent as expected using external command in AWK description: I'm trying to convert $2 into a base64 string for later decoding, and for this when I use awk , I'm getting overlapped results , or say it results are not 100% correct. my code is: gawk... (9 Replies)
Discussion started by: busyboy
9 Replies

2. Solaris

Determining number of hard disks in the system

Hello to all, what is the command in Solaris/Unix which I can use to determine how many hard disks exist in the system? I have tried with different command such as df -lk and similar but cannot know for sure how many actual disks are installed. Commands like # fdisk -l | grep Disk and #... (14 Replies)
Discussion started by: Mick
14 Replies

3. Shell Programming and Scripting

How to count number of files in directory and write to new file with number of files and their name?

Hi! I just want to count number of files in a directory, and write to new text file, with number of files and their name output should look like this,, assume that below one is a new file created by script Number of files in directory = 25 1. a.txt 2. abc.txt 3. asd.dat... (20 Replies)
Discussion started by: Akshay Hegde
20 Replies

4. Red Hat

Crontab: overlaps

I'm using CentOS 6.3 and I use a crontab entries like this: 0 23 2-31 * 1-6 root weekdayscript 0 23 1 * 7 root weekendscript this 2 entries always overlaps... but I don't know how... :wall: thanks (10 Replies)
Discussion started by: ionral
10 Replies

5. Shell Programming and Scripting

Compare values of hashes of hash for n number of hash in perl without sorting.

Hi, I have an hashes of hash, where hash is dynamic, it can be n number of hash. i need to compare data_count values of all . my %result ( $abc => { 'data_count' => '10', 'ID' => 'ABC122', } $def => { 'data_count' => '20', 'ID' => 'defASe', ... (1 Reply)
Discussion started by: asak
1 Replies

6. Shell Programming and Scripting

awk? create similarity matrix by calculating overlaps between sets comprising of individual parts

Hi everyone I am very new at awk and to me the task I need to get done is very very challenging... Nevertheless, after admiring how fast and elegant issues are being solved here I am sure this is my best chance. I have a 2D data file (input file is a plain tab-delimited text file). The first... (1 Reply)
Discussion started by: stonemonkey
1 Replies

7. UNIX for Dummies Questions & Answers

Determining file size for a list of files with paths

Hello, I have a flat file with a list of files with the path to the file and I am attempting to calculate the filesize for each one; however xargs isn't playing nicely and I am sure there is probably a better way of doing this. What I envisioned is this: cat filename|xargs -i ls -l {} |awk... (4 Replies)
Discussion started by: joe8mofo
4 Replies

8. Shell Programming and Scripting

Creating Hashes of Hashes of Array

Hi folks, I have a structure as mentioned below in a configuration file. <Component> Comp1: { item1:data,someUniqueAttribute; item2:data,someUniqueAttribute, } Comp2: { item3:data,someUniqueAttribute; ... (1 Reply)
Discussion started by: ckv84
1 Replies

9. Shell Programming and Scripting

Perl Hashes, reading and hashing 2 files

So I have two files that I want to put together via hashes and am having a terrible time with syntax. For example: File1 A apple B banana C citrusFile2 A red B yellow C orangeWhat I want to enter on the command line is: program.pl File1 File2And have the result... (11 Replies)
Discussion started by: silkiechicken
11 Replies

10. Programming

determining the object files...

hello, is there a utility to determine which object files are used to create a binary executable file?let me explain, please: for ex. there are three files: a.o b.o c.o and these files are used to create a binary called: prg namely, a.o b.o c.o -> prg so, how can i determine these three... (1 Reply)
Discussion started by: xyzt
1 Replies
Login or Register to Ask a Question