Determining number of overlaps between two files using Hashes?

09-15-2008

Registered User

317, 0

Join Date: Apr 2008

Last Activity: 22 May 2013, 8:38 AM EDT

Location: Calgary

Posts: 317

Thanks Given: 0

Thanked 0 Times in 0 Posts

I took another look at the block, and here's what I've come up with....

Code:

while (@fileA)  #open the file
{
   chomp;
   @lineFile1Array = split (/\t/,$_);   #split the line into temporary array elements
   @tempStart = split (/,/,$lineArray[5]);
   @tempEnd = split (/,/,$lineArray[6]);
   my @newLineArray;
   while ($count > $lineArray[4])   # changed this to ">" - the number in the file might be 1-6
                                    # but we'd have array elements 0-5
   {
      # - we grab the first element from [5] and the first element from [6]
      ###############Array construction removed - not needed
      while $line(@fileB)
      {
         @lineFile2Array = split (/\t/,$line);
          if (($lineFile2Array[1] >= $tempStart[$count]) && ($lineFile2Array[2] <= $tempEnd[$count]))
         {
            #This is where you will write the code to create your new file
            #Match found = write to yourfile
         }
      # If no match found (or when done evaluating that element)
      $count++;     # We increment afterward, so that the next time that the evaluation
                    # of $count > $lineArray[4], we should stop if we've reached the
                    # number of pairs for this line.
      }
   # If no match found, (or when done evaluating that line) move on to the next line in file2
   }
# If no match found, move on to the next line in file2
}

Last edited by avronius; 09-15-2008 at 03:52 PM.. Reason: Changed that darned @_ back to $_...

avronius

View Public Profile for avronius

Find all posts by avronius

09-15-2008

Registered User

317, 0

Join Date: Apr 2008

Last Activity: 22 May 2013, 8:38 AM EDT

Location: Calgary

Posts: 317

Thanks Given: 0

Thanked 0 Times in 0 Posts

Oh, and delete this line:

Code:

   my @newLineArray;

It's not required

avronius

View Public Profile for avronius

Find all posts by avronius

09-15-2008

Registered User

60, 0

Join Date: Sep 2008

Last Activity: 14 November 2008, 3:30 AM EST

Location: San Diego, CA

Posts: 60

Thanks Given: 0

Thanked 0 Times in 0 Posts

thanks for sorting it out for me

.

i'm getting this error

Quote:

syntax error at overlap.pl line 122, near "while $line"
syntax error at overlap.pl line 139, near "}"
Execution of overlap.pl aborted due to compilation errors.

any idea why ?

labrazil

View Public Profile for labrazil

Find all posts by labrazil

09-15-2008

Registered User

317, 0

Join Date: Apr 2008

Last Activity: 22 May 2013, 8:38 AM EDT

Location: Calgary

Posts: 317

Thanks Given: 0

Thanked 0 Times in 0 Posts

ok - here's what I have so far.....

(it seems to work - needs some error checking for when there's no value found...)

Good luck - let me know if it comes close to doing what you need!

Code:

#!/usr/bin/perl -w

################################################################################
################################################################################
# What this script does:                                                       #
# This script can be used to compare data overlap                              #
#                                                                              #
# How this script works:                                                       #
# It parses two external files and identifies which line entries fall within   #
# the match statement.                                                         #
#                                                                              #
# Where this script is run (and by whom):                                      #
# This script should be run from a host where the files can be accessed for    #
# comparison                                                                   #
#                                                                              #
# Revision history:                                                            #
# September 15, 2008                                                           #
#    AKG - File creation                                                       #
################################################################################
################################################################################


################################################################################
################################################################################
# Define Pragma                                                                #
################################################################################
################################################################################
use strict;
use Getopt::Std;
use vars qw/ %opt /;
use Time::HiRes qw(gettimeofday);

################################################################################
# Define Variables                                                             #
################################################################################
#my $dir        = "/usr/local/overlap"; # base directory
my $dir        = "/opt/home/agray/scripting/overlap"; # base directory
my $DEBUG;
my @tm;
my $logfile;
my $timeStamp;
my @fileA;
my @fileB;
my $count;
my @lineFile1Array;
my @lineFile2Array;
my @tempStart;
my @tempEnd;
my $line;

################################################################################
# Define Prerequisites  (Require / Include statements go here)                 #
################################################################################

################################################################################
# Forward declaration of subroutines                                           #
################################################################################
sub do_init();              # This manages the command line options
sub do_usage();             # This is the usage message for do_init()

################################################################################
################################################################################
# MAIN                                                                         #
################################################################################
################################################################################

# Parse command line variables                                                 #
do_init();

# set DEBUG flag according to command line options                             #
if ( $opt{d} )
{
   $DEBUG = 1;
}
else
{
   $DEBUG = 0;
}

# Next, create a timestamp and logfile to store information                    #
# getTimestamp
@tm = (localtime($^T))[0..5];
++$tm[4];
$tm[5] += 1900;
$timeStamp = sprintf("%04d%02d%02d.%02d%02d%02d", reverse @tm);
$logfile = "$dir/log/$0.$timeStamp.log";

open LOG, ">> $logfile" or die "Can't open $logfile for write: $!";
print LOG "$timeStamp: running $0\n";


################################################################################
# Open files into array                                                        #
################################################################################

open(FILEA, "<$opt{a}") or die "Cannot open $opt{a} for read :$!";
@fileA = <FILEA>;
close( FILEA );

open(FILEB, "<$opt{b}") or die "Cannot open $opt{b} for read :$!";
@fileB = <FILEB>;
close( FILEB );


# Open the first file
foreach (@fileA)
{
   chomp;
   $count = 0;
   @lineFile1Array = split /\t/,$_;   #split the line into temporary array elements
   my $numberOfElements = $lineFile1Array[4];
   @tempStart = split /,/,$lineFile1Array[5];
   @tempEnd = split /,/,$lineFile1Array[6];
   while ($count < $numberOfElements)
   {
      # changed this to lessthan - the number in the file might be 6 but we have array elements 0-5
      # We grab the first element from [5] and the first element from [6]
      foreach (@fileB)
      {
         @lineFile2Array = split /\t/,$_;
         if (($lineFile2Array[1] >= $tempStart[$count]) && ($lineFile2Array[2] <= $tempEnd[$count]))
         {
            print LOG "Match!   $lineFile2Array[1] ~ $tempStart[$count]\n";
            print LOG "Match!   $lineFile2Array[2] ~ $tempEnd[$count]\n";
         }
      # when done evaluating that element
      }
   # If no match found, (or when done evaluating that line) move on to the next line in file2
   $count++;     # We increment afterward, so that the next time that the evaluation
                 # of $count > $lineArray[4], we should stop if we've reached the
                 # number of pairs for this line.
   }
# If no match found, move on to the next line in file2
}

close( LOG );




################################################################################
################################################################################
# Subroutines                                                                  #
################################################################################
################################################################################
sub do_init()
{
   my $opt_string = 'hda:b:';
   getopts( "$opt_string", \%opt ) or do_usage();
   do_usage() if $opt{h};
   do_usage() unless ($opt{a} && $opt{b});
}


sub do_usage()
{
   print "\nusage: $0 [-h] [-d] [-a file1] [-b file2]\n\n";
   #############################################################################
   print "\n\n";
   exit;
}

Here are the results based on the two files that you provided (You'll likely want to include more information - this is just a sample)

Code:

20080915.134301: running ./overlap.pl
Match!   100208130 ~ 100208127
Match!   100208166 ~ 100208306
Match!   100231689 ~ 100231680
Match!   100231725 ~ 100231885

avronius

View Public Profile for avronius

Find all posts by avronius

09-15-2008

Registered User

1,714, 63

Join Date: Apr 2004

Last Activity: 15 May 2020, 11:27 AM EDT

Location: Bordeaux, France

Posts: 1,714

Thanks Given: 2

Thanked 63 Times in 59 Posts

Quote:

Originally Posted by labrazil

Hi there,

I have a doubt about how to set this up. This is the situation.

I have two files, one that is ~31,000 in length and has the following information (7 fields):
file1

Code:

1    +       100208127       100261594       6       100208127,100231680,100237404,100245177,100249508,100260529,    100208306,100231885,100237559,100245300,100249677,100261594,
1    +       100217082       100217185       1       100217082,      100217185,
1    +       100276376       100321515       12      100276376,100288052,100296809,100298021,100299978,100306120,100306616,100307757,100315308,100316594,100318639,100320146,        100276460,100288148,100296872,100298149,100300093,100306339,100306730,100307829,100315421,100316692,100318803,100321515,

the 5th field is important and it explains the number of segments represented in fields 6 and 7. So for example, the first line shows 6, so if you took the first number of field 6 this would represent the start of the first segment and the first number of field 7 would represent the end of the first segment, and so on till you have the total 6 segments. The second line for example shows only 1 in field 5 and hence there's only one segment starting at 100217082 and ending at 100217185.

the second file I have is variable in length and can be from 3,000,000 to 10,000,000 lines. The format contains 4 fields:
file2

Code:

1    100208130       100208166       +
1    100208310       100208346       +
1    100217090       100217126       +
1    100231689       100231725       +

As you can see, field 2 and 3 is just a difference of 36 numbers and I want to know how many times each line in file2 is contained within file1 specifically when looking at the segments (remember each line in file1 has different numbers of segments above, e.g. 6, 1, and 12 as represented in field 5).

So if I use these two files to generate my output, my output would tell me:

There are 3 lines from file2 that matches or overlaps segments in file1 and 1 line from file2 that DOESNOT match or overlap segments in file1.

Code:

YES 1    100208130       100208166       +
NO 1    100208310       100208346       +
YES 1    100217090       100217126       +
YES 1    100231689       100231725       +

To get this kind of computation, do you think it's important to use hashes for the first file or second file and if so, how would I set this up? Can someone assist here? Thanks!

Just for the fun, a solution with awk :

Code:

awk '
NR==FNR {
   seg_count++;
   Segs[seg_count, "min"  ] = $3;
   Segs[seg_count, "max"  ] = $4;
   Segs[seg_count, "count"] = $5;

   split($6, smin, /,/);
   split($7, smax, /,/);

   for (i=1; i<=$5; i++) {
      Segs[seg_count, i, "min"] = smin[i];
      Segs[seg_count, i, "max"] = smax[i]
   }

   next;
}
{
   for (s=1; s<=seg_count; s++) {
      if ($2 >= Segs[s, "min"] && $3 <= Segs[s, "max"]) {
         for (i=1; i<=Segs[s, "count"]; i++) {
            if ($2 >= Segs[s, i, "min"] && $3 <= Segs[s, i, "max"]) {
               printf("YES %s => Match line #%d segment #%d : %d,%d\n",
                      $0, s, i, Segs[s, i, "min"], Segs[s, i, "max"]);
               next;
            }
         }
      }
   }
   printf("NO  %s\n", $0);
}

' labrazil_1.dat labrazil_2.dat

Output:

Code:

YES 1    100208130       100208166       + => Match line #1 segment #1 : 100208127,100208306
NO  1    100208310       100208346       +
YES 1    100217090       100217126       + => Match line #2 segment #1 : 100217082,100217185
YES 1    100231689       100231725       + => Match line #1 segment #2 : 100231680,100231885

Jean-Pierre.

aigles

View Public Profile for aigles

Find all posts by aigles

09-15-2008

Registered User

60, 0

Join Date: Sep 2008

Last Activity: 14 November 2008, 3:30 AM EST

Location: San Diego, CA

Posts: 60

Thanks Given: 0

Thanked 0 Times in 0 Posts

that was fun, thanks Jean-Pierre!

I was working on the perl by avronius and when using file1 with 4000 lines and file2 with 276,000 lines, it seems to be running on overdrive. the log file is populating at about 4Kb every 5 min or so. It hasn't finished yet but it's taking a long time already, going on 30 min now. i wonder if there's a more efficient way to handle this?

labrazil

View Public Profile for labrazil

Find all posts by labrazil

09-16-2008

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

shell script

Code:

#!/bin/sh
# file1 = smaller file2 = large file
set -x
tr -s ',' ' ' < file1 | nl |\
awk '{ for(i=7; i<NF; i++) {j=i+1; print $1, $i, $j} }' | \
 sort -n -k1.1,1.6 -k2.1,2.10 > filea
 
 
nl file2 | awk '{print $1, $3, $4 }' | \
  sort -n -k2.1,2.10 > fileb

wc -l filea | read maxrow1 dummy
wc -l fileb | read maxrow2 dummy

if [ -x ./overlap ]; then
  overlap filea $maxrow1 fileb $maxrow2  > results.txt
else
   # gcc -o overlap overlap.c
   cc -o overlap overlap.c
   overlap filea $maxrow1 fileb $maxrow2 > results.txt
fi   
echo "results.txt created"
exit 0

C code:

Code:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#define chk(x) if ( (x)==NULL){ perror("memory error"); exit(1);}
/*******************
*      overlap calculation - 
* usage: overlap <small_filename> <linecount> <big_filename> <linecount>
* run in overlap.sh
* output to stdout
* assume at least 32 bit int
********************/

typedef
struct 
{
	unsigned long row;
	unsigned long val1;
	unsigned long val2;
} row_t;


/* return NULL if not overlap 
* overlap == any number in the range of sm->val1 .. sm->val2  is 
*                equal to or inside any number in 
*                the range of big->val1 ..big->val2 or
*            there exists at least one common number in each range.
*/
const row_t *show_overlap(const row_t *sm, const row_t *big)
{
    
    int retval=0;
    if(sm->row && big->row) 
    {   /* greatest sm number is less than least big number
    	*  or smallest sm number is bigger than greatest big number        
    	*/
		retval=(sm->val2 <  big->val1 || sm->val1 > big->val2)? 0: 1;
		if(retval)
	   		printf("%d: %d  %d\n", sm->row, big->val1, big->val2);
	}
	return (retval)? sm : NULL;
}
char **whack( char **arr, char *src, const char *delimiter)
{   
	int i=0;
	char *p=strtok(src, delimiter);
	while(p!=NULL  && i < 4)
	{
		arr[i++]=p;
		arr[i]=NULL;
		p=strtok('\0', delimiter);
	}                                                             
    return arr;                                              
}

void load(row_t *b,  FILE *fp)
{
	char tmp[256]={0x0};
	while(fgets(tmp, sizeof(tmp), fp)!=NULL)
	{
	    char *p[8]={NULL};  
	    whack(p, tmp, " ");
		b->row= strtoul(p[0], NULL, 10);
		b->val1=strtoul(p[1], NULL, 10);
		b->val2=strtoul(p[2], NULL, 10);
		b++;
	}
	b->row=b->val1=b->val2=0;
	fclose(fp);
}

void overlap(row_t *sm, row_t *bg)
{
	
	register row_t *p=bg;
	while(sm->row)
	{   
	
		while(sm->val2 < p->val1 && p->row) p++;
		while(show_overlap(sm, p) !=NULL) p++;		    			
		sm++;
		if(sm->row) /* back further down big as needed */
		   while(p->val2 >= sm->val1 && p>bg) p--;
	}
}

void disp(row_t *p, char *a)
{    int i=0;
	while(p->row)
	{
		printf("%s i=%d %d %d %d \n", 
		    a, i, p->row, p->val1, p->val2);
	    p++;	i++;
	}	
	
}

int process(FILE *sm, const int count1, 
            FILE *bg, const int count2)
{

    row_t *big=malloc(sizeof(row_t) * count2);
	row_t *small=malloc(sizeof(row_t) * count1);
	
	chk(small); /* memory check */
	chk(big);	
	
	load(big, bg);  /* create array of structs */
	load(small, sm);
/*	disp(big, "big"); turn on to see what is being loaded */
/*	disp(small, "small"); */
	
	overlap(small, big); /* print overlaps */
	free(small);
	free(big);
	return 0;
}

int main(int argc, char **argv)
{
	if(argc == 5)
	{
		int count1=0;
		int count2=0;
	    FILE *sm=fopen(argv[1], "r");
	    FILE *bg=fopen(argv[3], "r");		
		if(sm == NULL || bg == NULL)
		{		    
		    perror("Cannot open input files");
		    exit(1);
		}
		count1=atoi(argv[2]);
		count2=atoi(argv[4]);
		count1++; count2++;
		process(sm, count1, bg, count2);		  
	}
	else
	{
		printf("Invalid number of arguments\n");
		exit(1);
	}
	return 0;
}

The C search completes in less than O(2n) time. With 500 records in filea and 24000 records in fileb on HPUX PA-RISC 11.23 that has load:

Code:

time overlap filea $maxrow1 fileb $maxrow2  > /dev/null
real    0m1.51s
user    0m0.44s
sys     0m0.30s

You need to scutinize this line of code in show_overlap()

Code:

retval=(sm->val2 <  big->val1 || sm->val1 > big->val2)? 0: 1;

It decides what an overlap is and if the current values meet the criteria. Otherwise the code works by almost reading thru the "big" array only one time, a little futzing back and forth included. Unless you have tons of RAM I would only run this against fileb's that are less than 5 million records or so. You also need to get the output of ulimit to see if the sysadmin put memory limits on your process. You don't want that.

If you want changes you get to make 'em. If you're on linux use gcc instead of cc in the shell script. The shell script and the C code go in the same directory.

The reason it is in C is because our perl is v 5.1 - really old. C is generally faster than perl, FWIW.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

Shell Programming and Scripting

Determining number of overlaps between two files using Hashes?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Base64 conversion in awk overlaps

Discussion started by: busyboy

2. Solaris

Determining number of hard disks in the system

Discussion started by: Mick

3. Shell Programming and Scripting

How to count number of files in directory and write to new file with number of files and their name?

Discussion started by: Akshay Hegde

4. Red Hat

Crontab: overlaps

Discussion started by: ionral

5. Shell Programming and Scripting

Compare values of hashes of hash for n number of hash in perl without sorting.

Discussion started by: asak

6. Shell Programming and Scripting

awk? create similarity matrix by calculating overlaps between sets comprising of individual parts

Discussion started by: stonemonkey

7. UNIX for Dummies Questions & Answers

Determining file size for a list of files with paths

Discussion started by: joe8mofo

8. Shell Programming and Scripting

Creating Hashes of Hashes of Array

Discussion started by: ckv84

9. Shell Programming and Scripting

Perl Hashes, reading and hashing 2 files

Discussion started by: silkiechicken

10. Programming

determining the object files...

Discussion started by: xyzt