Determining number of overlaps between two files using Hashes?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Determining number of overlaps between two files using Hashes?
# 22  
Old 09-15-2008
I took another look at the block, and here's what I've come up with....
Code:
while (@fileA)  #open the file
{
   chomp;
   @lineFile1Array = split (/\t/,$_);   #split the line into temporary array elements
   @tempStart = split (/,/,$lineArray[5]);
   @tempEnd = split (/,/,$lineArray[6]);
   my @newLineArray;
   while ($count > $lineArray[4])   # changed this to ">" - the number in the file might be 1-6
                                    # but we'd have array elements 0-5
   {
      # - we grab the first element from [5] and the first element from [6]
      ###############Array construction removed - not needed
      while $line(@fileB)
      {
         @lineFile2Array = split (/\t/,$line);
          if (($lineFile2Array[1] >= $tempStart[$count]) && ($lineFile2Array[2] <= $tempEnd[$count]))
         {
            #This is where you will write the code to create your new file
            #Match found = write to yourfile
         }
      # If no match found (or when done evaluating that element)
      $count++;     # We increment afterward, so that the next time that the evaluation
                    # of $count > $lineArray[4], we should stop if we've reached the
                    # number of pairs for this line.
      }
   # If no match found, (or when done evaluating that line) move on to the next line in file2
   }
# If no match found, move on to the next line in file2
}


Last edited by avronius; 09-15-2008 at 03:52 PM.. Reason: Changed that darned @_ back to $_...
# 23  
Old 09-15-2008
Oh, and delete this line:
Code:
   my @newLineArray;

It's not required
# 24  
Old 09-15-2008
thanks for sorting it out for me Smilie.

i'm getting this error
Quote:
syntax error at overlap.pl line 122, near "while $line"
syntax error at overlap.pl line 139, near "}"
Execution of overlap.pl aborted due to compilation errors.
any idea why ?
# 25  
Old 09-15-2008
ok - here's what I have so far.....

(it seems to work - needs some error checking for when there's no value found...)

Good luck - let me know if it comes close to doing what you need!

Code:
#!/usr/bin/perl -w

################################################################################
################################################################################
# What this script does:                                                       #
# This script can be used to compare data overlap                              #
#                                                                              #
# How this script works:                                                       #
# It parses two external files and identifies which line entries fall within   #
# the match statement.                                                         #
#                                                                              #
# Where this script is run (and by whom):                                      #
# This script should be run from a host where the files can be accessed for    #
# comparison                                                                   #
#                                                                              #
# Revision history:                                                            #
# September 15, 2008                                                           #
#    AKG - File creation                                                       #
################################################################################
################################################################################


################################################################################
################################################################################
# Define Pragma                                                                #
################################################################################
################################################################################
use strict;
use Getopt::Std;
use vars qw/ %opt /;
use Time::HiRes qw(gettimeofday);

################################################################################
# Define Variables                                                             #
################################################################################
#my $dir        = "/usr/local/overlap"; # base directory
my $dir        = "/opt/home/agray/scripting/overlap"; # base directory
my $DEBUG;
my @tm;
my $logfile;
my $timeStamp;
my @fileA;
my @fileB;
my $count;
my @lineFile1Array;
my @lineFile2Array;
my @tempStart;
my @tempEnd;
my $line;

################################################################################
# Define Prerequisites  (Require / Include statements go here)                 #
################################################################################

################################################################################
# Forward declaration of subroutines                                           #
################################################################################
sub do_init();              # This manages the command line options
sub do_usage();             # This is the usage message for do_init()

################################################################################
################################################################################
# MAIN                                                                         #
################################################################################
################################################################################

# Parse command line variables                                                 #
do_init();

# set DEBUG flag according to command line options                             #
if ( $opt{d} )
{
   $DEBUG = 1;
}
else
{
   $DEBUG = 0;
}

# Next, create a timestamp and logfile to store information                    #
# getTimestamp
@tm = (localtime($^T))[0..5];
++$tm[4];
$tm[5] += 1900;
$timeStamp = sprintf("%04d%02d%02d.%02d%02d%02d", reverse @tm);
$logfile = "$dir/log/$0.$timeStamp.log";

open LOG, ">> $logfile" or die "Can't open $logfile for write: $!";
print LOG "$timeStamp: running $0\n";


################################################################################
# Open files into array                                                        #
################################################################################

open(FILEA, "<$opt{a}") or die "Cannot open $opt{a} for read :$!";
@fileA = <FILEA>;
close( FILEA );

open(FILEB, "<$opt{b}") or die "Cannot open $opt{b} for read :$!";
@fileB = <FILEB>;
close( FILEB );


# Open the first file
foreach (@fileA)
{
   chomp;
   $count = 0;
   @lineFile1Array = split /\t/,$_;   #split the line into temporary array elements
   my $numberOfElements = $lineFile1Array[4];
   @tempStart = split /,/,$lineFile1Array[5];
   @tempEnd = split /,/,$lineFile1Array[6];
   while ($count < $numberOfElements)
   {
      # changed this to lessthan - the number in the file might be 6 but we have array elements 0-5
      # We grab the first element from [5] and the first element from [6]
      foreach (@fileB)
      {
         @lineFile2Array = split /\t/,$_;
         if (($lineFile2Array[1] >= $tempStart[$count]) && ($lineFile2Array[2] <= $tempEnd[$count]))
         {
            print LOG "Match!   $lineFile2Array[1] ~ $tempStart[$count]\n";
            print LOG "Match!   $lineFile2Array[2] ~ $tempEnd[$count]\n";
         }
      # when done evaluating that element
      }
   # If no match found, (or when done evaluating that line) move on to the next line in file2
   $count++;     # We increment afterward, so that the next time that the evaluation
                 # of $count > $lineArray[4], we should stop if we've reached the
                 # number of pairs for this line.
   }
# If no match found, move on to the next line in file2
}

close( LOG );




################################################################################
################################################################################
# Subroutines                                                                  #
################################################################################
################################################################################
sub do_init()
{
   my $opt_string = 'hda:b:';
   getopts( "$opt_string", \%opt ) or do_usage();
   do_usage() if $opt{h};
   do_usage() unless ($opt{a} && $opt{b});
}


sub do_usage()
{
   print "\nusage: $0 [-h] [-d] [-a file1] [-b file2]\n\n";
   #############################################################################
   print "\n\n";
   exit;
}

Here are the results based on the two files that you provided (You'll likely want to include more information - this is just a sample)
Code:
20080915.134301: running ./overlap.pl
Match!   100208130 ~ 100208127
Match!   100208166 ~ 100208306
Match!   100231689 ~ 100231680
Match!   100231725 ~ 100231885

# 26  
Old 09-15-2008
Quote:
Originally Posted by labrazil
Hi there,

I have a doubt about how to set this up. This is the situation.

I have two files, one that is ~31,000 in length and has the following information (7 fields):
file1
Code:
1    +       100208127       100261594       6       100208127,100231680,100237404,100245177,100249508,100260529,    100208306,100231885,100237559,100245300,100249677,100261594,
1    +       100217082       100217185       1       100217082,      100217185,
1    +       100276376       100321515       12      100276376,100288052,100296809,100298021,100299978,100306120,100306616,100307757,100315308,100316594,100318639,100320146,        100276460,100288148,100296872,100298149,100300093,100306339,100306730,100307829,100315421,100316692,100318803,100321515,

the 5th field is important and it explains the number of segments represented in fields 6 and 7. So for example, the first line shows 6, so if you took the first number of field 6 this would represent the start of the first segment and the first number of field 7 would represent the end of the first segment, and so on till you have the total 6 segments. The second line for example shows only 1 in field 5 and hence there's only one segment starting at 100217082 and ending at 100217185.

the second file I have is variable in length and can be from 3,000,000 to 10,000,000 lines. The format contains 4 fields:
file2
Code:
1    100208130       100208166       +
1    100208310       100208346       +
1    100217090       100217126       +
1    100231689       100231725       +

As you can see, field 2 and 3 is just a difference of 36 numbers and I want to know how many times each line in file2 is contained within file1 specifically when looking at the segments (remember each line in file1 has different numbers of segments above, e.g. 6, 1, and 12 as represented in field 5).

So if I use these two files to generate my output, my output would tell me:

There are 3 lines from file2 that matches or overlaps segments in file1 and 1 line from file2 that DOESNOT match or overlap segments in file1.

Code:
YES 1    100208130       100208166       +
NO 1    100208310       100208346       +
YES 1    100217090       100217126       +
YES 1    100231689       100231725       +

To get this kind of computation, do you think it's important to use hashes for the first file or second file and if so, how would I set this up? Can someone assist here? Thanks!
Just for the fun, a solution with awk :
Code:
awk '
NR==FNR {
   seg_count++;
   Segs[seg_count, "min"  ] = $3;
   Segs[seg_count, "max"  ] = $4;
   Segs[seg_count, "count"] = $5;

   split($6, smin, /,/);
   split($7, smax, /,/);

   for (i=1; i<=$5; i++) {
      Segs[seg_count, i, "min"] = smin[i];
      Segs[seg_count, i, "max"] = smax[i]
   }

   next;
}
{
   for (s=1; s<=seg_count; s++) {
      if ($2 >= Segs[s, "min"] && $3 <= Segs[s, "max"]) {
         for (i=1; i<=Segs[s, "count"]; i++) {
            if ($2 >= Segs[s, i, "min"] && $3 <= Segs[s, i, "max"]) {
               printf("YES %s => Match line #%d segment #%d : %d,%d\n",
                      $0, s, i, Segs[s, i, "min"], Segs[s, i, "max"]);
               next;
            }
         }
      }
   }
   printf("NO  %s\n", $0);
}

' labrazil_1.dat labrazil_2.dat

Output:
Code:
YES 1    100208130       100208166       + => Match line #1 segment #1 : 100208127,100208306
NO  1    100208310       100208346       +
YES 1    100217090       100217126       + => Match line #2 segment #1 : 100217082,100217185
YES 1    100231689       100231725       + => Match line #1 segment #2 : 100231680,100231885

Jean-Pierre.
# 27  
Old 09-15-2008
that was fun, thanks Jean-Pierre!

I was working on the perl by avronius and when using file1 with 4000 lines and file2 with 276,000 lines, it seems to be running on overdrive. the log file is populating at about 4Kb every 5 min or so. It hasn't finished yet but it's taking a long time already, going on 30 min now. i wonder if there's a more efficient way to handle this?
# 28  
Old 09-16-2008
shell script
Code:
#!/bin/sh
# file1 = smaller file2 = large file
set -x
tr -s ',' ' ' < file1 | nl |\
awk '{ for(i=7; i<NF; i++) {j=i+1; print $1, $i, $j} }' | \
 sort -n -k1.1,1.6 -k2.1,2.10 > filea
 
 
nl file2 | awk '{print $1, $3, $4 }' | \
  sort -n -k2.1,2.10 > fileb

wc -l filea | read maxrow1 dummy
wc -l fileb | read maxrow2 dummy

if [ -x ./overlap ]; then
  overlap filea $maxrow1 fileb $maxrow2  > results.txt
else
   # gcc -o overlap overlap.c
   cc -o overlap overlap.c
   overlap filea $maxrow1 fileb $maxrow2 > results.txt
fi   
echo "results.txt created"
exit 0

C code:
Code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#define chk(x) if ( (x)==NULL){ perror("memory error"); exit(1);}
/*******************
*      overlap calculation - 
* usage: overlap <small_filename> <linecount> <big_filename> <linecount>
* run in overlap.sh
* output to stdout
* assume at least 32 bit int
********************/

typedef
struct 
{
	unsigned long row;
	unsigned long val1;
	unsigned long val2;
} row_t;


/* return NULL if not overlap 
* overlap == any number in the range of sm->val1 .. sm->val2  is 
*                equal to or inside any number in 
*                the range of big->val1 ..big->val2 or
*            there exists at least one common number in each range.
*/
const row_t *show_overlap(const row_t *sm, const row_t *big)
{
    
    int retval=0;
    if(sm->row && big->row) 
    {   /* greatest sm number is less than least big number
    	*  or smallest sm number is bigger than greatest big number        
    	*/
		retval=(sm->val2 <  big->val1 || sm->val1 > big->val2)? 0: 1;
		if(retval)
	   		printf("%d: %d  %d\n", sm->row, big->val1, big->val2);
	}
	return (retval)? sm : NULL;
}
char **whack( char **arr, char *src, const char *delimiter)
{   
	int i=0;
	char *p=strtok(src, delimiter);
	while(p!=NULL  && i < 4)
	{
		arr[i++]=p;
		arr[i]=NULL;
		p=strtok('\0', delimiter);
	}                                                             
    return arr;                                              
}

void load(row_t *b,  FILE *fp)
{
	char tmp[256]={0x0};
	while(fgets(tmp, sizeof(tmp), fp)!=NULL)
	{
	    char *p[8]={NULL};  
	    whack(p, tmp, " ");
		b->row= strtoul(p[0], NULL, 10);
		b->val1=strtoul(p[1], NULL, 10);
		b->val2=strtoul(p[2], NULL, 10);
		b++;
	}
	b->row=b->val1=b->val2=0;
	fclose(fp);
}

void overlap(row_t *sm, row_t *bg)
{
	
	register row_t *p=bg;
	while(sm->row)
	{   
	
		while(sm->val2 < p->val1 && p->row) p++;
		while(show_overlap(sm, p) !=NULL) p++;		    			
		sm++;
		if(sm->row) /* back further down big as needed */
		   while(p->val2 >= sm->val1 && p>bg) p--;
	}
}

void disp(row_t *p, char *a)
{    int i=0;
	while(p->row)
	{
		printf("%s i=%d %d %d %d \n", 
		    a, i, p->row, p->val1, p->val2);
	    p++;	i++;
	}	
	
}

int process(FILE *sm, const int count1, 
            FILE *bg, const int count2)
{

    row_t *big=malloc(sizeof(row_t) * count2);
	row_t *small=malloc(sizeof(row_t) * count1);
	
	chk(small); /* memory check */
	chk(big);	
	
	load(big, bg);  /* create array of structs */
	load(small, sm);
/*	disp(big, "big"); turn on to see what is being loaded */
/*	disp(small, "small"); */
	
	overlap(small, big); /* print overlaps */
	free(small);
	free(big);
	return 0;
}

int main(int argc, char **argv)
{
	if(argc == 5)
	{
		int count1=0;
		int count2=0;
	    FILE *sm=fopen(argv[1], "r");
	    FILE *bg=fopen(argv[3], "r");		
		if(sm == NULL || bg == NULL)
		{		    
		    perror("Cannot open input files");
		    exit(1);
		}
		count1=atoi(argv[2]);
		count2=atoi(argv[4]);
		count1++; count2++;
		process(sm, count1, bg, count2);		  
	}
	else
	{
		printf("Invalid number of arguments\n");
		exit(1);
	}
	return 0;
}

The C search completes in less than O(2n) time. With 500 records in filea and 24000 records in fileb on HPUX PA-RISC 11.23 that has load:
Code:
time overlap filea $maxrow1 fileb $maxrow2  > /dev/null
real    0m1.51s
user    0m0.44s
sys     0m0.30s

You need to scutinize this line of code in show_overlap()
Code:
retval=(sm->val2 <  big->val1 || sm->val1 > big->val2)? 0: 1;

It decides what an overlap is and if the current values meet the criteria. Otherwise the code works by almost reading thru the "big" array only one time, a little futzing back and forth included. Unless you have tons of RAM I would only run this against fileb's that are less than 5 million records or so. You also need to get the output of ulimit to see if the sysadmin put memory limits on your process. You don't want that.

If you want changes you get to make 'em. If you're on linux use gcc instead of cc in the shell script. The shell script and the C code go in the same directory.

The reason it is in C is because our perl is v 5.1 - really old. C is generally faster than perl, FWIW.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Base64 conversion in awk overlaps

hi, problem: output is not consistent as expected using external command in AWK description: I'm trying to convert $2 into a base64 string for later decoding, and for this when I use awk , I'm getting overlapped results , or say it results are not 100% correct. my code is: gawk... (9 Replies)
Discussion started by: busyboy
9 Replies

2. Solaris

Determining number of hard disks in the system

Hello to all, what is the command in Solaris/Unix which I can use to determine how many hard disks exist in the system? I have tried with different command such as df -lk and similar but cannot know for sure how many actual disks are installed. Commands like # fdisk -l | grep Disk and #... (14 Replies)
Discussion started by: Mick
14 Replies

3. Shell Programming and Scripting

How to count number of files in directory and write to new file with number of files and their name?

Hi! I just want to count number of files in a directory, and write to new text file, with number of files and their name output should look like this,, assume that below one is a new file created by script Number of files in directory = 25 1. a.txt 2. abc.txt 3. asd.dat... (20 Replies)
Discussion started by: Akshay Hegde
20 Replies

4. Red Hat

Crontab: overlaps

I'm using CentOS 6.3 and I use a crontab entries like this: 0 23 2-31 * 1-6 root weekdayscript 0 23 1 * 7 root weekendscript this 2 entries always overlaps... but I don't know how... :wall: thanks (10 Replies)
Discussion started by: ionral
10 Replies

5. Shell Programming and Scripting

Compare values of hashes of hash for n number of hash in perl without sorting.

Hi, I have an hashes of hash, where hash is dynamic, it can be n number of hash. i need to compare data_count values of all . my %result ( $abc => { 'data_count' => '10', 'ID' => 'ABC122', } $def => { 'data_count' => '20', 'ID' => 'defASe', ... (1 Reply)
Discussion started by: asak
1 Replies

6. Shell Programming and Scripting

awk? create similarity matrix by calculating overlaps between sets comprising of individual parts

Hi everyone I am very new at awk and to me the task I need to get done is very very challenging... Nevertheless, after admiring how fast and elegant issues are being solved here I am sure this is my best chance. I have a 2D data file (input file is a plain tab-delimited text file). The first... (1 Reply)
Discussion started by: stonemonkey
1 Replies

7. UNIX for Dummies Questions & Answers

Determining file size for a list of files with paths

Hello, I have a flat file with a list of files with the path to the file and I am attempting to calculate the filesize for each one; however xargs isn't playing nicely and I am sure there is probably a better way of doing this. What I envisioned is this: cat filename|xargs -i ls -l {} |awk... (4 Replies)
Discussion started by: joe8mofo
4 Replies

8. Shell Programming and Scripting

Creating Hashes of Hashes of Array

Hi folks, I have a structure as mentioned below in a configuration file. <Component> Comp1: { item1:data,someUniqueAttribute; item2:data,someUniqueAttribute, } Comp2: { item3:data,someUniqueAttribute; ... (1 Reply)
Discussion started by: ckv84
1 Replies

9. Shell Programming and Scripting

Perl Hashes, reading and hashing 2 files

So I have two files that I want to put together via hashes and am having a terrible time with syntax. For example: File1 A apple B banana C citrusFile2 A red B yellow C orangeWhat I want to enter on the command line is: program.pl File1 File2And have the result... (11 Replies)
Discussion started by: silkiechicken
11 Replies

10. Programming

determining the object files...

hello, is there a utility to determine which object files are used to create a binary executable file?let me explain, please: for ex. there are three files: a.o b.o c.o and these files are used to create a binary called: prg namely, a.o b.o c.o -> prg so, how can i determine these three... (1 Reply)
Discussion started by: xyzt
1 Replies
Login or Register to Ask a Question