Determining number of overlaps between two files using Hashes?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Determining number of overlaps between two files using Hashes?
# 15  
Old 09-15-2008
Well, here's the start - I don't have time to work on it at the moment, but it should certainly get you started...

Code:
#!/usr/bin/perl -w

################################################################################
################################################################################
# What this script does:                                                       #
# This script can be used to compare data overlap                              #
#                                                                              #
# How this script works:                                                       #
# It parses two external files and identifies which line entries fall within   #
# the match statement.                                                         #
#                                                                              #
# Where this script is run (and by whom):                                      #
# This script should be run from a host where the files can be accessed for    #
# comparison                                                                   #
#                                                                              #
# Revision history:                                                            #
# September 15, 2008                                                           #
#    AKG - File creation                                                       #
################################################################################
################################################################################


################################################################################
################################################################################
# Define Pragma                                                                #
################################################################################
################################################################################
use strict;
use Getopt::Std;
use vars qw/ %opt /;
use Time::HiRes qw(gettimeofday);

################################################################################
# Define Variables                                                             #
################################################################################
#my $dir        = "/usr/local/overlap"; # base directory
my $dir        = "/opt/home/agray/scripting/overlap"; # base directory
my $DEBUG;
my @tm;
my $logfile;
my $timeStamp;
my @fileA;
my @fileB;

################################################################################
# Define Prerequisites  (Require / Include statements go here)                 #
################################################################################

################################################################################
# Forward declaration of subroutines                                           #
################################################################################
sub do_init();              # This manages the command line options
sub do_usage();             # This is the usage message for do_init()

################################################################################
################################################################################
# MAIN                                                                         #
################################################################################
################################################################################

# Parse command line variables                                                 #
do_init();

# set DEBUG flag according to command line options                             #
if ( $opt{d} )
{
   $DEBUG = 1;
}
else
{
   $DEBUG = 0;
}

# Next, create a timestamp and logfile to store information                    #
# getTimestamp
@tm = (localtime($^T))[0..5];
++$tm[4];
$tm[5] += 1900;
$timeStamp = sprintf("%04d%02d%02d.%02d%02d%02d", reverse @tm);
$logfile = "$dir/log/$0.$timeStamp.log";

open LOG, ">> $logfile" or die "Can't open $logfile for write: $!";
print LOG "$timeStamp: running $0\n";


################################################################################
# Open files into array                                                        #
################################################################################

open(FILEA, "<$opt{a}") or die "Cannot open $opt{a} for read :$!";
@fileA = <FILEA>;
close( FILEA );

open(FILEB, "<$opt{b}") or die "Cannot open $opt{b} for read :$!";
@fileB = <FILEB>;
close( FILEB );

my @lineFileArray;
my @tempStart;
my @temEnd;
my $count;
my @neLineArray;
my $start;
my $end;

while (@fileA)  #open the file
{
   chomp;
   @lineFile1Array = split (/\t/,@_);   #split the line into temporary array elements
   @tempStart = split (/,/,$lineArray[5]);
   @tempEnd = split (/,/,$lineArray[6]);
   while ($count >= $lineArray[4])
   {
      #create a new array composed of $lineArray[5]:$lineArray[6]
      # I didn't put this code in, as the syntax escapes me this early in the morning...
   }
   while (@newLineArray)
   {
      $start,$end = split (/:/, $_)
      while $line(@fileB)
      {
         @lineFile2Array = split (/\t/,$line);
          if (($lineFile2Array[1] >= $start) && $lineFile2Array[2] <= $end)
         {
            #Match found = write to yourfile
         }
      # If no match found (or when done evaluating that element), move on to the next element of the line
      }
   # If no match found, (or when done evaluating that line) move on to the next line in file2
   }
# If no match found, move on to the next line in file2
}

################################################################################
################################################################################
# Subroutines                                                                  #
################################################################################
################################################################################
sub do_init()
{
   my $opt_string = 'hda:b:';
   getopts( "$opt_string", \%opt ) or do_usage();
   do_usage() if $opt{h};
   do_usage() unless ($opt{a} && $opt{b});
}


sub do_usage()
{
   print "\nusage: $0 [-h] [-d] [-a file1] [-b file2]\n\n";
   #############################################################################
   print "\n\n";
   exit;
}

# 16  
Old 09-15-2008
IT's called a cartesian product. All rows in file1 * all rows in file2.
My suggestion would be to use a database.

Failing that - can you not just record positives (YES in your example). Based on your information I would guess that at a minimum - 23/24 of the time or 98.53% of the tests will result in NO. Why record a NO count when you can infallibly infer it?
# 17  
Old 09-15-2008
Quote:
Originally Posted by jim mcnamara
IT's called a cartesian product. All rows in file1 * all rows in file2.
My suggestion would be to use a database.

Failing that - can you not just record positives (YES in your example). Based on your information I would guess that at a minimum - 23/24 of the time or 98.53% of the tests will result in NO. Why record a NO count when you can infallibly infer it?
yes that makes sense, if I know how many lines I start with and then infer the yes or nos, then I can simply do a subtraction and get the answers for both.
# 18  
Old 09-15-2008
Quote:
Originally Posted by avronius
Well, here's the start - I don't have time to work on it at the moment, but it should certainly get you started...

Code:
sub do_usage()
{
   print "\nusage: $0 [-h] [-d] [-a file1] [-b file2]\n\n";
   #############################################################################
   print "\n\n";
   exit;
}

Thank you for this and i've been working on it since you posted it. however, i get compilation errors which i tried to correct but can't seem to get pass this:
Quote:
Use of implicit split to @_ is deprecated at overlap.pl line 126.
syntax error at overlap.pl line 126, near "$line("
syntax error at overlap.pl line 138, near "}"
Execution of overlap.pl aborted due to compilation errors.
# 19  
Old 09-15-2008
Whoops - that should be $_ not @_ (hadn't had my coffee...)
# 20  
Old 09-15-2008
For the record, when a script is fairly simply using $_ is fine:
Code:
while (@array)
{
   something $_;
}

However, when it get's complicated, I'd recommend assigning a variable that makes sense to you such that you can follow it through the loop:
Code:
while $element(@array)
{
   something $element;
}

# 21  
Old 09-15-2008
Quote:
Originally Posted by avronius
Whoops - that should be $_ not @_ (hadn't had my coffee...)
hehe me too Smilie. but where are you referring to?
Code:
while (@fileA)  #open the file
{
   chomp;
   @lineFile1Array = split (/\t/,@_);   #split the line into temporary array elements
   @tempStart = split (/,/,$lineArray[5]);
   @tempEnd = split (/,/,$lineArray[6]);
   while ($count >= $lineArray[4])
   {
      #create a new array composed of $lineArray[5]:$lineArray[6]
      # I didn't put this code in, as the syntax escapes me this early in the morning...
      $count++;
   }
   while (@newLineArray)
   {
      $start,$end = split (/:/, $_)
      while $line(@fileB)
      {
         @lineFile2Array = split (/\t/,$line);
          if (($lineFile2Array[1] >= $start) && $lineFile2Array[2] <= $end)
         {
            #Match found = write to yourfile
         }
      # If no match found (or when done evaluating that element), move on to the next element of the line
      }
   # If no match found, (or when done evaluating that line) move on to the next line in file2
   }
# If no match found, move on to the next line in file2
}

this one
Code:
      $start,$end = split (/:/, $_)

and why is it splitting by ':'?
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Base64 conversion in awk overlaps

hi, problem: output is not consistent as expected using external command in AWK description: I'm trying to convert $2 into a base64 string for later decoding, and for this when I use awk , I'm getting overlapped results , or say it results are not 100% correct. my code is: gawk... (9 Replies)
Discussion started by: busyboy
9 Replies

2. Solaris

Determining number of hard disks in the system

Hello to all, what is the command in Solaris/Unix which I can use to determine how many hard disks exist in the system? I have tried with different command such as df -lk and similar but cannot know for sure how many actual disks are installed. Commands like # fdisk -l | grep Disk and #... (14 Replies)
Discussion started by: Mick
14 Replies

3. Shell Programming and Scripting

How to count number of files in directory and write to new file with number of files and their name?

Hi! I just want to count number of files in a directory, and write to new text file, with number of files and their name output should look like this,, assume that below one is a new file created by script Number of files in directory = 25 1. a.txt 2. abc.txt 3. asd.dat... (20 Replies)
Discussion started by: Akshay Hegde
20 Replies

4. Red Hat

Crontab: overlaps

I'm using CentOS 6.3 and I use a crontab entries like this: 0 23 2-31 * 1-6 root weekdayscript 0 23 1 * 7 root weekendscript this 2 entries always overlaps... but I don't know how... :wall: thanks (10 Replies)
Discussion started by: ionral
10 Replies

5. Shell Programming and Scripting

Compare values of hashes of hash for n number of hash in perl without sorting.

Hi, I have an hashes of hash, where hash is dynamic, it can be n number of hash. i need to compare data_count values of all . my %result ( $abc => { 'data_count' => '10', 'ID' => 'ABC122', } $def => { 'data_count' => '20', 'ID' => 'defASe', ... (1 Reply)
Discussion started by: asak
1 Replies

6. Shell Programming and Scripting

awk? create similarity matrix by calculating overlaps between sets comprising of individual parts

Hi everyone I am very new at awk and to me the task I need to get done is very very challenging... Nevertheless, after admiring how fast and elegant issues are being solved here I am sure this is my best chance. I have a 2D data file (input file is a plain tab-delimited text file). The first... (1 Reply)
Discussion started by: stonemonkey
1 Replies

7. UNIX for Dummies Questions & Answers

Determining file size for a list of files with paths

Hello, I have a flat file with a list of files with the path to the file and I am attempting to calculate the filesize for each one; however xargs isn't playing nicely and I am sure there is probably a better way of doing this. What I envisioned is this: cat filename|xargs -i ls -l {} |awk... (4 Replies)
Discussion started by: joe8mofo
4 Replies

8. Shell Programming and Scripting

Creating Hashes of Hashes of Array

Hi folks, I have a structure as mentioned below in a configuration file. <Component> Comp1: { item1:data,someUniqueAttribute; item2:data,someUniqueAttribute, } Comp2: { item3:data,someUniqueAttribute; ... (1 Reply)
Discussion started by: ckv84
1 Replies

9. Shell Programming and Scripting

Perl Hashes, reading and hashing 2 files

So I have two files that I want to put together via hashes and am having a terrible time with syntax. For example: File1 A apple B banana C citrusFile2 A red B yellow C orangeWhat I want to enter on the command line is: program.pl File1 File2And have the result... (11 Replies)
Discussion started by: silkiechicken
11 Replies

10. Programming

determining the object files...

hello, is there a utility to determine which object files are used to create a binary executable file?let me explain, please: for ex. there are three files: a.o b.o c.o and these files are used to create a binary called: prg namely, a.o b.o c.o -> prg so, how can i determine these three... (1 Reply)
Discussion started by: xyzt
1 Replies
Login or Register to Ask a Question