Determining number of overlaps between two files using Hashes?
Hi there,
I have a doubt about how to set this up. This is the situation.
I have two files, one that is ~31,000 in length and has the following information (7 fields):
file1
the 5th field is important and it explains the number of segments represented in fields 6 and 7. So for example, the first line shows 6, so if you took the first number of field 6 this would represent the start of the first segment and the first number of field 7 would represent the end of the first segment, and so on till you have the total 6 segments. The second line for example shows only 1 in field 5 and hence there's only one segment starting at 100217082 and ending at 100217185.
the second file I have is variable in length and can be from 3,000,000 to 10,000,000 lines. The format contains 4 fields:
file2
As you can see, field 2 and 3 is just a difference of 36 numbers and I want to know how many times each line in file2 is contained within file1 specifically when looking at the segments (remember each line in file1 has different numbers of segments above, e.g. 6, 1, and 12 as represented in field 5).
So if I use these two files to generate my output, my output would tell me:
There are 3 lines from file2 that matches or overlaps segments in file1 and 1 line from file2 that DOESNOT match or overlap segments in file1.
To get this kind of computation, do you think it's important to use hashes for the first file or second file and if so, how would I set this up? Can someone assist here? Thanks!
Are there any other comparisons or output expectations you have not mentioned - because you maybe thought they did not matter?
Hashing is meant for a lookup for a match, not necessarily finding something numerically in a range or something that falls in between two values.
The first thing you must do is to undo the complexity of the line structures and then sort them if you want the process to complete in reasonable time.
I don't want to try anything until I'm sure we won't get into a feedback loop: 'Now I need this...'
I once wrote a script that would do a comparison of two files for reporting on changes to a file (like diff, but with output for phb's)
It would open each file into an array.
As it stepped through the first array, it would do any specific line analysis required, then it would go through the second array line by line - looking for matches (or !matches).
It would write that data to an output array.
Next, it would step through the second array, it would do any specific line analysis required, then step through the first array line by line - again looking for matches (or !matches). It would write that data to a different output array.
These two arrays had the smaller subset of results that I required, and could be cross matched via a third function.
It's not the nicest way to do this sort of thing, but it did allow me to resolve my requirement in the shortest amount of time, without killing the system.
In this case, you have an array that is composed of 5 or 12 or 1 number pairs.
This is a rough idea - not a complete code snippet (obviously!)
I didn't include code for opening either file - open both files before entering the loop
Last edited by avronius; 09-15-2008 at 12:13 PM..
Reason: added some small amout of clarity
Are there any other comparisons or output expectations you have not mentioned - because you maybe thought they did not matter?
Hashing is meant for a lookup for a match, not necessarily finding something numerically in a range or something that falls in between two values.
The first thing you must do is to undo the complexity of the line structures and then sort them if you want the process to complete in reasonable time.
I don't want to try anything until I'm sure we won't get into a feedback loop: 'Now I need this...'
Thank you for the post. I can't think of anything else but I know that knowing if they do fall in the range or not would be informative for my analysis as I indicated in the expected output.
I once wrote a script that would do a comparison of two files for reporting on changes to a file (like diff, but with output for phb's)
It would open each file into an array.
As it stepped through the first array, it would do any specific line analysis required, then it would go through the second array line by line - looking for matches (or !matches).
It would write that data to an output array.
Next, it would step through the second array, it would do any specific line analysis required, then step through the first array line by line - again looking for matches (or !matches). It would write that data to a different output array.
These two arrays had the smaller subset of results that I required, and could be cross matched via a third function.
It's not the nicest way to do this sort of thing, but it did allow me to resolve my requirement in the shortest amount of time, without killing the system.
In this case, you have an array that is composed of 5 or 12 or 1 number pairs.
This is a rough idea - not a complete code snippet (obviously!)
I didn't include code for opening either file - open both files before entering the loop
thank you for your post. I think I understand the logic here but I'm not sure where you get
You'll need to create a counter $count
While $lineArray[4] (your number of elements), is less than $count, you'll loop through creating the new array of (element0start:element0end element1start:element1end etc.)
Then increment the counter so that you don't continue trying to create new elements in the new array after you've mapped the sets
hi,
problem:
output is not consistent as expected using external command in AWK
description:
I'm trying to convert $2 into a base64 string for later decoding, and for this when I use awk , I'm getting overlapped results , or say it results are not 100% correct.
my code is:
gawk... (9 Replies)
Hello to all,
what is the command in Solaris/Unix which I can use to determine how many hard disks exist in the system?
I have tried with different command such as df -lk and similar but cannot know for sure how many actual disks are installed.
Commands like # fdisk -l | grep Disk and #... (14 Replies)
Hi!
I just want to count number of files in a directory, and write to new text file, with number of files and their name
output should look like this,,
assume that below one is a new file created by script
Number of files in directory = 25
1. a.txt
2. abc.txt
3. asd.dat... (20 Replies)
I'm using CentOS 6.3 and I use a crontab entries like this:
0 23 2-31 * 1-6 root weekdayscript
0 23 1 * 7 root weekendscript
this 2 entries always overlaps... but I don't know how... :wall:
thanks (10 Replies)
Hi,
I have an hashes of hash, where hash is dynamic, it can be n number of hash. i need to compare data_count values of all .
my %result (
$abc => {
'data_count' => '10',
'ID' => 'ABC122',
}
$def => {
'data_count' => '20',
'ID' => 'defASe',
... (1 Reply)
Hi everyone
I am very new at awk and to me the task I need to get done is very very challenging... Nevertheless, after admiring how fast and elegant issues are being solved here I am sure this is my best chance.
I have a 2D data file (input file is a plain tab-delimited text file). The first... (1 Reply)
Hello,
I have a flat file with a list of files with the path to the file and I am attempting to calculate the filesize for each one; however xargs isn't playing nicely and I am sure there is probably a better way of doing this.
What I envisioned is this:
cat filename|xargs -i ls -l {} |awk... (4 Replies)
Hi folks,
I have a structure as mentioned below in a configuration file.
<Component>
Comp1:
{
item1:data,someUniqueAttribute;
item2:data,someUniqueAttribute,
}
Comp2:
{
item3:data,someUniqueAttribute;
... (1 Reply)
So I have two files that I want to put together via hashes and am having a terrible time with syntax. For example:
File1
A apple
B banana
C citrusFile2
A red
B yellow
C orangeWhat I want to enter on the command line is:
program.pl File1 File2And have the result... (11 Replies)
hello,
is there a utility to determine which object files are used to create a binary executable file?let me explain, please:
for ex. there are three files:
a.o b.o c.o
and these files are used to create a binary called:
prg
namely, a.o b.o c.o -> prg
so, how can i determine these three... (1 Reply)