Sponsored Content
Top Forums Shell Programming and Scripting Determining number of overlaps between two files using Hashes? Post 302236095 by labrazil on Sunday 14th of September 2008 08:03:43 PM
Old 09-14-2008
Determining number of overlaps between two files using Hashes?

Hi there,

I have a doubt about how to set this up. This is the situation.

I have two files, one that is ~31,000 in length and has the following information (7 fields):
file1
Code:
1    +       100208127       100261594       6       100208127,100231680,100237404,100245177,100249508,100260529,    100208306,100231885,100237559,100245300,100249677,100261594,
1    +       100217082       100217185       1       100217082,      100217185,
1    +       100276376       100321515       12      100276376,100288052,100296809,100298021,100299978,100306120,100306616,100307757,100315308,100316594,100318639,100320146,        100276460,100288148,100296872,100298149,100300093,100306339,100306730,100307829,100315421,100316692,100318803,100321515,

the 5th field is important and it explains the number of segments represented in fields 6 and 7. So for example, the first line shows 6, so if you took the first number of field 6 this would represent the start of the first segment and the first number of field 7 would represent the end of the first segment, and so on till you have the total 6 segments. The second line for example shows only 1 in field 5 and hence there's only one segment starting at 100217082 and ending at 100217185.

the second file I have is variable in length and can be from 3,000,000 to 10,000,000 lines. The format contains 4 fields:
file2
Code:
1    100208130       100208166       +
1    100208310       100208346       +
1    100217090       100217126       +
1    100231689       100231725       +

As you can see, field 2 and 3 is just a difference of 36 numbers and I want to know how many times each line in file2 is contained within file1 specifically when looking at the segments (remember each line in file1 has different numbers of segments above, e.g. 6, 1, and 12 as represented in field 5).

So if I use these two files to generate my output, my output would tell me:

There are 3 lines from file2 that matches or overlaps segments in file1 and 1 line from file2 that DOESNOT match or overlap segments in file1.

Code:
YES 1    100208130       100208166       +
NO 1    100208310       100208346       +
YES 1    100217090       100217126       +
YES 1    100231689       100231725       +

To get this kind of computation, do you think it's important to use hashes for the first file or second file and if so, how would I set this up? Can someone assist here? Thanks!
 

10 More Discussions You Might Find Interesting

1. Programming

determining the object files...

hello, is there a utility to determine which object files are used to create a binary executable file?let me explain, please: for ex. there are three files: a.o b.o c.o and these files are used to create a binary called: prg namely, a.o b.o c.o -> prg so, how can i determine these three... (1 Reply)
Discussion started by: xyzt
1 Replies

2. Shell Programming and Scripting

Perl Hashes, reading and hashing 2 files

So I have two files that I want to put together via hashes and am having a terrible time with syntax. For example: File1 A apple B banana C citrusFile2 A red B yellow C orangeWhat I want to enter on the command line is: program.pl File1 File2And have the result... (11 Replies)
Discussion started by: silkiechicken
11 Replies

3. Shell Programming and Scripting

Creating Hashes of Hashes of Array

Hi folks, I have a structure as mentioned below in a configuration file. <Component> Comp1: { item1:data,someUniqueAttribute; item2:data,someUniqueAttribute, } Comp2: { item3:data,someUniqueAttribute; ... (1 Reply)
Discussion started by: ckv84
1 Replies

4. UNIX for Dummies Questions & Answers

Determining file size for a list of files with paths

Hello, I have a flat file with a list of files with the path to the file and I am attempting to calculate the filesize for each one; however xargs isn't playing nicely and I am sure there is probably a better way of doing this. What I envisioned is this: cat filename|xargs -i ls -l {} |awk... (4 Replies)
Discussion started by: joe8mofo
4 Replies

5. Shell Programming and Scripting

awk? create similarity matrix by calculating overlaps between sets comprising of individual parts

Hi everyone I am very new at awk and to me the task I need to get done is very very challenging... Nevertheless, after admiring how fast and elegant issues are being solved here I am sure this is my best chance. I have a 2D data file (input file is a plain tab-delimited text file). The first... (1 Reply)
Discussion started by: stonemonkey
1 Replies

6. Shell Programming and Scripting

Compare values of hashes of hash for n number of hash in perl without sorting.

Hi, I have an hashes of hash, where hash is dynamic, it can be n number of hash. i need to compare data_count values of all . my %result ( $abc => { 'data_count' => '10', 'ID' => 'ABC122', } $def => { 'data_count' => '20', 'ID' => 'defASe', ... (1 Reply)
Discussion started by: asak
1 Replies

7. Red Hat

Crontab: overlaps

I'm using CentOS 6.3 and I use a crontab entries like this: 0 23 2-31 * 1-6 root weekdayscript 0 23 1 * 7 root weekendscript this 2 entries always overlaps... but I don't know how... :wall: thanks (10 Replies)
Discussion started by: ionral
10 Replies

8. Shell Programming and Scripting

How to count number of files in directory and write to new file with number of files and their name?

Hi! I just want to count number of files in a directory, and write to new text file, with number of files and their name output should look like this,, assume that below one is a new file created by script Number of files in directory = 25 1. a.txt 2. abc.txt 3. asd.dat... (20 Replies)
Discussion started by: Akshay Hegde
20 Replies

9. Solaris

Determining number of hard disks in the system

Hello to all, what is the command in Solaris/Unix which I can use to determine how many hard disks exist in the system? I have tried with different command such as df -lk and similar but cannot know for sure how many actual disks are installed. Commands like # fdisk -l | grep Disk and #... (14 Replies)
Discussion started by: Mick
14 Replies

10. Shell Programming and Scripting

Base64 conversion in awk overlaps

hi, problem: output is not consistent as expected using external command in AWK description: I'm trying to convert $2 into a base64 string for later decoding, and for this when I use awk , I'm getting overlapped results , or say it results are not 100% correct. my code is: gawk... (9 Replies)
Discussion started by: busyboy
9 Replies
bdiff(1)						      General Commands Manual							  bdiff(1)

Name
       bdiff - big file differential comparator

Syntax
       bdiff file1 file2 [n] [-s]

Description
       The  command  is  used to find lines that must be changed in two files to bring them into agreement.  Its purpose is to allow processing of
       files that are too large for

       The command ignores lines common to the beginning of both files, splits the remainder of each file into n-line segments, and  invokes  upon
       corresponding  segments.   The  value of n is 3500 by default.  If the optional third argument is given and if it is numeric, it is used as
       the value for n.  This is useful in those cases in which 3500-line segments are too large for causing it to fail.

       The output of the command is the same as the output of the command: line numbers are adjusted to account for the segmenting of the files to
       make  it  look  as  if the files had been processed whole.  Note that because of the segmenting of the files, does not necessarily find the
       smallest sufficient set of file differences.

       If either file1 or file2 is -, the standard input is read.  The optional -s (silent) argument specifies	that  no  diagnostics  are  to	be
       printed by However, this does not suppress possible exclamations by If both optional arguments are specified, they must appear in the order
       indicated above.

Options
       -s		   Suppresses normal diagnostic messages.

Diagnostics
       Use for explanations.

Files
       /tmp/bd?????

See Also
       diff(1)

																	  bdiff(1)
All times are GMT -4. The time now is 04:57 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy