Sponsored Content
Top Forums Shell Programming and Scripting Determining number of overlaps between two files using Hashes? Post 302236095 by labrazil on Sunday 14th of September 2008 08:03:43 PM
Old 09-14-2008
Determining number of overlaps between two files using Hashes?

Hi there,

I have a doubt about how to set this up. This is the situation.

I have two files, one that is ~31,000 in length and has the following information (7 fields):
file1
Code:
1    +       100208127       100261594       6       100208127,100231680,100237404,100245177,100249508,100260529,    100208306,100231885,100237559,100245300,100249677,100261594,
1    +       100217082       100217185       1       100217082,      100217185,
1    +       100276376       100321515       12      100276376,100288052,100296809,100298021,100299978,100306120,100306616,100307757,100315308,100316594,100318639,100320146,        100276460,100288148,100296872,100298149,100300093,100306339,100306730,100307829,100315421,100316692,100318803,100321515,

the 5th field is important and it explains the number of segments represented in fields 6 and 7. So for example, the first line shows 6, so if you took the first number of field 6 this would represent the start of the first segment and the first number of field 7 would represent the end of the first segment, and so on till you have the total 6 segments. The second line for example shows only 1 in field 5 and hence there's only one segment starting at 100217082 and ending at 100217185.

the second file I have is variable in length and can be from 3,000,000 to 10,000,000 lines. The format contains 4 fields:
file2
Code:
1    100208130       100208166       +
1    100208310       100208346       +
1    100217090       100217126       +
1    100231689       100231725       +

As you can see, field 2 and 3 is just a difference of 36 numbers and I want to know how many times each line in file2 is contained within file1 specifically when looking at the segments (remember each line in file1 has different numbers of segments above, e.g. 6, 1, and 12 as represented in field 5).

So if I use these two files to generate my output, my output would tell me:

There are 3 lines from file2 that matches or overlaps segments in file1 and 1 line from file2 that DOESNOT match or overlap segments in file1.

Code:
YES 1    100208130       100208166       +
NO 1    100208310       100208346       +
YES 1    100217090       100217126       +
YES 1    100231689       100231725       +

To get this kind of computation, do you think it's important to use hashes for the first file or second file and if so, how would I set this up? Can someone assist here? Thanks!
 

10 More Discussions You Might Find Interesting

1. Programming

determining the object files...

hello, is there a utility to determine which object files are used to create a binary executable file?let me explain, please: for ex. there are three files: a.o b.o c.o and these files are used to create a binary called: prg namely, a.o b.o c.o -> prg so, how can i determine these three... (1 Reply)
Discussion started by: xyzt
1 Replies

2. Shell Programming and Scripting

Perl Hashes, reading and hashing 2 files

So I have two files that I want to put together via hashes and am having a terrible time with syntax. For example: File1 A apple B banana C citrusFile2 A red B yellow C orangeWhat I want to enter on the command line is: program.pl File1 File2And have the result... (11 Replies)
Discussion started by: silkiechicken
11 Replies

3. Shell Programming and Scripting

Creating Hashes of Hashes of Array

Hi folks, I have a structure as mentioned below in a configuration file. <Component> Comp1: { item1:data,someUniqueAttribute; item2:data,someUniqueAttribute, } Comp2: { item3:data,someUniqueAttribute; ... (1 Reply)
Discussion started by: ckv84
1 Replies

4. UNIX for Dummies Questions & Answers

Determining file size for a list of files with paths

Hello, I have a flat file with a list of files with the path to the file and I am attempting to calculate the filesize for each one; however xargs isn't playing nicely and I am sure there is probably a better way of doing this. What I envisioned is this: cat filename|xargs -i ls -l {} |awk... (4 Replies)
Discussion started by: joe8mofo
4 Replies

5. Shell Programming and Scripting

awk? create similarity matrix by calculating overlaps between sets comprising of individual parts

Hi everyone I am very new at awk and to me the task I need to get done is very very challenging... Nevertheless, after admiring how fast and elegant issues are being solved here I am sure this is my best chance. I have a 2D data file (input file is a plain tab-delimited text file). The first... (1 Reply)
Discussion started by: stonemonkey
1 Replies

6. Shell Programming and Scripting

Compare values of hashes of hash for n number of hash in perl without sorting.

Hi, I have an hashes of hash, where hash is dynamic, it can be n number of hash. i need to compare data_count values of all . my %result ( $abc => { 'data_count' => '10', 'ID' => 'ABC122', } $def => { 'data_count' => '20', 'ID' => 'defASe', ... (1 Reply)
Discussion started by: asak
1 Replies

7. Red Hat

Crontab: overlaps

I'm using CentOS 6.3 and I use a crontab entries like this: 0 23 2-31 * 1-6 root weekdayscript 0 23 1 * 7 root weekendscript this 2 entries always overlaps... but I don't know how... :wall: thanks (10 Replies)
Discussion started by: ionral
10 Replies

8. Shell Programming and Scripting

How to count number of files in directory and write to new file with number of files and their name?

Hi! I just want to count number of files in a directory, and write to new text file, with number of files and their name output should look like this,, assume that below one is a new file created by script Number of files in directory = 25 1. a.txt 2. abc.txt 3. asd.dat... (20 Replies)
Discussion started by: Akshay Hegde
20 Replies

9. Solaris

Determining number of hard disks in the system

Hello to all, what is the command in Solaris/Unix which I can use to determine how many hard disks exist in the system? I have tried with different command such as df -lk and similar but cannot know for sure how many actual disks are installed. Commands like # fdisk -l | grep Disk and #... (14 Replies)
Discussion started by: Mick
14 Replies

10. Shell Programming and Scripting

Base64 conversion in awk overlaps

hi, problem: output is not consistent as expected using external command in AWK description: I'm trying to convert $2 into a base64 string for later decoding, and for this when I use awk , I'm getting overlapped results , or say it results are not 100% correct. my code is: gawk... (9 Replies)
Discussion started by: busyboy
9 Replies
bdiff(1)						      General Commands Manual							  bdiff(1)

NAME
bdiff - diff for large files SYNOPSIS
file1 file2 [n] DESCRIPTION
compares two files and produces output identical to what would be produced by (see diff(1)), specifying changes that must be made to make the files identical. is designed for handling files that are too large for but it can be used on files of any length. processes files as follows: o Ignore lines common to the beginning of both files. o Split the remainder of each file into n-line segments, then execute on corresponding segments. The default value of n is 3500. Command-Line Arguments recognizes the following command-line arguments: file1 file2 Names of two files to be compared by If file1 or file2 (but not both) is standard input is used instead. n If a numeric value is present as the third argument, the files are divided into n-line segments before processing by Default value for n is 3500. This option is useful when 3500-line segments are too large for processing by Silent option suppresses diagnostic printing by but does not suppress possible error messages from If the n and arguments are both used, the n argument must precede the option on the command line or it will not be properly recognized. EXTERNAL INFLUENCES
Environment Variables determines the language in which messages are displayed. If is not specified in the environment or is set to the empty string, the value of is used as a default for each unspecified or empty vari- able. If is not specified or is set to the empty string, a default of "C" (see lang(5)) is used instead of If any internationalization variable contains an invalid setting, behaves as if all internationalization variables are set to "C". See environ(5). International Code Set Support Single- and multi-byte character code sets are supported. DIAGNOSTICS
Standard input was specified for both files. Only one file can be specified as standard input. A non-numeric value was specified for the n (third) argument. EXAMPLES
Find differences between two large files: and and place the result in a new file named Do the same, but limit file length to 1400 lines; suppress error messages: WARNINGS
produces output identical to output from and makes the necessary line-number corrections so that the output looks like it was processed by However, depending on where the files are split, may or may not find a fully minimized set of file differences. FILES
SEE ALSO
diff(1). bdiff(1)
All times are GMT -4. The time now is 10:31 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy