awk? create similarity matrix by calculating overlaps between sets comprising of individual parts


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk? create similarity matrix by calculating overlaps between sets comprising of individual parts
# 1  
Old 10-02-2011
awk? create similarity matrix by calculating overlaps between sets comprising of individual parts

Hi everyone
I am very new at awk and to me the task I need to get done is very very challenging... Nevertheless, after admiring how fast and elegant issues are being solved here I am sure this is my best chance.
I have a 2D data file (input file is a plain tab-delimited text file). The first field contains a unique ID that serves as an identifier for set of individual parts located in fields 3-n. These individual parts are listed within the adjacent fields of the line and are listed as partIDs (ID number). The partIDs can only occur once per line.

In field 2 there is the count of partIDs comprised in the set.

What I would need is to get a matrix displaying the similarity of every set compared to all other sets. The strategy to get this done might be (works fine on a small scale in excel but inadequate for actual datasets with 8000 lines hence 64’000’000 overlaps to be calculated):

The Excel solution is attached as .xls

1) Search all lines for partIDs that occur in line1
2) Print a table with the same dimensions that, however, only contains the overlaps with line 1 (line one should be identical to the input file) > redirect to a file named e.g. OverlapLine00001.txt (suffix indicating the line number every other line is compared to).
a. In “OverlapLine00001” count the number of fields that are not empty per line and calculate the overlap between line1 and line1 using the following: “count of notemptyfields in line1” / ((“count of partIDs comprised in setID1” [copied from input file] + “count of partIDs comprised in setID1” [copied from input file]) - “count of notemptyfields in line1”)
b. Calculate similar for overlap between line1 and line2 “count of notemptyfields in line2” / ((“count of partIDs comprised in setID2” [copied from input file] + “count of partIDs comprised in setID1” [copied from input file]) - “count of notemptyfields in line2”)
c. Do similar to compute the overlap between line1 and line3, line1 and line4 etc. (the results of the overlap calculations may best inserted as field3 before the partIDs as the length of the lines vary).
d. Redirect this column to a file (e.g. MatrixOverlaps.txt) where all results from the overlap calculations would be collected.
3) Repeat the same procedure described in 1) and 2) and a. to d. for remaining lines.
(To keep the original row order is crucial as the redirect in 2) c. should join the column with the overlaps with line1 with line2 with line3 etc. in order to create a matrix).

Input file:
setIDnumber of the partIDs comprised in the set.fields 3-n comprise the partIDs comprised in the set    
setID15partID1partID2partID3partID4partID5
setID24partID3partID4partID100partID101
setID34partID2partID3partID104partID1001
setID41partID35
setID55partID50partID51partID5partID3partID1

Output for overlaps with line1 > OverlapLine00001.txt
setIDnumber of the partIDs comprised in the set.overlapnumber of partIDs overlapping with line1     
setID1515partID1partID2partID3partID4partID5
setID240.2857142862partID3partID4
setID340.2857142862partID2partID3
setID4100
setID550.4285714293 partID5partID3partID1

Output for overlaps with line2 > OverlapLine00002.txt
setIDnumber of the partIDs comprised in the set.overlapnumber of partIDs overlapping with line2     
setID150.2857142862 partID3partID4
setID2414partID3partID4partID100partID101
setID340.1428571431 partID3
setID4100
setID550.1251 partID3
Similar for all other lines
Output MatrixOverlaps.txt (final size ca. 8000x8000)
 setID1setID2setID3setID4setID5
setID110.285714286OverlapLine0003OverlapLine0003OverlapLine0003
setID20.2857142861OverlapLine0003OverlapLine0003OverlapLine0003
setID30.2857142860.142857143OverlapLine0003OverlapLine0003OverlapLine0003
setID400OverlapLine0003OverlapLine0003OverlapLine0003
setID50.4285714290.125OverlapLine0003OverlapLine0003OverlapLine0003

Thanks a million for your efforts… I would be so grateful if this works and I am sure this will provide the basis for some amazing networks☺.

Last edited by stonemonkey; 10-02-2011 at 03:35 PM.. Reason: found mistake
# 2  
Old 10-02-2011
sorry found a mistake in the last table
here the corrected one and the corresponding .xls

Output MatrixOverlaps.txt (final size ca. 8000x8000)
 setID1setID2setID3setID4setID5
setID110.285714286OverlapLine0003OverlapLine0004OverlapLine0005
setID20.2857142861OverlapLine0003OverlapLine0004OverlapLine0005
setID30.2857142860.142857143OverlapLine0003OverlapLine0004OverlapLine0005
setID400OverlapLine0003OverlapLine0004OverlapLine0005
setID50.4285714290.125OverlapLine0003OverlapLine0004OverlapLine0005

Thanks again
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Base64 conversion in awk overlaps

hi, problem: output is not consistent as expected using external command in AWK description: I'm trying to convert $2 into a base64 string for later decoding, and for this when I use awk , I'm getting overlapped results , or say it results are not 100% correct. my code is: gawk... (9 Replies)
Discussion started by: busyboy
9 Replies

2. Shell Programming and Scripting

How to create individual entries from a range of numbers?

I want to create entries based on the series as in examples below: Input: 2dat3 grht-5&&-15 3dat3 grht-16&&-30 4dat3 ftht-4&&-12 5sat3 ftht-16&&-20 Output: 2dat3 grht-5 2dat3 grht-6 2dat3 grht-7 2dat3 grht-8 (7 Replies)
Discussion started by: aydj
7 Replies

3. Shell Programming and Scripting

Incrementing parts of ten digits number by parts

I have number in file which contains date and serial number: 2013101000. The last two digits are serial number (00). So maximum of serial number is 100. After reaching 100 it becomes 00 with incrementing 10 which is day with max 31. after reaching 31 it becomes 00 and increments 10... (31 Replies)
Discussion started by: Natalie
31 Replies

4. Shell Programming and Scripting

Calculating the epoch time from standard time using awk and calculating the duration

Hi All, I have the following time stamp data in 2 columns Date TimeStamp(also with milliseconds) 05/23/2012 08:30:11.250 05/23/2012 08:30:15.500 05/23/2012 08:31.15.500 . . etc From this data I need the following output. 0.00( row1-row1 in seconds) 04.25( row2-row1 in... (5 Replies)
Discussion started by: ks_reddy
5 Replies

5. Shell Programming and Scripting

Individual Line processing in awk

Hi , I have a file like Activate your Membership now! Dear Cyrus Every relationship needs nurturing. Including ours. 2011-08-09T10:18:14Z 2011-08-09T10:18:14Z tag:gmail.google.com,2004:1376659800396305843 T League email@email.tleague.com How to refresh a graphical display through... (3 Replies)
Discussion started by: ddspark
3 Replies

6. Shell Programming and Scripting

Adding the individual columns of a matrix.

I have a huge matrix file containing some 1.5 million rows and 6000 columns. The matrix looks something like this: 1 2 3 4 5 6 7 8 9 3 4 5 I want to add all the numbers in the columns of this matrix and display the result to my stdout. This means that the numbers in the first column are: ... (2 Replies)
Discussion started by: shoaibjameel123
2 Replies

7. Shell Programming and Scripting

How to extract some parts of a file to create some outfile

Hi All, I am very new in programming. I need some help. I have one input file like: Number of disabled taxa: 9 Loading mapping file: ncbi.map Load mapping: taxId2TaxLevel: 469951 --- Subsample reads (20%): 66680 of 334386 Processing: tree-from-summary Running tree-from-summary... (21 Replies)
Discussion started by: iammitra
21 Replies

8. Shell Programming and Scripting

How to format or create a matrix report from file

Dear Unix champs, I have a input file as attached, i would like to create an report from the file as below FileType | EQUENS0001 | EQUENS0002 | EQUENS1100 | EQUENS0003 --------+-------------------------------------------------------- Msg No |... (3 Replies)
Discussion started by: manas_ranjan
3 Replies

9. Virtualization and Cloud Computing

Clouds (Partially Order Sets) - Streams (Linearly Ordered Sets) - Part 2

timbass Sat, 28 Jul 2007 10:07:53 +0000 Originally posted in Yahoo! CEP-Interest Here is my follow-up note on posets (partially ordered sets) and tosets (totally or linearly ordered sets) as background set theory for event processing, and in particular CEP and ESP. In my last note, we... (0 Replies)
Discussion started by: Linux Bot
0 Replies
Login or Register to Ask a Question