Hi everyone
I am very new at awk and to me the task I need to get done is very very challenging... Nevertheless, after admiring how fast and elegant issues are being solved here I am sure this is my best chance.
I have a 2D data file (input file is a plain tab-delimited text file). The first field contains a unique ID that serves as an identifier for set of individual parts located in fields 3-n. These individual parts are listed within the adjacent fields of the line and are listed as partIDs (ID number). The partIDs can only occur once per line.
In field 2 there is the count of partIDs comprised in the set.
What I would need is to get a matrix displaying the similarity of every set compared to all other sets. The strategy to get this done might be (works fine on a small scale in excel but inadequate for actual datasets with 8000 lines hence 64’000’000 overlaps to be calculated):
The Excel solution is attached as .xls
1) Search all lines for partIDs that occur in line1
2) Print a table with the same dimensions that, however, only contains the overlaps with line 1 (line one should be identical to the input file) > redirect to a file named e.g. OverlapLine00001.txt (suffix indicating the line number every other line is compared to).
a. In “OverlapLine00001” count the number of fields that are not empty per line and calculate the overlap between line1 and line1 using the following: “count of notemptyfields in line1” / ((“count of partIDs comprised in setID1” [copied from input file] + “count of partIDs comprised in setID1” [copied from input file]) - “count of notemptyfields in line1”)
b. Calculate similar for overlap between line1 and line2 “count of notemptyfields in line2” / ((“count of partIDs comprised in setID2” [copied from input file] + “count of partIDs comprised in setID1” [copied from input file]) - “count of notemptyfields in line2”)
c. Do similar to compute the overlap between line1 and line3, line1 and line4 etc. (the results of the overlap calculations may best inserted as field3 before the partIDs as the length of the lines vary).
d. Redirect this column to a file (e.g. MatrixOverlaps.txt) where all results from the overlap calculations would be collected.
3) Repeat the same procedure described in 1) and 2) and a. to d. for remaining lines.
(To keep the original row order is crucial as the redirect in 2) c. should join the column with the overlaps with line1 with line2 with line3 etc. in order to create a matrix).
Input file:
setID | number of the partIDs comprised in the set. | fields 3-n comprise the partIDs comprised in the set | | | | |
setID1 | 5 | partID1 | partID2 | partID3 | partID4 | partID5 |
setID2 | 4 | partID3 | partID4 | partID100 | partID101 | |
setID3 | 4 | partID2 | partID3 | partID104 | partID1001 | |
setID4 | 1 | partID35 | | | | |
setID5 | 5 | partID50 | partID51 | partID5 | partID3 | partID1 |
Output for overlaps with line1 > OverlapLine00001.txt
setID | number of the partIDs comprised in the set. | overlap | number of partIDs overlapping with line1 | | | | | |
setID1 | 5 | 1 | 5 | partID1 | partID2 | partID3 | partID4 | partID5 |
setID2 | 4 | 0.285714286 | 2 | partID3 | partID4 | | | |
setID3 | 4 | 0.285714286 | 2 | partID2 | partID3 | | | |
setID4 | 1 | 0 | 0 | | | | | |
setID5 | 5 | 0.428571429 | 3 | | | partID5 | partID3 | partID1 |
Output for overlaps with line2 > OverlapLine00002.txt
setID | number of the partIDs comprised in the set. | overlap | number of partIDs overlapping with line2 | | | | | |
setID1 | 5 | 0.285714286 | 2 | | | partID3 | partID4 | |
setID2 | 4 | 1 | 4 | partID3 | partID4 | partID100 | partID101 | |
setID3 | 4 | 0.142857143 | 1 | | partID3 | | | |
setID4 | 1 | 0 | 0 | | | | | |
setID5 | 5 | 0.125 | 1 | | | | partID3 | |
Similar for all other lines
Output MatrixOverlaps.txt (final size ca. 8000x8000)
| setID1 | setID2 | setID3 | setID4 | setID5 |
setID1 | 1 | 0.285714286 | OverlapLine0003 | OverlapLine0003 | OverlapLine0003 |
setID2 | 0.285714286 | 1 | OverlapLine0003 | OverlapLine0003 | OverlapLine0003 |
setID3 | 0.285714286 | 0.142857143 | OverlapLine0003 | OverlapLine0003 | OverlapLine0003 |
setID4 | 0 | 0 | OverlapLine0003 | OverlapLine0003 | OverlapLine0003 |
setID5 | 0.428571429 | 0.125 | OverlapLine0003 | OverlapLine0003 | OverlapLine0003 |
Thanks a million for your efforts… I would be so grateful if this works and I am sure this will provide the basis for some amazing networks☺.