I have a score matrix like below, where the 3rd column ( 1 max, 0 min) says how close the 2nd column variable is to the 1st column variable
Given a set of variables, I need to expand the selection to other combinations which are close (> 0.7) to the given variables.
So, for the row, I need to find all combinations which have closeness ( score >0.7) to the column variables.
For example : variable a can be expanded to include variables c and d as they have scores > 0.7 with a.
Variable b, can be expanded to include y, and c can include m and n.
So the above row can be expanded as
Similarly for the row
I want an expansion of
So my example input is
and desired output is
This is what I tried unsuccessfully.
Another potential issue might be there are 345 million rows in the score file. I realize that my code might be close, but 3 for loops for each input row might run forever.
Here is the cluster memory and os I have access to
Hi RudiC, thank you, will this help deal with the 345 million records in the score table problem? Is there a way we can avoid the 3 for loops? Thanks a lot for your help.
Can't tell - even though scores below 0.7 won't be collected into awk arrays, 345 E6 records may be too many. And, I'm afraid you can't avoid three loops if you want to permute three columns.
I have looked through several threads regarding merging files with awk and attempted using join however have been unsuccessful likely as I do not fully understand awk.
What I am attempting is to take a csv file which could be between 1 and 15,000 lines with 5 colums and another csv file that will... (4 Replies)
Hello masters,
I am filtering data based on completeness. A (Name , Group) combination in File2 is only complete when it has data for all subgroups specified in File1.
All incomplete (Name , Group) combinations do not appear in the output.
So for example , Name1 Group 1 in File2 is... (6 Replies)
I have two files with thousands of sequences of different lengths. infile1 contains the actual sequences and infile2 the scores for each A, T, G and C in infile1. Something like this:
infile1:
>HZVJKYI01ECH5R
TTGATGTGCCAGCTGCCGTTGGTGTGCCAA
>HZVJKYI01AQWJ8
GGATATGATGATGAACTGGTTTGGCACACC... (4 Replies)
I have to create a bunch of functional (non-user) accounts that are owned by 1 person. And I create several of these functional accounts each day so there are several owners. Is there a way to make a password based off an algorithm that uses the owners identification number, so all accounts I... (2 Replies)
Hi All,
I have a file which looks like this:
234422 1 .00222
323232 1 3232
32323 1 0.00222
1234 2 1211
2332 2 0.9
233 3 0.883
123 3 45
As you can see, the second column of the file is already sorted which I did using sort command.
Now, I want to create files based on the second... (1 Reply)
Hey everyone.
I am trying to figure out a way to create a file that will be renamed based off of one of multiple files. For example, if I have 3 files (cat.ctl, dog.ctl, and bird.ctl) that gets placed on to an ftp site I want to create a single file called new.cat.ctl, new.dog.ctl, etc for each... (3 Replies)
Hi,
I'm trying to assign a score to each row which will allow me to identify which rows differ. In the example file below, I've used "," to indicate column separators (my actual file has tab separators). In this example, I'd like to identify that row 1 and row 5 are the same, and row 2 and row... (4 Replies)
i get text files with
Action & Adventure|2012: Supernova NR|2009-11-01 00:01:00|2010-05-01 23:59:00|Active|3
Action & Adventure|50 Dead Men Walking|2010-01-05 00:01:00|2010-06-30 23:59:00|Active|4
Action & Adventure|Afterwards|2009-11-26 00:01:00|2010-03-26 23:59:00|Deactivated|6
Based... (3 Replies)