Create combinations based on scores


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Create combinations based on scores
# 1  
Old 10-27-2015
Create combinations based on scores

Hi experts,

I have a score matrix like below, where the 3rd column ( 1 max, 0 min) says how close the 2nd column variable is to the 1st column variable

Code:
a       b       0.3
a       c       0.87
a       d       0.75
b       x       0.67
b       y       0.98
b       z       0.24
c       m       0.9
c       n       0.76
d       p       0.87


Given a set of variables, I need to expand the selection to other combinations which are close (> 0.7) to the given variables.

So, for the row, I need to find all combinations which have closeness ( score >0.7) to the column variables.

Code:
a	b	c

For example : variable a can be expanded to include variables c and d as they have scores > 0.7 with a.

Variable b, can be expanded to include y, and c can include m and n.

So the above row can be expanded as

Code:
a	b	c
c	b	c
d	b	c
a	y	c
c	y	c
d	y	c
a	b	m
c	b	m
d	b	m
a	y	m
c	y	m
d	y	m
a	b	n
c	b	n
d	b	n
a	y	n
c	y	n
d	y	n

Similarly for the row

Code:
d	b	a

I want an expansion of

Code:
d	b	a
p	b 	a
d	y	a
p	y	a
d	b	c
p	b 	c
d	y	c
p	y	c
d	b	d
p	b 	d
d	y	d
p	y	d

So my example input is

Code:
a	b	c
d	b	a

and desired output is

Code:
a	b	c
c	b	c
d	b	c
a	y	c
c	y	c
d	y	c
a	b	m
c	b	m
d	b	m
a	y	m
c	y	m
d	y	m
a	b	n
c	b	n
d	b	n
a	y	n
c	y	n
d	y	n
d	b	a
p	b 	a
d	y	a
p	y	a
d	b	c
p	b 	c
d	y	c
p	y	c
d	b	d
p	b 	d
d	y	d
p	y	d

This is what I tried unsuccessfully.

Code:
awk 'NR==FNR{scr[$1,$2]=$3;var[$1];  next} { for (col1 in var ) { for (col2 in var) { for (col3 in var)   { if  ( scr[$1,col1]>0.7 && scr[$2,col2] > 0.7 && scr[$3,col3]>0.7 ) { print col1,col2, col3 }}}}}' scr input

Another potential issue might be there are 345 million rows in the score file. I realize that my code might be close, but 3 for loops for each input row might run forever.


Here is the cluster memory and os I have access to

Code:
 free -m
             total       used       free     shared    buffers     cached
Mem:        387591     299120      88471          2        481     292698


cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.6

# 2  
Old 10-27-2015
How about
Code:
awk '
FNR==NR         {if ($3 > 0.7) EXP[$1] = EXP[$1] "," $2
                 next
                }
                {split ($1 EXP[$1], U, ",")
                 split ($2 EXP[$2], V, ",")
                 split ($3 EXP[$3], W, ",")
                        {for (u in U)
                                for (v in V)
                                        for (w in W) print U[u], V[v], W[w]
                        }
                }
' OFS="\t" scr input
a    b    c
a    b    m
a    b    n
a    y    c
a    y    m
a    y    n
c    b    c
c    b    m
c    b    n
c    y    c
c    y    m
c    y    n
d    b    c
d    b    m
d    b    n
d    y    c
d    y    m
d    y    n
d    b    a
d    b    c
d    b    d
d    y    a
d    y    c
d    y    d
p    b    a
p    b    c
p    b    d
p    y    a
p    y    c
p    y    d

This User Gave Thanks to RudiC For This Post:
# 3  
Old 10-27-2015
Hi RudiC, thank you, will this help deal with the 345 million records in the score table problem? Is there a way we can avoid the 3 for loops? Thanks a lot for your help.
# 4  
Old 10-27-2015
Can't tell - even though scores below 0.7 won't be collected into awk arrays, 345 E6 records may be too many. And, I'm afraid you can't avoid three loops if you want to permute three columns.
This User Gave Thanks to RudiC For This Post:
# 5  
Old 10-27-2015
thank you, would you also point out what is wrong with my code? It will help a lot in my learning.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Create file based on data from two other files

I have looked through several threads regarding merging files with awk and attempted using join however have been unsuccessful likely as I do not fully understand awk. What I am attempting is to take a csv file which could be between 1 and 15,000 lines with 5 colums and another csv file that will... (4 Replies)
Discussion started by: cdubu2
4 Replies

2. UNIX for Dummies Questions & Answers

Delete data blocks based on missing combinations

Hello masters, I am filtering data based on completeness. A (Name , Group) combination in File2 is only complete when it has data for all subgroups specified in File1. All incomplete (Name , Group) combinations do not appear in the output. So for example , Name1 Group 1 in File2 is... (6 Replies)
Discussion started by: senhia83
6 Replies

3. Shell Programming and Scripting

Create table based on matched patterns

hi, i need help to create a table from an input file like this:- DB|QZX3 140 165 RT_2 VgGIGvGVR DB|QZX3 155 182 UT_1 rlgslqqLaIvlGiFT DB|QZX3 345 362 RT_1 GRKpllligS DB|ZXK6 174 199 RT_2 IstvtvptYlgEiatvkaR DB|ZXK6 189 216 UT_1 algtiyqLfLviGiLF DB|AZ264 15 17... (7 Replies)
Discussion started by: redse171
7 Replies

4. Shell Programming and Scripting

Selecting sequences based on scores

I have two files with thousands of sequences of different lengths. infile1 contains the actual sequences and infile2 the scores for each A, T, G and C in infile1. Something like this: infile1: >HZVJKYI01ECH5R TTGATGTGCCAGCTGCCGTTGGTGTGCCAA >HZVJKYI01AQWJ8 GGATATGATGATGAACTGGTTTGGCACACC... (4 Replies)
Discussion started by: Xterra
4 Replies

5. UNIX for Dummies Questions & Answers

Create password based on...

I have to create a bunch of functional (non-user) accounts that are owned by 1 person. And I create several of these functional accounts each day so there are several owners. Is there a way to make a password based off an algorithm that uses the owners identification number, so all accounts I... (2 Replies)
Discussion started by: MaindotC
2 Replies

6. Shell Programming and Scripting

Create files based on second column of a file

Hi All, I have a file which looks like this: 234422 1 .00222 323232 1 3232 32323 1 0.00222 1234 2 1211 2332 2 0.9 233 3 0.883 123 3 45 As you can see, the second column of the file is already sorted which I did using sort command. Now, I want to create files based on the second... (1 Reply)
Discussion started by: shoaibjameel123
1 Replies

7. UNIX for Advanced & Expert Users

Create a file based on multiple files

Hey everyone. I am trying to figure out a way to create a file that will be renamed based off of one of multiple files. For example, if I have 3 files (cat.ctl, dog.ctl, and bird.ctl) that gets placed on to an ftp site I want to create a single file called new.cat.ctl, new.dog.ctl, etc for each... (3 Replies)
Discussion started by: coach5779
3 Replies

8. UNIX for Dummies Questions & Answers

How to assign scores to rows based on column values

Hi, I'm trying to assign a score to each row which will allow me to identify which rows differ. In the example file below, I've used "," to indicate column separators (my actual file has tab separators). In this example, I'd like to identify that row 1 and row 5 are the same, and row 2 and row... (4 Replies)
Discussion started by: auburn
4 Replies

9. Shell Programming and Scripting

create pipes based on the column

i get text files with Action & Adventure|2012: Supernova NR|2009-11-01 00:01:00|2010-05-01 23:59:00|Active|3 Action & Adventure|50 Dead Men Walking|2010-01-05 00:01:00|2010-06-30 23:59:00|Active|4 Action & Adventure|Afterwards|2009-11-26 00:01:00|2010-03-26 23:59:00|Deactivated|6 Based... (3 Replies)
Discussion started by: ramse8pc
3 Replies
Login or Register to Ask a Question