Create combinations based on scores

10-27-2015

Registered User

24, 0

Join Date: Nov 2013

Last Activity: 5 October 2017, 4:24 PM EDT

Posts: 24

Thanks Given: 19

Thanked 0 Times in 0 Posts

Create combinations based on scores

Hi experts,

I have a score matrix like below, where the 3rd column ( 1 max, 0 min) says how close the 2nd column variable is to the 1st column variable

Code:

a       b       0.3
a       c       0.87
a       d       0.75
b       x       0.67
b       y       0.98
b       z       0.24
c       m       0.9
c       n       0.76
d       p       0.87

Given a set of variables, I need to expand the selection to other combinations which are close (> 0.7) to the given variables.

So, for the row, I need to find all combinations which have closeness ( score >0.7) to the column variables.

Code:

a	b	c

For example : variable a can be expanded to include variables c and d as they have scores > 0.7 with a.

Variable b, can be expanded to include y, and c can include m and n.

So the above row can be expanded as

Code:

a	b	c
c	b	c
d	b	c
a	y	c
c	y	c
d	y	c
a	b	m
c	b	m
d	b	m
a	y	m
c	y	m
d	y	m
a	b	n
c	b	n
d	b	n
a	y	n
c	y	n
d	y	n

Similarly for the row

Code:

d	b	a

I want an expansion of

Code:

d	b	a
p	b 	a
d	y	a
p	y	a
d	b	c
p	b 	c
d	y	c
p	y	c
d	b	d
p	b 	d
d	y	d
p	y	d

So my example input is

Code:

a	b	c
d	b	a

and desired output is

Code:

a	b	c
c	b	c
d	b	c
a	y	c
c	y	c
d	y	c
a	b	m
c	b	m
d	b	m
a	y	m
c	y	m
d	y	m
a	b	n
c	b	n
d	b	n
a	y	n
c	y	n
d	y	n
d	b	a
p	b 	a
d	y	a
p	y	a
d	b	c
p	b 	c
d	y	c
p	y	c
d	b	d
p	b 	d
d	y	d
p	y	d

This is what I tried unsuccessfully.

Code:

awk 'NR==FNR{scr[$1,$2]=$3;var[$1];  next} { for (col1 in var ) { for (col2 in var) { for (col3 in var)   { if  ( scr[$1,col1]>0.7 && scr[$2,col2] > 0.7 && scr[$3,col3]>0.7 ) { print col1,col2, col3 }}}}}' scr input

Another potential issue might be there are 345 million rows in the score file. I realize that my code might be close, but 3 for loops for each input row might run forever.

Here is the cluster memory and os I have access to

Code:

 free -m
             total       used       free     shared    buffers     cached
Mem:        387591     299120      88471          2        481     292698


cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.6

jianp83

View Public Profile for jianp83

Find all posts by jianp83

10-27-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

How about

Code:

awk '
FNR==NR         {if ($3 > 0.7) EXP[$1] = EXP[$1] "," $2
                 next
                }
                {split ($1 EXP[$1], U, ",")
                 split ($2 EXP[$2], V, ",")
                 split ($3 EXP[$3], W, ",")
                        {for (u in U)
                                for (v in V)
                                        for (w in W) print U[u], V[v], W[w]
                        }
                }
' OFS="\t" scr input
a    b    c
a    b    m
a    b    n
a    y    c
a    y    m
a    y    n
c    b    c
c    b    m
c    b    n
c    y    c
c    y    m
c    y    n
d    b    c
d    b    m
d    b    n
d    y    c
d    y    m
d    y    n
d    b    a
d    b    c
d    b    d
d    y    a
d    y    c
d    y    d
p    b    a
p    b    c
p    b    d
p    y    a
p    y    c
p    y    d

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

10-27-2015

Registered User

24, 0

Join Date: Nov 2013

Last Activity: 5 October 2017, 4:24 PM EDT

Posts: 24

Thanks Given: 19

Thanked 0 Times in 0 Posts

Hi RudiC, thank you, will this help deal with the 345 million records in the score table problem? Is there a way we can avoid the 3 for loops? Thanks a lot for your help.

jianp83

View Public Profile for jianp83

Find all posts by jianp83

10-27-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Can't tell - even though scores below 0.7 won't be collected into awk arrays, 345 E6 records may be too many. And, I'm afraid you can't avoid three loops if you want to permute three columns.

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

10-27-2015

Registered User

24, 0

Join Date: Nov 2013

Last Activity: 5 October 2017, 4:24 PM EDT

Posts: 24

Thanks Given: 19

Thanked 0 Times in 0 Posts

thank you, would you also point out what is wrong with my code? It will help a lot in my learning.

jianp83

View Public Profile for jianp83

Find all posts by jianp83

Shell Programming and Scripting

Create combinations based on scores

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Create file based on data from two other files

Discussion started by: cdubu2

2. UNIX for Dummies Questions & Answers

Delete data blocks based on missing combinations

Discussion started by: senhia83

3. Shell Programming and Scripting

Create table based on matched patterns

Discussion started by: redse171

4. Shell Programming and Scripting

Selecting sequences based on scores

Discussion started by: Xterra

5. UNIX for Dummies Questions & Answers

Create password based on...

Discussion started by: MaindotC

6. Shell Programming and Scripting

Create files based on second column of a file

Discussion started by: shoaibjameel123

7. UNIX for Advanced & Expert Users

Create a file based on multiple files

Discussion started by: coach5779

8. UNIX for Dummies Questions & Answers

How to assign scores to rows based on column values

Discussion started by: auburn

9. Shell Programming and Scripting

create pipes based on the column

Discussion started by: ramse8pc