Find all possible intersecting sets of cardinality 3 or more


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Find all possible intersecting sets of cardinality 3 or more
# 1  
Old 03-26-2015
Find all possible intersecting sets of cardinality 3 or more

Hi Gurus.

Please consider the following problem where col1 is the set name and col2 is comma delimited elements in the corresponding set.

I have about 2600 such rows.

In:
Code:
A v1,v2,v3,v4
B v6,v7,v1,v2
C v7,v1,v2,v5,v6
D v3,v2,v5,v4
E v8,v4,v6,v10,v11
F v3,v2,v1,v4,v6,v8
G v2,v5,v8,v4
H v1,v8,v4,v6
I v6,v10,v11
J v10,v11,v12
K v1,v12,v2,v3,v4


I want to find intersection of all possible sets, where cardinality of the intersection is at least 3 (at least 3 elements of the sets are in common).
A set once assigned , can only be assigned more than once if its maximum intersection cardinality is same with more than one bunch.

For example A has 4 elements in common with F, and K but it has 3 elements common with D. So D doesnt appear with A.

E has a maximal 3 elements common with both H and I. So it appears with both. J doesnt meet the minimum intersection cardinality of 3 with any. So it remains solo.

Out:
Code:
4 A v1,v2,v3,v4
4 F v3,v2,v1,v4,v6,v8
4 K v1,v12,v2,v3,v4


4 B v6,v7,v1,v2
4 C v7,v1,v2,v5,v6


3 D v3,v2,v5,v4
3 G v2,v5,v8,v4

3 E v8,v4,v6,v10,v11
3 H v1,v8,v4,v6

3 E v8,v4,v6,v10,v11
3 I v6,v10,v11

1 J v10,v11,v12



I tried

Code:
awk '{n=split($2,a,","); 
      for (i=1;i<=n;i++) b[$1,i]=a[i]}' ;
      next;
      }
      $1 in b
      {for (i in n)
      	{if (b[$1,i]==$b[$1,i-1])
      	cnt[$1]++
      }
      }
      END { for (i in n)
          if (cnt[$1] > 3)
              print cnt,$0
            
      }' file

# 2  
Old 03-30-2015
How about this:

Code:
awk '
function cardinality(a,b,i,j,x,y) {
    x=split(a,av,",")
    y=split(b,bv,",")

    ret=0
    for(i=1; i<=x;i++)
       for(j=1;j<=y;j++)
          if(av[i]==bv[j]) ret++
    return ret
}
FNR==NR { sets=NR; S[NR]=$1; V[NR]=$2; next }
FNR in S {   max_card=0
    for(i=FNR+1; i<=sets; i++) {
       if(i in S) {
           c = cardinality(V[FNR], V[i])
           if(c == max_card) card_list=card_list","i
           if( c > max_card) {
               card_list=i
               max_card=c
           }
        }
    }
    if(max_card) {
        h=split(card_list,out,",")
        printf "%d %s %s\n", max_card, S[FNR], V[FNR]
        for(i=1;i<=h;i++) {
           printf "%d %s %s\n", max_card, S[out[i]], V[out[i]]
           delete S[out[i]]
        }
        printf "\n"
    } else printf "1 %s %s\n", S[FNR], V[FNR]
}' infile infile

output:

Code:
4 A v1,v2,v3,v4
4 F v3,v2,v1,v4,v6,v8
4 K v1,v12,v2,v3,v4

4 B v6,v7,v1,v2
4 C v7,v1,v2,v5,v6

3 D v3,v2,v5,v4
3 G v2,v5,v8,v4

3 E v8,v4,v6,v10,v11
3 H v1,v8,v4,v6
3 I v6,v10,v11

1 J v10,v11,v12

This User Gave Thanks to Chubler_XL For This Post:
# 3  
Old 03-31-2015
Thanks, I`ll run this now
# 4  
Old 03-31-2015
I'm still a little confused about this output.

In your example H has a 4 cardinality with F so why isn't the required output:

Code:
4 A v1,v2,v3,v4
4 F v3,v2,v1,v4,v6,v8
4 K v1,v12,v2,v3,v4

4 B v6,v7,v1,v2
4 C v7,v1,v2,v5,v6

3 D v3,v2,v5,v4
3 G v2,v5,v8,v4

3 E v8,v4,v6,v10,v11
3 I v6,v10,v11

4 F v3,v2,v1,v4,v6,v8
4 H v1,v8,v4,v6
4 K v1,v12,v2,v3,v4

1 J v10,v11,v12

This User Gave Thanks to Chubler_XL For This Post:
# 5  
Old 04-01-2015
My oversight, you are right... except the red row, because K doesnt have 4 common with that bunch..

Code:
4 A v1,v2,v3,v4
4 F v3,v2,v1,v4,v6,v8
4 K v1,v12,v2,v3,v4

4 B v6,v7,v1,v2
4 C v7,v1,v2,v5,v6

3 D v3,v2,v5,v4
3 G v2,v5,v8,v4

3 E v8,v4,v6,v10,v11
3 I v6,v10,v11

4 F v3,v2,v1,v4,v6,v8
4 H v1,v8,v4,v6
4 K v1,v12,v2,v3,v4

1 J v10,v11,v12

# 6  
Old 04-01-2015
OK, I think I see your full requirement now. This turned out to be a little more complex than I originally thought.

I a little concerned about performance but don't want to add any further complexity to this so please try with you big set and report if it need some further enhancements.

Code:
awk '
function cardinality(a,b,found,i,j,x,y) {
    x=split(a,av,",")
    y=split(b,bv,",")

    found=""
    card=0
    for(i=1; i<=x;i++)
       for(j=1;j<=y;j++)
          if(av[i]==bv[j]) {
              found = found ","  av[i]
              card++
          }
    return substr(found,2)
}
FNR==NR { sets=NR; S[NR]=$1; V[NR]=$2; next }
END {
  # Calculate cardinality for each possible set link
    for(i=1; i<=sets; i++) {
       for(j=i+1; j<=sets; j++) {
           cardinality(V[i], V[j])
           if(card>2) {
             CL[i,j]=card
             if(card > highC) highC=card
            }
        }
    }

  # Only keep highest cardinality link(s) for each set
    for(card=highC; card>2; card--)
        for(p in CL)
           if(p in CL && CL[p]==card) {
              split(p,vals,SUBSEP)
              if (!(vals[1] in MaxC)) MaxC[vals[1]]=card
              for(p=1;p<3;p++) {
                  j=vals[p]
                  for(i=1; i<j; i++)
                     if ((i SUBSEP j) in CL && CL[i,j]<card)
                         delete CL[i,j]
              }
           }

  # Print results
    for(i=1; i<=sets; i++) {
       if(i in MaxC) {
           more=1
           while(more) {
               # print each branch for a set
               printf "%d %s %s\n", MaxC[i], S[i], V[i]
               union=""
               for(j=i+1; j<=sets; j++) {
                   if(((i SUBSEP j) in CL) && MaxC[i]==CL[i,j]) {
                       if(!length(union)) {
                           more=0
                           union=cardinality(V[i],V[j])
                       }
                       if (cardinality(union, V[j]) == union) {
                           printf "%d %s %s\n", CL[i,j], S[j], V[j]
                           CL[i, j] = 0
                       }
                       else more++
                   }
               }
               printf "\n"
            }
       } else {
           used=0
           for(j=1; j<i; j++)
              if(j SUBSEP i in CL) used=1
           if (!used) printf "%d %s %s\n\n", 1, S[i], V[i]
       }
    }

}' infile

Example data produces:

Code:
4 A v1,v2,v3,v4
4 F v3,v2,v1,v4,v6,v8
4 K v1,v12,v2,v3,v4

4 B v6,v7,v1,v2
4 C v7,v1,v2,v5,v6

3 D v3,v2,v5,v4
3 G v2,v5,v8,v4

3 E v8,v4,v6,v10,v11
3 I v6,v10,v11

4 F v3,v2,v1,v4,v6,v8
4 H v1,v8,v4,v6

4 F v3,v2,v1,v4,v6,v8
4 K v1,v12,v2,v3,v4

1 J v10,v11,v12

This User Gave Thanks to Chubler_XL For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Solaris

FSS and processor sets

I read somewhere which says """FSS can be assigned to processor sets, resulting in more sensitive control of priorities on a server than raw processor sets"" can any one tell me how we can assign FSS to processor set and how it works ? Thanx (2 Replies)
Discussion started by: fugitive
2 Replies

2. Shell Programming and Scripting

differentiating two sets for filenames????

set 1 ./abc@@/main/61 ./def.cpp@@/main/13 ./fgh.cpp@@/main/16 ./ijk.cpp@@/main/12 ./mln.cpp@@/main/9 ./uvw.cpp@@/main/30 set2 ./eww@@/main/61 ./def.cpp@@/main/13 ./xxx.cpp@@/main/26 ./kkk.cpp@@/main/72 ./qqq.cpp@@/main/19 ./fgh.cpp@@/main/16 I have two sets with filenames in... (13 Replies)
Discussion started by: skyineyes
13 Replies

3. Shell Programming and Scripting

differentiating two sets

Hi Suppose i have a set of files like this set1 a.cpp@@main/5 b.cpp@@main/6 set 2 m.cpp@@main/51 n.hpp@@main/51 a.cpp@@main/15 b.cpp@@main/2 there may be files with same name in 2 sets. i need to list the files in set1 which have last numeric field less than the same file in... (15 Replies)
Discussion started by: skyineyes
15 Replies

4. Shell Programming and Scripting

Files common in two sets ??? How to find ??

Suppose we have 2 set of files set 1 set 2 ------ ------ abc hgb def ppp mgh vvv nmk sdf hgb ... (1 Reply)
Discussion started by: skyineyes
1 Replies

5. Programming

How An Application Sets The Ip Options???

Hello Friends, I'm involved in test the UDP/IP source code. As you might be knowing, IPv4 provides several options: like Loose Source and Record Route (LSRR), Strict Source and Record Route (SSRR) etc. I wanted to test the above mentioned IP options. My strategy is to write a test application... (3 Replies)
Discussion started by: aamirglb
3 Replies

6. Virtualization and Cloud Computing

Clouds (Partially Order Sets) - Streams (Linearly Ordered Sets) - Part 2

timbass Sat, 28 Jul 2007 10:07:53 +0000 Originally posted in Yahoo! CEP-Interest Here is my follow-up note on posets (partially ordered sets) and tosets (totally or linearly ordered sets) as background set theory for event processing, and in particular CEP and ESP. In my last note, we... (0 Replies)
Discussion started by: Linux Bot
0 Replies

7. Shell Programming and Scripting

Character Sets

Hi I was just wondering if there was a way in which i could find out the character set used in a file in HP-UX. ie Whether it is Unicode, UTF-8,ascii etc. Regards (3 Replies)
Discussion started by: PradeepRed
3 Replies

8. UNIX for Advanced & Expert Users

FILE SETS in unix

Hi all, Pls. let me know whether there is any concept called "FILE SETS" in unix? Because, I am using ETL tool DataStage which creates FILE SETS. While I am able to view the data of such a file set in the tool, the "cat" command on this FILESET lists only the Metadata and not the data content... (2 Replies)
Discussion started by: Aparna_A
2 Replies
Login or Register to Ask a Question