Today (Saturday) We will make some minor tuning adjustments to MySQL.

You may experience 2 up to 10 seconds "glitch time" when we restart MySQL. We expect to make these adjustments around 1AM Eastern Daylight Saving Time (EDT) US.


Venn Data Maker


Login or Register to Reply

 
Thread Tools Search this Thread
# 1  
Venn Data Maker

Hi,

My input is like this

Code:
head input.txt
Set1,Set2,Set3
g1,g2,g3
g2,g1,g3,
g4,g5,g5
g1,g1,g1,
g2,g1,g1,
g6,g7,g8
,g7,g8
,,g8


My output file should be

Code:
Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1

Logic
1. First get all unique genenames (g1,g2.....g8).
2. Then look if that particular gene is present in any of the columns in the input file.
3. If it is present, print 1. If absent, print 0.

Special Notes
1. Please note that the each of the columns (Set1,Set2,Set3) in input.txt can have missing values(last two records in input.txt).
2. The columns are not always three. My actual input file has 7. So I want the column counts to be dynamic.

Thanks
This User Gave Thanks to jacobs.smith For This Post:
# 2  
An approach using gawk:-
Code:
gawk -F, '
        NR == 1  {
                print "Name," $0
        }
        NR > 1 {
                for ( i = 1; i <= NF; i++ )
                {
                        if ( $i )
                                T[$i]
                        R[$i FS i]
                }
        }
        END {
                n = asorti(T)
                for ( i = 1; i <= n; i++ )
                {
                        for ( j = 1; j <= NF; j++ )
                        {
                                if ( ( T[i] FS j ) in R )
                                        S = S ? S FS 1 : T[i] FS 1
                                else
                                        S = S ? S FS 0 : T[i] FS 0
                        }
                        print S
                        S = ""
                }
        }
' file

This User Gave Thanks to Yoda For This Post:
# 3  
Hello jacobs.smith,

If you are not bothered about sequence of field 1st as per your Input_file then following may help you in same.
Code:
awk -F, 'NR==1{print "Name," $0;R=NF} NR>1{for(i=1;i<=NF;i++){A[$i,i]++;if($i){C[$i]}}} END{for(i in C){for(j=1;j<=R;j++){Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};print Q;Q=""}}'  Input_file

Output will be as follows.
Code:
Name,Set1,Set2,Set3
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0

In case you need output into same order as per Input-file(sorted order) then following may help you in same.
Code:
awk -F, 'NR==1{print "Name," $0;R=NF} NR>1{for(i=1;i<=NF;i++){A[$i,i]++;if($i){C[$i]}}} END{for(i in C){for(j=1;j<=R;j++){Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};print Q;Q=""}}' Input_file  | sort -k1

Output will be as follows.
Code:
Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1

EDIT: Adding a non-one liner form of solutions here.
Solution1:
Code:
awk -F, 'NR==1{
                print "Name," $0;
                R=NF
              }
         NR>1 {
                for(i=1;i<=NF;i++){
                                        A[$i,i]++;
                                        if($i){
                                                C[$i]
                                              }
                                  }
              }
         END  {
                for(i in C)       {
                                        for(j=1;j<=R;j++){
                                                                Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};
                                                                print Q;
                                                                Q=""
                                                         }
              }
         ' Input_file

Solution2:
Code:
awk -F, 'NR==1{
                print "Name," $0;
                R=NF
              }
         NR>1 {
                for(i=1;i<=NF;i++){
                                        A[$i,i]++;
                                        if($i){
                                                C[$i]
                                              }
                                  }
              }
         END  {
                for(i in C)       {
                                        for(j=1;j<=R;j++){
                                                                Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};
                                                                print Q;
                                                                Q=""
                                                         }
              }
         ' Input_file  | sort -k1

Thanks,
R. Singh

Last edited by RavinderSingh13; 08-18-2016 at 02:59 PM.. Reason: Added non-one liner form of solutions now.
This User Gave Thanks to RavinderSingh13 For This Post:
# 4  
Try also
Code:
awk '
NR==1   {print "Name", $0
         next
        }
        {for (i=1; i<=3; i++)   {T[$i]
                                 R[$i,i] = 1
                                }
        }
END     {delete T[""]
         for (t in T) print t, R[t,1]+0, R[t,2]+0, R[t,3]+0
        }
' FS=, OFS=, file
Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1

# 5  
Thank you folks - @Yoda @RavinderSingh13 and @Rudic.

Can I also get the intersection list into another file?

Code:
Intersectionlist.txt
Set1_unique=2
Set2_unique=1
Set3_unique=2
Set12_common=1
Set13_common=0
Set23_common=1
Set123_common=1

Thanks
# 6  
Do we need to guess what an "intersection" is? Any attempt from your side?
# 7  
Quote:
Originally Posted by RudiC
Do we need to guess what an "intersection" is? Any attempt from your side?

My apologies Rudic.

If the value is "1" in any set, that means a presence of value and it should be counted.

If the value is "0" in any set, that means an absent and it should not be counted.

Ex:

Code:
Name, set1, set2, set3
g1,0,0,1
g2,0,0,1
g3,1,1,0

Gene1 and Gene2 are present only in set3. So set3_unique=2.

Gene3 is present in both set1 and set2. So set12_common=1

Please ask me more questions and I will be glad to reply.

Also - the number of lines in the intersectionlist.txt should be equal to = (2^(number of sets))-1

Thanks.
Login or Register to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

2 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Venn diagram results using awk

Hi, I have the following files 1.txt a 10 b 11 c 12 d 13 e 14 f 15 g 16 h 17 i 18 j 19 k 20 2.txt a 21 b 22 (15 Replies)
Discussion started by: jacobs.smith
15 Replies

2. Programming

maker

how can i remake a program to crash a harddrive using unix:rolleyes: (2 Replies)
Discussion started by: flomper
2 Replies

Featured Tech Videos