Venn Data Maker


 
Thread Tools Search this Thread
Homework and Emergencies Emergency UNIX and Linux Support Venn Data Maker
# 1  
Old 08-18-2016
Venn Data Maker

Hi,

My input is like this

Code:
head input.txt
Set1,Set2,Set3
g1,g2,g3
g2,g1,g3,
g4,g5,g5
g1,g1,g1,
g2,g1,g1,
g6,g7,g8
,g7,g8
,,g8


My output file should be

Code:
Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1

Logic
1. First get all unique genenames (g1,g2.....g8).
2. Then look if that particular gene is present in any of the columns in the input file.
3. If it is present, print 1. If absent, print 0.

Special Notes
1. Please note that the each of the columns (Set1,Set2,Set3) in input.txt can have missing values(last two records in input.txt).
2. The columns are not always three. My actual input file has 7. So I want the column counts to be dynamic.

Thanks
This User Gave Thanks to jacobs.smith For This Post:
# 2  
Old 08-18-2016
An approach using gawk:-
Code:
gawk -F, '
        NR == 1  {
                print "Name," $0
        }
        NR > 1 {
                for ( i = 1; i <= NF; i++ )
                {
                        if ( $i )
                                T[$i]
                        R[$i FS i]
                }
        }
        END {
                n = asorti(T)
                for ( i = 1; i <= n; i++ )
                {
                        for ( j = 1; j <= NF; j++ )
                        {
                                if ( ( T[i] FS j ) in R )
                                        S = S ? S FS 1 : T[i] FS 1
                                else
                                        S = S ? S FS 0 : T[i] FS 0
                        }
                        print S
                        S = ""
                }
        }
' file

This User Gave Thanks to Yoda For This Post:
# 3  
Old 08-18-2016
Hello jacobs.smith,

If you are not bothered about sequence of field 1st as per your Input_file then following may help you in same.
Code:
awk -F, 'NR==1{print "Name," $0;R=NF} NR>1{for(i=1;i<=NF;i++){A[$i,i]++;if($i){C[$i]}}} END{for(i in C){for(j=1;j<=R;j++){Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};print Q;Q=""}}'  Input_file

Output will be as follows.
Code:
Name,Set1,Set2,Set3
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0

In case you need output into same order as per Input-file(sorted order) then following may help you in same.
Code:
awk -F, 'NR==1{print "Name," $0;R=NF} NR>1{for(i=1;i<=NF;i++){A[$i,i]++;if($i){C[$i]}}} END{for(i in C){for(j=1;j<=R;j++){Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};print Q;Q=""}}' Input_file  | sort -k1

Output will be as follows.
Code:
Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1

EDIT: Adding a non-one liner form of solutions here.
Solution1:
Code:
awk -F, 'NR==1{
                print "Name," $0;
                R=NF
              }
         NR>1 {
                for(i=1;i<=NF;i++){
                                        A[$i,i]++;
                                        if($i){
                                                C[$i]
                                              }
                                  }
              }
         END  {
                for(i in C)       {
                                        for(j=1;j<=R;j++){
                                                                Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};
                                                                print Q;
                                                                Q=""
                                                         }
              }
         ' Input_file

Solution2:
Code:
awk -F, 'NR==1{
                print "Name," $0;
                R=NF
              }
         NR>1 {
                for(i=1;i<=NF;i++){
                                        A[$i,i]++;
                                        if($i){
                                                C[$i]
                                              }
                                  }
              }
         END  {
                for(i in C)       {
                                        for(j=1;j<=R;j++){
                                                                Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};
                                                                print Q;
                                                                Q=""
                                                         }
              }
         ' Input_file  | sort -k1

Thanks,
R. Singh

Last edited by RavinderSingh13; 08-18-2016 at 02:59 PM.. Reason: Added non-one liner form of solutions now.
This User Gave Thanks to RavinderSingh13 For This Post:
# 4  
Old 08-18-2016
Try also
Code:
awk '
NR==1   {print "Name", $0
         next
        }
        {for (i=1; i<=3; i++)   {T[$i]
                                 R[$i,i] = 1
                                }
        }
END     {delete T[""]
         for (t in T) print t, R[t,1]+0, R[t,2]+0, R[t,3]+0
        }
' FS=, OFS=, file
Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1

# 5  
Old 08-19-2016
Thank you folks - @Yoda @RavinderSingh13 and @Rudic.

Can I also get the intersection list into another file?

Code:
Intersectionlist.txt
Set1_unique=2
Set2_unique=1
Set3_unique=2
Set12_common=1
Set13_common=0
Set23_common=1
Set123_common=1

Thanks
# 6  
Old 08-19-2016
Do we need to guess what an "intersection" is? Any attempt from your side?
# 7  
Old 08-19-2016
Quote:
Originally Posted by RudiC
Do we need to guess what an "intersection" is? Any attempt from your side?

My apologies Rudic.

If the value is "1" in any set, that means a presence of value and it should be counted.

If the value is "0" in any set, that means an absent and it should not be counted.

Ex:

Code:
Name, set1, set2, set3
g1,0,0,1
g2,0,0,1
g3,1,1,0

Gene1 and Gene2 are present only in set3. So set3_unique=2.

Gene3 is present in both set1 and set2. So set12_common=1

Please ask me more questions and I will be glad to reply.

Also - the number of lines in the intersectionlist.txt should be equal to = (2^(number of sets))-1

Thanks.
Login or Register to Ask a Question

Previous Thread | Next Thread

2 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Venn diagram results using awk

Hi, I have the following files 1.txt a 10 b 11 c 12 d 13 e 14 f 15 g 16 h 17 i 18 j 19 k 20 2.txt a 21 b 22 (15 Replies)
Discussion started by: jacobs.smith
15 Replies

2. Programming

maker

how can i remake a program to crash a harddrive using unix:rolleyes: (2 Replies)
Discussion started by: flomper
2 Replies
Login or Register to Ask a Question