Venn Data Maker

Tags
solved

 
Thread Tools Search this Thread
# 1  
Old 08-18-2016
Venn Data Maker

Hi,

My input is like this

Code:
head input.txt
Set1,Set2,Set3
g1,g2,g3
g2,g1,g3,
g4,g5,g5
g1,g1,g1,
g2,g1,g1,
g6,g7,g8
,g7,g8
,,g8


My output file should be

Code:
Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1

Logic
1. First get all unique genenames (g1,g2.....g8).
2. Then look if that particular gene is present in any of the columns in the input file.
3. If it is present, print 1. If absent, print 0.

Special Notes
1. Please note that the each of the columns (Set1,Set2,Set3) in input.txt can have missing values(last two records in input.txt).
2. The columns are not always three. My actual input file has 7. So I want the column counts to be dynamic.

Thanks
This User Gave Thanks to jacobs.smith For This Post:
RavinderSingh13 (08-19-2016)
# 2  
Old 08-18-2016
An approach using gawk:-
Code:
gawk -F, '
        NR == 1  {
                print "Name," $0
        }
        NR > 1 {
                for ( i = 1; i <= NF; i++ )
                {
                        if ( $i )
                                T[$i]
                        R[$i FS i]
                }
        }
        END {
                n = asorti(T)
                for ( i = 1; i <= n; i++ )
                {
                        for ( j = 1; j <= NF; j++ )
                        {
                                if ( ( T[i] FS j ) in R )
                                        S = S ? S FS 1 : T[i] FS 1
                                else
                                        S = S ? S FS 0 : T[i] FS 0
                        }
                        print S
                        S = ""
                }
        }
' file

This User Gave Thanks to Yoda For This Post:
jacobs.smith (08-18-2016)
# 3  
Old 08-18-2016
Hello jacobs.smith,

If you are not bothered about sequence of field 1st as per your Input_file then following may help you in same.
Code:
awk -F, 'NR==1{print "Name," $0;R=NF} NR>1{for(i=1;i<=NF;i++){A[$i,i]++;if($i){C[$i]}}} END{for(i in C){for(j=1;j<=R;j++){Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};print Q;Q=""}}'  Input_file

Output will be as follows.
Code:
Name,Set1,Set2,Set3
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0

In case you need output into same order as per Input-file(sorted order) then following may help you in same.
Code:
awk -F, 'NR==1{print "Name," $0;R=NF} NR>1{for(i=1;i<=NF;i++){A[$i,i]++;if($i){C[$i]}}} END{for(i in C){for(j=1;j<=R;j++){Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};print Q;Q=""}}' Input_file  | sort -k1

Output will be as follows.
Code:
Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1

EDIT: Adding a non-one liner form of solutions here.
Solution1:
Code:
awk -F, 'NR==1{
                print "Name," $0;
                R=NF
              }
         NR>1 {
                for(i=1;i<=NF;i++){
                                        A[$i,i]++;
                                        if($i){
                                                C[$i]
                                              }
                                  }
              }
         END  {
                for(i in C)       {
                                        for(j=1;j<=R;j++){
                                                                Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};
                                                                print Q;
                                                                Q=""
                                                         }
              }
         ' Input_file

Solution2:
Code:
awk -F, 'NR==1{
                print "Name," $0;
                R=NF
              }
         NR>1 {
                for(i=1;i<=NF;i++){
                                        A[$i,i]++;
                                        if($i){
                                                C[$i]
                                              }
                                  }
              }
         END  {
                for(i in C)       {
                                        for(j=1;j<=R;j++){
                                                                Q=Q?Q FS (A[i,j]=A[i,j]>=1?1:0):i FS  (A[i,j]=A[i,j]>=1?1:0)};
                                                                print Q;
                                                                Q=""
                                                         }
              }
         ' Input_file  | sort -k1

Thanks,
R. Singh

Last edited by RavinderSingh13; 08-18-2016 at 02:59 PM.. Reason: Added non-one liner form of solutions now.
This User Gave Thanks to RavinderSingh13 For This Post:
jacobs.smith (08-18-2016)
# 4  
Old 08-18-2016
Try also
Code:
awk '
NR==1   {print "Name", $0
         next
        }
        {for (i=1; i<=3; i++)   {T[$i]
                                 R[$i,i] = 1
                                }
        }
END     {delete T[""]
         for (t in T) print t, R[t,1]+0, R[t,2]+0, R[t,3]+0
        }
' FS=, OFS=, file
Name,Set1,Set2,Set3
g1,1,1,1
g2,1,1,0
g3,0,0,1
g4,1,0,0
g5,0,1,1
g6,1,0,0
g7,0,1,0
g8,0,0,1

# 5  
Old 08-19-2016
Thank you folks - @Yoda @RavinderSingh13 and @Rudic.

Can I also get the intersection list into another file?

Code:
Intersectionlist.txt
Set1_unique=2
Set2_unique=1
Set3_unique=2
Set12_common=1
Set13_common=0
Set23_common=1
Set123_common=1

Thanks
# 6  
Old 08-19-2016
Do we need to guess what an "intersection" is? Any attempt from your side?
# 7  
Old 08-19-2016
Quote:
Originally Posted by RudiC
Do we need to guess what an "intersection" is? Any attempt from your side?

My apologies Rudic.

If the value is "1" in any set, that means a presence of value and it should be counted.

If the value is "0" in any set, that means an absent and it should not be counted.

Ex:

Code:
Name, set1, set2, set3
g1,0,0,1
g2,0,0,1
g3,1,1,0

Gene1 and Gene2 are present only in set3. So set3_unique=2.

Gene3 is present in both set1 and set2. So set12_common=1

Please ask me more questions and I will be glad to reply.

Also - the number of lines in the intersectionlist.txt should be equal to = (2^(number of sets))-1

Thanks.

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
Venn diagram results using awk jacobs.smith Shell Programming and Scripting 15 07-25-2012 07:11 AM
maker flomper Programming 2 09-11-2002 09:52 PM