Awk: conversion of matrix formats


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Awk: conversion of matrix formats
# 1  
Old 03-14-2013
Awk: conversion of matrix formats

hello,

i would need a fast awk script for conversion of network formats (from 'sif' to 'adjacency' format):

sif (pp means only: protein-protein interaction):
A pp B
A pp C
B pp D
D pp E

in an adjacency n x n matrix:
Code:
 
  A B C D E
A 0 1 1 0 0
B 1 0 0 1 0
C 1 0 0 0 0
D 0 1 0 0 1
E 0 0 0 1 0

my idea:

go through all rows and build two indexed arrays (if array-names taken from the input file, i.e. $1, are allowed - i think this is called name substitution):
Code:
 
names[$1]=dummy
names[$3]=dummy
$1[$3] = 1
$3[$1] = 1

then loop over all array-names for (i in names) to write the column headers.

then loop nested two times over all array-names for (j in names); for (k in names) and write "1" if j[k] is 1 else "0". (I hope indices are always sorted the same way).


do you think this could work? and perhaps you can provide some code drafts (I am rather untrained in awk).

if substitution for array names doesn't work, perhaps 'two dimensional' arrays would work?
Code:
 
names($1)=dummy
names($3)=dummy
pp($1,$3)=1
pp($3,$1)=1

the rest as above, loop two times over name indices and check if pp(j,k) is 1.

thank you very much...

dietmar
# 2  
Old 03-14-2013
Are names always a single character?
If not, are names always the same length?
If not, is there a maximum length you want to consider, or should the output column widths adjust to the names found in the input?
How big is the array (i.e., number of different names)?
# 3  
Old 03-14-2013
Your thoughts were quite easy to translate into awk speak, at least for the simple sample that you gave:
Code:
awk     '!HD[$1]        {HD[$1]++}
         !HD[$3]        {HD[$3]++}
                        {PP[$1,$3] = 1
                         PP[$3,$1] = 1
                        }
         END            {printf "  "
                         for (i in HD)   printf "%s ", i; printf "\n"
                         for (i in HD)  {printf "%s ", i
                                         for (j in HD) printf "%s ", PP[i,j]?PP[i,j]:"0"
                                         printf "\n"
                                        }
                        }
        ' file
  A B C D E 
A 0 1 1 0 0 
B 1 0 0 1 0 
C 1 0 0 0 0 
D 0 1 0 0 1 
E 0 0 0 1 0

If Don Cragun's suspicions come true, this might need to be seriously reworked, though.
And, the (i in HD) supplies the names in arbitrary order, esp. in large files, it's by sheer luck they seem sorted in above case. So, in the end you may need to add some sorting code on top.
# 4  
Old 03-15-2013
thank you RudiC and Don Cragun

names are of any length
output shoul be tab delimited
there are up to millions names
(I should have clarified this)

I think RudiC's script is nearly perfect. Only the space delimite has to be changed to tab (perhaps you could change it, otherwise I will try: if I am right if have only to remove the "%s " part and set the OFS to tab.).

and only for my curiosity:
why check for !HD[$1]: when the $1 is already in use than happens nothing? the index is already set. the ++ is only for using of the array (to fill indices), the number for each index is never used?

sorting is no problem, if they come for all three loops in the same order!

thank you very much.

dietmar

---------- Post updated at 02:39 AM ---------- Previous update was at 01:34 AM ----------

now the script works with one exception:

Code:
#!/bin/bash

fn=$1
fname=${fn%.*}
echo $fname

awk 'BEGIN {FS="\t"};
    NF >= 3    
    {HD[$1]++; HD[$3]++; PP[$1,$3] = 1; PP[$3,$1] = 1 }
    END    {printf "\t"
        for (i in HD) { printf "%s\t" ,i } printf "\n"
        for (i in HD) {printf "%s\t", i ;
            for (j in HD) { printf "%s\t", PP[i,j]?PP[i,j]:"0" } ;
            printf "\n" } }' $fn > $fname.adj

BUT: I get the complete input file in front of my matrix output file, and I don't see why this happens...
# 5  
Old 03-15-2013
You are absolutely right - the !HD check is redundant, used out of sheer habit to keep HD at the logic levels 1 and 0. Try this simplified version with <TAB>s as separators, as OFS will not work:
Code:
awk     '               {HD[$1]++
                         HD[$3]++
                         PP[$1,$3] = PP[$3,$1] = 1
                        }
         END            {printf "\t"
                         for (i in HD)   printf "%s\t", i; printf "\n"
                         for (i in HD)  {printf "%s\t", i
                                         for (j in HD) printf "%s\t", PP[i,j]?PP[i,j]:"0"
                                         printf "\n"
                                        }
                        }
        ' file

---------- Post updated at 08:48 ---------- Previous update was at 08:41 ----------

Quote:
Originally Posted by dietmar13
. . . now the script works with one exception:

Code:
. . .     NF >= 3   . . .

BUT: I get the complete input file in front of my matrix output file, and I don't see why this happens...
That's because above pattern is true, and it is separated from what you want to be its action by a new line, so assuming the default action which is print $0.
# 6  
Old 03-15-2013
RudiC,

thank you - now it works perfect.

what newline does in awk is new for me...
# 7  
Old 03-15-2013
Quote:
Originally Posted by dietmar13
names are of any length
output shoul be tab delimited
there are up to millions names
(I should have clarified this)
Hi dietmar,
If any names are more than 7 characters long (assuming tab stops set at every 8 column positions), your output headings won't line up with the data values. If this is a concern to you, you could change the script to print the top headings vertically instead of horizontally and adjust the printing of the row headings to make the 1st column in your output be the width of the longest name.

Furthermore, with millions of rows and columns, the output produced will not be a text file (due to excessive line lengths), so you will be restricted by the number of utilities you can use to post-process your output.
Quote:
Originally Posted by dietmar13
... ... ...
RudiC has already commented on the part removed above.
Quote:
Originally Posted by dietmar13

thank you very much.

dietmar

---------- Post updated at 02:39 AM ---------- Previous update was at 01:34 AM ----------

now the script works with one exception:

Code:
#!/bin/bash

fn=$1
fname=${fn%.*}
echo $fname

awk 'BEGIN {FS="\t"};
    NF >= 3    
    {HD[$1]++; HD[$3]++; PP[$1,$3] = 1; PP[$3,$1] = 1 }
    END    {printf "\t"
        for (i in HD) { printf "%s\t" ,i } printf "\n"
        for (i in HD) {printf "%s\t", i ;
            for (j in HD) { printf "%s\t", PP[i,j]?PP[i,j]:"0" } ;
            printf "\n" } }' $fn > $fname.adj

BUT: I get the complete input file in front of my matrix output file, and I don't see why this happens...
Assuming that there are no empty or blank lines in your input files, that column and row headings are less than 8 characters long (or you don't care about column alignment), and that you don't want the extra tab at the end of each line that your script currently produces, you could also try this slightly simplified script:
Code:
#!/bin/bash
fn=$1
fname=${fn%.*}
echo $fname

awk '{  HD[$1]; HD[$3]; PP[$1,$3] = PP[$3,$1] = 1}
END {   for (i in HD) printf "\t%s", i
        printf "\n"
        for (i in HD) {
                printf "%s", i
                for (j in HD) printf "\t%d", PP[i,j]?1:0
                printf "\n"
        }
}' $fn > $fname.adj

which puts:
Code:
	A	B	C	D	E
A	0	1	1	0	0
B	1	0	0	1	0
C	1	0	0	0	0
D	0	1	0	0	1
E	0	0	0	1	0

in your output file when the file named by $1 contains the input given in the 1st message in this thread. (Again as RudiC stated, the order of rows and columns may vary, but the row headings and the column headings should be in the same order.)

If you wanted to run this on a Solaris/SunOS system, you would need to use /usr/xpg4/bin/awk or nawk instead of awk.

Last edited by Don Cragun; 03-15-2013 at 08:35 AM.. Reason: s/empty/empty or blank/
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to sum the matrix using awk?

input A1 B1 A2 B2 0 0 1 1 1 0 0 1 0 1 1 0 1 1 1 1 Output label A1 B1 A2 B2 A1 2 1 1 2 B1 1 2 2 1 A2 1 2 3 2 B2 2 1 2 3 Ex: The number of times that A1 and B1 row values are both 1 should be printed as output. The last row of A1 and B1 in the input match by having 1 in both... (4 Replies)
Discussion started by: quincyjones
4 Replies

2. Shell Programming and Scripting

Using awk to parse a file with mixed formats in columns

Greetings I have a file formatted like this: rhino grey weight=1003;height=231;class=heaviest;histology=9,0,0,8 bird white weight=23;height=88;class=light;histology=7,5,1,0,0 turtle green weight=40;height=9;class=light;histology=6,0,2,0... (2 Replies)
Discussion started by: Twinklefingers
2 Replies

3. Shell Programming and Scripting

how to rearrange a matrix with awk

Hi, every one. I have two files ,one is in matrix like this, one is a list with the same data as the matrix. AB AE AC AD AA AF SA 3 4 5 6 4 6 SC 5 7 2 8 4 3 SD 4 6 5 3 8 3 SE 45 ... (5 Replies)
Discussion started by: xshang
5 Replies

4. Shell Programming and Scripting

conversion: 3 columns into matrix

Hi guys, here https://www.unix.com/shell-programming-scripting/193043-3-column-csv-correlation-matrix-awk-perl.html I found awk script converting awk '{ OFS = ";" if (t) { if (l != $1) t = t OFS $1 } else t = OFS $1 x = x ? x OFS $NF : $NF l = $1 }... (2 Replies)
Discussion started by: grincz
2 Replies

5. Shell Programming and Scripting

Summing up a matrix using awk

Hi there, If anyone can help me sorting out this small task would be great. Given a matrix like the following: 100 3 3 3 3 3 ... 200 5 5 5 5 5 ... 400 1 1 1 1 1 ... 500 8 8 8 8 8 ... 900 0 0 0 0... (5 Replies)
Discussion started by: JRodrigoF
5 Replies

6. UNIX for Dummies Questions & Answers

tab-separated file to matrix conversion

hello all, i have an input file like that A A X0 A B X1 A C X2 ... A Z Xx B A X1 B B X3 .... Z A Xx Z B X4 and i want to have an output like that A B C D A X0 X1 X2 Xy B X1 X3 X4 (4 Replies)
Discussion started by: TheTransporter
4 Replies

7. Shell Programming and Scripting

awk? adjacency matrix to adjacency list / correlation matrix to list

Hi everyone I am very new at awk but think that that might be the best strategy for this. I have a matrix very similar to a correlation matrix and in practical terms I need to convert it into a list containing the values from the matrix (one value per line) with the first field of the line (row... (5 Replies)
Discussion started by: stonemonkey
5 Replies

8. Shell Programming and Scripting

awk matrix problem

hi there I'm very new in programing and i've started with awk. I'm processing 200 data files and I need to do some precessing on them. The files have 3 columns with N-lines for each line a have on the first and second value is the same for all the files and only the third is variable. like... (2 Replies)
Discussion started by: philstar
2 Replies

9. Shell Programming and Scripting

matrix inverse (awk)

I need to inverse a matrix given in a file. The problem is I'm stuck with writing determinant finding algoritm into code. I found this algoritm about finding determinant of nxn matrix. This is what i need: Matrices and Determinants and here: a11 a12 a13 a21 a22 a23 a31 a32 a33... (0 Replies)
Discussion started by: vesyyr
0 Replies

10. UNIX for Dummies Questions & Answers

need help-matrix inverse (awk)

I have few days to complete my awk homework. But I'm stucked. i hope some1 will help me out. I have to inverse n x n matrix, but I have problems with finding the determinant of the matrix. I found the algoritm, how to find a determinant of n x n matrix:... (0 Replies)
Discussion started by: vesyyr
0 Replies
Login or Register to Ask a Question