Calculating correlations across columns in awk


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Calculating correlations across columns in awk
# 1  
Old 11-21-2018
Calculating correlations across columns in awk

Hello,

I came across a previous thread "awk-calculating-simple-correlation-rows" which calculated correlations across rows in awk.

Code:
    Code:
    awk '{
      a = 0; for (i = 2; i <= NF; ++i) a += $i; a /= NF-1
      b = 0; for (i = 2; i <= NF; ++i) b += ($i - a) ^ 2; b = sqrt(b)
      if (b <= 0) next
      for (i = 2; i <= NF; ++i) x[NR, i] = ($i - a) / b
      n[NR] = $1
      for (i = 2; i <= NR; ++i) {
        if (!(i in n)) continue
        a = 0
        for (k = 2; k <= NF; ++k)
          a += x[NR, k] * x[i, k]
        print n[NR], n[i], a
      }
    }'

I need something similar which can calculate correlations across columns and produce a similar output

Input file:
Code:
    
    Name C1 C2 C3 C4 C5 C6
    
    R1 1 2 3 4 5 6
    
    R2 2 1 1 0 1 0
    
    R3 1 3 1 1 2 1
    
    R4 1 1 0 2 0 1
    
    R5 1 2 2 2 0 2
    
    R6 1 1 0 1 2 0
    
    Desired Output
    
    C1 C1 1.00
    
    C1 C2 -0.4
    
    C1 C3 -0.069
    
    C1 C4 -0.597
    
    C1 C5 -0.175
    
    C1 C5 -0.362
    
    C2 C2 1.00
    
    C2 C3 0.4889
    
    etc.


Any help would be much appreciated.
# 2  
Old 11-21-2018
Please post the code you have written / attempted so far and any error messages you get when you run the file.

Thanks.
# 3  
Old 11-21-2018
At the moment i have got around this by transposing the file first and then running the existing code.

------ Post updated at 06:10 PM ------

Apologies i sent the reply before i had finished. The actual file is quite large (1.5 million rows x 50240 columns). If there was a solution without the need to transpose then that would be very useful.
# 4  
Old 11-21-2018
Code which doesn't do what you want isn't helpful in explaining what you do want -- especially without sample data.

Please show sample data input and output and explain how they're related.
# 5  
Old 11-21-2018
Apologies if unclear but the sample data was described in the initial post. I have added it again below.

Input file:
Code:
 Name C1 C2 C3 C4 C5 C6
    
    R1 1 2 3 4 5 6
    
    R2 2 1 1 0 1 0
    
    R3 1 3 1 1 2 1
    
    R4 1 1 0 2 0 1
    
    R5 1 2 2 2 0 2
    
    R6 1 1 0 1 2 0

Desired Output

Code:
   C1 C1 1.00
    
    C1 C2 -0.4
    
    C1 C3 -0.069
    
    C1 C4 -0.597
    
    C1 C5 -0.175
    
    C1 C5 -0.362
    
    C2 C2 1.00
    
    C2 C3 0.4889

The first row in the desired output (C1 C1 1) is the correlation between column 2 and column 2 which is a correlation of1 as it is the same data. The second row in the desired output (C1 C2 -0.4) is the correlation between column 2 and column 3 etc.

Last edited by Scrutinizer; 11-21-2018 at 05:25 PM.. Reason: Code Tags
# 6  
Old 11-22-2018
Apologies, I thought the data was old data.

By simple correlation you mean pearson's? And does your data file actually have the double-newlines and odd spacing shown?
# 7  
Old 11-22-2018
I've broken my code up into stdev and pearson parts for readability.

Code:
$ cat stdev.awk

function stdev_mean(TITLE) { return(DATA[TITLE,"T"]/DATA[TITLE,"C"]); }
function stdev_count(TITLE) { return(DATA[TITLE,"C"]-1); }

# Run first for data 1-n to get means
function stdev_pass1(TITLE, VAL) {
        DATA[TITLE,"T"] += VAL+0;
        DATA[TITLE,"C"] ++;
}

# Run second for data 1-n to get standard deviations
function stdev_pass2(TITLE,VAL,X) {
        X = stdev_mean(TITLE) - VAL+0;
        DATA[TITLE,"D"] += X*X;
}

# Final result after both passes
function stdev(TITLE) { return(sqrt(DATA[TITLE,"D"] / stdev_count(TITLE)));}

$ cat pearson.awk

# Integrate titles and skip first line
FNR==1 {        for(N=2; N<=NF; N++) COL[N]=$N ; MAX=NF; next   }

# First pass, calculate means and skip to next line
NR==FNR {       for(N=2; N<=NF; N++) stdev_pass1(N, $N); next }

# Second pass, means are now valid, calculate deviation and correlation
{
        for(N=2; N<=NF; N++) stdev_pass2(N, $N);
        for(N=2; N<=NF; N++) for(M=N; M<=NF; M++)
                CORR[N,M]+=(stdev_mean(N) - ($N+0)) * (stdev_mean(M) - ($M+0));
}

END { # Print final data
        for(N=2; N<=MAX; N++)   for(M=N; M<=MAX; M++)
        print COL[N], COL[M], CORR[N,M] / (stdev_count(N)*stdev(N)*stdev(M));
}

# Not a typo, awk is fed inputfile twice.
# This avoids needing to store the entire massive file in memory.
$ awk -f stdev.awk -f pearson.awk inputfile inputfile

C1 C1 1
C1 C2 -0.4
C1 C3 -0.069843
C1 C4 -0.597614
C1 C5 -0.175412
C1 C6 -0.362738
C2 C2 1
C2 C3 0.488901
C2 C4 0.239046
C2 C5 0.30697
C2 C6 0.362738
C3 C3 1
C3 C4 0.667827
C3 C5 0.581936
C3 C6 0.861381
C4 C4 1
C4 C5 0.576557
C4 C6 0.932143
C5 C5 1
C5 C6 0.731727
C6 C6 1

$

Be sure to use GNU awk, other versions may have a 2048-byte line size limit.
These 2 Users Gave Thanks to Corona688 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Calculating cumulative frequency using awk

Hi, I wanted to calculate cumulative frequency distribution of my data that involves several arithmetic calls. I did things in excel but its taking me forever. this is what I want to do: var1.txt contains n observations which I have to compute for frequency which is given by 1/n and subsequently... (7 Replies)
Discussion started by: ida1215
7 Replies

2. Shell Programming and Scripting

Calculating Running Variance Using Awk

Hi all, I am attempting to calculate a running variance for a file containing a column of numbers. I am using the formula variance=sum((x-mean(x))^2)/(n-1), where x is the value on the current row, and mean(x) is the average of all of the values up until that row. n represents the total number... (1 Reply)
Discussion started by: Jahn
1 Replies

3. Shell Programming and Scripting

Calculating average with awk

I need to find the average from a file like: data => BW:123 M:30 RTD:0 1 0 1 0 0 1 1 1 1 0 0 1 1 0' data => BW:123 N:30 RTD:0 1 0 1 0 0 1 1 1 1 0 0 1 1 0' data => BW:123 N:30 RTD:0 1 0 1 0 0 1 1 1 1 0 0 1 1 0' data => BW:123 N:30 RTD:0 1 0 1 0 0 1 1 1 1 0 0 1 1 0' data => BW:123 N:30 RTD:0 1... (4 Replies)
Discussion started by: Slagle
4 Replies

4. Shell Programming and Scripting

Calculating the epoch time from standard time using awk and calculating the duration

Hi All, I have the following time stamp data in 2 columns Date TimeStamp(also with milliseconds) 05/23/2012 08:30:11.250 05/23/2012 08:30:15.500 05/23/2012 08:31.15.500 . . etc From this data I need the following output. 0.00( row1-row1 in seconds) 04.25( row2-row1 in... (5 Replies)
Discussion started by: ks_reddy
5 Replies

5. Shell Programming and Scripting

AWK way of calculating growth

Hi All, IS there any 'awk' way to manipulate following data? Fruit Date Count Apple 20/08/2011 5 Apple 27/08/2011 7 Apple 05/09/2011 11 Apple 12/09/2011 3 Apple 19/09/2011 25 . . . . Orange 20/08/2011 9 Orange 27/08/2011 20 Orange 27/08/2011 7 Orange 05/09/2011 15 Orange... (3 Replies)
Discussion started by: aniketdixit
3 Replies

6. Shell Programming and Scripting

Calculating an integer with awk

I would like to extract a number from $0 and calculate if it can be devided by 25. Though the number can also be less then 25 or bigger than 100. How do i extract the number and how can the integer be calculated? String: "all_results">39</span>I am looking for the number between "all_results"> ... (5 Replies)
Discussion started by: sdf
5 Replies

7. Shell Programming and Scripting

[Solved] awk calculating between lines

Hey guys, maybe you can help me with this... I want to read input.dat line by line, while doing a simple calculation between the second column value of the current line and the second column value of the next line (like a difference). input is something like this: 0 3.945757 1 ... (1 Reply)
Discussion started by: origamisven
1 Replies

8. Shell Programming and Scripting

Awk program for calculating dates.

Hi All, I have a txt file which has hundreds of lines and 41 fields. I have a requirement to pick up field 14 from the text file which is a date fiels in the following format. Field 14 : Data Type : NUMERIC DATE (YYYYMMDD) Field Length : 8 Example of Data :20090415 Field 42 : Data Type... (2 Replies)
Discussion started by: nua7
2 Replies

9. Shell Programming and Scripting

calculating endless columns

I have about 5000 columns of data that i need to convert all of it into pecentages. for shorter colums i have been using this code: {print $1/($1+$2)*100,$2/($1+$2),$3/($3+$4)*100 .....} but this is a teadious process... is there anyway to do it without having to write all of them out? sample... (20 Replies)
Discussion started by: chronicx
20 Replies

10. Shell Programming and Scripting

Calculating totals in AWK

Hello, With the following small script I list the size of documents belonging to a certain user by each time selecting the bytes-field of that file ($7). Now it fills the array with every file it finds so in the end the output of some users contains up to 200.000 numbers. So how can I calculate... (7 Replies)
Discussion started by: Hille
7 Replies
Login or Register to Ask a Question