Calculating correlations across columns in awk

11-21-2018

Registered User

4, 1

Join Date: Nov 2018

Last Activity: 18 January 2019, 10:59 AM EST

Posts: 4

Thanks Given: 1

Thanked 1 Time in 1 Post

Calculating correlations across columns in awk

Hello,

I came across a previous thread "awk-calculating-simple-correlation-rows" which calculated correlations across rows in awk.

Code:

    Code:
    awk '{
      a = 0; for (i = 2; i <= NF; ++i) a += $i; a /= NF-1
      b = 0; for (i = 2; i <= NF; ++i) b += ($i - a) ^ 2; b = sqrt(b)
      if (b <= 0) next
      for (i = 2; i <= NF; ++i) x[NR, i] = ($i - a) / b
      n[NR] = $1
      for (i = 2; i <= NR; ++i) {
        if (!(i in n)) continue
        a = 0
        for (k = 2; k <= NF; ++k)
          a += x[NR, k] * x[i, k]
        print n[NR], n[i], a
      }
    }'

I need something similar which can calculate correlations across columns and produce a similar output

Input file:

Code:

    
    Name C1 C2 C3 C4 C5 C6
    
    R1 1 2 3 4 5 6
    
    R2 2 1 1 0 1 0
    
    R3 1 3 1 1 2 1
    
    R4 1 1 0 2 0 1
    
    R5 1 2 2 2 0 2
    
    R6 1 1 0 1 2 0
    
    Desired Output
    
    C1 C1 1.00
    
    C1 C2 -0.4
    
    C1 C3 -0.069
    
    C1 C4 -0.597
    
    C1 C5 -0.175
    
    C1 C5 -0.362
    
    C2 C2 1.00
    
    C2 C3 0.4889
    
    etc.

Any help would be much appreciated.

Ross

View Public Profile for Ross

Find all posts by Ross

11-21-2018

Administrator

19,118, 3,359

Join Date: Sep 2000

Last Activity: 15 July 2022, 8:51 AM EDT

Location: Asia Pacific, Cyberspace, in the Dark Dystopia

Posts: 19,118

Thanks Given: 2,351

Thanked 3,359 Times in 1,878 Posts

Please post the code you have written / attempted so far and any error messages you get when you run the file.

Thanks.

Neo

View Public Profile for Neo

Visit Neo's homepage!

Find all posts by Neo

11-21-2018

Registered User

4, 1

Join Date: Nov 2018

Last Activity: 18 January 2019, 10:59 AM EST

Posts: 4

Thanks Given: 1

Thanked 1 Time in 1 Post

At the moment i have got around this by transposing the file first and then running the existing code.

------ Post updated at 06:10 PM ------

Apologies i sent the reply before i had finished. The actual file is quite large (1.5 million rows x 50240 columns). If there was a solution without the need to transpose then that would be very useful.

Ross

View Public Profile for Ross

Find all posts by Ross

11-21-2018

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Code which doesn't do what you want isn't helpful in explaining what you do want -- especially without sample data.

Please show sample data input and output and explain how they're related.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

11-21-2018

Registered User

4, 1

Join Date: Nov 2018

Last Activity: 18 January 2019, 10:59 AM EST

Posts: 4

Thanks Given: 1

Thanked 1 Time in 1 Post

Apologies if unclear but the sample data was described in the initial post. I have added it again below.

Input file:

Code:

 Name C1 C2 C3 C4 C5 C6
    
    R1 1 2 3 4 5 6
    
    R2 2 1 1 0 1 0
    
    R3 1 3 1 1 2 1
    
    R4 1 1 0 2 0 1
    
    R5 1 2 2 2 0 2
    
    R6 1 1 0 1 2 0

Desired Output

Code:

   C1 C1 1.00
    
    C1 C2 -0.4
    
    C1 C3 -0.069
    
    C1 C4 -0.597
    
    C1 C5 -0.175
    
    C1 C5 -0.362
    
    C2 C2 1.00
    
    C2 C3 0.4889

The first row in the desired output (C1 C1 1) is the correlation between column 2 and column 2 which is a correlation of1 as it is the same data. The second row in the desired output (C1 C2 -0.4) is the correlation between column 2 and column 3 etc.

Last edited by Scrutinizer; 11-21-2018 at 05:25 PM.. Reason: Code Tags

Ross

View Public Profile for Ross

Find all posts by Ross

11-22-2018

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Apologies, I thought the data was old data.

By simple correlation you mean pearson's? And does your data file actually have the double-newlines and odd spacing shown?

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

11-22-2018

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

I've broken my code up into stdev and pearson parts for readability.

Code:

$ cat stdev.awk

function stdev_mean(TITLE) { return(DATA[TITLE,"T"]/DATA[TITLE,"C"]); }
function stdev_count(TITLE) { return(DATA[TITLE,"C"]-1); }

# Run first for data 1-n to get means
function stdev_pass1(TITLE, VAL) {
        DATA[TITLE,"T"] += VAL+0;
        DATA[TITLE,"C"] ++;
}

# Run second for data 1-n to get standard deviations
function stdev_pass2(TITLE,VAL,X) {
        X = stdev_mean(TITLE) - VAL+0;
        DATA[TITLE,"D"] += X*X;
}

# Final result after both passes
function stdev(TITLE) { return(sqrt(DATA[TITLE,"D"] / stdev_count(TITLE)));}

$ cat pearson.awk

# Integrate titles and skip first line
FNR==1 {        for(N=2; N<=NF; N++) COL[N]=$N ; MAX=NF; next   }

# First pass, calculate means and skip to next line
NR==FNR {       for(N=2; N<=NF; N++) stdev_pass1(N, $N); next }

# Second pass, means are now valid, calculate deviation and correlation
{
        for(N=2; N<=NF; N++) stdev_pass2(N, $N);
        for(N=2; N<=NF; N++) for(M=N; M<=NF; M++)
                CORR[N,M]+=(stdev_mean(N) - ($N+0)) * (stdev_mean(M) - ($M+0));
}

END { # Print final data
        for(N=2; N<=MAX; N++)   for(M=N; M<=MAX; M++)
        print COL[N], COL[M], CORR[N,M] / (stdev_count(N)*stdev(N)*stdev(M));
}

# Not a typo, awk is fed inputfile twice.
# This avoids needing to store the entire massive file in memory.
$ awk -f stdev.awk -f pearson.awk inputfile inputfile

C1 C1 1
C1 C2 -0.4
C1 C3 -0.069843
C1 C4 -0.597614
C1 C5 -0.175412
C1 C6 -0.362738
C2 C2 1
C2 C3 0.488901
C2 C4 0.239046
C2 C5 0.30697
C2 C6 0.362738
C3 C3 1
C3 C4 0.667827
C3 C5 0.581936
C3 C6 0.861381
C4 C4 1
C4 C5 0.576557
C4 C6 0.932143
C5 C5 1
C5 C6 0.731727
C6 C6 1

$

Be sure to use GNU awk, other versions may have a 2048-byte line size limit.

These 2 Users Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

UNIX for Beginners Questions & Answers

Calculating correlations across columns in awk

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Calculating cumulative frequency using awk

Discussion started by: ida1215

2. Shell Programming and Scripting

Calculating Running Variance Using Awk

Discussion started by: Jahn

3. Shell Programming and Scripting

Calculating average with awk

Discussion started by: Slagle

4. Shell Programming and Scripting

Calculating the epoch time from standard time using awk and calculating the duration

Discussion started by: ks_reddy

5. Shell Programming and Scripting

AWK way of calculating growth

Discussion started by: aniketdixit

6. Shell Programming and Scripting

Calculating an integer with awk

Discussion started by: sdf

7. Shell Programming and Scripting

[Solved] awk calculating between lines

Discussion started by: origamisven

8. Shell Programming and Scripting

Awk program for calculating dates.

Discussion started by: nua7

9. Shell Programming and Scripting

calculating endless columns

Discussion started by: chronicx

10. Shell Programming and Scripting

Calculating totals in AWK

Discussion started by: Hille