Sponsored Content
Top Forums UNIX for Beginners Questions & Answers Calculating correlations across columns in awk Post 303026256 by Corona688 on Thursday 22nd of November 2018 11:52:27 AM
Old 11-22-2018
I've broken my code up into stdev and pearson parts for readability.

Code:
$ cat stdev.awk

function stdev_mean(TITLE) { return(DATA[TITLE,"T"]/DATA[TITLE,"C"]); }
function stdev_count(TITLE) { return(DATA[TITLE,"C"]-1); }

# Run first for data 1-n to get means
function stdev_pass1(TITLE, VAL) {
        DATA[TITLE,"T"] += VAL+0;
        DATA[TITLE,"C"] ++;
}

# Run second for data 1-n to get standard deviations
function stdev_pass2(TITLE,VAL,X) {
        X = stdev_mean(TITLE) - VAL+0;
        DATA[TITLE,"D"] += X*X;
}

# Final result after both passes
function stdev(TITLE) { return(sqrt(DATA[TITLE,"D"] / stdev_count(TITLE)));}

$ cat pearson.awk

# Integrate titles and skip first line
FNR==1 {        for(N=2; N<=NF; N++) COL[N]=$N ; MAX=NF; next   }

# First pass, calculate means and skip to next line
NR==FNR {       for(N=2; N<=NF; N++) stdev_pass1(N, $N); next }

# Second pass, means are now valid, calculate deviation and correlation
{
        for(N=2; N<=NF; N++) stdev_pass2(N, $N);
        for(N=2; N<=NF; N++) for(M=N; M<=NF; M++)
                CORR[N,M]+=(stdev_mean(N) - ($N+0)) * (stdev_mean(M) - ($M+0));
}

END { # Print final data
        for(N=2; N<=MAX; N++)   for(M=N; M<=MAX; M++)
        print COL[N], COL[M], CORR[N,M] / (stdev_count(N)*stdev(N)*stdev(M));
}

# Not a typo, awk is fed inputfile twice.
# This avoids needing to store the entire massive file in memory.
$ awk -f stdev.awk -f pearson.awk inputfile inputfile

C1 C1 1
C1 C2 -0.4
C1 C3 -0.069843
C1 C4 -0.597614
C1 C5 -0.175412
C1 C6 -0.362738
C2 C2 1
C2 C3 0.488901
C2 C4 0.239046
C2 C5 0.30697
C2 C6 0.362738
C3 C3 1
C3 C4 0.667827
C3 C5 0.581936
C3 C6 0.861381
C4 C4 1
C4 C5 0.576557
C4 C6 0.932143
C5 C5 1
C5 C6 0.731727
C6 C6 1

$

Be sure to use GNU awk, other versions may have a 2048-byte line size limit.
These 2 Users Gave Thanks to Corona688 For This Post:
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Calculating totals in AWK

Hello, With the following small script I list the size of documents belonging to a certain user by each time selecting the bytes-field of that file ($7). Now it fills the array with every file it finds so in the end the output of some users contains up to 200.000 numbers. So how can I calculate... (7 Replies)
Discussion started by: Hille
7 Replies

2. Shell Programming and Scripting

calculating endless columns

I have about 5000 columns of data that i need to convert all of it into pecentages. for shorter colums i have been using this code: {print $1/($1+$2)*100,$2/($1+$2),$3/($3+$4)*100 .....} but this is a teadious process... is there anyway to do it without having to write all of them out? sample... (20 Replies)
Discussion started by: chronicx
20 Replies

3. Shell Programming and Scripting

Awk program for calculating dates.

Hi All, I have a txt file which has hundreds of lines and 41 fields. I have a requirement to pick up field 14 from the text file which is a date fiels in the following format. Field 14 : Data Type : NUMERIC DATE (YYYYMMDD) Field Length : 8 Example of Data :20090415 Field 42 : Data Type... (2 Replies)
Discussion started by: nua7
2 Replies

4. Shell Programming and Scripting

[Solved] awk calculating between lines

Hey guys, maybe you can help me with this... I want to read input.dat line by line, while doing a simple calculation between the second column value of the current line and the second column value of the next line (like a difference). input is something like this: 0 3.945757 1 ... (1 Reply)
Discussion started by: origamisven
1 Replies

5. Shell Programming and Scripting

Calculating an integer with awk

I would like to extract a number from $0 and calculate if it can be devided by 25. Though the number can also be less then 25 or bigger than 100. How do i extract the number and how can the integer be calculated? String: "all_results">39</span>I am looking for the number between "all_results"> ... (5 Replies)
Discussion started by: sdf
5 Replies

6. Shell Programming and Scripting

AWK way of calculating growth

Hi All, IS there any 'awk' way to manipulate following data? Fruit Date Count Apple 20/08/2011 5 Apple 27/08/2011 7 Apple 05/09/2011 11 Apple 12/09/2011 3 Apple 19/09/2011 25 . . . . Orange 20/08/2011 9 Orange 27/08/2011 20 Orange 27/08/2011 7 Orange 05/09/2011 15 Orange... (3 Replies)
Discussion started by: aniketdixit
3 Replies

7. Shell Programming and Scripting

Calculating the epoch time from standard time using awk and calculating the duration

Hi All, I have the following time stamp data in 2 columns Date TimeStamp(also with milliseconds) 05/23/2012 08:30:11.250 05/23/2012 08:30:15.500 05/23/2012 08:31.15.500 . . etc From this data I need the following output. 0.00( row1-row1 in seconds) 04.25( row2-row1 in... (5 Replies)
Discussion started by: ks_reddy
5 Replies

8. Shell Programming and Scripting

Calculating average with awk

I need to find the average from a file like: data => BW:123 M:30 RTD:0 1 0 1 0 0 1 1 1 1 0 0 1 1 0' data => BW:123 N:30 RTD:0 1 0 1 0 0 1 1 1 1 0 0 1 1 0' data => BW:123 N:30 RTD:0 1 0 1 0 0 1 1 1 1 0 0 1 1 0' data => BW:123 N:30 RTD:0 1 0 1 0 0 1 1 1 1 0 0 1 1 0' data => BW:123 N:30 RTD:0 1... (4 Replies)
Discussion started by: Slagle
4 Replies

9. Shell Programming and Scripting

Calculating Running Variance Using Awk

Hi all, I am attempting to calculate a running variance for a file containing a column of numbers. I am using the formula variance=sum((x-mean(x))^2)/(n-1), where x is the value on the current row, and mean(x) is the average of all of the values up until that row. n represents the total number... (1 Reply)
Discussion started by: Jahn
1 Replies

10. UNIX for Dummies Questions & Answers

Calculating cumulative frequency using awk

Hi, I wanted to calculate cumulative frequency distribution of my data that involves several arithmetic calls. I did things in excel but its taking me forever. this is what I want to do: var1.txt contains n observations which I have to compute for frequency which is given by 1/n and subsequently... (7 Replies)
Discussion started by: ida1215
7 Replies
IGAWK(1)							 Utility Commands							  IGAWK(1)

NAME
igawk - gawk with include files SYNOPSIS
igawk [ all gawk options ] -f program-file [ -- ] file ... igawk [ all gawk options ] [ -- ] program-text file ... DESCRIPTION
Igawk is a simple shell script that adds the ability to have ``include files'' to gawk(1). AWK programs for igawk are the same as for gawk, except that, in addition, you may have lines like @include getopt.awk in your program to include the file getopt.awk from either the current directory or one of the other directories in the search path. OPTIONS
See gawk(1) for a full description of the AWK language and the options that gawk supports. EXAMPLES
cat << EOF > test.awk @include getopt.awk BEGIN { while (getopt(ARGC, ARGV, "am:q") != -1) ... } EOF igawk -f test.awk SEE ALSO
gawk(1) Effective AWK Programming, Edition 1.0, published by the Free Software Foundation, 1995. AUTHOR
Arnold Robbins (arnold@skeeve.com). Free Software Foundation Nov 3 1999 IGAWK(1)
All times are GMT -4. The time now is 05:10 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy