AWK - calculating simple correlation of rows


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting AWK - calculating simple correlation of rows
# 1  
Old 05-29-2011
AWK - calculating simple correlation of rows

Is there any way to calculate a simple correlation of few selected rows with all the rows in input ?
In the below example I selected Row01,02,03 and correlated with all the rows.
I was trying to run in R. But the this big data matrix is too much to handle for R and eventually my system is hanging.

Note: The below example output values are not real.

Code:
ID A B C D E F G H I [162 columns]
Row0$-1 0.08 0.47 0.94 0.33 0.08 0.93 0.72 0.51 0.55
Row02$1 0.37 0.87 0.72 0.96 0.20 0.55 0.35 0.73 0.44
Row03$ 0.19 0.71 0.52 0.73 0.03 0.18 0.13 0.13 0.30
Row04$- 0.08 0.77 0.89 0.12 0.39 0.18 0.74 0.61 0.57
Row05$- 0.09 0.60 0.73 0.65 0.43 0.21 0.27 0.52 0.60
Row06-$ 0.60 0.54 0.70 0.56 0.49 0.94 0.23 0.80 0.63
Row07$- 0.02 0.33 0.05 0.90 0.48 0.47 0.51 0.36 0.26
Row08$_ 0.34 0.96 0.37 0.06 0.20 0.14 0.84 0.28 0.47
........
(300,000 rows!)

output
Code:
Name   Correlation Coefficient(r-square)
Row0$-1:Row0$-1   1
Row01$-1:Row02$   0.25
Row01$-1:Row03$   0.01
..................
..................
Row02$:Row01$-1   0.09
Row02$:Row02$   1
..................
..................
Row03$:Row01$-1   0.19
Row03$:Row02$   0.11
..................
..................

Thanx in advance
# 2  
Old 05-29-2011
I have read your posting several times and I still cannot understand it.

It will be very helpful if you:

1) Display a sample of your input data.

2) Explain in plain English what you want to do.
Do not assume any prior knowledge of your issue.
Put as much details as you can.

3) Display the desired output based on your sample input.

Following these steps will help members find a solution for your issue.
# 3  
Old 05-29-2011
OK

Here is the real example (I used R).
Input has 6 rows and 4 columns.
I calculated correlation of specific rows (R1 to R4) with all rows(R1 to R6) in the output.
Please let me know if you need more clear explanation. My apologies if the previous post was confusing too much.

Thanx
Q


input
Code:
Name    C1      C2      C3      C4
R1      1       2       3       4
R2      2       1       3       0
R3      1       1       1       1
R4      1       1       0       0
R5      1       2       2       2
R6      1       1       0       0

output
Code:
Name    Correlation
R1:R1   1
R1:R2   -0.40
R1:R3   NA
R1:R4   -0.89
R1:R5   0.77
R1:R6   -0.89
R2:R1   -0.49
R2:R2   1
R2:R3   NA
R2:R4   0
R2:R5   -0.25
R2:R6   0
R3:R1   NA
R3:R2   NA
R3:R3   1
R3:R4   NA
R3:R5   NA
R3:R6   NA
R4:R1   -0.89
R4:R2   0
R4:R3   NA
R4:R4   1
R4:R5   -0.5
R4:R6   1

# 4  
Old 05-30-2011
Code:
awk '{
  for (i = 2; i <= NF; ++i)
    s[NR] += (x[NR, i] = $i)
  s[NR] /= (NF-1)
  n[NR] = $1

  for (i = 2; i <= NR; ++i) {
    a = b = c = 0
    for (k = 2; k <= NF; ++k) {
      a += (x[NR,k] - s[NR]) * (x[i,k] - s[i])
      b += (x[NR,k] - s[NR]) ** 2
      c += (x[i,k] - s[i]) ** 2
    }
    print n[NR], n[i], (b > 0 && c > 0)? a/sqrt(b)/sqrt(c) : "NA"
  }
}' <<EOF
Name    C1      C2      C3      C4
R1      1       2       3       4
R2      2       1       3       0
R3      1       1       1       1
R4      1       1       0       0
R5      1       2       2       2
R6      1       1       0       0
EOF

R1 R1 1
R2 R1 -0.4
R2 R2 1
R3 R1 NA
R3 R2 NA
R3 R3 NA
R4 R1 -0.894427
R4 R2 0
R4 R3 NA
R4 R4 1
R5 R1 0.774597
R5 R2 -0.258199
R5 R3 NA
R5 R4 -0.57735
R5 R5 1
R6 R1 -0.894427
R6 R2 0
R6 R3 NA
R6 R4 1
R6 R5 -0.57735
R6 R6 1

---------- Post updated at 07:28 PM ---------- Previous update was at 05:42 AM ----------

I just realized that there were too much repeated calculations for over 300,000 rows in my previous code. This will be much more efficient (NA cors omitted, adjust if you want them printed out also):
Code:
awk '{
  a = 0; for (i = 2; i <= NF; ++i) a += $i; a /= NF-1
  b = 0; for (i = 2; i <= NF; ++i) b += ($i - a) ^ 2; b = sqrt(b)

  if (b <= 0) next
  for (i = 2; i <= NF; ++i) x[NR, i] = ($i - a) / b
  n[NR] = $1

  # with the data normalized, the following loop does very little computation
  for (i = 2; i <= NR; ++i) {
    if (!(i in n)) continue
    a = 0
    for (k = 2; k <= NF; ++k)
      a += x[NR, k] * x[i, k]
    print n[NR], n[i], a
  }
}'

These 2 Users Gave Thanks to binlib For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Calculating Time difference Between two Rows in Linux

16:45:51 10051 77845 16:45:51 10051 77845 16:46:52 10051 77846 16:46:53 10051 77846 Match the last PID then subtract second line time with first line. Please help me with any command or script. working in media company on a project OS: RHEl7 tried command: awk 'function... (2 Replies)
Discussion started by: vivekn
2 Replies

2. Shell Programming and Scripting

Awk: group rows by id and simple conversion

Hi all, I am a newbie to awk and trying to learn by doing examples. I got stuck at this relatively simple conversion. The start file looks like: 1 2 "t1" 1 3 "h1" 2 1 "h1" 2 2 "h2" and I want to convert it into 1 t1:2, h1:3; 2 h1:1, h2:2; Thanks. (9 Replies)
Discussion started by: eagle_fly
9 Replies

3. UNIX for Dummies Questions & Answers

simple code to collapse rows in bash

Hello to the experts! I have a file that I'd like to collapse based on a common ID column, separated by a character delimiter. example input a 1 6 word1 uniq1 b 2 7 WORD2 uniq2 b 2 7 WORD2 uniq3 b 2 7 WORD2 uniq4 c 3 8 word4 uniq5 d 4 9 word5 uniq6 e 5 1 word6 uniq7 desired output a 1... (3 Replies)
Discussion started by: torchij
3 Replies

4. Shell Programming and Scripting

3 column .csv --> correlation matrix; awk, perl?

Greetings, salutations. I have a 3 column csv file with ~13 million rows and I would like to generate a correlation matrix. Interestingly, you all previously provided a solution to the inverse of this problem. Thread title: "awk? adjacency matrix to adjacency list / correlation matrix to list"... (6 Replies)
Discussion started by: R3353
6 Replies

5. Shell Programming and Scripting

Calculating average with awk

I need to find the average from a file like: data => BW:123 M:30 RTD:0 1 0 1 0 0 1 1 1 1 0 0 1 1 0' data => BW:123 N:30 RTD:0 1 0 1 0 0 1 1 1 1 0 0 1 1 0' data => BW:123 N:30 RTD:0 1 0 1 0 0 1 1 1 1 0 0 1 1 0' data => BW:123 N:30 RTD:0 1 0 1 0 0 1 1 1 1 0 0 1 1 0' data => BW:123 N:30 RTD:0 1... (4 Replies)
Discussion started by: Slagle
4 Replies

6. Shell Programming and Scripting

Calculating the epoch time from standard time using awk and calculating the duration

Hi All, I have the following time stamp data in 2 columns Date TimeStamp(also with milliseconds) 05/23/2012 08:30:11.250 05/23/2012 08:30:15.500 05/23/2012 08:31.15.500 . . etc From this data I need the following output. 0.00( row1-row1 in seconds) 04.25( row2-row1 in... (5 Replies)
Discussion started by: ks_reddy
5 Replies

7. Shell Programming and Scripting

Calculating an integer with awk

I would like to extract a number from $0 and calculate if it can be devided by 25. Though the number can also be less then 25 or bigger than 100. How do i extract the number and how can the integer be calculated? String: "all_results">39</span>I am looking for the number between "all_results"> ... (5 Replies)
Discussion started by: sdf
5 Replies

8. Shell Programming and Scripting

correlation coefficient - Awk

Hi guys I have an input file with multiple columns and and rows. Is it possible to calculate correlation of certain value of certain No (For example x of S1 = 112) with all other values (for example start with x 112 corr a 3 of S1 = x-a 0.2 ) INPUT ******* No S1 S2 S3 S4 Sn a 3 ... (2 Replies)
Discussion started by: quincyjones
2 Replies

9. UNIX for Dummies Questions & Answers

Calculating the Number of Rows and Average

Hi All I like to know how can we calculate the number of rows and the average of the values present in the file. I will not know what will be the rowcount, which will be dynamic in nature of the file. eg. 29 33 48 30 28 (6 Replies)
Discussion started by: pk_eee
6 Replies

10. Shell Programming and Scripting

Calculating totals in AWK

Hello, With the following small script I list the size of documents belonging to a certain user by each time selecting the bytes-field of that file ($7). Now it fills the array with every file it finds so in the end the output of some users contains up to 200.000 numbers. So how can I calculate... (7 Replies)
Discussion started by: Hille
7 Replies
Login or Register to Ask a Question