Pearson correlation between two files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Pearson correlation between two files
# 1  
Old 05-27-2012
Pearson correlation between two files

Hi, I want a quick way to determine the pearson correlation between two files. The two files have the same format with only the 3rd column varying.

E.g. of file 1

Code:
chr1	0	62
chr1	1	260
chr1	2	474
chr1	3	562
chr1	4	633
chr1	5	870
chr1	6	931
chr1	7	978
chr1	8	1058
chr1	9	1151

E.g. of file 2


Code:
chr1	0	76
chr1	1	455
chr1	2	806
chr1	3	914
chr1	4	986
chr1	5	1391
chr1	6	1484
chr1	7	1563
chr1	8	1705
chr1	9	1859

So I would want to know the correlation between column 3 for the two files.

Thanks
# 2  
Old 05-28-2012
Code:
#! /usr/bin/perl -w
use strict;

my ($x_bar, $x_sd, $y_bar, $y_sd, $i, $numerator, $r);
my (@f1_data, @f2_data);

open F1, "< file1";
for (<F1>) {
    push (@f1_data, (split /\s+/)[2]);
}
close F1;

open F2, "< file2";
for (<F2>) {
    push (@f2_data, (split /\s+/)[2]);
}
close F2;

($x_bar, $x_sd) = avg_sd (@f1_data);
($y_bar, $y_sd) = avg_sd (@f2_data);

for ($i=0; $i<@f1_data; $i++) {
    $numerator += (($f1_data[$i] - $x_bar) * ($f2_data[$i] - $y_bar));
}

$r = $numerator / (@f1_data * $x_sd * $y_sd);
print "$r\n";

sub avg_sd {
    my ($sum, $avg, $sum_of_sq, $sd) = (0, 0, 0, 0);
    my @data = @_;
    for (@data) {
        $sum += $_;
    }
    $avg = $sum / @data;
    
    for (@data) {
        $sum_of_sq += (($_ - $avg) ** 2);
    }
    
    $sd = sqrt ($sum_of_sq / @data);
    
    return ($avg, $sd);
}

For the given two input files viz. file1 and file2, the correlation coefficient is 0.999125083532687.

By the way, if the input data are fewer in number, I'd suggest you use a scientific calculator. I was using a Casio FX 991 MS back in college Smilie I still have it. Masterpiece.
This User Gave Thanks to balajesuri For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Automate splitting of files , scp files as each split completes and combine files on target server

i use the split command to split a one terabyte backup file into 10 chunks of 100 GB each. The files are split one after the other. While the files is being split, I will like to scp the files one after the other as soon as the previous one completes, from server A to Server B. Then on server B ,... (2 Replies)
Discussion started by: malaika
2 Replies

2. Shell Programming and Scripting

Correlation Between 3 Different Loops using Bash

I have 3 loops that I use to determine the permission level of AWS user accounts. This array lists the AWS policy ARN (Amazon Resource Name): for ((policy_index=0;policy_index<${#aws_managed_policies};++policy_index)); do aws_policy_arn="${aws_managed_policies}" ... (1 Reply)
Discussion started by: bluethundr
1 Replies

3. Shell Programming and Scripting

3 column .csv --> correlation matrix; awk, perl?

Greetings, salutations. I have a 3 column csv file with ~13 million rows and I would like to generate a correlation matrix. Interestingly, you all previously provided a solution to the inverse of this problem. Thread title: "awk? adjacency matrix to adjacency list / correlation matrix to list"... (6 Replies)
Discussion started by: R3353
6 Replies

4. Shell Programming and Scripting

awk? adjacency matrix to adjacency list / correlation matrix to list

Hi everyone I am very new at awk but think that that might be the best strategy for this. I have a matrix very similar to a correlation matrix and in practical terms I need to convert it into a list containing the values from the matrix (one value per line) with the first field of the line (row... (5 Replies)
Discussion started by: stonemonkey
5 Replies

5. Shell Programming and Scripting

AWK - calculating simple correlation of rows

Is there any way to calculate a simple correlation of few selected rows with all the rows in input ? In the below example I selected Row01,02,03 and correlated with all the rows. I was trying to run in R. But the this big data matrix is too much to handle for R and eventually my system is... (3 Replies)
Discussion started by: quincyjones
3 Replies

6. Shell Programming and Scripting

Calculate Correlation between two fields !

Hello, I request your help with a shell script (awk) that ask for two inputs in order to calculate the correlation of the last rows between two fields ( 3 and 4). Data: EC-GLD,1/25/2011,41.270000,129.070000 EC-GLD,1/26/2011,41.550000,129.280000 EC-GLD,1/27/2011,42.260000,127.800000... (1 Reply)
Discussion started by: csierra
1 Replies

7. Shell Programming and Scripting

correlation coefficient - Awk

Hi guys I have an input file with multiple columns and and rows. Is it possible to calculate correlation of certain value of certain No (For example x of S1 = 112) with all other values (for example start with x 112 corr a 3 of S1 = x-a 0.2 ) INPUT ******* No S1 S2 S3 S4 Sn a 3 ... (2 Replies)
Discussion started by: quincyjones
2 Replies

8. UNIX for Dummies Questions & Answers

chmod and cgi correlation

How much do chmod settings affect cgi scripts?? I have a "webmaster" at my work that says I cannot change the permissions on the cgi scripts, and that they work with only certain permissions. They are set for 644, I want to change them to 775 and put her in her own group, like she should be, not... (6 Replies)
Discussion started by: bigmacc
6 Replies
Login or Register to Ask a Question