Go Back   The UNIX and Linux Forums > Top Forums > UNIX for Dummies Questions & Answers


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 06-27-2012
Registered User
 
Join Date: Jun 2012
Posts: 1
Thanks: 0
Thanked 0 Times in 0 Posts
Question Standard deviation of one column when another column has the same value

Hey guys, I am currently learning different bioinformatics applications, but I do not have all that much of a computer science background. Anyway, I have been asked to perform the mean and standard deviation of coverage for different transcript ID numbers. This involves a huge file with about 30 million lines. Basically, whenever there is the same value in one column/field, I want to get the mean and standard deviation for the other column/field for the corresponding lines. My input and desired output are below, but just imagine there being thousands to millions of different transcript IDs. I also want the output to include all the other fields from the original line for each calculation. The other fields do not follow any special pattern.

So far I have been using a lot of awk, so if you have an awk solution that would be great.

Also if you could give me a formula to next calculate the number of standard deviations each coverage value is away from the mean and put it in a separate field that would be even better, but I think I can figure this part out on my own.

Input

Code:
Transcript ID   Other field Other field Coverage         
1                        3               6             1
2                        4               8             2  
1                        5               10           3  
2                        6               12           6

Output

Code:
Transcript ID   Other field  Other field Coverage  Mean   Standard deviation
1                         3              6            1           2                  1
2                         4              8            2           4                  2
1                         5              10           3           2                  1 
2                        6               12           6           4                  2


Last edited by Scrutinizer; 06-28-2012 at 03:05 AM..
Sponsored Links
    #2  
Old 06-28-2012
radoulov's Avatar
--
 
Join Date: Jan 2007
Location: Варна, България / Milano, Italia
Posts: 5,468
Thanks: 139
Thanked 538 Times in 506 Posts
Hope I got the maths right


Code:
awk 'BEGIN {
  ARGV[ARGC++] = ARGV[ARGC-1] 
  }  
NR == FNR && FNR > 1 {
  id[$1] += $4; cid[$1]++
  idq[$1] += $4 * $4
  next
  }
FNR == 1 {
  if (NR == FNR) next
  print $0, "Mean", "Standard deviation"
  next
  }  
{  
  $1 = $1
  print $0, id[$1]/cid[$1], sqrt(idq[$1]/cid[$1] - (id[$1]/cid[$1])**2)
  }' OFS='\t' infile

Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
AWK script for standard deviation / root mean square deviation chrisjorg Shell Programming and Scripting 12 01-18-2012 11:30 AM
Standard deviation in awk gd9629 Shell Programming and Scripting 11 08-24-2011 12:40 PM
Changing one column of delimited file column to fixed width column manneni prakash Shell Programming and Scripting 5 06-22-2009 05:27 AM
Calculating the Standard Deviation for a column kylle345 UNIX for Dummies Questions & Answers 1 05-18-2009 04:58 PM
Mean and Standard deviation lakshmikanth.pg Shell Programming and Scripting 4 04-27-2009 03:04 PM



All times are GMT -4. The time now is 01:39 PM.