AWK script for standard deviation / root mean square deviation

01-12-2012

Registered User

89, 1

Join Date: Oct 2010

Last Activity: 19 July 2017, 8:11 AM EDT

Posts: 89

Thanks Given: 18

Thanked 1 Time in 1 Post

AWK script for standard deviation / root mean square deviation

I have a file with say 50 columns, each containing a whole lot of data.

Each column contains data from a separate simulation, but each simulation is related to the data in the last (REFERENCE) column $50

I need to calculate the RMS deviation for each data line, i.e. column 1 relative to column 50, column 2 relative to column 50, etc. and my expected outcome with be a column 51 containing the RMSD for each line.

Code:

#!/usr/bin/awk

      BEGIN { s=0;n=0 }
                   { n++; s=s+(($n)^2-($50)^2) }
      END
      { print sqrt(s/50) }

But this is not helping, there is a syntax error,

any help is appreciated!

Moderator's Comments:

Please use next time code tags

chrisjorg

View Public Profile for chrisjorg

Find all posts by chrisjorg

01-12-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

It shows what syntax error?

Please show a sample of input data and the output you'd want for it.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

01-12-2012

Registered User

89, 1

Join Date: Oct 2010

Last Activity: 19 July 2017, 8:11 AM EDT

Posts: 89

Thanks Given: 18

Thanked 1 Time in 1 Post

Code:

2.91187 4.25656 7.3225   ..... until column 50 
3.4187   2.67656 6.3225
3.54117 6.27656 4.3225
5.61187 6.27656 2.3225
....          ...           ....

The output should just be a column of numbers where each entry represents the calculated root-mean-square deviation (RMSD) of each entry of the 50 columns, i.e. (here I am just giving you random numbers)

Code:

RMSD
4.31185 
3.4185   
2.64115
4.71183 
....

The error on running the script as it is is

Code:

awk -f rmsd merge.pmf 
awk: syntax error at source line 6 source file rmsd
 context is
          END  >>> 
 <<< 
awk: bailing out at source line 7

Moderator's Comments:

Use code tags please, check your PMs.

Last edited by zaxxon; 01-12-2012 at 01:59 PM.. Reason: code tags

chrisjorg

View Public Profile for chrisjorg

Find all posts by chrisjorg

01-12-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Given that awk script doesn't even have 7 lines, I'm not sure what to tell you about the syntax error, but if you want a standard deviation for each column, you're going to need to loop over each column, awk's behavior only loops over lines...

You also don't usually need to bother setting s=0 before you start and such. A blank variable + 1 gracefully becomes 1 the same way 0+1 becomes 1.

Also, since you need to subtract the average from every single number to calculate a standard deviation, you need all the numbers twice and might as well just store everything to calculate in an END{} block.

Working on something.

---------- Post updated at 11:21 AM ---------- Previous update was at 11:10 AM ----------

Code:

NF {    LINE++; MAX=NF; for(N=1; N<=NF; N++) A[LINE,N]=$N       }
END     {
        # Calculate sum for each column and row
        for(COL=1; COL<=MAX; COL++)
        for(ROW=1; ROW<=LINE; ROW++)
                AVG[COL]+=A[ROW,COL];

        # Turn the sums into averages
        for(COL=1; COL<=MAX; COL++) AVG[COL] /= LINE

        # Calculate deviations for each column and row
        for(COL=1; COL<=MAX; COL++)
        for(ROW=1; ROW<=LINE; ROW++)
                DEV[COL]+=(AVG[COL]-A[ROW,COL])^2

        # Divide sum by number of lines, then give the square root.
        for(COL=1; COL<=MAX; COL++)     print sqrt(DEV[COL]/LINE);
}

These 2 Users Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

01-12-2012

Registered User

89, 1

Join Date: Oct 2010

Last Activity: 19 July 2017, 8:11 AM EDT

Posts: 89

Thanks Given: 18

Thanked 1 Time in 1 Post

Thanks!

Yes, I want a standard deviation of all data relative to the last column of data which I suppose represents the 'average' or 'reference'.

That is a huge help with the script. Thanks a lot for now, I will try to see if it works.
It is true, I want to parse over each line in the columns,
so e.g. line 1 of column 1 relative to line 1 in column 50, line 2 in column 1 relative to live 2 in column 50, etc. etc. then line 1 in column 2 relative to line 1 in column 50, etc.....

---------- Post updated at 01:03 PM ---------- Previous update was at 12:37 PM ----------

Ok maybe I can clarify.

The last (50th) column already constitutes the average,
so I want to subtract each column of data to the 50th to
look at how much the data deviates from the reference column.

chrisjorg

View Public Profile for chrisjorg

Find all posts by chrisjorg

01-12-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Which average? If you have 49 different data columns, wouldn't you need 49 different averages?

What would help a whole lot more is a sample of your input data. And labels. And a sample of your output data.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

01-12-2012

Registered User

89, 1

Join Date: Oct 2010

Last Activity: 19 July 2017, 8:11 AM EDT

Posts: 89

Thanks Given: 18

Thanked 1 Time in 1 Post

Ok,

let me clarify.

I want to parse each *line* individually, I should have said this earlier.
So if there are 2000 lines, line 1 is different from line 2 etc.

There are 50 columns of data. All the data has to be compared to the *last* column, which is special. Therefore I am performing an RSMD calculation which is not exactly the same as a standard deviation, because the 'average' is that 50th column of data.

Let us for a moment forget there are e.g. 2000 lines of data. Let us imagine there is only 1

my data looks something like this:

Code:

2.91187  2.27656  3.3225  2.33938 2.55781 3.05656 2.66063 2.02781... ... 2.31219

where 2.31319 would represent the 50th column.

Ok, so I want to do the following

Code:

sqrt( ([2.31319-2.91187]^2 + [2.31319-2.27656]^2 + [2.31319-3.3225]^2 + [2.31319-2.33938]^2 [2.31319-2.55781]^2 ... [2.31319-2.31319]^2) /50 )

or in words

Code:

sqrt( ([col.50-col1]^2 + [col.50 - col.2]^2 + [col.50 - col.3]^2 + ... + [col.50 -col.50]^2 ) / 50 )

UPDATE
and yes, you are right, because I parse each line separately, if there are 2000 lines, I will want to end up with 2000 RMSD values lined up in a column.

---------- Post updated at 03:28 PM ---------- Previous update was at 02:08 PM ----------

Code:

set mean  = `awk '{++n;sum+=$NF} END{if(n) print sum/n}' slice.txt`

set rmsd  = `awk -v mean=$mean '{++n;sum+=($NF-mean)^2} END{if(n) print sqrt(sum/n)}' slice.txt`

Maybe something like that? But I need to be able to distinguish between columns and lines.

Moderator's Comments:

Please use code tags when posting data and code samples!

---------- Post updated at 03:29 PM ---------- Previous update was at 03:28 PM ----------

Root-mean-square deviation - Wikipedia, the free encyclopedia

Last edited by vgersh99; 01-12-2012 at 04:31 PM.. Reason: code tags, PLEASE!

chrisjorg

View Public Profile for chrisjorg

Find all posts by chrisjorg

Shell Programming and Scripting

AWK script for standard deviation / root mean square deviation

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

SMA (Single Moving Average) and Standard Deviation

Discussion started by: csierra

2. Shell Programming and Scripting

Output mean and standard deviation of a row

Discussion started by: kayak

3. Shell Programming and Scripting

Computing average and standard deviation from multiple text files

Discussion started by: charmmilein

4. Shell Programming and Scripting

calculating row-wise standard deviation using awk

Discussion started by: ida1215

5. Shell Programming and Scripting

Finding standard deviation for all columns in a data file

Discussion started by: ks_reddy

6. Shell Programming and Scripting

Standard deviation in awk

Discussion started by: gd9629

7. Shell Programming and Scripting

using awk to print average and standard deviation into a file

Discussion started by: phil_heath

8. UNIX for Dummies Questions & Answers

Calculating the Standard Deviation for a column

Discussion started by: kylle345

9. Shell Programming and Scripting

Mean and Standard deviation

Discussion started by: lakshmikanth.pg

10. Shell Programming and Scripting

Script for finding standard deviation

Discussion started by: RJ17