Standardization


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Standardization
# 1  
Old 01-28-2013
Standardization procedure using awk programming

Hi,

I am trying to do standardization (subtracting the mean from original value and dividing by stdev) using an automated script.
I have a file with 10000 rows. I need to initially calculate mean and stdev for each column and then using those values I have to obatin the new standardized values.. I can do this pretty easily in excel.. But I am looking for an automated script.

input
Code:
DOTR1	10.29006	10.06744	10.47105	10.05041	10.18407	9.770205	10.90548	10.75112
RCC2	6.699481	7.240353	7.263434	6.654058	6.86063	7.151931	6.796337	6.78525
HHPA6	7.31182	7.547056	8.338827	7.278408	7.545548	7.409964	7.149899	7.300342
PAX8	8.336847	8.651292	8.493323	8.5056	8.445139	8.651406	8.664237	8.56571
ACA1A	4.233111	4.320666	4.232803	4.390224	4.269969	4.314899	4.264211	4.142419
UBA7	8.196608	8.164725	7.361889	8.055019	8.882745	7.6884	7.835754	8.354209
OOA	5.098222	5.212986	5.301191	5.211401	5.13133	5.153725	5.269111	5.195991
ACX1	4.875679	5.01305	4.921618	4.930978	4.899562	4.92918	4.970339	4.986362

The mean for column 1 is 6.880 and stdev is 2.066

I will now subtract the mean from my observation and divide by stdev to (10.29006-6.880)/2.066. I will do this on all subsequent observations row-wise in column 1. For column 2 again I will find its mean and corresponding stdev and follow the same procedure.

Thanks,

Last edited by Diya123; 01-28-2013 at 08:10 PM..
# 2  
Old 01-29-2013
In bash, you could read each record into a simple array with read -a, put the first column values in a simple array and put the numeric values in an associative array that mimics a two dimensional array by keying it with "row:col". Then you can walk the rows column by column and do your calculations and modifications. When done, you regenerate the file from the arrays for rows 0-n, columns 0-m. I suppose awk can do this, too.
# 3  
Old 01-29-2013
Hi,

Thanks for the reply.. But I am not familiar with bash scripting as I am with awk, but anyhow i will try the algorithm you mentioned.

Thanks
# 4  
Old 01-29-2013
You could use one dimensional arrays tricked into another dimension by concatenation: $row$col, as $col is always one digit. typeset'ing variables can prevent repeated conversion from numeric to character form and back. For problems like this, life is simpler in PERL, C/C++, JAVA.
# 5  
Old 01-30-2013
Hello
with examples and numbers you provide i wrote this code .
my shell is bourne . save the code in a file called diya123 and chmod it to 755 :
Code:
chmod 755 diya123

save your data in a file and , i assume that program and file both are in currrent working directory , invoke the program with

Code:
./diya123 filename

Code:
case $# in
        1) ;;
        *) echo 'Usage: diya123 file' 1>&2; exit 1;;
esac

test -r "$1" || { echo "diya123: can't open '$1'" 1>&2; exit 2; }

( cat  "$1" | awk '
                { for (i=2; i<=9; i++) sum[i]+=$i; }
                END { printf "mean"; for (i=2; i<=9; i++) printf "\t%.6f", sum[i]/NR ; printf "\n"; }
                ' | tee /dev/tty
 cat "$1" ) |

                awk '
                        NR == 1 { for(i=2;i<=9;i++) mean[i]=$i }
                        NR == 2 { for(i=2;i<=9;i++) obser[i]=$i }
                        NR != 1 { for(i=2;i<=9;i++) std[i] += ($i-mean[i]) ** 2  }

                        END {
                                printf "stddev";
                                for(i=2;i<=9;i++) {
                                        stddev[i]=sqrt(std[i]/(NR-2));
                                        printf "\t%.6f", stddev[i];
                                }
                                printf "\n";

                                printf "sub";
                                for(i=2;i<=9;i++)
                                        printf "\t%.6f", (obser[i] - mean[i])/stddev[i];
                                printf "\n";
                        }'

# 6  
Old 01-30-2013
Hi mstafreshi,

Thank you so much.. I tried using your code on the example data I provided, but in the output it prints only the first row..

Here is the output
Code:
[dvaka@cabiopeds new]$ vi test_file
[diya@cab new]$ vi diya123
[diya@cab new]$ chmod 755 diya123
[diya@cab new]$ ./diya123 test_file
mean    6.880228        7.027196        7.048017        6.884512        7.0273746.883714        6.981921        7.010175
stddev  2.066795        2.005844        2.108336        1.966369        2.1192771.922232        2.183782        2.202813
sub     1.649817        1.515693        1.623571        1.610022        1.4895161.501635        1.796681        1.698258
[diya@cab new]$

How can I get the result for all 10,000 rows.

Thanks,
# 7  
Old 01-31-2013
Hello
I don't know your bussiness and i has not work with excell !
I just test this script with one line of example you provided . can this scripts works as expected?

Code:
case $# in
        1) ;;
        *) echo 'Usage: diya123 file' 1>&2; exit 1;;
esac

test -r "$1" || { echo "diya123: can't open '$1'" 1>&2; exit 2; }

( cat  "$1" | awk '
                { for (i=2; i<=9; i++) sum[i]+=$i; }
                END { printf "mean4col"; for (i=2; i<=9; i++) printf "\t%.6f", sum[i]/NR; printf "\n"}
                '
 cat "$1" ) | ( # tee /dev/tty
                awk '
                        NR == 1 { print; for(i=2;i<=9;i++) mean[i]=$i }
                        NR != 1 { for(i=2;i<=9;i++) std[i] += ($i-mean[i]) ** 2  }

                        END {
                                printf "stddev4col";
                                for(i=2;i<=9;i++) {
                                        stddev[i]=sqrt(std[i]/(NR-2));
                                        printf "\t%.6f", stddev[i];
                                }
                                printf "\n";
                        }'; cat "$1" )| awk '
                                NR == 1 { print; for(i=2;i<=9;i++) mean[i]=$i }
                                NR == 2 { print; for(i=2;i<=9;i++) std[i]=$i }

                                NR != 1 && NR != 2 {
                                        printf "sub[%s]", $1;
                                        for(i=2;i<=9;i++)
                                                printf "\t%.6f", ($i - mean[i])/std[i];
                                        printf "\n";
                                }'

Login or Register to Ask a Question

Previous Thread | Next Thread

1 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Standardization of input source data files using shell script

Hi there, I'm a newbie in unix and am fishing for options related to how raw input data files are handled. The scenario, as I'm sure y'all must be very familiar with, is this : we receive upwards of 50 data files in ASCII format from various source systems - now each file has its own structure... (3 Replies)
Discussion started by: Prat Khos
3 Replies
Login or Register to Ask a Question