Normalization using awk

08-06-2011

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

Small changes below. I've assumed that regardless of the number of columns, the data to normalise is always in the next to last (NF-1) column. This handles the odd case of the "all" file without the need for a specific test.

I'm a bit confused with your new computation for "average." Your words say sum of all values divided by number of input files, but your example shows sum divide by 4. The code below computes the output based on your description and not the example and thus the output for the first record in the first sample file you gave is

Code:

a1    10      100     nameX   0       2       +       5.500

because 44 is divided by 2 input files, not 4. If that is wrong, where are you getting 4 from? It might be that in your testing you have two other input files that have all zeros in the n-1 column, and thus your example, and the code, is correct.

Small revisions....

Code:

#!/usr/bin/env ksh
awk '
    {   # first pass to compute sums and total number of lines from all files
        # we assume that data to snarf is always next to last column regardless
        # of the number of columns in the input file.
        if( !seen[FILENAME]++ )   # must now count input files here
            nin++;                      # number of input files

        sum[FILENAME] += $(NF-1);       # sum across current file
        tsum += $(NF-1);                # sum across all files
        tnv++;                      # total number of values
    }

    END {
        statsf = "stats.out";   # stats output file name
        #tmean = tsum/tnv;      # mean of values across all files (unused)
        tmean = tsum/nin;       # not the mean anymore though we keep the original name
        nin = 0;                # number of input files
        for( fn in seen )       # make second pass across the input files
        {
            printf( "%s sum = %.0f\n", fn, sum[fn] ) >statsf;       # collect stats
            ofn = sprintf( "%s.out", fn );
            while( (getline < fn) > 0 )
            {
                nv = ($(NF-1)/sum[fn]) * tmean;
                gsub( "\t+", " " );
                gsub( "  +", " " );
                gsub( " ", "\t" );
                printf( "%s\t%.3f\n", $0, nv ) >ofn;   # write to output files
            }
            close( fn );
            close( ofn );
        }
        printf( "mean across %d input files %.0f/%.0f = %.03f\n", nin, tsum, tnv, tsum/tnv ) >statsf;
    }
' "$@"
exit

agama

View Public Profile for agama

Find all posts by agama

08-06-2011

Registered User

184, 0

Join Date: Sep 2010

Last Activity: 10 July 2017, 5:54 AM EDT

Posts: 184

Thanks Given: 53

Thanked 0 Times in 0 Posts

yes it should be 2 not 4. my mistake. the reason why ended up writing 4 is that my real data sets are 4.

---------- Post updated at 08:45 AM ---------- Previous update was at 08:40 AM ----------

And one personal question regarding awk. How come you write so well in awk ? How did you get in to awk and how did you practiced it ? I started with free online awk book. up to few chapters it was easy to follow and the very difficult to grasp the contents. Your suggestion could be really helpful to me. An thank you for the modifications!.

---------- Post updated at 09:12 AM ---------- Previous update was at 08:45 AM ----------

I think some thing wrong with mean in stats file. it should be

Quote:

mean across 2 input files $5SUM/2 i.e. 44/2 = 22

quincyjones

View Public Profile for quincyjones

Find all posts by quincyjones

08-06-2011

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

Quote:

Originally Posted by quincyjones

And one personal question regarding awk. How come you write so well in awk ? How did you get in to awk and how did you practiced it ? I started with free online awk book. up to few chapters it was easy to follow and the very difficult to grasp the contents. Your suggestion could be really helpful to me. An thank you for the modifications!.

Thanks.

I started using awk at some point in 1990 or 91. I bought the O'Reilly Sed & Awk book and went from there. Awk takes a while to wrap your head around, so don't give up. A great way to improve your skills is to look at the posted solutions on this forum. Try to solve the problem yourself, and use the posted solution(s) as a way to "check your answer." Also, having the answer can help if you just don't see how to solve the problem. Do remember that there may be lots of different approaches so your solution might not look like what was posted, but may still work.

And you are most welcome for the code and tweeks.

Sed & Awk at amazon:
Amazon.com: sed & awk (2nd Edition) (9781565922259): Dale Dougherty, Arnold Robbins: Books

---------- Post updated at 10:30 ---------- Previous update was at 10:24 ----------

Quote:

Originally Posted by quincyjones

---------- Post updated at 09:12 AM ---------- Previous update was at 08:45 AM ----------

I think some thing wrong with mean in stats file. it should be

Oops. Yep missed that one, and another small mistake earlier

Code:

#!/usr/bin/env ksh
awk '
    {   # first pass to compute sums and total number of lines from all files
        # we assume that data to snarf is always next to last column regardless
        # of the number of columns in the input file.
        if( !seen[FILENAME]++ )
            nin++;                      # number of input files

        sum[FILENAME] += $(NF-1);       # sum across current file
        tsum += $(NF-1);                # sum across all files
        tnv++;                      # total number of values
    }

    END {
        statsf = "stats.out";   # stats output file name
        #tmean = tsum/tnv;      # mean of values across all files (unused)
        tmean = tsum/nin;       # not the mean anymore, we keep the original name
        for( fn in seen )       # make second pass across the input files
        {
            printf( "%s sum = %.0f\n", fn, sum[fn] ) >statsf;       # collect stats
            ofn = sprintf( "%s.out", fn );
            while( (getline < fn) > 0 )
            {
                nv = ($(NF-1)/sum[fn]) * tmean;
                gsub( "\t+", " " );
                gsub( "  +", " " );
                gsub( " ", "\t" );
                printf( "%s\t%.3f\n", $0, nv ) >ofn;   # write to output files
            }
            close( fn );
            close( ofn );
        }
        printf( "mean across %d input files %.0f/%.0f = %.03f\n", nin, tsum, nin, tmean ) >statsf;
    }
' "$@"
exit

Better now I hope!

Last edited by agama; 08-06-2011 at 01:07 PM.. Reason: pulled output to tty

This User Gave Thanks to agama For This Post:

agama

View Public Profile for agama

Find all posts by agama

08-06-2011

Registered User

184, 0

Join Date: Sep 2010

Last Activity: 10 July 2017, 5:54 AM EDT

Posts: 184

Thanks Given: 53

Thanked 0 Times in 0 Posts

Thanks for valuable suggestions. However this time the code is printing result on terminal instead of in different outputs.

quincyjones

View Public Profile for quincyjones

Find all posts by quincyjones

08-06-2011

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

For my testing I write the small output to the tty and forgot to pull that when I cut/pasted the last sample.

Uncomment the first line and remove the second line in the END section that look like this:

Code:

#printf( "%s\t%.3f\n", $0, nv ) >ofn;   # write to output files
printf( "%s: %s\t%.3f\n", fn, $0, nv ); # write to output stdout

Should become

Code:

printf( "%s\t%.3f\n", $0, nv ) >ofn;   # write to output files

agama

View Public Profile for agama

Find all posts by agama

08-06-2011

Registered User

184, 0

Join Date: Sep 2010

Last Activity: 10 July 2017, 5:54 AM EDT

Posts: 184

Thanks Given: 53

Thanked 0 Times in 0 Posts

AWESOME! have a great day!

quincyjones

View Public Profile for quincyjones

Find all posts by quincyjones

Shell Programming and Scripting

Normalization using awk

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk output yields error: awk:can't open job_name (Autosys)

Discussion started by: alexcol

2. Shell Programming and Scripting

Data Normalization

Discussion started by: meetsriharsha

3. Shell Programming and Scripting

Passing awk variable argument to a script which is being called inside awk

Discussion started by: vivek d r

4. Shell Programming and Scripting

HELP with AWK one-liner. Need to employ an If condition inside AWK to check for array variable ?

Discussion started by: shell_boy23

5. Shell Programming and Scripting

awk command to compare a file with set of files in a directory using 'awk'

Discussion started by: anandek

6. Shell Programming and Scripting

Normalization using awk

Discussion started by: Diya123

7. Shell Programming and Scripting

Problem with awk awk: program limit exceeded: sprintf buffer size=1020

Discussion started by: fate

8. Shell Programming and Scripting

Normalization Using Shell Scripting.

Discussion started by: satyaranjon

9. Shell Programming and Scripting

Awk problem: How to express the single quote(') by using awk print function

Discussion started by: patrick87