Normalization using awk

08-05-2011

Registered User

184, 0

Join Date: Sep 2010

Last Activity: 10 July 2017, 5:54 AM EDT

Posts: 184

Thanks Given: 53

Thanked 0 Times in 0 Posts

Normalization using awk

I made my explanation precise in the CODE below.
I can do this manually. But is there a way to automate this?
If I give 4 or 10 or any number of inputs. It should calculate the CODE and print the different outputs with normalization value ?
some thing like script.sh input1 input2 input3 input4 >>output1 output2 output3 output4 ?

input1

Code:

a1    10    100    nameX    0    2    +
a1    123    126    nameu    0    6    -
a2    10    100    nameT    0    0    +

input2

Code:

a1    10    100    name1    0    2    +
a1    123    126    name2    0    6    -
a1    223    226    name10    0    6    -
a1    323    326    name5    0    6    -
a2    10    100    name7    0    0    +
a4    10    100    name9    0    2    +
a5    123    126    name8    0    6    -
a6    10    100    name6    0    8    +

CODE

Code:

Normalized value = ($6 / $6SUM) * Average of (input1 input2 ..............) $6SUM

Code:

output_input1
a1    10    100    nameX    0    2    +    (2/8)*(44/11)=0.25*4=1
…………
…………..

Code:

output_input2
a1    10    100    name1    0    2    +    (2/36)*(44/11)=0.055*4=0.22
…………
…………..

Last edited by quincyjones; 08-05-2011 at 10:59 PM..

quincyjones

View Public Profile for quincyjones

Find all posts by quincyjones

08-05-2011

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

This should work provided your input files aren't too large (it reads everything into core). If you have large input files, then I'd suggest modifying this to make two passes over the input so that only the sums and counts need to be saved in core.

Code:

#!/usr/bin/env ksh
awk '
    {
        input[FILENAME,FNR] = $0;   # original input line
        ovalue[FILENAME,FNR] = $6;  # original value in col 6
        nr[FILENAME]++;             # num rec current file
        sum[FILENAME] += $6;        # sum across current file
        tsum += $6;                 # sum across all files
        tnv++;                      # total number of values

        if( !seen[FILENAME]++ )
            order[oidx++] = FILENAME;
    }

    END {
        tmean = tsum/tnv;           # mean of values across all files
        for( i = 0; i < oidx; i++ )
        {
            fn = order[i];
            ofn = sprintf( "%s.out", fn );
            for( j = 1; j <= nr[fn]; j++ )
            {
                nv = (ovalue[fn,j]/sum[fn]) * tmean;
                printf( "%s %.3f\n", input[fn,j], nv ) >ofn;    # write to output files
                #printf( "%s: %s %.3f\n", fn, input[fn,j], nv ); # uncomment to write all to stdout
            }

            close( ofn );
        }

    }
' "$@"

This will accept the input files on the command line, and create output file names that are the <input-name>.out.

Last edited by agama; 08-05-2011 at 11:56 PM.. Reason: typo

agama

View Public Profile for agama

Find all posts by agama

08-06-2011

Registered User

184, 0

Join Date: Sep 2010

Last Activity: 10 July 2017, 5:54 AM EDT

Posts: 184

Thanks Given: 53

Thanked 0 Times in 0 Posts

Quote:

This should work provided your input files aren't too large (it reads everything into core). If you have large input files, then I'd suggest modifying this to make two passes over the input so that only the sums and counts need to be saved in core.

First of all thanks for your time and script.
Yes my files are very large. May contain 2-3millions of rows. I don't quiet understand "I'd suggest modifying this to make two passes over the input" ?

And also is it possible to make tab-delimited output ?

Last edited by quincyjones; 08-06-2011 at 12:12 AM..

quincyjones

View Public Profile for quincyjones

Find all posts by quincyjones

08-06-2011

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

I almost assumed you'd have a huge dataset, but as soon as I'd have done that you wouldn't have

Here is the script with the few tweeks to make two passes over the data and the output records are tab separated.

Code:

#!/usr/bin/env ksh

awk '
    {   # first pass to compute sums and total number of lines from all files
        sum[FILENAME] += $6;        # sum across current file
        tsum += $6;                 # sum across all files
        tnv++;                      # total number of values
        seen[FILENAME]  = 1;  
    }

    END {
        tmean = tsum/tnv;           # mean of values across all files
        for( fn in seen )                # for each of the original input files
        {
            ofn = sprintf( "%s.out", fn );
            while( (getline < fn) > 0 )     # make second pass across the input file
            {
                nv = ($6/sum[fn]) * tmean;
                gsub( "\t+", " " );   #maybe overkill, but ensure one tab between fields
                gsub( "  +", " " );
                gsub( " ", "\t" );
                printf( "%s\t%.3f\n", $0, nv ) >ofn;    # write to output files
            }
            close( fn );
            close( ofn );
        }

    }
' "$@"

In case you don't know, if you need higher precision, change the 3 in %.3f to a larger value.

And you are very welcome.

Last edited by agama; 08-06-2011 at 12:51 AM.. Reason: output file creation order not important, so cleaned it up a bit

agama

View Public Profile for agama

Find all posts by agama

08-06-2011

Registered User

184, 0

Join Date: Sep 2010

Last Activity: 10 July 2017, 5:54 AM EDT

Posts: 184

Thanks Given: 53

Thanked 0 Times in 0 Posts

And a small addition

Awesome. It is really faster.
Is it possible to produce a stats text file that contains sums of inputs and its average

Code:

input1 $5 SUM = 8
input2 $5 SUM = 36
average of input1 and input2 = 44/11=4

quincyjones

View Public Profile for quincyjones

Find all posts by quincyjones

08-06-2011

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

Glad that worked. Small tweeks below to generate a summary file.

Code:

#!/usr/bin/env ksh
awk '
    {   # first pass to compute sums and total number of lines from all files
        sum[FILENAME] += $6;        # sum across current file
        tsum += $6;                 # sum across all files
        tnv++;                      # total number of values
        seen[FILENAME] = 1;
    }

    END {
        statsf = "stats.out";       # stats output file name
        tmean = tsum/tnv;           # mean of values across all files
        nin = 0;                    # number of input files
        for( fn in seen )
        {
            printf( "%s sum = %.0f\n", fn, sum[fn] ) >statsf;       # collect stats
            ofn = sprintf( "%s.out", fn );
            while( (getline < fn) > 0 )     # make second pass across the input file
            {
                nv = ($6/sum[fn]) * tmean;
                gsub( "\t+", " " );
                gsub( "  +", " " );
                gsub( " ", "\t" );
                printf( "%s\t%.3f\n", $0, nv ) >ofn;    # write to output files
            }
            close( fn );
            close( ofn );

            nin++;
        }
        printf( "mean across %d input files %.0f/%.0f = %.03f\n", nin, tsum, tnv, tsum/tnv ) >statsf;
    }
' "$@"

The summary file will be created in the same directory and will have the format:

Code:

t31.data sum = 8
t31.data2 sum = 36  
mean across 2 input files 44/11 = 4.000

I didn't list the filenames in the last file, just the count, as it could get unruely, and the files are listed before it anyway. I wasn't sure what you meant by $5 in your example. Did you mean the column that was used ($6)???

You should be able to add that to the printf() if you want to see that.

agama

View Public Profile for agama

Find all posts by agama

08-06-2011

Registered User

184, 0

Join Date: Sep 2010

Last Activity: 10 July 2017, 5:54 AM EDT

Posts: 184

Thanks Given: 53

Thanked 0 Times in 0 Posts

hi

yes you are right it is $6 not $5. Great work. I like the code also. Very clean and understandable.

I just found a small mistake in my calculation of average. it is total number/number of inputs. so it would look like this. Could you please modify the above script. Really sorry for this correction.

And in the input (all) there are only 6 columns.
input1

Code:

a1    10    100    nameX    2    +
a1    123    126    nameu    6    -
a2    10    100    nameT    0    +

output_input1

Code:

a1    10    100    nameX    2    +    (2/8)*(44/4)=0.25*11=2.75
…………
…………..

Last edited by quincyjones; 08-06-2011 at 07:49 AM..

quincyjones

View Public Profile for quincyjones

Find all posts by quincyjones

Shell Programming and Scripting

Normalization using awk

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk output yields error: awk:can't open job_name (Autosys)

Discussion started by: alexcol

2. Shell Programming and Scripting

Data Normalization

Discussion started by: meetsriharsha

3. Shell Programming and Scripting

Passing awk variable argument to a script which is being called inside awk

Discussion started by: vivek d r

4. Shell Programming and Scripting

HELP with AWK one-liner. Need to employ an If condition inside AWK to check for array variable ?

Discussion started by: shell_boy23

5. Shell Programming and Scripting

awk command to compare a file with set of files in a directory using 'awk'

Discussion started by: anandek

6. Shell Programming and Scripting

Normalization using awk

Discussion started by: Diya123

7. Shell Programming and Scripting

Problem with awk awk: program limit exceeded: sprintf buffer size=1020

Discussion started by: fate

8. Shell Programming and Scripting

Normalization Using Shell Scripting.

Discussion started by: satyaranjon

9. Shell Programming and Scripting

Awk problem: How to express the single quote(') by using awk print function

Discussion started by: patrick87