Normalization using awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Normalization using awk
# 1  
Old 08-05-2011
Normalization using awk

I made my explanation precise in the CODE below.
I can do this manually. But is there a way to automate this?
If I give 4 or 10 or any number of inputs. It should calculate the CODE and print the different outputs with normalization value ?
some thing like script.sh input1 input2 input3 input4 >>output1 output2 output3 output4 ?

input1
Code:
a1    10    100    nameX    0    2    +
a1    123    126    nameu    0    6    -
a2    10    100    nameT    0    0    +

input2
Code:
a1    10    100    name1    0    2    +
a1    123    126    name2    0    6    -
a1    223    226    name10    0    6    -
a1    323    326    name5    0    6    -
a2    10    100    name7    0    0    +
a4    10    100    name9    0    2    +
a5    123    126    name8    0    6    -
a6    10    100    name6    0    8    +

CODE

Code:
Normalized value = ($6 / $6SUM) * Average of (input1 input2 ..............) $6SUM

Code:
output_input1
a1    10    100    nameX    0    2    +    (2/8)*(44/11)=0.25*4=1
…………
…………..

Code:
output_input2
a1    10    100    name1    0    2    +    (2/36)*(44/11)=0.055*4=0.22
…………
…………..


Last edited by quincyjones; 08-05-2011 at 10:59 PM..
# 2  
Old 08-05-2011
This should work provided your input files aren't too large (it reads everything into core). If you have large input files, then I'd suggest modifying this to make two passes over the input so that only the sums and counts need to be saved in core.

Code:
#!/usr/bin/env ksh
awk '
    {
        input[FILENAME,FNR] = $0;   # original input line
        ovalue[FILENAME,FNR] = $6;  # original value in col 6
        nr[FILENAME]++;             # num rec current file
        sum[FILENAME] += $6;        # sum across current file
        tsum += $6;                 # sum across all files
        tnv++;                      # total number of values

        if( !seen[FILENAME]++ )
            order[oidx++] = FILENAME;
    }

    END {
        tmean = tsum/tnv;           # mean of values across all files
        for( i = 0; i < oidx; i++ )
        {
            fn = order[i];
            ofn = sprintf( "%s.out", fn );
            for( j = 1; j <= nr[fn]; j++ )
            {
                nv = (ovalue[fn,j]/sum[fn]) * tmean;
                printf( "%s %.3f\n", input[fn,j], nv ) >ofn;    # write to output files
                #printf( "%s: %s %.3f\n", fn, input[fn,j], nv ); # uncomment to write all to stdout
            }

            close( ofn );
        }

    }
' "$@"

This will accept the input files on the command line, and create output file names that are the <input-name>.out.

Last edited by agama; 08-05-2011 at 11:56 PM.. Reason: typo
# 3  
Old 08-06-2011
Quote:
This should work provided your input files aren't too large (it reads everything into core). If you have large input files, then I'd suggest modifying this to make two passes over the input so that only the sums and counts need to be saved in core.
First of all thanks for your time and script.
Yes my files are very large. May contain 2-3millions of rows. I don't quiet understand "I'd suggest modifying this to make two passes over the input" ?

And also is it possible to make tab-delimited output ?

Last edited by quincyjones; 08-06-2011 at 12:12 AM..
# 4  
Old 08-06-2011
I almost assumed you'd have a huge dataset, but as soon as I'd have done that you wouldn't have Smilie

Here is the script with the few tweeks to make two passes over the data and the output records are tab separated.

Code:
#!/usr/bin/env ksh

awk '
    {   # first pass to compute sums and total number of lines from all files
        sum[FILENAME] += $6;        # sum across current file
        tsum += $6;                 # sum across all files
        tnv++;                      # total number of values
        seen[FILENAME]  = 1;  
    }

    END {
        tmean = tsum/tnv;           # mean of values across all files
        for( fn in seen )                # for each of the original input files
        {
            ofn = sprintf( "%s.out", fn );
            while( (getline < fn) > 0 )     # make second pass across the input file
            {
                nv = ($6/sum[fn]) * tmean;
                gsub( "\t+", " " );   #maybe overkill, but ensure one tab between fields
                gsub( "  +", " " );
                gsub( " ", "\t" );
                printf( "%s\t%.3f\n", $0, nv ) >ofn;    # write to output files
            }
            close( fn );
            close( ofn );
        }

    }
' "$@"

In case you don't know, if you need higher precision, change the 3 in %.3f to a larger value.

And you are very welcome.

Last edited by agama; 08-06-2011 at 12:51 AM.. Reason: output file creation order not important, so cleaned it up a bit
# 5  
Old 08-06-2011
And a small addition

Awesome. It is really faster.
Is it possible to produce a stats text file that contains sums of inputs and its average

Code:
input1 $5 SUM = 8
input2 $5 SUM = 36
average of input1 and input2 = 44/11=4

# 6  
Old 08-06-2011
Glad that worked. Small tweeks below to generate a summary file.

Code:
#!/usr/bin/env ksh
awk '
    {   # first pass to compute sums and total number of lines from all files
        sum[FILENAME] += $6;        # sum across current file
        tsum += $6;                 # sum across all files
        tnv++;                      # total number of values
        seen[FILENAME] = 1;
    }

    END {
        statsf = "stats.out";       # stats output file name
        tmean = tsum/tnv;           # mean of values across all files
        nin = 0;                    # number of input files
        for( fn in seen )
        {
            printf( "%s sum = %.0f\n", fn, sum[fn] ) >statsf;       # collect stats
            ofn = sprintf( "%s.out", fn );
            while( (getline < fn) > 0 )     # make second pass across the input file
            {
                nv = ($6/sum[fn]) * tmean;
                gsub( "\t+", " " );
                gsub( "  +", " " );
                gsub( " ", "\t" );
                printf( "%s\t%.3f\n", $0, nv ) >ofn;    # write to output files
            }
            close( fn );
            close( ofn );

            nin++;
        }
        printf( "mean across %d input files %.0f/%.0f = %.03f\n", nin, tsum, tnv, tsum/tnv ) >statsf;
    }
' "$@"

The summary file will be created in the same directory and will have the format:
Code:
t31.data sum = 8
t31.data2 sum = 36  
mean across 2 input files 44/11 = 4.000

I didn't list the filenames in the last file, just the count, as it could get unruely, and the files are listed before it anyway. I wasn't sure what you meant by $5 in your example. Did you mean the column that was used ($6)???

You should be able to add that to the printf() if you want to see that.
# 7  
Old 08-06-2011
hi

yes you are right it is $6 not $5. Great work. I like the code also. Very clean and understandable.

I just found a small mistake in my calculation of average. it is total number/number of inputs. so it would look like this. Could you please modify the above script. Really sorry for this correction.

And in the input (all) there are only 6 columns.
input1
Code:
a1    10    100    nameX    2    +
a1    123    126    nameu    6    -
a2    10    100    nameT    0    +

output_input1
Code:
a1    10    100    nameX    2    +    (2/8)*(44/4)=0.25*11=2.75
…………
…………..


Last edited by quincyjones; 08-06-2011 at 07:49 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk output yields error: awk:can't open job_name (Autosys)

Good evening, Im newbie at unix specially with awk From an scheduler program called Autosys i want to extract some data reading an inputfile that comprises jobs names, then formating the output to columns for example 1. This is the inputfile: $ more MapaRep.txt ds_extra_nikira_usuarios... (18 Replies)
Discussion started by: alexcol
18 Replies

2. Shell Programming and Scripting

Data Normalization

Hi, there Need help on rearranging the data. I have data in the following format. LAC = 040 DN = 24001001 EQN = 920- 2- 0- 1 CAT = MS OPTRCL (7 Replies)
Discussion started by: meetsriharsha
7 Replies

3. Shell Programming and Scripting

Passing awk variable argument to a script which is being called inside awk

consider the script below sh /opt/hqe/hqapi1-client-5.0.0/bin/hqapi.sh alert list --host=localhost --port=7443 --user=hqadmin --password=hqadmin --secure=true >/tmp/alerts.xml awk -F'' '{for(i=1;i<=NF;i++){ if($i=="Alert id") { if(id!="") if(dt!=""){ cmd="sh someScript.sh... (2 Replies)
Discussion started by: vivek d r
2 Replies

4. Shell Programming and Scripting

HELP with AWK one-liner. Need to employ an If condition inside AWK to check for array variable ?

Hello experts, I'm stuck with this script for three days now. Here's what i need. I need to split a large delimited (,) file into 2 files based on the value present in the last field. Samp: Something.csv bca,adc,asdf,123,12C bca,adc,asdf,123,13C def,adc,asdf,123,12A I need this split... (6 Replies)
Discussion started by: shell_boy23
6 Replies

5. Shell Programming and Scripting

awk command to compare a file with set of files in a directory using 'awk'

Hi, I have a situation to compare one file, say file1.txt with a set of files in directory.The directory contains more than 100 files. To be more precise, the requirement is to compare the first field of file1.txt with the first field in all the files in the directory.The files in the... (10 Replies)
Discussion started by: anandek
10 Replies

6. Shell Programming and Scripting

Normalization using awk

Hi I have a file with chr22_190_200 XXY 0 0 chr22_201_210 XXY 0 30 chr22_211_220 XXY 3 0 chr22_221_230 XXY 0 0 chr22_231_240 XXY 5 0 chr22_241_250 ABC 0 0 chr22_251_260 ABC 22 11 ... (12 Replies)
Discussion started by: Diya123
12 Replies

7. Shell Programming and Scripting

Problem with awk awk: program limit exceeded: sprintf buffer size=1020

Hi I have many problems with a script. I have a script that formats a text file but always prints the same error when i try to execute it The code is that: { if (NF==17){ print $0 }else{ fields=NF; all=$0; while... (2 Replies)
Discussion started by: fate
2 Replies

8. Shell Programming and Scripting

Normalization Using Shell Scripting.

Hi All, I am having a file having below three lines or maybe more than 3 lines. The first line will be always constant. ### Line 1 #### Transformation||Transformation Mapplet Name||Transformation Group||Partition Index||Transformation Row ID||Error Sequence||Error Timestamp||Error UTC... (4 Replies)
Discussion started by: satyaranjon
4 Replies

9. Shell Programming and Scripting

Awk problem: How to express the single quote(') by using awk print function

Actually I got a list of file end with *.txt I want to use the same command apply to all the *.txt Thus I try to find out the fastest way to write those same command in a script and then want to let them run automatics. For example: I got the file below: file1.txt file2.txt file3.txt... (4 Replies)
Discussion started by: patrick87
4 Replies
Login or Register to Ask a Question