Normalization using awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Normalization using awk
# 8  
Old 08-06-2011
Small changes below. I've assumed that regardless of the number of columns, the data to normalise is always in the next to last (NF-1) column. This handles the odd case of the "all" file without the need for a specific test.

I'm a bit confused with your new computation for "average." Your words say sum of all values divided by number of input files, but your example shows sum divide by 4. The code below computes the output based on your description and not the example and thus the output for the first record in the first sample file you gave is

Code:
a1    10      100     nameX   0       2       +       5.500

because 44 is divided by 2 input files, not 4. If that is wrong, where are you getting 4 from? It might be that in your testing you have two other input files that have all zeros in the n-1 column, and thus your example, and the code, is correct.

Small revisions....
Code:
#!/usr/bin/env ksh
awk '
    {   # first pass to compute sums and total number of lines from all files
        # we assume that data to snarf is always next to last column regardless
        # of the number of columns in the input file.
        if( !seen[FILENAME]++ )   # must now count input files here
            nin++;                      # number of input files

        sum[FILENAME] += $(NF-1);       # sum across current file
        tsum += $(NF-1);                # sum across all files
        tnv++;                      # total number of values
    }

    END {
        statsf = "stats.out";   # stats output file name
        #tmean = tsum/tnv;      # mean of values across all files (unused)
        tmean = tsum/nin;       # not the mean anymore though we keep the original name
        nin = 0;                # number of input files
        for( fn in seen )       # make second pass across the input files
        {
            printf( "%s sum = %.0f\n", fn, sum[fn] ) >statsf;       # collect stats
            ofn = sprintf( "%s.out", fn );
            while( (getline < fn) > 0 )
            {
                nv = ($(NF-1)/sum[fn]) * tmean;
                gsub( "\t+", " " );
                gsub( "  +", " " );
                gsub( " ", "\t" );
                printf( "%s\t%.3f\n", $0, nv ) >ofn;   # write to output files
            }
            close( fn );
            close( ofn );
        }
        printf( "mean across %d input files %.0f/%.0f = %.03f\n", nin, tsum, tnv, tsum/tnv ) >statsf;
    }
' "$@"
exit

# 9  
Old 08-06-2011
hi

yes it should be 2 not 4. my mistake. the reason why ended up writing 4 is that my real data sets are 4.

---------- Post updated at 08:45 AM ---------- Previous update was at 08:40 AM ----------

And one personal question regarding awk. How come you write so well in awk ? How did you get in to awk and how did you practiced it ? I started with free online awk book. up to few chapters it was easy to follow and the very difficult to grasp the contents. Your suggestion could be really helpful to me. An thank you for the modifications!.

---------- Post updated at 09:12 AM ---------- Previous update was at 08:45 AM ----------

I think some thing wrong with mean in stats file. it should be

Quote:
mean across 2 input files $5SUM/2 i.e. 44/2 = 22
# 10  
Old 08-06-2011
Quote:
Originally Posted by quincyjones
And one personal question regarding awk. How come you write so well in awk ? How did you get in to awk and how did you practiced it ? I started with free online awk book. up to few chapters it was easy to follow and the very difficult to grasp the contents. Your suggestion could be really helpful to me. An thank you for the modifications!.
Thanks.

I started using awk at some point in 1990 or 91. I bought the O'Reilly Sed & Awk book and went from there. Awk takes a while to wrap your head around, so don't give up. A great way to improve your skills is to look at the posted solutions on this forum. Try to solve the problem yourself, and use the posted solution(s) as a way to "check your answer." Also, having the answer can help if you just don't see how to solve the problem. Do remember that there may be lots of different approaches so your solution might not look like what was posted, but may still work.

And you are most welcome for the code and tweeks.

Sed & Awk at amazon:
Amazon.com: sed & awk (2nd Edition) (9781565922259): Dale Dougherty, Arnold Robbins: Books

---------- Post updated at 10:30 ---------- Previous update was at 10:24 ----------

Quote:
Originally Posted by quincyjones

---------- Post updated at 09:12 AM ---------- Previous update was at 08:45 AM ----------

I think some thing wrong with mean in stats file. it should be
Oops. Yep missed that one, and another small mistake earlier

Code:
#!/usr/bin/env ksh
awk '
    {   # first pass to compute sums and total number of lines from all files
        # we assume that data to snarf is always next to last column regardless
        # of the number of columns in the input file.
        if( !seen[FILENAME]++ )
            nin++;                      # number of input files

        sum[FILENAME] += $(NF-1);       # sum across current file
        tsum += $(NF-1);                # sum across all files
        tnv++;                      # total number of values
    }

    END {
        statsf = "stats.out";   # stats output file name
        #tmean = tsum/tnv;      # mean of values across all files (unused)
        tmean = tsum/nin;       # not the mean anymore, we keep the original name
        for( fn in seen )       # make second pass across the input files
        {
            printf( "%s sum = %.0f\n", fn, sum[fn] ) >statsf;       # collect stats
            ofn = sprintf( "%s.out", fn );
            while( (getline < fn) > 0 )
            {
                nv = ($(NF-1)/sum[fn]) * tmean;
                gsub( "\t+", " " );
                gsub( "  +", " " );
                gsub( " ", "\t" );
                printf( "%s\t%.3f\n", $0, nv ) >ofn;   # write to output files
            }
            close( fn );
            close( ofn );
        }
        printf( "mean across %d input files %.0f/%.0f = %.03f\n", nin, tsum, nin, tmean ) >statsf;
    }
' "$@"
exit

Better now I hope!

Last edited by agama; 08-06-2011 at 01:07 PM.. Reason: pulled output to tty
This User Gave Thanks to agama For This Post:
# 11  
Old 08-06-2011
hi

Thanks for valuable suggestions. However this time the code is printing result on terminal instead of in different outputs. Smilie
# 12  
Old 08-06-2011
For my testing I write the small output to the tty and forgot to pull that when I cut/pasted the last sample.

Uncomment the first line and remove the second line in the END section that look like this:

Code:
#printf( "%s\t%.3f\n", $0, nv ) >ofn;   # write to output files
printf( "%s: %s\t%.3f\n", fn, $0, nv ); # write to output stdout

Should become

Code:
printf( "%s\t%.3f\n", $0, nv ) >ofn;   # write to output files

# 13  
Old 08-06-2011
AWESOME! have a great day!
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk output yields error: awk:can't open job_name (Autosys)

Good evening, Im newbie at unix specially with awk From an scheduler program called Autosys i want to extract some data reading an inputfile that comprises jobs names, then formating the output to columns for example 1. This is the inputfile: $ more MapaRep.txt ds_extra_nikira_usuarios... (18 Replies)
Discussion started by: alexcol
18 Replies

2. Shell Programming and Scripting

Data Normalization

Hi, there Need help on rearranging the data. I have data in the following format. LAC = 040 DN = 24001001 EQN = 920- 2- 0- 1 CAT = MS OPTRCL (7 Replies)
Discussion started by: meetsriharsha
7 Replies

3. Shell Programming and Scripting

Passing awk variable argument to a script which is being called inside awk

consider the script below sh /opt/hqe/hqapi1-client-5.0.0/bin/hqapi.sh alert list --host=localhost --port=7443 --user=hqadmin --password=hqadmin --secure=true >/tmp/alerts.xml awk -F'' '{for(i=1;i<=NF;i++){ if($i=="Alert id") { if(id!="") if(dt!=""){ cmd="sh someScript.sh... (2 Replies)
Discussion started by: vivek d r
2 Replies

4. Shell Programming and Scripting

HELP with AWK one-liner. Need to employ an If condition inside AWK to check for array variable ?

Hello experts, I'm stuck with this script for three days now. Here's what i need. I need to split a large delimited (,) file into 2 files based on the value present in the last field. Samp: Something.csv bca,adc,asdf,123,12C bca,adc,asdf,123,13C def,adc,asdf,123,12A I need this split... (6 Replies)
Discussion started by: shell_boy23
6 Replies

5. Shell Programming and Scripting

awk command to compare a file with set of files in a directory using 'awk'

Hi, I have a situation to compare one file, say file1.txt with a set of files in directory.The directory contains more than 100 files. To be more precise, the requirement is to compare the first field of file1.txt with the first field in all the files in the directory.The files in the... (10 Replies)
Discussion started by: anandek
10 Replies

6. Shell Programming and Scripting

Normalization using awk

Hi I have a file with chr22_190_200 XXY 0 0 chr22_201_210 XXY 0 30 chr22_211_220 XXY 3 0 chr22_221_230 XXY 0 0 chr22_231_240 XXY 5 0 chr22_241_250 ABC 0 0 chr22_251_260 ABC 22 11 ... (12 Replies)
Discussion started by: Diya123
12 Replies

7. Shell Programming and Scripting

Problem with awk awk: program limit exceeded: sprintf buffer size=1020

Hi I have many problems with a script. I have a script that formats a text file but always prints the same error when i try to execute it The code is that: { if (NF==17){ print $0 }else{ fields=NF; all=$0; while... (2 Replies)
Discussion started by: fate
2 Replies

8. Shell Programming and Scripting

Normalization Using Shell Scripting.

Hi All, I am having a file having below three lines or maybe more than 3 lines. The first line will be always constant. ### Line 1 #### Transformation||Transformation Mapplet Name||Transformation Group||Partition Index||Transformation Row ID||Error Sequence||Error Timestamp||Error UTC... (4 Replies)
Discussion started by: satyaranjon
4 Replies

9. Shell Programming and Scripting

Awk problem: How to express the single quote(') by using awk print function

Actually I got a list of file end with *.txt I want to use the same command apply to all the *.txt Thus I try to find out the fastest way to write those same command in a script and then want to let them run automatics. For example: I got the file below: file1.txt file2.txt file3.txt... (4 Replies)
Discussion started by: patrick87
4 Replies
Login or Register to Ask a Question