Use awk to read multiple files twice


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users Use awk to read multiple files twice
# 1  
Old 03-09-2012
Question Use awk to read multiple files twice

Hi folks
I have a situation where I am trying to use awk to compute mean and standard deviation for a variable that spans across multiple files. The layout of each file is same and arranged in 3 columns and uses comma as a delimiter.

Code:
File1 layout:

col1,col2,col3

0,0-1,0.2345
1,1-2,0.3456
1,1-2,0.4567
2,2-3,0.5678

what I need to do is first scan each file (i have at least 200 files) and estimate the global mean of the third column for each index value given in the first colum over all files and then make a repeat pass to calculate the global standard deviation, again for each index value in the first column, over all files and using the global mean I calculated previously.

I thought of using awk for this as my file sizes are big and other scripting languages like Perl or ordinary bash are turning out to be too slow. I did a test and it seems awk can read these huge files line by line really quick but am stuck as to how to implement the actual stuff in awk.

Any help will be very useful.
Thanks
# 2  
Old 03-09-2012
I think it is a bit confusing.

Could you please give an example with (at least) 2 input files and show us the expected output file ?

Thanks for clarifying a little more
# 3  
Old 03-09-2012
Sure, I will try my best. okay heres two files test1.txt and test2.txt:

Code:
test1.txt

0,0.0-0.1,0.00087
0,0.0-0.1,0.00089
1,0.1-0.2,0.00100
1,0.1-0.2,0.00074
1,0.1-0.2,0.00097
2,0.2-0.4,0.00208
2,0.2-0.4,0.00218
2,0.2-0.4,0.00227
3,0.9-1.0,0.00845
3,0.9-1.0,0.01016

Code:
test2.txt

0,0.0-0.1,0.00118
0,0.0-0.1,0.00131
0,0.0-0.1,0.00101
1,0.1-0.2,0.00015
1,0.1-0.2,0.00038
1,0.1-0.2,0.00122
2,0.2-0.4,0.00219
2,0.2-0.4,0.00214
2,0.2-0.4,0.00216
2,0.2-0.4,0.00199
3,0.9-1.0,0.01002
3,0.9-1.0,0.01070

the final output should be:
Code:
index    mean    std
0           m0        std0
1           m1        std1
2           m2       std2
3           m3        std3

where m0-3 are the global mean values for each index and std0-3 are the global standard deviations corresponding to each index. The index values are the ones given in the first column of each file. The third column is the one that I have to find global mean and std for.
Now I can calculate just the mean over all files fine. But the problem comes once I know the mean then how do I force awk to rescan all the files and use this mean to calculate the standard deviation.

Heres my awk code for calculating global mean:
Code:
#!/bin/awk -f
BEGIN{
    FS = ",";
    OFS = "\t";
    glbcnt[""]=0;
    glbacc[""]=0;
    glbprcn[""]=0;
}
{
    #print FILENAME;
    #if(FNR > 1){
        glbacc[$1] += $3;
        glbcnt[$1]++;
     #   }
}
END{
    for (i in glbcnt){
        if(i != ""){
            glbacc[i] = glbacc[i]/glbcnt[i];
            print i, glbacc[i], glbcnt[i]; 
        }
    }
}

which I call like this:
Code:
awk -f test.awk test*.txt

where tes.awk is my awk script and the test*.txt are all my txt files having the 3 column values.
Hope now its more clear.
# 4  
Old 03-12-2012
Code:
$ cat myawk
BEGIN{
    FS = ",";
    OFS = "\t";
    glbcnt[""]=0;
    glbacc[""]=0;
    glbprcn[""]=0;
}
{
k[$1]=$2
e[$1":"NR]=$3
n=NR
}
END{
    print "indx","range","deviation","mean","num of elements";
    for(i in k){
        for (j=0;++j<=n;){
            if (e[i":"j]=="") continue
            glbacc[i]+=e[i":"j]
            glbcnt[i]++
        }
    }
    for(o in k){
        for (p=0;++p<=n;){
            if (e[o":"p]=="") continue
            delta[o":"p]=(e[o":"p]-glbacc[o])
            sumdelta[o]+=(delta[o":"p]^2)
        }
    }
    for(d in glbacc){
        if(d=="") continue
        glbacc[d] = glbacc[d]/glbcnt[d];
        drift[d]=sqrt(sumdelta[d]/glbcnt[d]);
        print d,k[d],drift[d],glbacc[d],glbcnt[d];
    }

}

Code:
$ awk -f myawk t1 t2
indx    range    deviation    mean    num of elements
0    0.0-0.1    0.00421142    0.001052    5
1    0.1-0.2    0.0037352    0.000743333    6
2    0.2-0.4    0.012866    0.00214429    7
3    0.9-1.0    0.0295094    0.0098325    4
$

This code assume that the pseudo code of the formula used to calculate the deviation is :

squareroot_of ( sum _of ( square_of(element - mean of elements) ) / number of elements )

Just feel free to adapt to your needs.

Last edited by ctsgnb; 03-12-2012 at 08:10 AM..
This User Gave Thanks to ctsgnb For This Post:
# 5  
Old 03-12-2012
Hi ctsgnb
First of all please accept my thanks for taking out the time to offer me a solution. I appreciate your effort.
Your solution looks good but frankly I had thought along similar lines but the only problem with the solution below is that you are in effect storing all the values from the column 3 in awk arrays and then going over the arrays twice. I have a scenario where I will have at least 200+ files to process with each file having at least 11 million records so my main concern is storing all these values in arrays will be a huge drain on memory hence I was looking for ways to achieve this without having to store all the values in memory.
But again thanks a lot for your response and maybe there is no such solution out there.
# 6  
Old 03-12-2012
In one way or another, you will have to do 2 pass since the mean is needed to calculate the deviation.

If you can't do the 2 pass because of memory limitation you can either : split the task into shorter one that the memory can handle and/or go for the use a temporary file that you will then scan to calculated you deviation.

---------- Post updated at 02:42 PM ---------- Previous update was at 02:39 PM ----------

The real point is : achieving calculation & processing of such a data volume should be done at a database level, not at a scripting level.

Last edited by ctsgnb; 03-12-2012 at 11:20 AM..
# 7  
Old 03-12-2012
Yeah i think those are the only options. I was hoping if theres any neat trick in awk that I don't know of, which allows me to scan the same input files twice but it seems I am just imagining.

This operation has to be done everyday on a set of files that gets updated daily and the files are themselves generated by a C code. So theres a master program that takes care of everything overall so theres some other implications for which database implementation though logical is not the preferred choice.

Thanks again for the feedback.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk GSUB read field values from multiple text files

My program run without error. The problem I am having. The program isn't outputting field values with the column headers to file.txt. Each of the column headers in file.txt has no data. MEMSIZE SECOND SASFoundation Filename The output results in file.txt should show: ... (1 Reply)
Discussion started by: dellanicholson
1 Replies

2. Shell Programming and Scripting

Script to read multiple files...

I have 7 text files of varying sizes for each month of System Maintenance done during the 2013 calendar year (Jan. 134 jobs, Feb. 84 jobs, Apr. 594 jobs, May 158 jobs, July 69 jobs, Aug. 1 job, Oct. 102 jobs) and I have another text file which contains everything from those 7 files. Each of the... (8 Replies)
Discussion started by: CyberOptiq
8 Replies

3. UNIX for Dummies Questions & Answers

awk - how to read multiple files

Hi, is there a ways to read multiple files in a single awk command? For example: awk -f awk_script file1 file2 file3 I've google it, most of them suggest using FNR. But I don't understand how it works. It will be a great help if someone able to explain it in simple term with some example. (4 Replies)
Discussion started by: KCApple
4 Replies

4. Shell Programming and Scripting

read the lines of multiple files

I am trying to create a script which will read 2 files and use the lines of file 1 for each line on file 2. here's my sample code cat $SBox | while read line do cat $Date | while read line do $SCRIPTEXE <line from first file> $2 <line from 2nd file> ... (12 Replies)
Discussion started by: khestoi
12 Replies

5. Shell Programming and Scripting

awk, multiple files input and multiple files output

Hi! I'm new in awk and I need some help. I have a folder with a lot of files and I need that awk do something in each file and print a new file with the output. The input file name should be modified when I print the outpu files. Thanks in advance for help! :-) ciao (5 Replies)
Discussion started by: gabrysfe
5 Replies

6. UNIX for Dummies Questions & Answers

Using AWK: Extract data from multiple files and output to multiple new files

Hi, I'd like to process multiple files. For example: file1.txt file2.txt file3.txt Each file contains several lines of data. I want to extract a piece of data and output it to a new file. file1.txt ----> newfile1.txt file2.txt ----> newfile2.txt file3.txt ----> newfile3.txt Here is... (3 Replies)
Discussion started by: Liverpaul09
3 Replies

7. Shell Programming and Scripting

Read and edit multiple files using a while loop

Hi all, I would like to simply read a file which lists a number of pathnames and files, then search and replace key strings using a few vi commands: :1,$s/search_str/replace_str/g<return> but I am not sure how to automate the <return> of these vis commands when I am putting this in a... (8 Replies)
Discussion started by: cyberfrog
8 Replies

8. Shell Programming and Scripting

How to Read Multiple files in a Shell Script

Hi, Can any one tell me if i can read two files in a shell script... My actual requirement is to read the 1st text file and parse it to get the file code and use this file code to retrieve data from database and print the fetched data in the 2nd text file (I have parsed it and printed the... (2 Replies)
Discussion started by: funonnet
2 Replies

9. Shell Programming and Scripting

Awk - to test multiple files "read" permission ?

Hi Masters, Iam new to this Forum and this is my first post. My question is: I've some datafiles belongs the type (A, B, C) in the location 'export/home/lokiman ' dataA1.txt dataB28.txt dataC35.txt 1) I've to check the read permission for each file, if it not there then I've to... (1 Reply)
Discussion started by: lokiman
1 Replies

10. Shell Programming and Scripting

How to read from multiple files

Hi All, I have list of multiple files with 7 fields all together. Those are being split to exact lines of 20000 each. xaa xab : : : xhx Please advise me how to read from those files and in fact I need to invoke and sql update statement for each inputs values.. Regards, (5 Replies)
Discussion started by: cedrichiu
5 Replies
Login or Register to Ask a Question