Compute average ignoring outliers of different segments within a dat file using awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Compute average ignoring outliers of different segments within a dat file using awk
# 1  
Old 09-18-2014
Compute average ignoring outliers of different segments within a dat file using awk

I have data files that look like this, say data.txt
Code:
0.00833 6.34
0.00833 6.95
0.00833 7.08
0.00833 8.07
0.00833 8.12
0.00833 8.26
0.00833 8.70
0.00833 9.36
0.01667 20.53
0.01667 6.35
0.01667 6.94
0.01667 7.07
0.01667 8.06
0.01667 8.10
0.01667 8.25
0.01667 8.71
0.01667 9.31
0.02500 20.19
0.02500 6.35
0.02500 6.92
0.02500 7.07
0.02500 8.08
0.02500 8.09
0.02500 8.24
0.02500 8.70
0.02500 9.26
0.03333 19.89
0.03333 6.33
0.03333 6.90
0.03333 7.07
0.03333 8.07
0.03333 8.09
0.03333 8.22
0.03333 8.70
0.03333 9.22
0.04167 19.65
0.04167 6.34
0.04167 6.87
0.04167 7.07
0.04167 8.03
0.04167 8.08
0.04167 8.19
0.04167 8.69
0.04167 9.19

As you can see the data has various segments based on column 1. I use the following code to compute the mean of each segment and output the value of column 1 for that segment and the mean of the values of column 2 and some other things just so I can check am doing the right thing.
Code:
awk '{if($1<0)$1=0}
{
    sum[$1]+=$2
    cnt[$1]++
}
END {
#     print "Name" "\t" "sum" "\t" "cnt" "\t" "avg"
    for (i in sum)
        printf "%8.5f   %6.2f   %6d   %6.3f\n", i, sum[i], cnt[i], sum[i]/cnt[i]

}' data.txt  | sort -n -k1 > avgFile.txt

Unfortunately as you can see, my data has outliers in these segments. I need to remove these outliers before I compute the mean so that they don't mess up my results. I am using awk to process my data.

This is what I have been able to do so far, if I get one segment to a file say temp.txt I am able to use the following code to remove the outlier in that segment
Code:
awk 'BEGIN{CNT=0} {ROW[CNT]=$0;DATA[CNT]=$2; 
    TOTAL+=$2;CNT+=1;} END{for (i = 0;i < NR; i++){if ((sqrt((DATA[i]-(TOTAL/NR))^2))<((TOTAL/NR)*30/100)) 
    {print ROW[i] ;}}}' temp.txt

But I need to able to do this within the code that computes the average so that my mean value excludes this outlier.

Any assistance will be highly appreciated.

Malandisa

Last edited by Scott; 09-18-2014 at 03:40 PM.. Reason: Moved from Programming forum
# 2  
Old 09-18-2014
Do you mean use standard deviation to identify "outliers"? That is usually the accepted approach - 3 stddev from the mean.
# 3  
Old 09-18-2014
Yes please, In this case, in the little code, for each segment, I am removing the rows where the second column element has more than 30% divergence from the average, then I consider such to be an outlier.
# 4  
Old 09-18-2014
This can be done with associative arrays in awk i.e. if you are familiar with them and if you're not then I'd suggest reading up on them...

---------- Post updated at 01:12 PM ---------- Previous update was at 12:22 PM ----------

Quote:
Originally Posted by malandisa
Yes please, In this case, in the little code, for each segment, I am removing the rows where the second column element has more than 30% divergence from the average, then I consider such to be an outlier.
Here is how you'd go about eliminating outliers from your data in order to compute the mean...
Code:
awk '{
    cnt[$1]++
    val[$1] = (val[$1] ? val[$1] "," $2 : $2)
    sum[$1] += $2
} END {
    for (i in val) {
        n = split(val[i], a, ",")
        for (k=1; k<=n; k++)
            if (!((sqrt((a[k] - (sum[i]/cnt[i]))^2)) < ((sum[i] / cnt[i]) * (30/100)))) {
                cnt[i]--
                sum[i] -= val[i]
            }
    }
    for (i in sum)
        printf "%8.5f   %6.2f   %6d   %6.3f\n", i, sum[i], cnt[i], sum[i] / cnt[i] | "sort -nk1"
}' data.txt

# 5  
Old 09-19-2014
Thank you for your suggestion, but interestingly it suggestion for a small file, when I run this on a big file it complains about attempted division by 0. Let me attach a large file and you see what I am talking about. However I am very grateful for your response, it gives me a starting point.
# 6  
Old 09-19-2014
Shamrock please help me learn something. What does this line
Code:
val[$1] = (val[$1] ? val[$1] "," $2 : $2)

do exactly in this code! Am sure once I understand this, I would be able to see where the problem could be

---------- Post updated at 05:58 AM ---------- Previous update was at 03:04 AM ----------

Okay I understand that this creates a string with the values in column 2 separated by a comma. so that it is latter split

---------- Post updated at 06:06 AM ---------- Previous update was at 05:58 AM ----------

and the problem I see here is if indeed val is a string created from the values of column 2each value separated by comma, how does the following part of the code work?
Code:
sum[i] -= val[i]

sum is a number and val is a string? Sorry for so many questions I am new to awk and I really want to learn it.
# 7  
Old 09-19-2014
Quote:
Originally Posted by malandisa
and the problem I see here is if indeed val is a string created from the values of column 2each value separated by comma, how does the following part of the code work?
Code:
sum[i] -= val[i]

sum is a number and val is a string? Sorry for so many questions I am new to awk and I really want to learn it.
Good catch...it should be sum[i] -= a[k] and I did it run the modified code on "temp.txt" and it gave me no such errors...

Last edited by shamrock; 09-19-2014 at 05:41 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk to average matching lines in file

The awk below executes and is close (producing the first 4 columns in desired). However, when I add the sum of $7, I get nothing returned. Basically, I am trying to combine all the matching $4 in f1 and output them with the average of $7 in each match. Thank you :). f1 ... (2 Replies)
Discussion started by: cmccabe
2 Replies

2. Shell Programming and Scripting

Compute average based on field values

Im looking for a way to average the values in field 14 (when field 2 is equal to 2016) and fields 3 and 4 (when field 2 is equal to 2017). Any help is appreciated. 001001 2016 33.22 38.19 48.07 51.75 59.77 67.68 70.86 72.21 66.92 53.67 42.31 40.15 001001 2017 ... (10 Replies)
Discussion started by: ncwxpanther
10 Replies

3. Shell Programming and Scripting

Shell or awk script to compute average of all the points within a circle

HI Help, I have a file which looks like below --- Input file ---> 1970113.00000 3460.00000 1.09516 1970116.00000 3791.00000 1.06350 1970120.00000 4120.00000 1.07588 1970115.00000 4450.00000 1.09591 1970116.00000 4780.00000 1.09965 1970120.00000 5109.00000 1.06733 ... (7 Replies)
Discussion started by: Indra2011
7 Replies

4. Shell Programming and Scripting

Average, min and max in file with header, using awk

Hi, I have a file which looks like this: FID IID MISS_PHENO N_MISS N_GENO F_MISS 12AB43131 12AB43131 N 17774 906341 0.01961 65HJ87451 65HJ87451 N 10149 906341 0.0112 43JJ21345 43JJ21345 N 2826 906341 0.003118I would... (11 Replies)
Discussion started by: kayakj
11 Replies

5. Shell Programming and Scripting

awk command on .DAT file not working?

Hi All, I am trying to run awk command on .DAT file and it is not working. The same command is working on .txt file: Contents of the file ZZ_55555555_444444_ZZZZZZ_7777777_888_99.DAT: HEADER|ZZ_55555555_444444_ZZZZZZ_7777777_888_99.DAT... (10 Replies)
Discussion started by: sagar.cumar
10 Replies

6. Shell Programming and Scripting

awk based script to find the average of all the columns in a data file

Hi All, I need the modification for the below mentioned code (found in one more post https://www.unix.com/shell-programming-scripting/27161-script-generate-average-values.html) to find the average values for all the columns(but for a specific rows) and print the averages side by side. I have... (4 Replies)
Discussion started by: ks_reddy
4 Replies

7. Shell Programming and Scripting

Remove interspersed headers in .dat file with AWK

Heya there, A small selection of my data is shown below. DATE TIME FRAC_DAYS_SINCE_JAN1 2011-06-25 08:03:20.000 175.33564815 2011-06-25 08:03:25.000 175.33570602 2011-06-25 ... (4 Replies)
Discussion started by: gd9629
4 Replies

8. Shell Programming and Scripting

Compute the median of a set of numbers with AWK?

Is there a way in awk to compute the median of a set of numbers in a file in the following format. 34 67 78 100 23 45 67 (3 Replies)
Discussion started by: Lucky Ali
3 Replies

9. Shell Programming and Scripting

using awk to print average and standard deviation into a file

Hi I want to use awk to print avg and st deviation but it does not go into a file for column 1 only. I can do average and # of records but i cannot get st deviation. awk '{sum+=$1} END { print "Average = ",sum/NR}' thanks (1 Reply)
Discussion started by: phil_heath
1 Replies

10. Shell Programming and Scripting

[Splitting file] Extracting group of segments from one file to others

Hi there, I need to split one huge file into separate files if the condition is fulfilled according to that the position between 97 and 98 matches with “IT” at the segment MAS. There is no delimiter file is fix-width with varous line length. Could you please help me how I do split the file... (1 Reply)
Discussion started by: ozgurgul
1 Replies
Login or Register to Ask a Question