Compute average ignoring outliers of different segments within a dat file using awk

09-18-2014

Registered User

30, 1

Join Date: Sep 2010

Last Activity: 9 March 2015, 9:06 PM EDT

Location: Lusaka, Zambia

Posts: 30

Thanks Given: 6

Thanked 1 Time in 1 Post

Compute average ignoring outliers of different segments within a dat file using awk

I have data files that look like this, say data.txt

Code:

0.00833 6.34
0.00833 6.95
0.00833 7.08
0.00833 8.07
0.00833 8.12
0.00833 8.26
0.00833 8.70
0.00833 9.36
0.01667 20.53
0.01667 6.35
0.01667 6.94
0.01667 7.07
0.01667 8.06
0.01667 8.10
0.01667 8.25
0.01667 8.71
0.01667 9.31
0.02500 20.19
0.02500 6.35
0.02500 6.92
0.02500 7.07
0.02500 8.08
0.02500 8.09
0.02500 8.24
0.02500 8.70
0.02500 9.26
0.03333 19.89
0.03333 6.33
0.03333 6.90
0.03333 7.07
0.03333 8.07
0.03333 8.09
0.03333 8.22
0.03333 8.70
0.03333 9.22
0.04167 19.65
0.04167 6.34
0.04167 6.87
0.04167 7.07
0.04167 8.03
0.04167 8.08
0.04167 8.19
0.04167 8.69
0.04167 9.19

As you can see the data has various segments based on column 1. I use the following code to compute the mean of each segment and output the value of column 1 for that segment and the mean of the values of column 2 and some other things just so I can check am doing the right thing.

Code:

awk '{if($1<0)$1=0}
{
    sum[$1]+=$2
    cnt[$1]++
}
END {
#     print "Name" "\t" "sum" "\t" "cnt" "\t" "avg"
    for (i in sum)
        printf "%8.5f   %6.2f   %6d   %6.3f\n", i, sum[i], cnt[i], sum[i]/cnt[i]

}' data.txt  | sort -n -k1 > avgFile.txt

Unfortunately as you can see, my data has outliers in these segments. I need to remove these outliers before I compute the mean so that they don't mess up my results. I am using awk to process my data.

This is what I have been able to do so far, if I get one segment to a file say temp.txt I am able to use the following code to remove the outlier in that segment

Code:

awk 'BEGIN{CNT=0} {ROW[CNT]=$0;DATA[CNT]=$2; 
    TOTAL+=$2;CNT+=1;} END{for (i = 0;i < NR; i++){if ((sqrt((DATA[i]-(TOTAL/NR))^2))<((TOTAL/NR)*30/100)) 
    {print ROW[i] ;}}}' temp.txt

But I need to able to do this within the code that computes the average so that my mean value excludes this outlier.

Any assistance will be highly appreciated.

Malandisa

Last edited by Scott; 09-18-2014 at 03:40 PM.. Reason: Moved from Programming forum

malandisa

View Public Profile for malandisa

Find all posts by malandisa

09-18-2014

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Do you mean use standard deviation to identify "outliers"? That is usually the accepted approach - 3 stddev from the mean.

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

09-18-2014

Registered User

30, 1

Join Date: Sep 2010

Last Activity: 9 March 2015, 9:06 PM EDT

Location: Lusaka, Zambia

Posts: 30

Thanks Given: 6

Thanked 1 Time in 1 Post

Yes please, In this case, in the little code, for each segment, I am removing the rows where the second column element has more than 30% divergence from the average, then I consider such to be an outlier.

malandisa

View Public Profile for malandisa

Find all posts by malandisa

09-18-2014

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

This can be done with associative arrays in awk i.e. if you are familiar with them and if you're not then I'd suggest reading up on them...

---------- Post updated at 01:12 PM ---------- Previous update was at 12:22 PM ----------

Quote:

Originally Posted by malandisa

Here is how you'd go about eliminating outliers from your data in order to compute the mean...

Code:

awk '{
    cnt[$1]++
    val[$1] = (val[$1] ? val[$1] "," $2 : $2)
    sum[$1] += $2
} END {
    for (i in val) {
        n = split(val[i], a, ",")
        for (k=1; k<=n; k++)
            if (!((sqrt((a[k] - (sum[i]/cnt[i]))^2)) < ((sum[i] / cnt[i]) * (30/100)))) {
                cnt[i]--
                sum[i] -= val[i]
            }
    }
    for (i in sum)
        printf "%8.5f   %6.2f   %6d   %6.3f\n", i, sum[i], cnt[i], sum[i] / cnt[i] | "sort -nk1"
}' data.txt

shamrock

View Public Profile for shamrock

Find all posts by shamrock

09-19-2014

Registered User

30, 1

Join Date: Sep 2010

Last Activity: 9 March 2015, 9:06 PM EDT

Location: Lusaka, Zambia

Posts: 30

Thanks Given: 6

Thanked 1 Time in 1 Post

Thank you for your suggestion, but interestingly it suggestion for a small file, when I run this on a big file it complains about attempted division by 0. Let me attach a large file and you see what I am talking about. However I am very grateful for your response, it gives me a starting point.

temp.txt (262.2 KB)

malandisa

View Public Profile for malandisa

Find all posts by malandisa

09-19-2014

Registered User

30, 1

Join Date: Sep 2010

Last Activity: 9 March 2015, 9:06 PM EDT

Location: Lusaka, Zambia

Posts: 30

Thanks Given: 6

Thanked 1 Time in 1 Post

Shamrock please help me learn something. What does this line

Code:

val[$1] = (val[$1] ? val[$1] "," $2 : $2)

do exactly in this code! Am sure once I understand this, I would be able to see where the problem could be

---------- Post updated at 05:58 AM ---------- Previous update was at 03:04 AM ----------

Okay I understand that this creates a string with the values in column 2 separated by a comma. so that it is latter split

---------- Post updated at 06:06 AM ---------- Previous update was at 05:58 AM ----------

and the problem I see here is if indeed val is a string created from the values of column 2each value separated by comma, how does the following part of the code work?

Code:

sum[i] -= val[i]

sum is a number and val is a string? Sorry for so many questions I am new to awk and I really want to learn it.

malandisa

View Public Profile for malandisa

Find all posts by malandisa

09-19-2014

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

Quote:

Originally Posted by malandisa

and the problem I see here is if indeed val is a string created from the values of column 2each value separated by comma, how does the following part of the code work?

Code:

sum[i] -= val[i]

sum is a number and val is a string? Sorry for so many questions I am new to awk and I really want to learn it.

Good catch...it should be sum[i] -= a[k] and I did it run the modified code on "temp.txt" and it gave me no such errors...

Last edited by shamrock; 09-19-2014 at 05:41 PM..

shamrock

View Public Profile for shamrock

Find all posts by shamrock

Shell Programming and Scripting

Compute average ignoring outliers of different segments within a dat file using awk

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk to average matching lines in file

Discussion started by: cmccabe

2. Shell Programming and Scripting

Compute average based on field values

Discussion started by: ncwxpanther

3. Shell Programming and Scripting

Shell or awk script to compute average of all the points within a circle

Discussion started by: Indra2011

4. Shell Programming and Scripting

Average, min and max in file with header, using awk

Discussion started by: kayakj

5. Shell Programming and Scripting

awk command on .DAT file not working?

Discussion started by: sagar.cumar

6. Shell Programming and Scripting

awk based script to find the average of all the columns in a data file

Discussion started by: ks_reddy

7. Shell Programming and Scripting

Remove interspersed headers in .dat file with AWK

Discussion started by: gd9629

8. Shell Programming and Scripting

Compute the median of a set of numbers with AWK?

Discussion started by: Lucky Ali

9. Shell Programming and Scripting

using awk to print average and standard deviation into a file

Discussion started by: phil_heath

10. Shell Programming and Scripting

[Splitting file] Extracting group of segments from one file to others

Discussion started by: ozgurgul