Compute average based on field values

02-22-2018

Registered User

152, 3

Join Date: Aug 2011

Last Activity: 5 January 2020, 12:29 PM EST

Posts: 152

Thanks Given: 48

Thanked 3 Times in 3 Posts

Compute average based on field values

Im looking for a way to average the values in field 14 (when field 2 is equal to 2016) and fields 3 and 4 (when field 2 is equal to 2017).

Any help is appreciated.

Code:

001001  2016    33.22   38.19   48.07   51.75   59.77   67.68   70.86   72.21   66.92   53.67   42.31   40.15
001001  2017    42.04   44.02   47.01   54.81   57.97   67.21   70.88   70.11   63.86   55.80   42.96   37.83

(40.15 + 42.04 + 44.02) / 3

Expected output

Code:

001001  42.07

ncwxpanther

View Public Profile for ncwxpanther

Find all posts by ncwxpanther

02-22-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

After six and a half years as a member and with more than 130 posts, some of which on average handling / calculation, one would presume you have at least some idea of an approach. So - any attempts from your side?

These 2 Users Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-22-2018

Registered User

152, 3

Join Date: Aug 2011

Last Activity: 5 January 2020, 12:29 PM EST

Posts: 152

Thanks Given: 48

Thanked 3 Times in 3 Posts

Using the example that RudiC linked to - I know how to get the average of fields 3 and 4, but Im not sure how to include field 14 in the calculation. The below script is not clean but appears to work.

Code:

 awk 'NR == FNR {if ($2==2017) {SUM[$1] += ($3 + $4)/2; CNT[$1]++}; next} {print $0, SUM[$1]/CNT[$1]}' OFS="\t" OFMT="%.2f" file file

Code:

001001  2016    33.22   38.19   48.07   51.75   59.77   67.68   70.86   72.21   66.92   53.67   42.31   40.150          43.03
001001  2017    42.04   44.02   47.01   54.81   57.97   67.21   70.88   70.11   63.86   55.80   42.96   37.830          43.03

ncwxpanther

View Public Profile for ncwxpanther

Find all posts by ncwxpanther

02-22-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Quote:

Originally Posted by ncwxpanther

. . . but appears to work.

Does it? For your 2017 line, yes, but not for your specified task for the entire data file. To implement that, you need to tell us,
- are the records sorted in ascending / descending order?
- are there more / other lines, on top of 2016 or 2017 lines, like 2015 or 2018? Should those be ignored?

Try

Code:

awk '$2==2016 {SUM[$1]+=$14; CNT[$1]++} $2==2017 {SUM[$1]+=$3+$4; CNT[$1]+=2} END {for (s in SUM) print s, SUM[s]/CNT[s]}' file

If there are no other data / years to be found, this can be simplified. The order of data is not conserved.

EDIT: IF there's 2016 & 2017 only, try

Code:

awk '
        {SUM[$1] += ($2==2016 ? $14 : $3+$4)
         CNT[$1] += $2-2015
        }

END     {for (s in SUM) print s, SUM[s]/CNT[s]
        }

' file
001001 42.07

Last edited by RudiC; 02-22-2018 at 03:39 PM..

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-23-2018

Registered User

152, 3

Join Date: Aug 2011

Last Activity: 5 January 2020, 12:29 PM EST

Posts: 152

Thanks Given: 48

Thanked 3 Times in 3 Posts

The records are sorted by field 2 then by field 1.

Code:

001001  2015    32.97   30.74   48.52   56.88   60.67   69.22   72.41   69.48   64.44   54.18   49.64   47.21
001001  2016    33.22   38.19   48.07   51.75   59.77   67.68   70.86   72.21   66.92   53.67   42.31   40.15
001001  2017    42.04   44.02   47.01   54.81   57.97   67.21   70.88   70.11   63.86   55.80   42.96   37.83
001001  2018    30.78   -99.99  -99.99  -99.99  -99.99  -99.99  -99.99  -99.99  -99.99  -99.99  -99.99  -99.99
001002  1900    39.87   34.00   49.69   56.97   63.68   71.04   73.20   73.63   71.76   54.52   47.71   40.05
001002  1901    40.68   43.45   49.51   60.91   69.08   71.44   73.69   73.94   67.93   57.72   52.77   41.38
001002  1902    38.23   47.23   58.32   56.21   60.51   72.48   73.54   72.19   67.95   60.04   49.51   43.66

Your script

Code:

awk '$2==2016 {SUM[$1]+=$14; CNT[$1]++} $2==2017 {SUM[$1]+=$3+$4; CNT[$1]+=2} END {for (s in SUM) print s, SUM[s]/CNT[s]}'

works for a 3 value computation ((2016 field 14 + 2017 field 3 + field 4)/3)

Output

Code:

001001 42.07

but once you add more values, the calculation is not correct.

For instance

Code:

awk '$2>=2015 {SUM[$1]+=$14; CNT[$1]++} $2<=2018 {SUM[$1]+=$3+$4; CNT[$1]+=2} END {for (s in SUM) print s, SUM[s]/CNT[s]}'

Output

Code:

001001 35.2234

I could not replicate the end result in excel. Perhaps its the ordering/sorting? There are additional lines on top of 2015 so that could be an issue. They should not be ignored.

ncwxpanther

View Public Profile for ncwxpanther

Find all posts by ncwxpanther

02-23-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Quote:

Originally Posted by ncwxpanther

The records are sorted by field 2 then by field 1.

They are not. It's field 1, then field 2.

Quote:

. . . but once you add more values, the calculation is not correct. . . .

How can that be "not correct"? You didn't specify what to do for field 2 values other than 2016 and 2017, so "ignore" was assumed. With your NEW sample data, the proposal given yields

Code:

001001 42.07

- exactly what was requested.

With your modified code, several fields will be counted more than once, falsifying the average.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-23-2018

Registered User

152, 3

Join Date: Aug 2011

Last Activity: 5 January 2020, 12:29 PM EST

Posts: 152

Thanks Given: 48

Thanked 3 Times in 3 Posts

You are correct. The data is sorted by field 1 than field 2.

Can the code be modified to work across multiple fields for instance for values between 2015 and 2018? Much like the script you helped me with here. Seems like you would just divide by the number of fields examined.

ncwxpanther

View Public Profile for ncwxpanther

Find all posts by ncwxpanther

Shell Programming and Scripting

Compute average based on field values

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Problem with getting awk to multiply a field by a value set based on condition of another field

Discussion started by: cotilloe

2. Shell Programming and Scripting

awk to look up values in File 2 from File 1, & printingNth field of File1 based value of File2 $2

Discussion started by: samonl

3. UNIX for Beginners Questions & Answers

Print lines based upon unique values in Nth field

Discussion started by: jvoot

4. Shell Programming and Scripting

awk to adjust coordinates in field based on sequential numbers in another field

Discussion started by: cmccabe

5. Shell Programming and Scripting

awk to combine by field and average by another

Discussion started by: cmccabe

6. Shell Programming and Scripting

Shell or awk script to compute average of all the points within a circle

Discussion started by: Indra2011

7. Shell Programming and Scripting

Compute average ignoring outliers of different segments within a dat file using awk

Discussion started by: malandisa

8. Shell Programming and Scripting

Get the average from column and write the value at the last field

Discussion started by: jiam912

9. Shell Programming and Scripting

Average values in a column based on range

Discussion started by: bhargavpbk88

10. Shell Programming and Scripting

Find top N values for field X based on field Y's value

Discussion started by: FrancoisCN