Average across multiple columns group by

01-26-2015

Registered User

76, 1

Join Date: Jul 2013

Last Activity: 1 March 2017, 7:46 PM EST

Location: Bengaluru

Posts: 76

Thanks Given: 68

Thanked 1 Time in 1 Post

Quote:

Originally Posted by Don Cragun

The code you showed us in your 1st post in this thread skips data in the 1st line of your file (which I assumed was intended to skip over a header line). But, I don't see any headers in this sample. Is there a header, or not? If there s a header, should it be copied to the output?

Yes, there is header to be carried over.

Is the number of fields constant in an input file, or can it vary from line to line?

It is constant at 22. The missing values are designated as NA

It looks like there is a leading space in your sample input and output. Is a leading space required in your output?

No leading spaces and none required.

Do you want 2 decimal places in all computed output fields, or do you want values to be printed without decimal places (as in your sample output) in cases where the computed result is an integral value?

Integer values can also be outputted with 2 decimal places, it doesnt matter to add .00 or not.

You say you want to calculate averages for fields 7 through NF, but your sample data also calculates the average for field 6? Is field 6 supposed to be ignored in calculations and removed from the output, or is field 6 to be averaged as well as fields 7 through NF?

Yes, $6 is to be ignored, and doesnt appear in the output.

ritakadm

View Public Profile for ritakadm

Find all posts by ritakadm

01-26-2015

Registered User

174, 45

Join Date: Oct 2014

Last Activity: 8 April 2019, 3:29 PM EDT

Posts: 174

Thanks Given: 78

Thanked 45 Times in 45 Posts

I want to edit the sample input slightly to accommodate NA missing values, and remove the leading spaces. For a finite number of columns you can of course use arr1, arr2, arr3 etc like you have used arr

Code:

 
a1 b1 c1 d1 e1 12 13 14 15
a1 b1 c1 d1 e1 14 15 16 17
a1 b1 c1 d1 e1 NA 14 15 16
a2 b1 c1 d1 e1 112 113 114 115
a2 b1 c1 d1 e1 114 115 116 117
a2 b1 c1 d1 e1 113 NA 115 116

Output should be

Code:

Code:

a1 b1 c1 d1 e1 13 14 15 16
a2 b1 c1 d1 e1 113 114 115 116

Last edited by senhia83; 01-26-2015 at 08:55 PM.. Reason: formatted sample data

This User Gave Thanks to senhia83 For This Post:

senhia83

View Public Profile for senhia83

Find all posts by senhia83

01-26-2015

Registered User

316, 33

Join Date: Sep 2008

Last Activity: 13 September 2020, 12:21 AM EDT

Location: US

Posts: 316

Thanks Given: 66

Thanked 33 Times in 31 Posts

I adjusted your original code to support N counts/accumulators (a pair per each column). I use isnum function from Wikipedia

Code:

 
 awk '
function isnum(x){
        return(x==x+0);
}
{
    if(NR>1) {
        a = $1 $2 $3 $4 $5;
        keys[a] = 1;
        for(I = 7; I <= NF; I++) {
                b = I a;
                arr[b]   += $I;
                if(isnum($I)) {
                        count[b] += 1;
                }
        }
    }
}
END{
        for (key in keys) {
                printf "%-16s ", key;
                for(I = 7; I <= NF; I++) {
                        b = I key;
                        printf "%8.2f ", arr[b] / count[b];
                }
                printf "\n";
        }
}
'

The input file:

Code:

 
 HEADER
a1 b1 c1 d1 e1 f1  1  2  4  5
a1 b1 c1 d1 e1 f1 NA  2  6  7
a1 b1 c1 d1 e1 f1  1  2  5  6
a2 b1 c1 d1 e1 f1 12 13 14 15
a2 b1 c1 d1 e1 f1 14 15 16 17
a2 b1 c1 d1 e1 f1 13 14 15 16

Results

Code:

 
 a2b1c1d1e1          13.00    14.00    15.00    16.00 
a1b1c1d1e1           1.00     2.00     5.00     6.00

Last edited by migurus; 01-26-2015 at 09:10 PM.. Reason: forgot to show input and results

This User Gave Thanks to migurus For This Post:

migurus

View Public Profile for migurus

Find all posts by migurus

01-26-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You could also try something like this:

Code:

awk '
NR == 1 {
	# Note that this copies in the input header to the output, but the
	# output will NOT include field 6 from the input.
	print
	next
}
{	# Gather keys:
	key[k = $1 FS $2 FS $3 FS $4 FS $5]
	# Accumulate counts and data from non-"NA" data fields.
	for(i = 7; i <= NF; i++) {
		if($i != "NA") {
			data[k, i] += $i
			cnt[k, i]++
		}
	}
}
END {	for(k in key) {
		printf("%s ", k)
		for(i = 7; i <= NF; i++)
			if(cnt[k, i])
				printf("%.2f%s", data[k, i] / cnt[k, i],
					(i == NF) ? "\n" : " ")
			else	printf("NA%s", (i == NF) ? "\n" : " ")
	}
}' file

With the following sample input:

Code:

This is a header line
a1 b1 c1 d1 e1 12 13 14 15
a1 b1 c1 d1 e1 14 15 16 17
a1 b1 c1 d1 e1 13 NA 15 16
a2 b1 c1 d1 e1 112 113 114 115
a2 b1 c1 d1 e1 114 115 116 117
a2 b1 c1 d1 e1 113 114 115 NA
a3 b2 c1 d2 e3 110 111 112 NA
a3 b2 c1 d2 e3 110 111 113 NA
a3 b2 c1 d2 e3 110 112 113 NA

the above code produces the output:

Code:

This is a header line
a2 b1 c1 d1 e1 114.00 115.00 116.00
a1 b1 c1 d1 e1 14.00 15.00 16.00
a3 b2 c1 d2 e3 111.33 112.67 NA

Note that with this input, the code migurus suggested will produce output similar to the following:

Code:

a2b1c1d1e1         114.00   115.00   116.00 
a1b1c1d1e1          14.00    15.00    16.00 
a3b2c1d2e3         111.33   112.67 awk: division by zero
 input record number 10, file file
 source line number 23

If someone wants to try the above awk script on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Average across multiple columns group by

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Average of columns

Discussion started by: nans

2. Shell Programming and Scripting

Average of a columns from three files

Discussion started by: nans

3. UNIX for Beginners Questions & Answers

Group by columns and add sum in new columns

Discussion started by: ricky1991

4. Shell Programming and Scripting

Average across multiple columns - awk

Discussion started by: theflamingmoe

5. Emergency UNIX and Linux Support

Average columns based on header name

Discussion started by: jacobs.smith

6. Shell Programming and Scripting

Match first two columns and average third from multiple files

Discussion started by: ncwxpanther

7. Shell Programming and Scripting

Get the SUM of TWO columns SEPARATELY by doing GROUP BY on other columns

Discussion started by: machomaddy

8. Shell Programming and Scripting

How to calculate average of two columns and copy into another file?

Discussion started by: Lokaps

9. Shell Programming and Scripting

Average of columns with values of other column with same name

Discussion started by: isildur1234

10. UNIX for Dummies Questions & Answers

Taking the average of two columns and printing it on a new column

Discussion started by: evelibertine