Average across multiple columns - awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Average across multiple columns - awk
# 1  
Old 11-20-2016
Average across multiple columns - awk

Hi forum members,

I'm trying to get an average of multiple columns in a csv file using awk. A small example of my input data is as follows:
Code:
cu,u3o8,au,ag
-9,20,-9,3.6
0.005,30,-9,-9
0.005,50,10,3.44
0.021,-9,8,3.35

The following code seems to do most of what I want
Code:
gawk -F"," 'NR==1;NR>1 {for (i=1;i<=4;i++){if($i>0)a[i]+=$i}}END{ for (i=1;i<=4;i++){printf a[i]/(NR-1)","};printf "\n"}' OFS=, input.csv > average.csv

However, the issue is i'm using the $i>o condition to filter out the -9 values and then dividing by NR to calculate the average, which gives the wrong answer. For the Cu column, ignoring -9 values, I should get an average of 0.0103 but instead I get 0.0077.
I tried using a counter as follows:
Code:
gawk -F"," 'NR==1;NR>1 {for (i=1;i<=4;i++){if($i>0)a[i]+=$i}; ++count[$i]}END{ for (i=1;i<=4;i++){printf a[i]/count[$i]","};printf "\n"}' OFS=, input.csv > average.csv

I can't seem to get it to work. Any help would be greatly appreciated.
# 2  
Old 11-20-2016
A couple of issues corrected:
Code:
gawk -F"," 'NR>1 {for (i=1;i<=4;i++){if($i>0){a[i]+=$i;++count[i]}}}END{ for (i=1;i<=4;i++){printf a[i]/count[i]","};printf "\n"}' input.csv > average.csv

# 3  
Old 11-20-2016
Thanks jlliagre, that worked perfectly. I have one final question. Say I have the same data, but added two groups, rd and ru. How would I get the average of the groups?
Code:
cu,u3o8,au,ag,group
-9,20,-9,3.6,ru
0.005,30,-9,-9,ru
0.005,50,10,3.44,rd
0.021,-9,8,3.35,rd

Desired output:
Code:
0.005,25,2,3.6,ru
0.013,50,9,3.39,rd

Thanks in advance, Smilie
# 4  
Old 11-20-2016
Since jillagre was kind enough to fix your earlier code to do what you said you wanted to do, why don't you show us what you have learned and try to add your new requirements to the code you now have. If you get stuck, we'll try to help; but we are here to help you learn how to write your own code. We are not here to act as your unpaid programming staff when your requirements change.

Will there always be two groups (rd and ru)? Or do you want code that will work for any number of groups with names consisting of any strings?

Will all lines for a given group be adjacent (as in your example), or will data for various groups be intermixed?
# 5  
Old 11-20-2016
Fair enough Don

My various attempts were something along the line of adding a
Code:
for (group in a)

as part of the code. For example:
Code:
gawk -F"," 'NR>1 {for (i=1;i<=4;i++){if($i>0){a[i]+=$i;++count[i];++group[$5]}}} END{for (g in group) {for (i=1;i<=4;i++){printf a[i]/count[i]","};printf "\n"}}' input.csv > average.csv

I just used "rd" and "ru" as an example. I would like to match any string in column 5.
Smilie
# 6  
Old 11-21-2016
You didn't answer the question: "Will all lines for a given group be adjacent (as in your example), or will data for various groups be intermixed?" so I tried to write code that would work either way. Note that the order of the groups in the output from this is random. It uses two dimensional arrays to keep track of the sums and counts for each column and group. (If all lines for a group are adjacent, you could still use one-dimensional arrays and dump the data for each group when a new group is found.)

Given that both entries in the 3rd column for group ru are -9, I don't understand why you believe the output for that column for that group should be 2; the following code produces 0 in this case:
Code:
gawk '
BEGIN {	FS = OFS = ","
}
NR > 1 {for(i = 1; i <= 4; i++)
		if($i > 0) {
			s[i, $5] += $i
			++count[i, $5]
			group[$5]
		}
}
END {	for(g in group) {
		for(i = 1; i <= 4; i++)
			printf("%g%s", count[i, g] ? s[i, g] / count[i, g] : 0,
			    OFS)
		print g
	}
}' input.csv > average.csv

Using awk on macOS 10.12.1, I get the output:
Code:
0.005,25,0,3.6,ru
0.013,50,9,3.395,rd

in average.csv with the sample input.csv data you provided in post #3.
# 7  
Old 11-21-2016
Thanks for your help Don and sorry for not answering all your questions. The code works perfectly and matches my requirements. Regarding the "group ru are -9" in your post, it was a typo error. I had been working on this all morning and trying different input data. Thanks again for all your help Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Comparing multiple columns using awk

Hello All; I have two files with below conditions: 1. Entries in file A is missing in file B (primary is field 1) 2. Entries in file B is missing in file A (primary is field 1) 3. Field 1 is present in both files but Field 2 is different. Example Content: File A ... (4 Replies)
Discussion started by: mystition
4 Replies

2. Shell Programming and Scripting

Awk: is it possible to print into multiple columns?

Hi guys, I have hundreds file like this, here I only show two of them: file 1 feco4_s_BB95.log ZE_1=-1717.5206260 feco4_t_BB95.log ZE_1=-1717.5169250 feco5_s_BB95.log ZE_1=-1830.9322060... (11 Replies)
Discussion started by: liuzhencc
11 Replies

3. Shell Programming and Scripting

Match first two columns and average third from multiple files

I have the following format of input from multiple files File 1 24.01 -81.01 1.0 24.02 -81.02 5.0 24.03 -81.03 0.0 File 2 24.01 -81.01 2.0 24.02 -81.02 -5.0 24.03 -81.03 10.0 I need to scan through the files and when the first 2 columns match I... (18 Replies)
Discussion started by: ncwxpanther
18 Replies

4. Shell Programming and Scripting

Average across multiple columns group by

Hi experts, I want to group by average, for multiple columns starting column $7 until NF, group by ($1-$5), please help For just 7th column, I can do awk ' NR>1{ arr += $7 count += 1 } END{ for (a in arr) { print a, arr/count ... (10 Replies)
Discussion started by: ritakadm
10 Replies

5. Shell Programming and Scripting

awk based script to find the average of all the columns in a data file

Hi All, I need the modification for the below mentioned code (found in one more post https://www.unix.com/shell-programming-scripting/27161-script-generate-average-values.html) to find the average values for all the columns(but for a specific rows) and print the averages side by side. I have... (4 Replies)
Discussion started by: ks_reddy
4 Replies

6. Shell Programming and Scripting

Extracting multiple columns with awk

Hi everyone!! I need to apply a simple command to extract columns from a matrix, but I need to extract contemporary from the first to the tenth columns, than from the eleventh to the twentyth and so on... how can i do that? (1 Reply)
Discussion started by: gabrysfe
1 Replies

7. Shell Programming and Scripting

Awk match multiple columns in multiple lines in single file

Hi, Input 7488 7389 chr1.fa chr1.fa 3546 9887 chr5.fa chr9.fa 7387 7898 chrX.fa chr3.fa 7488 7389 chr21.fa chr3.fa 7488 7389 chr1.fa chr1.fa 3546 9887 chr9.fa chr5.fa 7898 7387 chrX.fa chr3.fa Desired Output 7488 7389 chr1.fa chr1.fa 2 3546 9887 chr5.fa chr9.fa 2... (2 Replies)
Discussion started by: jacobs.smith
2 Replies

8. Shell Programming and Scripting

Awk if-else syntax with multiple columns

I can't seem to get this to work. I can reformat the date field if it's the first field (and only field) in the file: However, I get a syntax error when the date field is the second field (or has any other columns following): I can use a ";" but then it puts each column on separate... (8 Replies)
Discussion started by: giannicello
8 Replies

9. Shell Programming and Scripting

Generating multiple new columns with awk

Hi, I'm trying to reformat a file to create a new columns reflecting the previous 2 over and over. By that I mean currently each observation has two columns and I want to create a third which has a value equal to 1 minus the sum of the previous two. This is slightly complicated as 1) I... (6 Replies)
Discussion started by: reformatplink
6 Replies

10. Shell Programming and Scripting

AWK subtraction in multiple columns

AWK subtraction in multiple columns Hi there, Can not get the following: input: 34523 934 9485 3847 394 3847 3456 9384 awk 'NR==1 {for (i = 1; i <= NF; i++) {n=$i; next}; {n-=$i} END {print n}' input output: 21188 first column only,... (2 Replies)
Discussion started by: awkward
2 Replies
Login or Register to Ask a Question