Average each numeric column

02-12-2019

Registered User

46, 3

Join Date: Jan 2018

Last Activity: 11 February 2019, 11:45 PM EST

Posts: 46

Thanks Given: 44

Thanked 3 Times in 3 Posts

Average each numeric column

Hi all,

Does anyone know of an efficient unix script to average each numeric column of a multi-column tab delimited file (with header) with some character columns.

Here is an example input file:

Code:

CHR    RS_ID    ALLELE    POP1    POP2    POP3    POP4    POP5    POP6    POP7    POP8    POP9    SUBJECT_A1    SUBJECT_A2    POP10    POP11    POP12    POP13    POP14    POP15    POP16    POP17    POP18
11    rs1201342    G    0.25    0.25    0.07    0.64    0.13    0.16    0.17    0.04    0.01    G    C    0.25    0.25    0.07    0.64    0.13    0.16    0.17    0.04    0.01
6    rs6941158    T    0.16    0.16    0.16    0.58    0.29    0.27    0.27    0.05    0.03    T    C    0.16    0.16    0.16    0.58    0.29    0.27    0.27    0.05    0.03
5    rs17612017    T    0.15    0.15    0.11    0.52    0.24    0.19    0.25    0.00    0.01    G    G    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
16    rs1133238    A    0.10    0.07    0.02    0.72    0.03    0.04    0.08    0.00    0.01    G    G    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

Desired output file:

Code:

AVG_POP1    AVG_POP2    AVG_POP3    AVG_POP4    AVG_POP5    AVG_POP6    AVG_POP7    AVG_POP8    AVG_POP9    AVG_POP10    AVG_POP11    AVG_POP12    AVG_POP13    AVG_POP14    AVG_POP15    AVG_POP16    AVG_POP17    AVG_POP18
0.17    0.16    0.09    0.61    0.17    0.16    0.19    0.02    0.02    0.10    0.10    0.06    0.30    0.11    0.11    0.11    0.02    0.01

Thanks

Last edited by RudiC; 02-12-2019 at 09:35 AM.. Reason: Corrected formatting.

Geneanalyst

View Public Profile for Geneanalyst

Find all posts by Geneanalyst

02-12-2019

Administrator

19,118, 3,359

Join Date: Sep 2000

Last Activity: 15 July 2022, 8:51 AM EDT

Location: Asia Pacific, Cyberspace, in the Dark Dystopia

Posts: 19,118

Thanks Given: 2,351

Thanked 3,359 Times in 1,878 Posts

Moderator's Comments:

First of all, you should be writing scripts and asking us for help on your attempts to create your own scripts, not posting asking us for scripts. Please do your own work and homework. We are here to help you, not do your work for you.

Second, please be careful when you are cutting and pasting into this site. I just noticed you cut-and-paste a bunch of CSS (style sheet) "garbage" from another site into your post above and had to delete it and reformat your post.

Cutting-and-pasting "junk" into our site and at the same time asking us to do your work for you is not an acceptable community standard for forums, here and elsewhere.

Neo

Neo

View Public Profile for Neo

Visit Neo's homepage!

Find all posts by Neo

02-12-2019

Registered User

46, 3

Join Date: Jan 2018

Last Activity: 11 February 2019, 11:45 PM EST

Posts: 46

Thanks Given: 44

Thanked 3 Times in 3 Posts

Never mind, the following AWK code works:

Code:

 awk -F '\t' '{ for(i=1;i<=NF;i++) total[i]+=$i ; } END { for(i=1;i<=NF;i++) printf "%f ",total[i]/NR ;}'

Header can easily be added with:

Code:

printf '%s\n' '0r header' x | ex file

Moderator's Comments:

Added code tags. Issued reminder to user.

Geneanalyst

View Public Profile for Geneanalyst

Find all posts by Geneanalyst

02-12-2019

Administrator

19,118, 3,359

Join Date: Sep 2000

Last Activity: 15 July 2022, 8:51 AM EDT

Location: Asia Pacific, Cyberspace, in the Dark Dystopia

Posts: 19,118

Thanks Given: 2,351

Thanked 3,359 Times in 1,878 Posts

Quote:

Originally Posted by Geneanalyst

Never mind, ...

We mind. Everyone here should follow the forum rules and community standards.

Here is a link to the forum community standards:

Code:

https://www.unix.com/misc.php?do=cfrules

Thanks!

This User Gave Thanks to Neo For This Post:

Neo

View Public Profile for Neo

Visit Neo's homepage!

Find all posts by Neo

02-12-2019

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Quote:

Originally Posted by Geneanalyst

Never mind, the following AWK code works:

Code:

 awk -F '\t' '{ for(i=1;i<=NF;i++) total[i]+=$i ; } END { for(i=1;i<=NF;i++) printf "%f ",total[i]/NR ;}'

Header can easily be added with:

Code:

printf '%s\n' '0r header' x | ex file

Moderator's Comments:

Added code tags. Issued reminder to user.

And, you should mind as well, as the results you get from your line are plainly wrong:

Code:

7.600000    0.000000    0.000000    0.132000    0.126000    0.072000    0.492000    0.138000    0.132000    0.154000    0.018000    0.012000    0.000000    0.000000    0.082000    0.082000    0.046000   0.244000    0.084000    0.086000    0.088000    0.018000    0.008000

Try

Code:

awk 'NR==1 {print; next} {for (i=1; i<=NF; i++) sum[i]+=$i} END {for (i=1; i<=NF; i++) printf "%.2f\t",sum[i]/(NR-1); print ""}' file
CHR    RS_ID    ALLELE    POP1    POP2    POP3    POP4    POP5    POP6    POP7    POP8    POP9    SUBJECT_A1    SUBJECT_A2    POP10    POP11    POP12    POP13    POP14    POP15    POP16    POP17    POP18
9.50    0.00    0.00    0.17    0.16    0.09    0.61    0.17    0.17    0.19    0.02    0.02    0.00    0.00    0.10    0.10    0.06    0.30    0.10    0.11    0.11    0.02    0.01

instead.

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-12-2019

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

If we look very closely at the desired output shown in post #1 in this thread and do a LOT of reading between the lines, the awk program with the printf piped through ex helper script shown in post #3 is even further off the mark than RudiC noticed.

Like RudiC's code, the code shown in post #3 sums and prints every input field in the output. But the sample input contains 23 fields and the sample (desired) output only contains 18 fields??? So reading between the lines and assuming that the header for the desired 1st output field was supposed to be AVG_POP1 instead of VG_POP1, we might guess that what is really wanted is to only calculate averages of the input fields that have headers that start with POP. If we make that assumption and also assume that the heading for the output field should have the input field header with the string AVG_ prepended to it AND assume that the average should only count the data lines being averaged (as was done in RudiC's code but was not done in the code in post #3), we could come closer to getting what seems to be the desired output.

But, then we also note that the code in post #3 uses a <tab> character as the input field separator and there are no <tab>s in the sample input provided (so the code in post #3 only produces one output field).

So, if we modify the sample input data to be <tab> separated and assume that the desired output should also be <tab> separated instead of separating output fields with a single <space> character and not even including a <newline> terminator to the single partial line of output produced by the script in post #3 AND not just as sequences of <space>s as shown in the given in the desired output shown in post #1, we might try something more like:

Code:

awk '
BEGIN {	#Set input and output field separators.
	FS = OFS = "\t"
}
NR == 1 {
	# Process input header line...
	# Determine which input fields are to become output fields.
	for(i = 1; i <= NF; i++)
		if($i ~ /^POP/) {
			# Add an entry to POP[] for each field number to be
			# processed, and set "last" to the field number of the
			# last field number to be processed.
			POP[last = i]
		}

	# Print output file header.
	for(i = 1; i <= NF; i++)
		if(i in POP)
			printf("AVG_%s%s", $i, (i == last) ? ORS : OFS)
}
NR > 1 {# Loop through each input field...
	for(i = 1; i <= NF; i++)
		# Is this an input field to be processed...
		if(i in POP)
			# Yes.  Keep a running total of values in this column.
			total[i] += $i
}
END {	# We have hit the EOF on the input...
	# Calculate and print the results for the selected fields...
	for(i = 1; i <= NF; i++)
		if(i in POP)
			printf("%9.2f%s", total[i] / (NR - 1),
			    (i == last) ? ORS : OFS)
}' file

Using %9.2f as the output format for the calculated values makes the field values line up with the output field headers. It isn't obvious whether or not this was required by the sample output provided in post #1. The above code produces the output:

Code:

Output produced by above code:
AVG_POP1	AVG_POP2	AVG_POP3	AVG_POP4	AVG_POP5	AVG_POP6	AVG_POP7	AVG_POP8	AVG_POP9	AVG_POP10	AVG_POP11	AVG_POP12	AVG_POP13	AVG_POP14	AVG_POP15	AVG_POP16	AVG_POP17	AVG_POP18
     0.17	     0.16	     0.09	     0.61	     0.17	     0.17	     0.19	     0.02	     0.02	     0.10	     0.10	     0.06	     0.30	     0.10	     0.11	     0.11	     0.02	     0.01

Output requested in post #1 (with columns aligned):
 VG_POP1	AVG_POP2	AVG_POP3	AVG_POP4	AVG_POP5	AVG_POP6	AVG_POP7	AVG_POP8	AVG_POP9	AVG_POP10	AVG_POP11	AVG_POP12	AVG_POP13	AVG_POP14	AVG_POP15	AVG_POP16	AVG_POP17	AVG_POP18
     0.17	     0.16	     0.09	     0.61	     0.17	     0.16	     0.19	     0.02	     0.02	     0.10	     0.10	     0.06	     0.30	     0.11	     0.11	     0.11	     0.02	     0.01

Output produced by code in post #3:
7.600000 0.000000 0.000000 0.132000 0.126000 0.072000 0.492000 0.138000 0.132000 0.154000 0.018000 0.012000 0.000000 0.000000 0.082000 0.082000 0.046000 0.244000 0.084000 0.086000 0.088000 0.018000 0.008000

The values shown in red differ from the desired sample output (with fields aligned for comparison purposes) in post #1 by ±.01 for the AVG_POP6 and AVG_POP14 fields shown in red above; all of the other values exactly match the desired output shown on the last line of the above comparison.

If you strip out the values shown in orange in the last line of the output (corresponding to fields that have been deleted from the output in the other two sets of output shown), the values given by the code in post #3 are about 80% of the desired values (as I would expect since that code is calculating the average by dividing the sum of four numeric fields by five instead of dividing the sum of four numeric fields by four).

I hope this helps. It would certainly be a lot easier to come up with code like the above if the description of the problem matched the desired output a lot closer.

Last edited by Don Cragun; 02-13-2019 at 01:56 AM.. Reason: Fix font problem with <plus-or-minus> character.

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

02-13-2019

Registered User

489, 285

Join Date: Nov 2018

Last Activity: 30 October 2021, 10:47 AM EDT

Location: undefined

Posts: 489

Thanks Given: 382

Thanked 285 Times in 215 Posts

Code:

awk '           { $1=$2=$3=$13=$14=""; $0=$0 }
(NR == 1)       { print; next }
                { for(i=1; i<=NF; i++) T[i]+=$i }
END             { for(i=1; i<=NF; i++) printf OFS"%0.2f", T[i] / ( NR - 1 ) }
' file | column -t

nezabudka

View Public Profile for nezabudka

Find all posts by nezabudka

Shell Programming and Scripting

Average each numeric column

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Replace a numeric values in a certain column

Discussion started by: arunkumar_mca

2. UNIX for Advanced & Expert Users

Sort by second column numeric values

Discussion started by: cokedude

3. Shell Programming and Scripting

Check first column - average second column based on a condition

Discussion started by: jacobs.smith

4. Shell Programming and Scripting

Get column average using ID

Discussion started by: jwbucha

5. Shell Programming and Scripting

Remove certain column with numeric value

Discussion started by: radius

6. Shell Programming and Scripting

Calculate the average of a column based on the value of another column

Discussion started by: jackken007

7. Shell Programming and Scripting

average of rows with same value in the first column

Discussion started by: paolo.kunder

8. Shell Programming and Scripting

How to check if a column is having a numeric value or not in a file?

Discussion started by: keerthan

9. UNIX for Dummies Questions & Answers

average of a column in a table

Discussion started by: Gussifinknottle

10. UNIX for Dummies Questions & Answers

calculate average of column 2

Discussion started by: onthetopo