Average values of duplicate rows

06-02-2014

Registered User

27, 0

Join Date: Sep 2013

Last Activity: 23 April 2020, 7:26 AM EDT

Posts: 27

Thanks Given: 17

Thanked 0 Times in 0 Posts

Average values of duplicate rows

I have this file input.txt. I want to take average column-wise for the rows having duplicate gene names.

Code:

Gene Sample_1 Sample_2 Sample_3
gene_A 2 4 5
gene_B 1 2 3
gene_A 0 5 7
gene_B 4 5 6
gene_A 11 12 13
gene_C 2 3 4

Desired output:

Code:

gene_A 4.3 7 8.3
gene_B 2.5 3.5 4.5
gene_C 2 3 4

Thanks in advance

Sanchari

View Public Profile for Sanchari

Find all posts by Sanchari

06-02-2014

Moderator

1,837, 668

Join Date: Nov 2012

Last Activity: 30 June 2020, 12:07 PM EDT

Posts: 1,837

Thanks Given: 180

Thanked 668 Times in 590 Posts

This is by reading same file twice, you can also process this in END block

Code:

$ cat file
Gene Sample_1 Sample_2 Sample_3
gene_A 2 4 5
gene_B 1 2 3
gene_A 0 5 7
gene_B 4 5 6
gene_A 11 12 13
gene_C 2 3 4

Code:

awk '   NR==1{
		print
		next
             }
 	FNR==NR \
	     {
		  for(i=2;i<=NF;i++)
		  {
		 	A[$1,i]+=$i
		 	C[$1,i]++ 
		  } next
             }
       !x[$1]++ && FNR>1 \
             {
		for(i=2;i<=NF;i++)
		printf "%s%s",(i==2?"" : OFS),A[$1,i]/C[$1,i];
		printf RS
	     }
    ' OFS="\t" file file

Resulting

Code:

Gene Sample_1 Sample_2 Sample_3
4.33333	7	8.33333
2.5	3.5	4.5
2	3	4

---------- Post updated at 09:22 PM ---------- Previous update was at 09:13 PM ----------

This is processing in END block reading file once

Code:

 awk '  NR==1{
		print
		next
             }
 	FNR==NR \
	     {
		  for(i=2;i<=NF;i++)
		  {
		 	A[$1,i]+=$i
		 	C[$1,i]++ 
		  } next
             }
         END {
		for( i in A)
		{
			split(i,X,SUBSEP)
			if(!(X[1] in x))
			{
				printf X[1] OFS
				for(j=2;j<=NF;j++)
				{
					printf "%s%s",j==2?"":OFS,A[X[1],j]/C[X[1],j]
				}
				printf RS
				x[X[1]]
			}
		}
	     }
       ' OFS="\t" file

---------- Post updated at 09:23 PM ---------- Previous update was at 09:22 PM ----------

If you don't care order use this..

Last edited by Akshay Hegde; 06-02-2014 at 02:34 PM.. Reason: typo fix

This User Gave Thanks to Akshay Hegde For This Post:

Akshay Hegde

View Public Profile for Akshay Hegde

Find all posts by Akshay Hegde

06-02-2014

Read Only

1,278, 486

Join Date: Sep 2012

Last Activity: 27 February 2020, 8:59 PM EST

Location: Houston, Texas, USA

Posts: 1,278

Thanks Given: 0

Thanked 486 Times in 451 Posts

try also:

Code:

awk '
NR>1{l[$1]=$1; c[$1]++;
   for (i=2; i<=NF; i++) a[$1,i]+=$i;
}
END {
   for (g in l) {
      printf g " ";
      for (i=2; i<=NF; i++) printf ("%.1f ", (a[g,i]/c[g]));
      print "";
   }
}
' infile

This User Gave Thanks to rdrtx1 For This Post:

rdrtx1

View Public Profile for rdrtx1

Find all posts by rdrtx1

06-03-2014

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

l[$1]=$1 is a useless value; l[$1] alone defines the key (no value).
Or store the length, i.e. allow an individual length for each gene type:

Code:

awk '
NR>1 {
   L[$1]=NF; c[$1]++
   for (i=2; i<=NF; i++) a[$1,i]+=$i
}
END {
   for (g in L) {
      printf "%s", g
      for (i=2; i<=L[g]; i++) printf " %s", (a[g,i]/c[g])
      print ""
   }
}
' infile

NB: the %s format allows any cast from a number to a string; awk indeed seems to handle printf "%s\n", number like print number.

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

07-09-2014

Registered User

27, 0

Join Date: Sep 2013

Last Activity: 23 April 2020, 7:26 AM EDT

Posts: 27

Thanks Given: 17

Thanked 0 Times in 0 Posts

Hi, I was using your second program, wanted to know how to run this program using a script? Just saving in a .sh file would work ?

Sanchari

View Public Profile for Sanchari

Find all posts by Sanchari

07-09-2014

Registered User

559, 160

Join Date: Jul 2012

Last Activity: 20 September 2019, 7:24 AM EDT

Location: India, Hyderabad

Posts: 559

Thanks Given: 11

Thanked 160 Times in 148 Posts

yes, it would. Try to pass the file name correctly and handle it in code

This User Gave Thanks to SriniShoo For This Post:

SriniShoo

View Public Profile for SriniShoo

Find all posts by SriniShoo

08-23-2014

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello,

Following may also help in same.

Code:

awk 'NR==FNR && NR>1{a[$1]+=$2;b[$1]++;c[$1]+=$3;d[$1]+=$3;e[$1]+=$4;next} ($1 in a){ {if(s[$1] == ""){{f=a[$1]/b[$1]; g=c[$1]/b[$1]; h=d[$1]/b[$1]; i=e[$1]/b[$1];s[$1]=1}; {print $1 OFS f OFS g OFS i}}}}'  OFS="\t" filename filename

Output will be as follows.

Code:

gene_A  4.33333 7       8.33333
gene_B  2.5     3.5     4.5
gene_C  2       3       4

Thanks,
R. Singh

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

Shell Programming and Scripting

Average values of duplicate rows

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Average select rows

Discussion started by: ncwxpanther

2. Shell Programming and Scripting

Extract and exclude rows based on duplicate values

Discussion started by: CHoggarth

3. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Discussion started by: sajmar

4. Shell Programming and Scripting

Average across rows with a condition

Discussion started by: jacobs.smith

5. Shell Programming and Scripting

Get the average from column, and eliminate the duplicate values.

Discussion started by: jiam912

6. UNIX for Dummies Questions & Answers

Writing a script to take the average of two columns every 3 rows

Discussion started by: evelibertine

7. Shell Programming and Scripting

average of rows with same value in the first column

Discussion started by: paolo.kunder

8. Shell Programming and Scripting

Duplicate rows in CSV files based on values

Discussion started by: vbhonde11

9. Shell Programming and Scripting

Duplicate rows in CSV files based on values

Discussion started by: Incrediblian

10. UNIX for Dummies Questions & Answers

Calculating the Number of Rows and Average

Discussion started by: pk_eee