Average values of duplicate rows


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Average values of duplicate rows
# 1  
Old 06-02-2014
Average values of duplicate rows

I have this file input.txt. I want to take average column-wise for the rows having duplicate gene names.

Code:
Gene Sample_1 Sample_2 Sample_3
gene_A 2 4 5
gene_B 1 2 3
gene_A 0 5 7
gene_B 4 5 6
gene_A 11 12 13
gene_C 2 3 4

Desired output:
Code:
gene_A 4.3 7 8.3
gene_B 2.5 3.5 4.5
gene_C 2 3 4

Thanks in advance
# 2  
Old 06-02-2014
This is by reading same file twice, you can also process this in END block

Code:
$ cat file
Gene Sample_1 Sample_2 Sample_3
gene_A 2 4 5
gene_B 1 2 3
gene_A 0 5 7
gene_B 4 5 6
gene_A 11 12 13
gene_C 2 3 4

Code:
awk '   NR==1{
		print
		next
             }
 	FNR==NR \
	     {
		  for(i=2;i<=NF;i++)
		  {
		 	A[$1,i]+=$i
		 	C[$1,i]++ 
		  } next
             }
       !x[$1]++ && FNR>1 \
             {
		for(i=2;i<=NF;i++)
		printf "%s%s",(i==2?"" : OFS),A[$1,i]/C[$1,i];
		printf RS
	     }
    ' OFS="\t" file file

Resulting
Code:
Gene Sample_1 Sample_2 Sample_3
4.33333	7	8.33333
2.5	3.5	4.5
2	3	4

---------- Post updated at 09:22 PM ---------- Previous update was at 09:13 PM ----------

This is processing in END block reading file once

Code:
 awk '  NR==1{
		print
		next
             }
 	FNR==NR \
	     {
		  for(i=2;i<=NF;i++)
		  {
		 	A[$1,i]+=$i
		 	C[$1,i]++ 
		  } next
             }
         END {
		for( i in A)
		{
			split(i,X,SUBSEP)
			if(!(X[1] in x))
			{
				printf X[1] OFS
				for(j=2;j<=NF;j++)
				{
					printf "%s%s",j==2?"":OFS,A[X[1],j]/C[X[1],j]
				}
				printf RS
				x[X[1]]
			}
		}
	     }
       ' OFS="\t" file

---------- Post updated at 09:23 PM ---------- Previous update was at 09:22 PM ----------

If you don't care order use this..

Last edited by Akshay Hegde; 06-02-2014 at 02:34 PM.. Reason: typo fix
This User Gave Thanks to Akshay Hegde For This Post:
# 3  
Old 06-02-2014
try also:
Code:
awk '
NR>1{l[$1]=$1; c[$1]++;
   for (i=2; i<=NF; i++) a[$1,i]+=$i;
}
END {
   for (g in l) {
      printf g " ";
      for (i=2; i<=NF; i++) printf ("%.1f ", (a[g,i]/c[g]));
      print "";
   }
}
' infile

This User Gave Thanks to rdrtx1 For This Post:
# 4  
Old 06-03-2014
l[$1]=$1 is a useless value; l[$1] alone defines the key (no value).
Or store the length, i.e. allow an individual length for each gene type:
Code:
awk '
NR>1 {
   L[$1]=NF; c[$1]++
   for (i=2; i<=NF; i++) a[$1,i]+=$i
}
END {
   for (g in L) {
      printf "%s", g
      for (i=2; i<=L[g]; i++) printf " %s", (a[g,i]/c[g])
      print ""
   }
}
' infile

NB: the %s format allows any cast from a number to a string; awk indeed seems to handle printf "%s\n", number like print number.
This User Gave Thanks to MadeInGermany For This Post:
# 5  
Old 07-09-2014
Hi, I was using your second program, wanted to know how to run this program using a script? Just saving in a .sh file would work ?
# 6  
Old 07-09-2014
yes, it would. Try to pass the file name correctly and handle it in code
This User Gave Thanks to SriniShoo For This Post:
# 7  
Old 08-23-2014
Hello,

Following may also help in same.

Code:
awk 'NR==FNR && NR>1{a[$1]+=$2;b[$1]++;c[$1]+=$3;d[$1]+=$3;e[$1]+=$4;next} ($1 in a){ {if(s[$1] == ""){{f=a[$1]/b[$1]; g=c[$1]/b[$1]; h=d[$1]/b[$1]; i=e[$1]/b[$1];s[$1]=1}; {print $1 OFS f OFS g OFS i}}}}'  OFS="\t" filename filename

Output will be as follows.

Code:
gene_A  4.33333 7       8.33333
gene_B  2.5     3.5     4.5
gene_C  2       3       4


Thanks,
R. Singh
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Average select rows

I have no idea how to even get started with this script. I need to average field 3 for each of the unique identifiers found in field 1. However, I only want to average these rows when field 2 is equal to 1506 - 2000 (note that i replaced the values field 2 for security reasons, but the real... (6 Replies)
Discussion started by: ncwxpanther
6 Replies

2. Shell Programming and Scripting

Extract and exclude rows based on duplicate values

Hello I have a file like this: > cat examplefile ghi|NN603762|eee mno|NN607265|ttt pqr|NN613879|yyy stu|NN615002|uuu jkl|NN607265|rrr vwx|NN615002|iii yzA|NN618555|ooo def|NN190486|www BCD|NN628717|ppp abc|NN190486|qqq EFG|NN628717|aaa HIJ|NN628717|sss > I can sort the file by... (5 Replies)
Discussion started by: CHoggarth
5 Replies

3. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Dear folks I have a map file of around 54K lines and some of the values in the second column have the same value and I want to find them and delete all of the same values. I looked over duplicate commands but my case is not to keep one of the duplicate values. I want to remove all of the same... (4 Replies)
Discussion started by: sajmar
4 Replies

4. Shell Programming and Scripting

Average across rows with a condition

Hi Friends, My input file Gene1 10 20 0 Gene2 5 0 15 Gene3 10 10 10 Gene4 5 0 0 If there is a zero for any gene in any column, I don't want that column to be considered which reduces the denominator value during average. Here is my output Gene1 10 20 0 10 Gene2 5 0 15 10 Gene3... (5 Replies)
Discussion started by: jacobs.smith
5 Replies

5. Shell Programming and Scripting

Get the average from column, and eliminate the duplicate values.

Dear Experts, Kindly help me please, I have a big file where there is duplicate values in col 11 till col 23, every 2 rows appers a new numbers, but in each row there is different coordinates x and y in col 57 till col 74. Please i will like to get a single value and average of the x and y... (8 Replies)
Discussion started by: jiam912
8 Replies

6. UNIX for Dummies Questions & Answers

Writing a script to take the average of two columns every 3 rows

I have a dataset with 120 columns. I would like to write a script, that takes the average of every two columns, starting from columns 2 and 3, and moving consecutively in frames of 3 columns, all the way until the last column. The first column in the output file would be the averages of columns... (1 Reply)
Discussion started by: evelibertine
1 Replies

7. Shell Programming and Scripting

average of rows with same value in the first column

Dear All, I have this file tab delimited A 1 12 22 B 3 34 33 C 55 9 32 A 12 81 71 D 11 1 66 E 455 4 2 B 89 4 3 I would like to make the average every column where the first column is the same, for example, A 6,5 46,5 46,5 B 46,0 19,0 18,0 C 55,0 9,0 32,0 D 11,0 1,0 66,0... (8 Replies)
Discussion started by: paolo.kunder
8 Replies

8. Shell Programming and Scripting

Duplicate rows in CSV files based on values

I am new to this forum and this is my first post. I am looking at an old post with exactly the same name. Can not paste URL because I do not have 5 posts My requirement is exactly opposite. I want to get rid of duplicate rows and try to append the values of columns in those rows ... (10 Replies)
Discussion started by: vbhonde11
10 Replies

9. Shell Programming and Scripting

Duplicate rows in CSV files based on values

I want to duplicate a row if found two or more values in a particular column for corresponding row which is delimitted by comma. Input abc,line one,value1 abc,line two, value1, value2 abc,line three,value1 needs to converted to abc,line one,value1 abc,line two, value1 abc,line... (8 Replies)
Discussion started by: Incrediblian
8 Replies

10. UNIX for Dummies Questions & Answers

Calculating the Number of Rows and Average

Hi All I like to know how can we calculate the number of rows and the average of the values present in the file. I will not know what will be the rowcount, which will be dynamic in nature of the file. eg. 29 33 48 30 28 (6 Replies)
Discussion started by: pk_eee
6 Replies
Login or Register to Ask a Question