awk to combine by field and average by another


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to combine by field and average by another
# 1  
Old 02-18-2016
awk to combine by field and average by another

In the below awk I am trying to combine all matching $4 into a single $5 (up to the -), and count the lines in $6 and average all values in $7. The awk is close but it seems to only be using the last line in the file and skipping all others. The posted input is a sample of the file that is over 20MB. The output, currently, isn't in the desired way either. What am I doing wrong? Thank you Smilie.

input (/home/cmccabe/Desktop/NGS/API/2-12-2015/bedtools/30x/*30reads_perbase.txt)
Code:
chr1    955543  955763  chr1:955543-955763  AGRN-6|gc=75    1   15
chr1    955543  955763  chr1:955543-955763  AGRN-6|gc=75    2   16
chr1    955543  955763  chr1:955543-955763  AGRN-6|gc=75    3   16
chr1    955543  955763  chr1:955543-955763  AGRN-6|gc=75    4   14
chr1 976035 976270 chr1:976035-976270 AGRN-9|gc=74.5 1  28
chr1 976035 976270 chr1:976035-976270 AGRN-9|gc=74.5 2   27
chr1 976035 976270 chr1:976035-976270 AGRN-9|gc=74.5 3   27

desired output
Code:
chr1:955543-955763 4 AGRN 15
chr1:976035-976270 3 AGRN 27

bash
Code:
for f in /home/cmccabe/Desktop/NGS/API/2-12-2015/30x/*30reads_perbase.txt ; do
     bname=`basename $f`
     pref=${bname%%.txt}
     awk '{k=$4 FS $5; a[k]+=$7; c[k]++}
     END{for(k in a)
     split(k,ks,FS);
     print ks[1],c[k],ks[2],a[k]/c[k]}}' $f > /home/cmccabe/Desktop/NGS/API/2-12-2015/30x/${pref}_genes.txt
done

current output
Code:
chr1:976035-976270 3 AGRN-9|gc=74.5 27.3333

---------- Post updated at 12:49 PM ---------- Previous update was at 11:54 AM ----------

This awk is closer but still only uses the last line in the file not the entire file.

Code:
cat input | awk '{k=$4 FS $5; a[k]+=$7; c[k]++}END{for(k in a)split(k,ks,FS);printf ("%s %d %s %.0f\n",ks[1],c[k],substr(ks[2],0,match(ks[2],"-")-1),a[k]/c[k])}' 

chr1:976035-976270 3 AGRN 27 (output order: $4 $6 $5 average of $7


Last edited by cmccabe; 02-18-2016 at 02:50 PM.. Reason: added details
# 2  
Old 02-18-2016
It splits ALL of a's elements but prints just the last one's.

Last edited by RudiC; 02-18-2016 at 04:18 PM.. Reason: typos
This User Gave Thanks to RudiC For This Post:
# 3  
Old 02-18-2016
You might want to try something more like:
Code:
#!/bin/ksh
DIR='/home/cmccabe/Desktop/NGS/API/2-12-2015/30x'

cd "$DIR"
awk '
function file_print() {
	for(k in a) {
		split(k, ks, / |(-[0-9]*[|])/)
		printf("%s %d %s %d\n", ks[1], c[k], ks[2], a[k] / c[k]) > ofn
		delete a[k]
		delete c[k]
	}
	close(ofn)
}
NR > 1 && FNR == 1 {
	file_print()
}
FNR == 1 {
	ofn = substr(FILENAME, 1, length(FILENAME) - 4) "_genes.txt"
}
{	a[k = $4 " " $5] += $7
	c[k]++
}
END {	file_print()
}' *30reads_perbase.txt

This was written and tested with a Korn shell, but should work with any shell based on Bourne shell syntax.

Note that this script only invokes awk once instead of once per input file, so if you have several files to process it should run a little faster. (It does, however, assume that the list of filenames to be processed doesn't push your awk command line over the ARG_MAX limit.

And, as always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.
This User Gave Thanks to Don Cragun For This Post:
# 4  
Old 02-18-2016
Thank you very much Smilie... works great.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to average field if matching string in another

In the awk below I am trying to get the average of the sum of $7 if the string in $4 matches in the line below it. The --- in the desired out is not needed, it is just to illustrate the calculation. The awk executes and produces the current out. I am not sure why the middle line is skipped and the... (2 Replies)
Discussion started by: cmccabe
2 Replies

2. Shell Programming and Scripting

Compute average based on field values

Im looking for a way to average the values in field 14 (when field 2 is equal to 2016) and fields 3 and 4 (when field 2 is equal to 2017). Any help is appreciated. 001001 2016 33.22 38.19 48.07 51.75 59.77 67.68 70.86 72.21 66.92 53.67 42.31 40.15 001001 2017 ... (10 Replies)
Discussion started by: ncwxpanther
10 Replies

3. Shell Programming and Scripting

awk to combine all matching fields in input but only print line with largest value in specific field

In the below I am trying to use awk to match all the $13 values in input, which is tab-delimited, that are in $1 of gene which is just a single column of text. However only the line with the greatest $9 value in input needs to be printed. So in the example below all the MECP2 and LTBP1... (0 Replies)
Discussion started by: cmccabe
0 Replies

4. Shell Programming and Scripting

awk to combine matches and use a field to adjust coordinates in other fields

Trying to output a result that uses the data from file to combine and subtract specific lines. If $4 matches in each line then the last $6 value is added to $2 and that becomes the new$3. Each matching line in combined into one with $1 then the original $2 then the new$3 then $5. For the cases... (4 Replies)
Discussion started by: cmccabe
4 Replies

5. UNIX for Dummies Questions & Answers

Combine Similar Output from the 2nd field w.r.t 1st Field

Hi, For example: I have: HostA,XYZ HostB,XYZ HostC,ABC I would like the output to be: HostA,HostB: XYZ HostC:ABC How can I achieve this? So far what I though of is: (1 Reply)
Discussion started by: alvinoo
1 Replies

6. Shell Programming and Scripting

Combine identical lines and average the one variable field

I have the following file 299899 chrX_299716_300082 196 78.2903 299991 chrX_299982_300000 18.2538 Tajd:0.745591 FayWu:-0.245701 T2:1.45 299899 chrX_299716_300082 196 78.2903 299991 chrX_299982_300000 18.2538 Tajd:0.745591 FayWu:-0.245701 T2:0.283 311027 chrX_310892_311162 300 91.6452... (2 Replies)
Discussion started by: jfern
2 Replies

7. UNIX for Dummies Questions & Answers

Combine table field for time

Hi, I have table like usrid Month Date year Time w23da Feb 10 2014 12:42:34 ae3aw Feb 20 2014 12:47:02 zse3q Feb 09 2014 10:02:28 all the five fields are inserted into different columns I want to combine all four (Month,Date,year and Time) and make it... (4 Replies)
Discussion started by: stew
4 Replies

8. Shell Programming and Scripting

Get the average from column and write the value at the last field

Dear Experts, Kindly help me please to get the average from column 14 and to write the value at the last field., But we need to take as reference the column 16., example the first 4 lines has the same value in column 16, therefore I want ot get the average only for these lines in column 14. And... (2 Replies)
Discussion started by: jiam912
2 Replies

9. Shell Programming and Scripting

To find sum & average of 8th field

Hi Friends, I have many files like below. total,0.7%,0.0%,0.2%,0.0%,0.2%,0.7%,98.0% total,1.9%,0.0%,0.4%,0.0%,0.0%,6.8%,90.6% total,0.9%,0.0%,0.4%,0.0%,0.0%,0.0%,98.5% total,1.4%,0.0%,0.7%,0.0%,0.2%,2.9%,94.5% total,0.7%,0.0%,0.4%,0.0%,0.0%,0.9%,97.7%... (13 Replies)
Discussion started by: SunilB2011
13 Replies

10. Shell Programming and Scripting

print running field average for a set of lines

Hi everyone, I have a program that generates logs that contains sections like this: IMAGE INPUT 81 0 0.995 2449470 0 1726 368 1 0.0635 0.3291 82 0 1.001 2448013 0 1666 365 1 0.0649 0.3235 83 0 1.009 2444822 0 1697 371 1 ... (3 Replies)
Discussion started by: euval
3 Replies
Login or Register to Ask a Question