Combine identical lines and average the one variable field

06-11-2014

Registered User

6, 0

Join Date: May 2014

Last Activity: 18 July 2014, 10:42 AM EDT

Posts: 6

Thanks Given: 0

Thanked 0 Times in 0 Posts

Combine identical lines and average the one variable field

I have the following file

Code:

299899 chrX_299716_300082 196  78.2903 299991 chrX_299982_300000 18.2538 Tajd:0.745591 FayWu:-0.245701 T2:1.45
299899 chrX_299716_300082 196  78.2903 299991 chrX_299982_300000 18.2538 Tajd:0.745591 FayWu:-0.245701 T2:0.283
311027 chrX_310892_311162 300  91.6452 311022 chrX_311013_311031 14.9526 Tajd:0.640409 FayWu:-0.278087 T2:0.283
311027 chrX_310892_311162 300  91.6452 311022 chrX_311013_311031 14.9526 Tajd:0.640409 FayWu:-0.278087 T2:-0.324
388608 chrX_388393_388823 562  50.619 388603 chrX_388594_388612 18.4584 Tajd:0.342217 FayWu:-0.742664 T2:-0.421
688781 chrX_688561_689002 552 -0 688817 chrX_688808_688826 10.6874 Tajd:0.302043 FayWu:-1.079566 T2:0.803
688781 chrX_688561_689002 552 -0 688817 chrX_688808_688826 10.6874 Tajd:0.302043 FayWu:-1.079566 T2:-1.233
1220600 chrX_1220404_1220797 510 -0 1220617 chrX_1220608_1220626 16.7085 Tajd:0.391032 FayWu:-0.421912 T2:1.093

There are a lot of identical lines which differ only in the last field (T2:#). I'm looking for a way to combine these lines so that the T2 entry is averaged. In this excerpt I would wish to receive something like:

Code:

299899 chrX_299716_300082 196  78.2903 299991 chrX_299982_300000 18.2538 Tajd:0.745591 FayWu:-0.245701 T2:0.8665
311027 chrX_310892_311162 300  91.6452 311022 chrX_311013_311031 14.9526 Tajd:0.640409 FayWu:-0.278087 T2:-0.0205
388608 chrX_388393_388823 562  50.619 388603 chrX_388594_388612 18.4584 Tajd:0.342217 FayWu:-0.742664 T2:-0.421
688781 chrX_688561_689002 552 -0 688817 chrX_688808_688826 10.6874 Tajd:0.302043 FayWu:-1.079566 T2:-0.215
1220600 chrX_1220404_1220797 510 -0 1220617 chrX_1220608_1220626 16.7085 Tajd:0.391032 FayWu:-0.421912 T2:1.093

The file is sorted, so all identical lines will be consecutive entries. The closest I have gotten is:

Code:

more input.file | awk '{split($10,a,":");avt2[$1]+=a[2];c[$1]++}END{for(i in avt2) print $0,avt2[i]/c[i]}' > output.file

but have not received any helpful results.
Thanks a lot for any help,
Jonas

jfern

View Public Profile for jfern

Find all posts by jfern

06-11-2014

Registered User

1,690, 205

Join Date: Jun 2007

Last Activity: 13 July 2020, 5:35 PM EDT

Location: Mumbai, India

Posts: 1,690

Thanks Given: 139

Thanked 205 Times in 199 Posts

May be ,

Code:

awk -F: '{S=$1 FS $2 FS $3;a[S]++;b[S]=b[S]+$4} END {for (i in a) {print i FS b[i]/a[i]}}' file

Note: I assumed you have 4 colon separated fields on each line.

Also, The file need not to be sorted in this case. It would work in both cases.

EDIT:
I see, you have used almost similar way, except the IFS=":".

Last edited by clx; 06-11-2014 at 07:24 AM..

clx

View Public Profile for clx

Find all posts by clx

06-11-2014

Registered User

6, 0

Join Date: May 2014

Last Activity: 18 July 2014, 10:42 AM EDT

Posts: 6

Thanks Given: 0

Thanked 0 Times in 0 Posts

Seems to have worked.

Thanks!

jfern

View Public Profile for jfern

Find all posts by jfern

Shell Programming and Scripting

Combine identical lines and average the one variable field

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk - If field value of consecutive records are the identical print portion of lines

Discussion started by: jvoot

2. UNIX for Beginners Questions & Answers

How to delete identical lines while leaving one undeleted?

Discussion started by: beginner_99

3. UNIX for Dummies Questions & Answers

Combine Similar Output from the 2nd field w.r.t 1st Field

Discussion started by: alvinoo

4. Shell Programming and Scripting

awk to combine by field and average by another

Discussion started by: cmccabe

5. Shell Programming and Scripting

sed print all lines between second and third identical lines

Discussion started by: godfreydanials

6. Shell Programming and Scripting

Combine multiple lines in file based on specific field

Discussion started by: ratheesh2011

7. UNIX for Dummies Questions & Answers

more than 10 identical lines

Discussion started by: lawsongeek

8. Shell Programming and Scripting

print running field average for a set of lines

Discussion started by: euval

9. Shell Programming and Scripting

Ignore identical lines

Discussion started by: forumthreads

10. Shell Programming and Scripting

replace 2 identical strings on different lines

Discussion started by: prkfriryce