Calculate average of top n% of values

05-20-2014

Registered User

60, 0

Join Date: Jul 2012

Last Activity: 23 August 2016, 11:46 AM EDT

Location: Sweden

Posts: 60

Thanks Given: 25

Thanked 0 Times in 0 Posts

Calculate average of top n% of values - UNIX

Hey guys,

I have several huge tab delimited files which look like this:

Code:

what I am interested is to calculate the average of top n% of data in third column. So for example for this file the top 50% values are:

Code:

23
20

(Please note that it is not top 50% of the number of inputs, it's top 50% of values. Basically repetitions are ignored)
And then the average (the output) will be:

Code:

21.5

I can do it with making 2 temporary files for each input file which is dumber even than the most possible stupid way! :|

I appreciate if you have something neat and clean on your mind for this.
Thanks in advance!

@man

View Public Profile for @man

Find all posts by @man

05-20-2014

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

Code:

gawk -v p=50 '{a[$3]}END{asorti(a,as);l=length(a);for(i=l;(i/l)*100>p;i--) print as[i]}' myFile

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

05-20-2014

Registered User

60, 0

Join Date: Jul 2012

Last Activity: 23 August 2016, 11:46 AM EDT

Location: Sweden

Posts: 60

Thanks Given: 25

Thanked 0 Times in 0 Posts

Thanks for your reply vgersh99!

It has some problems
First of all it does not do the sorting properly. Here is what I got for top 0.1% for example:

Code:

apparently it sort them like text.

Also I would like to have the average of all these numbers (current output) as thefinal output.

I also tried to make it automatic for many files like below, it did not work!

Code:

for file in *.txt do; gawk -v p=50 '{a[$3]}END{asorti(a,as);l=length(a);for(i=l;(i/l)*100>p;i--) print as[i]}' $file; done

Thanks 1000 times for your time.

@man

View Public Profile for @man

Find all posts by @man

05-20-2014

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

What's your longest number?

Here is a update with assumption of 7 or less digits (adjust red value as you need):

Code:

gawk -v p=50 '{a[sprintf("%07d",$3)]}
  END{asorti(a,as);l=length(a);for(i=l;(i/l)*100>p;i--)
  t+=as[i]; printf "%.2f\n", t/l/2}' infile

For multi files:

Code:

for file in *.txt
do
   gawk -v p=50 '
   {a[sprintf("%07d",$3)]}
   END{
      asorti(a,as)
      l=length(a)
      for(i=l;(i/l)*100>p;i--) t+=as[i]
      printf "%s\t%.2f\n", FILENAME, t/l/2}' "$file"
done

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

05-21-2014

Moderator

3,843, 841

Join Date: Jun 2007

Last Activity: 29 June 2020, 12:30 PM EDT

Location: Lancashire, UK

Posts: 3,843

Thanks Given: 2,004

Thanked 841 Times in 727 Posts

Quote:

I can do it with making 2 temporary files for each input file which is dumber even than the most possible stupid way! :|

What way have you tried?

Can you send the output from one command straight into the next, a bit like this?:-

Code:

cut -f3 -d " " input_file | sort -unr | ........

It would probably be more beneficial if it is in the tools that you are most comfortable with rather than a bespoke one-off that you dare not adjust.

What suits you?

Robin

rbatte1

View Public Profile for rbatte1

Visit rbatte1's homepage!

Find all posts by rbatte1

05-21-2014

Registered User

60, 0

Join Date: Jul 2012

Last Activity: 23 August 2016, 11:46 AM EDT

Location: Sweden

Posts: 60

Thanks Given: 25

Thanked 0 Times in 0 Posts

Quote:

What's your longest number?

Here is a update with assumption of 7 or less digits (adjust red value as you need):

Code:
gawk -v p=50 '{a[sprintf("%07d",$3)]}
END{asorti(a,as);l=length(a);for(i=l;(i/l)*100>p;i--)
t+=as[i]; printf "%.2f\n", t/l/2}' infile

For multi files:

Code:
for file in *.txt
do
gawk -v p=50 '
{a[sprintf("%07d",$3)]}
END{
asorti(a,as)
l=length(a)
for(i=l;(i/l)*100>p;i--) t+=as[i]
printf "%s\t%.2f\n", FILENAME, t/l/2}' "$file"
done

I don't think I have more than 7 digits. I tried it with two following sample files:
file1.txt

Code:

chr2L	10	23
chr2L	20	20
chr2L	35	15
chr2L	36	10
chr3R	12	10
chrX	10	15

file2.txt

Code:

chr2L	10	230
chr2L	20	20
chr2L	35	1.5
chr2L	36	1000
chr3R	12	100
chr3R	20	300
chrX	10	15
chrX	26	1500

and this is what I got as output when p=50:

Code:

file1.txt 21.50
file2.txt 1515.00

I think something is wrong but I don't know what! For sure the average of top 50% (1515 for second file) cannot be more than the maximum(1500)!!

I also tried it with different percentages. (p=12.5,p=10,...) but it does not seem to work properly. my desired percentage is p=0.1 if that is important to know.

Thanks once again for helping me.

---------- Post updated at 12:18 PM ---------- Previous update was at 12:10 PM ----------

Quote:

It would probably be more beneficial if it is in the tools that you are most comfortable with rather than a bespoke one-off that you dare not adjust.

Dear rbatte1,

You are absolutely right with no doubt!
I am a student in "bioinformatics" and my knowledge in programming is sub zero which is a shame. you asked what I did. What I could think of was to extract the third column first and using 'pipe' sort it numerically and then again use pipe to have only unique values and then sort them again and count the numbers of values take 0.1% of them and another pipe and then the average function!!
you see how stupid one can be!!
But even for each step of this dumb way I have to Google and such simple script takes me a half a day or more to complete! and this is part of a huge analysis of-course...

Now you see why I decided to ask for help! :|

Kindest regards,
aman

@man

View Public Profile for @man

Find all posts by @man

05-21-2014

Moderator

3,843, 841

Join Date: Jun 2007

Last Activity: 29 June 2020, 12:30 PM EDT

Location: Lancashire, UK

Posts: 3,843

Thanks Given: 2,004

Thanked 841 Times in 727 Posts

We're always happy to have questions & I've learned lots from this site by asking what many consider as daft questions. Hopefully this helps you understand and become better, but it sometimes helps us who know a bit to have problems to solve. We all contribute and see different ways of doing things.

If you make an effort, folks here are happy to help - and I've learned lots of new things. Most people will have joined to ask a question in the first place, so you are most welcome.

Whatever suits you best is a good way to steer us. Some tools like awk and their variants can take years to get a good understanding on. I'm still struggling along trying to learn, so don't get disheartened.

If you have a go, then show what you've done when asking for help the collective group will no doubt help you out.

Regards,
Robin

rbatte1

View Public Profile for rbatte1

Visit rbatte1's homepage!

Find all posts by rbatte1

Shell Programming and Scripting

Calculate average of top n% of values - UNIX

8 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Calculate average from a given set of keys and values

Discussion started by: nans

2. Shell Programming and Scripting

Calculate the average per block.

Discussion started by: invinzin21

3. Shell Programming and Scripting

Calculate average, azimut and distance

Discussion started by: jiam912

4. Shell Programming and Scripting

Calculate Average AWK

Discussion started by: AriasFco

5. Shell Programming and Scripting

AWK novice - calculate the average

Discussion started by: alex2005

6. Shell Programming and Scripting

Calculate average time using a script

Discussion started by: jaredhanks

7. Programming

calculate average

Discussion started by: cdfd123

8. UNIX for Dummies Questions & Answers

calculate average of column 2

Discussion started by: onthetopo