I have several huge tab delimited files which look like this:
what I am interested is to calculate the average of top n% of data in third column. So for example for this file the top 50% values are:
(Please note that it is not top 50% of the number of inputs, it's top 50% of values. Basically repetitions are ignored)
And then the average (the output) will be:
I can do it with making 2 temporary files for each input file which is dumber even than the most possible stupid way! :|
I appreciate if you have something neat and clean on your mind for this.
Thanks in advance!
I can do it with making 2 temporary files for each input file which is dumber even than the most possible stupid way! :|
What way have you tried?
Can you send the output from one command straight into the next, a bit like this?:-
It would probably be more beneficial if it is in the tools that you are most comfortable with rather than a bespoke one-off that you dare not adjust.
Code:
for file in *.txt
do
gawk -v p=50 '
{a[sprintf("%07d",$3)]}
END{
asorti(a,as)
l=length(a)
for(i=l;(i/l)*100>p;i--) t+=as[i]
printf "%s\t%.2f\n", FILENAME, t/l/2}' "$file"
done
I don't think I have more than 7 digits. I tried it with two following sample files:
file1.txt
file2.txt
and this is what I got as output when p=50:
I think something is wrong but I don't know what! For sure the average of top 50% (1515 for second file) cannot be more than the maximum(1500)!!
I also tried it with different percentages. (p=12.5,p=10,...) but it does not seem to work properly. my desired percentage is p=0.1 if that is important to know.
Thanks once again for helping me.
---------- Post updated at 12:18 PM ---------- Previous update was at 12:10 PM ----------
Quote:
It would probably be more beneficial if it is in the tools that you are most comfortable with rather than a bespoke one-off that you dare not adjust.
Dear rbatte1,
You are absolutely right with no doubt!
I am a student in "bioinformatics" and my knowledge in programming is sub zero which is a shame. you asked what I did. What I could think of was to extract the third column first and using 'pipe' sort it numerically and then again use pipe to have only unique values and then sort them again and count the numbers of values take 0.1% of them and another pipe and then the average function!!
you see how stupid one can be!!
But even for each step of this dumb way I have to Google and such simple script takes me a half a day or more to complete! and this is part of a huge analysis of-course...
We're always happy to have questions & I've learned lots from this site by asking what many consider as daft questions. Hopefully this helps you understand and become better, but it sometimes helps us who know a bit to have problems to solve. We all contribute and see different ways of doing things.
If you make an effort, folks here are happy to help - and I've learned lots of new things. Most people will have joined to ask a question in the first place, so you are most welcome.
Whatever suits you best is a good way to steer us. Some tools like awk and their variants can take years to get a good understanding on. I'm still struggling along trying to learn, so don't get disheartened.
If you have a go, then show what you've done when asking for help the collective group will no doubt help you out.
Hello,
I am writing a script which expects as its input a hash with student names as the keys and marks as the values. The script then returns array of average marks for student scored 60-70, 70-80, and over 90.
Output expected
50-70 1
70-90 3
over 90 0
The test script so far... (4 Replies)
My old school way is a one liner. And will search for average from SAR, to get the data receive rate. But, I dont think it is practical or accurate,. Because it calculates to off peak hours. I am planning to change it. My cron runs every 30 mins. When my cron runs, and my time is 14:47pm,, it will... (1 Reply)
Gents,
Please i will to get the distance and azimut from 2 coordinates:
Usig excel formula i get the correct values, but i will like to do it using awk.
Example
A 35089.0 50345.016 9 75 1 2101774 77 70 79 483911.6 2380106.9 137.4 1 1 6 1
A 35089.0 50345.01620 75... (8 Replies)
I want to calculate the average line by line of some files with several lines on them, the files are identical, just want to average the 3rd columns of those files.:wall:
Example file:
File 1
001 0.046 0.667267
001 0.047 0.672028
001 0.048 0.656025
001 0.049 ... (2 Replies)
Hello,
I'm hoping to get some help on calculating an average time from a list of times (hour:minute:second).
Here's what my list looks like right now, it will grow (I can get the full date or change the formatting of this as well):
07:55:31
09:42:00
08:09:02
09:15:23
09:27:45
09:49:26... (4 Replies)
I have a file which is
2
3
4
5
6
6
so i am writing program in C to calculate mean..
#include<stdio.h>
#include<string.h>
#include <math.h>
double CALL mean(int n , double x)
main (int argc, char **argv)
{
char Buf,SEQ;
int i;
double result = 0;
FILE *fp; (3 Replies)
Hi I have fakebook.csv as following:
F1(current date) F2(popularity) F3(name of book) F4(release date of book)
2006-06-21,6860,"Harry Potter",2006-12-31
2006-06-22,,"Harry Potter",2006-12-31
2006-06-23,7120,"Harry Potter",2006-12-31
2006-06-24,,"Harry Potter",2006-12-31... (0 Replies)