AWK counting interval / histogram data


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting AWK counting interval / histogram data
# 1  
Old 02-15-2012
AWK counting interval / histogram data

My data looks like this:
frame phi psi
Code:
0 68.466774 -58.170494
1  75.128593 -51.646816
2 76.083946 -64.300102
3 77.578056  -76.464218
4 63.180199 -76.067680
5 77.203979 -58.560757
6  66.574913 -60.000214
7 73.218269 -70.978203
8 70.956879 -76.096558
9  65.538872 -76.716568
10 57.107117 -67.572067
11 63.389595  -49.936893
12 83.935219 -65.073227
13 78.492310 -69.225609
14  58.567463 -77.028725
15 60.258656 -85.608917
16 80.604012  -68.479416
17 79.839516 -58.189476
18 68.693405 -66.911407
19  48.195873 -56.744625
20 75.479187 -48.657692
21 80.180649  -69.976234
22 71.216110 -70.213730
23 67.672768 -50.655262
24  55.870106 -63.952560
25 65.091850 -59.066532
26 64.395363  -40.585659
27 80.011673 -56.789768
28 74.003281 -69.651680
29  65.848534 -60.928204
30 65.260933 -78.133301
...

I would like to bin this data following the criteria of a bin for phi and psi values.

I.e. my desired output data would be of the form, if we
choose the bins to have width 10.

Code:
phi   psi   count
-180  -180   464
-170 -170   324
-160 -160   133

...


So, for an AWK script I need a command that will consider $2 and $3 in ranges of e.g. bin width 10:
e.g.
$2<=-170&&$2>=-180&&$3<=-170&&$3>=-180
$2<=-160&&$2>=-170&&$3<=-160&&$3>=-170
$2<=-150&&$2>=-160&&$3<=-150&&$3>=-160
$2<=-140&&$2>=-150&&$3<=-140&&$3>=-150
$2<=-130&&$2>=-140&&$3<=-130&&$3>=-140
$2<=-120&&$2>=-130&&$3<=-120&&$3>=-130
$2<=-110&&$2>=-120&&$3<=-110&&$3>=-120
$2<=-100&&$2>=-110&&$3<=-100&&$3>=-110
$2<=-90&&$2>=-100&&$3<=-90&&$3>=-100
...

and for each of these ranges, I wish to bin (count) the number of data points that fall within each interval. Any help here?
I can sort of see how to count in AWK, but how do you discretise the count in intervals of this kind. e.g how do you loop
with 10 units of change between each loop?
Thanks
# 2  
Old 02-15-2012
I see no values in your output that have anything to do with your input, so I'm left a bit confused.

What about values that don't match? phi is between 80 and 90, and rho is between 70 and 80? should they be ignored?

---------- Post updated at 11:32 AM ---------- Previous update was at 11:29 AM ----------

Based on what I'm guessing you want:

Code:
awk 'BEGIN { MIN=99999999; MAX=-MIN }

{        A=sprintf("%d", $2/10);
          B=sprintf("%d", $3/10);
          if(A == B)
          {
                  BIN[A]++;
                  if(A<MIN) MIN=A;
                  else if(A>MAX) MAX=A;
          }
}
END { for(N=MIN; N<=MAX; N++) print N*10, N*10, BIN[N]; }' inputfile

# 3  
Old 02-15-2012
No, they should not be ignored.

Maybe it would simply be easier to simply bin the data like in this Perl script (which only works for binning 1-column arrays like Phi on its own: @list denotes the input array containing Phi, $bin_width is 10.

Code:
sub histogram
{
   my($bin_width, @list) = @_;
   my %histogram;
   $histogram{ceil(($_ + 1) / $bin_width) -1}++ for @list;
   print "%histogram"
   my $max;
   my $min;

   while (my ($key, $value) = each(%histogram))
   {
     $max = $key if !defined($min) || $key > $max;
     $min = $key if !defined($min) || $key < $min;
   }
   for (my $i = $min; $i <= $max; $i++)
   {
     my $bin = sprintf("% 10d", ($i)*$bin_width);
     my $frequency = $histogram{$i} || 0;

     print $bin." ".$frequency."\n";
   }
   print "    Width: ".$bin_width."\n";
   print "    Range: ".$min."-".$max."\n\n";

In this Perl script, we iterate over the hash using two variables called $key and $value. Consider the bin width 10. For an input data value of -173 we perform the ceiling calculation
ceil((-173+1)/10 - 1) =-18
This input number, -173, is located in bin -18 which is $key=-18 and has a $value=1. Then the next time the script locates a value in bin -18, it will augment (++) the $value to 2. etc. so we are binning the data in this way without requiring any selections. I would like to try to extend this script to a more complicated hash with 2 columns (one for phi, one for psi).

Anyway, maybe this helps?

The example output data is just an example of how it could look.
Code:
phi   psi   count
-180  -180   464
-170 -170   324
-160 -160   133

# 4  
Old 02-15-2012
Okay then:

Code:
awk 'BEGIN { MIN=99999999; MAX=-MIN; OFS="\t"; BINSIZE=10; }

{        A=sprintf("%d", $2/BINSIZE);
          B=sprintf("%d", $3/BINSIZE);
          BIN[A OFS B]++;
}
END { for(X in A) print X, BIN[X]; }' inputfile

# 5  
Old 02-15-2012
awk: can't assign to A; it's an array name.
input record number 1, file angles_merge.dat
source line number 3
# 6  
Old 02-15-2012
Typo.

Code:
awk 'BEGIN { MIN=99999999; MAX=-MIN; OFS="\t"; BINSIZE=10; }

{        A=sprintf("%d", $2/BINSIZE);
          B=sprintf("%d", $3/BINSIZE);
          BIN[A OFS B]++;
}
END { for(X in BIN) print X, BIN[X]; }' inputfile

This User Gave Thanks to Corona688 For This Post:
# 7  
Old 02-16-2012
Fantastic, thanks.
What if I wanted to output bins that were not visited?
Currently I am only getting out bins that contain data, but
I would like to include bins that are not visited, is this possible?

In Perl, during the loop, if no data was accrued for a certain bin the
bin is still printed and the frequency is 0.
Code:
   my $frequency = $histogram{$i} || 0;

---------- Post updated at 01:31 PM ---------- Previous update was at 01:28 PM ----------

Currently my data is sort of useless when plotting:

Code:
-170 -140 4
-170 -150 14
-170 -160 46
-170 -170 122
-170 -30 1
-170 -40 7
-170 -50 3
-170 -60 3
-170 120 9
-170 130 83
-170 140 258
-170 150 366
-170 160 384
-170 170 246
-160 -130 4
-160 -140 9
-160 -150 38
-160 -160 164
-160 -170 587
-160 -30 3
-160 -40 4
-160 -50 8
-160 -60 1
-160 100 2
-160 110 13
-160 120 35
-160 130 339
-160 140 1135
-160 150 1903
-160 160 1975
-160 170 1414
-150 -110 3
-150 -120 1
-150 -130 6
-150 -140 26
-150 -150 95
-150 -160 453
-150 -170 1771
-150 -20 3
-150 -30 8
-150 -40 4
-150 -50 10
-150 -60 9

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk Sort 2d histogram output from min(X,Y) to max(X,Y)

I've got Gnuplot-format 2D histogram data output which looks as follows. 6.5 -1.25 10.2804 6.5404 -1.25 10.4907 6.58081 -1.25 10.8087 6.62121 -1.25 10.4686 6.66162 -1.25 10.506 6.70202 -1.25 10.3084 6.74242 -1.25 9.68256 6.78283 -1.25 9.41229 6.82323 -1.25 9.43078 6.86364 -1.25 9.62408... (1 Reply)
Discussion started by: chrisjorg
1 Replies

2. Shell Programming and Scripting

Script (ksh) to get data in every 30 mins interval for the given date

Hello, Since I m new to shell, I had a hard time to sought out this problem. I have a log file of a utility which tells that batch files are successful with timestamp. Given below is a part of the log file. 2013/03/07 00:13:50 Apache/1.3.29 (Unix) configured -- resuming normal operations... (12 Replies)
Discussion started by: rpm120
12 Replies

3. Shell Programming and Scripting

awk for histogram

I have a single file that looks like this: 1.62816 1.62816 0.86941 0.86941 0.731465 0.731465 1.03174 1.03174 0.769444 0.769444 0.981181 0.981181 1.14681 1.14681 1.00511 1.00511 1.20385 1.20385 (2 Replies)
Discussion started by: kayak
2 Replies

4. Shell Programming and Scripting

Data counting

I have a large tab delimited text file with 10 columns for example chrM 412 A A 75 0 25 2 ..,AGAATt II chrM 413 G G 72 0 25 4 ..t,,Aag IIIH chrM 414 C C 75 0 25 4 ...a,.. III2 chrM 415 C T 75 75 25 4 TTTt,,,ATC III7 At... (4 Replies)
Discussion started by: Lucky Ali
4 Replies

5. Shell Programming and Scripting

counting using awk

Hi, I want to perform a task using shell script. I am new to awk programming and any help would be greatly appreciated. I have the following 3 files (for example) file1: Name count Symbol chr1_1_50 10 XXXX chr3_101_150 30 YYYY File2: Name ... (13 Replies)
Discussion started by: Diya123
13 Replies

6. Shell Programming and Scripting

Help- counting delimiter in a huge file and split data into 2 files

I’m new to Linux script and not sure how to filter out bad records from huge flat files (over 1.3GB each). The delimiter is a semi colon “;” Here is the sample of 5 lines in the file: Name1;phone1;address1;city1;state1;zipcode1 Name2;phone2;address2;city2;state2;zipcode2;comment... (7 Replies)
Discussion started by: lv99
7 Replies

7. Shell Programming and Scripting

Counting average data per hour

Hi i have log like this : Actually i will process the data become Anybody can help me ? (6 Replies)
Discussion started by: justbow
6 Replies

8. Shell Programming and Scripting

compare the interval of 2 numbers of input2with interval of several numbers of input1

Help plz Does any one have any idea how to compare interval ranges of 2 files. finding 1-4 (1,2,3,4) of input2 in input1 of same key "a" values (5-10, 30-40, 45-60, 80-90, 100-120 ). Obviously 1-4 is not one of the range with in input1 a. so it should give out of range. finding 30-33(31,32,33)... (1 Reply)
Discussion started by: repinementer
1 Replies

9. Shell Programming and Scripting

Counting with Awk

I need "awk solution" for simple counting! File looks like: STUDENT GRADE student1 A student2 A student3 B student4 A student5 B Desired Output: GRADE No.of Students A 3 B 2 Thanks for awking! (4 Replies)
Discussion started by: saint2006
4 Replies

10. Shell Programming and Scripting

To extract data of a perticular interval (date-time wise)

I want a shell script which extract data from a log file which contains date and time-wise data and i need the data for a perticular interval of time...what can i do??? (3 Replies)
Discussion started by: abhishek27
3 Replies
Login or Register to Ask a Question