Generate Regex numeric range with specific sub-ranges

03-17-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

alister's proposal assumes a fixed bucket size (in this case 100 ms per bucket), and a fixed number of buckets, 10. Your header does not (5ms, 5ms, 10ms, 8 x 10ms, 50 ms, 50 ms, infinity) and thus is incompatible with that nice, simple, and linear solution. You would need to explicitly pass the buckets to awk; then it also would be easy to both print the header and check "out of range".

EDIT: Chubler_XL just outpassed me; his proposal comes close to what I had in mind. He just doesn't put the 279 ms in the sample file into the right bin.

EDIT 2: massaging Chubler_XL's proposal slightly, this might be acceptable to the requestor:

Code:

awk -v buckets="5,10,20,30,40,50,60,70,80,90,100,150,200" '
         BEGIN                  {n=split(buckets,B,",");B[n+1]=">"B[n]};
         /^Response time/       {for(i=1;B[i]&&($3>B[i]);i++);v[i]++}
         END                    {for (i=1; i<=n+1; i++) printf "%3sms,", B[i]
                                 printf "\n"
                                 for (i=1; i<=n+1; i++) printf "%3d  ,", v[i]
                                 printf "\n"
                                }
        ' OFS=, file
  5ms, 10ms, 20ms, 30ms, 40ms, 50ms, 60ms, 70ms, 80ms, 90ms,100ms,150ms,200ms,>200ms,
  1  ,  0  ,  1  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  1  ,  0  ,  1  ,

Last edited by RudiC; 03-17-2013 at 07:18 PM..

RudiC

View Public Profile for RudiC

Find all posts by RudiC

03-17-2013

Registered User

28, 0

Join Date: Oct 2008

Last Activity: 17 February 2014, 12:56 PM EST

Location: UK - South East

Posts: 28

Thanks Given: 4

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by RudiC

Code:

awk -v buckets="5,10,20,30,40,50,60,70,80,90,100,150,200" '
         BEGIN                  {n=split(buckets,B,",");B[n+1]=">"B[n]};
         /^Response time/       {for(i=1;B[i]&&($3>B[i]);i++);v[i]++}
         END                    {for (i=1; i<=n+1; i++) printf "%3sms,", B[i]
                                 printf "\n"
                                 for (i=1; i<=n+1; i++) printf "%3d  ,", v[i]
                                 printf "\n"
                                }
        ' OFS=, file
  5ms, 10ms, 20ms, 30ms, 40ms, 50ms, 60ms, 70ms, 80ms, 90ms,100ms,150ms,200ms,>200ms,
  1  ,  0  ,  1  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  1  ,  0  ,  1  ,

The problem with your/ Chubler_XL suggestion is that i'll have to defined the upper bucket and this is the main reason why i'm moving away from my current solution otherwise for a range of 0 - 1000 with an upper bucket limit of 10 ms will take me ages to define it.

Alister's solution is very simple and so i have to defined only 2 values.

With regards to the header - i only gave an example but as i said to keep the nice/ tidy solution, the header should be generated based ont he n/ s values.

Cheers

varu0612

View Public Profile for varu0612

Find all posts by varu0612

03-17-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by Chubler_XL

Code:

B[i]&&($3>B[i])

Quote:

Originally Posted by RudiC

Code:

B[i]&&($3>B[i])

If $3>B[i] works with your awk implementation (I know it works with at least some mawk versions, if not all) then it's because it's violating POSIX. That should be performing a string comparison for all iterations of the loop, even when both B[i] and $3 are numeric strings. A compliant implementation can yield an incorrect result (such as when "200" is treated as greater than "10").

From http://pubs.opengroup.org/onlinepubs...ities/awk.html:

Quote:

Comparisons (with the '<' , "<=" , "!=" , "==" , '>' , and ">=" operators) shall be made numerically if both operands are numeric, if one is numeric and the other has a string value that is a numeric string, or if one is numeric and the other has the uninitialized value. Otherwise, operands shall be converted to strings as required and a string comparison shall be made using the locale-specific collation sequence. The value of the comparison expression shall be 1 if the relation is true, or 0 if the relation is false.

Except for the final value in B, every member of B that results from split() is a numeric string. Every "number' assigned from the input data to a field variable (such as $3) is also a numeric string. Note that the case of comparing a numeric string with a numeric string should be handled as a string comparison; at least one operand should be numeric for a numeric comparison to occur (which means "casting" with +0, or using the result of a function that returns a number, or using a numeric literal).

Another issue is that the terminating condition is locale dependent. The only reason the loop terminates is because a string comparison is used to compare the value of $3 against ">200" (in this instance). If a locale-aware implementation were run under a locale that did not place the ">" after all of the digits, an infinite loop would result upon encountering a value that should land in the last bucket.

Regards,
Alister

---------- Post updated at 08:04 PM ---------- Previous update was at 07:18 PM ----------

Quote:

Originally Posted by varu0612

Alister,

Your method is a very tidy/ nice one (balajesuri yours works ok as well, so thank you!).

Two more question:

a) how can i add a header like this which should take into account the n buckets of size s

Buckets,0-5ms,5-10ms,10-20ms,20-30ms,30-40ms,40-50ms,50-60ms,60-70ms,70-80ms,80-90ms,90-100ms,100-150ms,150-200ms,> 200ms

b) if the values are beyond the valid range, how can i add it under >200ms for example?

Many thanks,

Code:

awk -v n=10 -v s=100 '
/^Response time/ { ++b[((i=int($3/s)) > n) ? n : i] }
END {
    for (i=0; i<n; i++) printf "%s-%s,", (i*s), ((i+1)*s-1)
    print ">" (n*s)
    for (i=0; i<n; i++) printf "%s,", (b[i]+0)
    print b[i]+0
}' file

Regards,
Alister

This User Gave Thanks to alister For This Post:

alister

View Public Profile for alister

Find all posts by alister

03-18-2013

Registered User

28, 0

Join Date: Oct 2008

Last Activity: 17 February 2014, 12:56 PM EST

Location: UK - South East

Posts: 28

Thanks Given: 4

Thanked 0 Times in 0 Posts

thank you!

Alister,

the code in bold what does it mean?

IF this will help others, the code/ output below

Input file

Code:

cat file1.txt
Response time 2 ms
Response time 15 ms
Response time 17 ms
Response time 50 ms
Response time 45 ms
Response time 80 ms
Response time 89 ms
Response time 50 ms
Response time 53 ms
Response time 58 ms
Response time 57 ms
Response time 56 ms
Response time 98 ms
Response time 99 ms
Response time 100 ms
Response time 102 ms
Response time 110 ms

Code:

awk -v n=10 -v s=10 '
/^Response time/ { ++b[((i=int($3/s)) > n) ? n : i] }
END { printf "Buckets, "
for (i=0; i<n; i++) printf "%s-%s,", (i*s), ((i+1)*s-1)
print ">" (n*s)
for (i=0; i<n; i++) printf "%s,", (b[i]+0)
print b[i]+0
}' file1.txt

Output/ result

Code:

Buckets, 0-9,10-19,20-29,30-39,40-49,50-59,60-69,70-79,80-89,90-99,>100
1,2,0,0,1,6,0,0,2,2,3

Just for my own knowledge: should i understand that is very hard to implement this using the regular expressions? Has anyone done it?

Cheers

Last edited by varu0612; 03-18-2013 at 09:12 AM..

varu0612

View Public Profile for varu0612

Find all posts by varu0612

03-18-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by varu0612

Alister,

the code in bold what does it mean?

Code:

++b[((i=int($3/s)) > n) ? n : i]

It's part of the ternary operator, e1 ? e2 : e3, which involves three expressions, e1, e2, and e3. If the first expression, e1, evaluates to true, then the result is e2. If e1 is instead false, return e3.

In the quoted code fragment:
e1: (i=int($3/s)) > n
e2: n
e3: i

e1 calculates the bucket index to which $3 belongs, stores that value in i, and then compares the value of the assignment (which is the value stored in i) to n. If i is greater than n, which would indicate a bucket beyond the final bucket, then e1 is true and the result is e2, which is n. This is the logic which folds all values that would fall into a bucket beyond the final bucket into that final bucket. If, however, i is not greater than n, then e1 is false, i is a valid bucket index, and the ternary operator returns e3 (i).

I don't recommend this type of coding, as it's difficult to decipher. Even an expert programmer has to give it a close look to be certain of what's going on. My only defense is that it makes it more fun for me to contribute here, as I attempt to be as concise as possible. A possible beneficial side effect is that it may help others learn more about the language in question.

A much more readable, maintainable, and professional version:

Code:

i = int($3/s)
if (i > n)
    i = n
b[i] = b[i] + 1

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

03-18-2013

Registered User

28, 0

Join Date: Oct 2008

Last Activity: 17 February 2014, 12:56 PM EST

Location: UK - South East

Posts: 28

Thanks Given: 4

Thanked 0 Times in 0 Posts

The fact you took your time to explain in detail how it works where even a 5 years old kid can understand is very much appreciated.

I've seen many smart users replying with solutions who don't fail to explain the logic ... in my view that is a useless answer since it doesn't help the requester to understand/ learn how it works.

All the best!!

varu0612

View Public Profile for varu0612

Find all posts by varu0612

Shell Programming and Scripting

Generate Regex numeric range with specific sub-ranges

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Cannot subset ranges from another range set

Discussion started by: cryptodice

2. Shell Programming and Scripting

Regex to exclude numeric

Discussion started by: tpx99

3. Shell Programming and Scripting

Zipping files by numeric name range

Discussion started by: enwood

4. Shell Programming and Scripting

sed filtering lines by range fails 1-line-ranges

Discussion started by: bakunin

5. Shell Programming and Scripting

getting files between specific date ranges in solaris

Discussion started by: aliyesami

6. Shell Programming and Scripting

Awk numeric range match only one digit?

Discussion started by: meridionaljet

7. Programming

Perl : Numeric Range Pattern Matching

Discussion started by: doubando

8. Shell Programming and Scripting

Count occurences of a numeric string falling in a range

Discussion started by: chen.xiao.po

9. Shell Programming and Scripting

awk to match a numeric range specified by two columns

Discussion started by: heecha

10. Shell Programming and Scripting

numeric range comparisons

Discussion started by: dcfargo