Generate Regex numeric range with specific sub-ranges


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Generate Regex numeric range with specific sub-ranges
# 8  
Old 03-17-2013
alister's proposal assumes a fixed bucket size (in this case 100 ms per bucket), and a fixed number of buckets, 10. Your header does not (5ms, 5ms, 10ms, 8 x 10ms, 50 ms, 50 ms, infinity) and thus is incompatible with that nice, simple, and linear solution. You would need to explicitly pass the buckets to awk; then it also would be easy to both print the header and check "out of range".

EDIT: Chubler_XL just outpassed me; his proposal comes close to what I had in mind. He just doesn't put the 279 ms in the sample file into the right bin.

EDIT 2: massaging Chubler_XL's proposal slightly, this might be acceptable to the requestor:
Code:
awk -v buckets="5,10,20,30,40,50,60,70,80,90,100,150,200" '
         BEGIN                  {n=split(buckets,B,",");B[n+1]=">"B[n]};
         /^Response time/       {for(i=1;B[i]&&($3>B[i]);i++);v[i]++}
         END                    {for (i=1; i<=n+1; i++) printf "%3sms,", B[i]
                                 printf "\n"
                                 for (i=1; i<=n+1; i++) printf "%3d  ,", v[i]
                                 printf "\n"
                                }
        ' OFS=, file
  5ms, 10ms, 20ms, 30ms, 40ms, 50ms, 60ms, 70ms, 80ms, 90ms,100ms,150ms,200ms,>200ms,
  1  ,  0  ,  1  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  1  ,  0  ,  1  ,


Last edited by RudiC; 03-17-2013 at 07:18 PM..
# 9  
Old 03-17-2013
Quote:
Originally Posted by RudiC
alister's proposal assumes a fixed bucket size (in this case 100 ms per bucket), and a fixed number of buckets, 10. Your header does not (5ms, 5ms, 10ms, 8 x 10ms, 50 ms, 50 ms, infinity) and thus is incompatible with that nice, simple, and linear solution. You would need to explicitly pass the buckets to awk; then it also would be easy to both print the header and check "out of range".

EDIT: Chubler_XL just outpassed me; his proposal comes close to what I had in mind. He just doesn't put the 279 ms in the sample file into the right bin.

EDIT 2: massaging Chubler_XL's proposal slightly, this might be acceptable to the requestor:
Code:
awk -v buckets="5,10,20,30,40,50,60,70,80,90,100,150,200" '
         BEGIN                  {n=split(buckets,B,",");B[n+1]=">"B[n]};
         /^Response time/       {for(i=1;B[i]&&($3>B[i]);i++);v[i]++}
         END                    {for (i=1; i<=n+1; i++) printf "%3sms,", B[i]
                                 printf "\n"
                                 for (i=1; i<=n+1; i++) printf "%3d  ,", v[i]
                                 printf "\n"
                                }
        ' OFS=, file
  5ms, 10ms, 20ms, 30ms, 40ms, 50ms, 60ms, 70ms, 80ms, 90ms,100ms,150ms,200ms,>200ms,
  1  ,  0  ,  1  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  0  ,  1  ,  0  ,  1  ,

The problem with your/ Chubler_XL suggestion is that i'll have to defined the upper bucket and this is the main reason why i'm moving away from my current solution otherwise for a range of 0 - 1000 with an upper bucket limit of 10 ms will take me ages to define it.

Alister's solution is very simple and so i have to defined only 2 values.

With regards to the header - i only gave an example but as i said to keep the nice/ tidy solution, the header should be generated based ont he n/ s values.

Cheers
# 10  
Old 03-17-2013
Quote:
Originally Posted by Chubler_XL
Code:
B[i]&&($3>B[i])

Quote:
Originally Posted by RudiC
Code:
B[i]&&($3>B[i])

If $3>B[i] works with your awk implementation (I know it works with at least some mawk versions, if not all) then it's because it's violating POSIX. That should be performing a string comparison for all iterations of the loop, even when both B[i] and $3 are numeric strings. A compliant implementation can yield an incorrect result (such as when "200" is treated as greater than "10").

From http://pubs.opengroup.org/onlinepubs...ities/awk.html:
Quote:
Comparisons (with the '<' , "<=" , "!=" , "==" , '>' , and ">=" operators) shall be made numerically if both operands are numeric, if one is numeric and the other has a string value that is a numeric string, or if one is numeric and the other has the uninitialized value. Otherwise, operands shall be converted to strings as required and a string comparison shall be made using the locale-specific collation sequence. The value of the comparison expression shall be 1 if the relation is true, or 0 if the relation is false.
Except for the final value in B, every member of B that results from split() is a numeric string. Every "number' assigned from the input data to a field variable (such as $3) is also a numeric string. Note that the case of comparing a numeric string with a numeric string should be handled as a string comparison; at least one operand should be numeric for a numeric comparison to occur (which means "casting" with +0, or using the result of a function that returns a number, or using a numeric literal).

Another issue is that the terminating condition is locale dependent. The only reason the loop terminates is because a string comparison is used to compare the value of $3 against ">200" (in this instance). If a locale-aware implementation were run under a locale that did not place the ">" after all of the digits, an infinite loop would result upon encountering a value that should land in the last bucket.

Regards,
Alister

---------- Post updated at 08:04 PM ---------- Previous update was at 07:18 PM ----------

Quote:
Originally Posted by varu0612
Alister,

Your method is a very tidy/ nice one (balajesuri yours works ok as well, so thank you!).

Two more question:

a) how can i add a header like this which should take into account the n buckets of size s

Buckets,0-5ms,5-10ms,10-20ms,20-30ms,30-40ms,40-50ms,50-60ms,60-70ms,70-80ms,80-90ms,90-100ms,100-150ms,150-200ms,> 200ms

b) if the values are beyond the valid range, how can i add it under >200ms for example?

Many thanks,
Code:
awk -v n=10 -v s=100 '
/^Response time/ { ++b[((i=int($3/s)) > n) ? n : i] }
END {
    for (i=0; i<n; i++) printf "%s-%s,", (i*s), ((i+1)*s-1)
    print ">" (n*s)
    for (i=0; i<n; i++) printf "%s,", (b[i]+0)
    print b[i]+0
}' file

Regards,
Alister
This User Gave Thanks to alister For This Post:
# 11  
Old 03-18-2013
thank you!

Alister,

the code in bold what does it mean?

IF this will help others, the code/ output below

Input file
Code:
cat file1.txt
Response time 2 ms
Response time 15 ms
Response time 17 ms
Response time 50 ms
Response time 45 ms
Response time 80 ms
Response time 89 ms
Response time 50 ms
Response time 53 ms
Response time 58 ms
Response time 57 ms
Response time 56 ms
Response time 98 ms
Response time 99 ms
Response time 100 ms
Response time 102 ms
Response time 110 ms

Code:
awk -v n=10 -v s=10 '
/^Response time/ { ++b[((i=int($3/s)) > n) ? n : i] }
END { printf "Buckets, "
for (i=0; i<n; i++) printf "%s-%s,", (i*s), ((i+1)*s-1)
print ">" (n*s)
for (i=0; i<n; i++) printf "%s,", (b[i]+0)
print b[i]+0
}' file1.txt

Output/ result
Code:
Buckets, 0-9,10-19,20-29,30-39,40-49,50-59,60-69,70-79,80-89,90-99,>100
1,2,0,0,1,6,0,0,2,2,3

Just for my own knowledge: should i understand that is very hard to implement this using the regular expressions? Has anyone done it?

Cheers

Last edited by varu0612; 03-18-2013 at 09:12 AM..
# 12  
Old 03-18-2013
Quote:
Originally Posted by varu0612
Alister,

the code in bold what does it mean?

Code:
++b[((i=int($3/s)) > n) ? n : i]

It's part of the ternary operator, e1 ? e2 : e3, which involves three expressions, e1, e2, and e3. If the first expression, e1, evaluates to true, then the result is e2. If e1 is instead false, return e3.

In the quoted code fragment:
e1: (i=int($3/s)) > n
e2: n
e3: i

e1 calculates the bucket index to which $3 belongs, stores that value in i, and then compares the value of the assignment (which is the value stored in i) to n. If i is greater than n, which would indicate a bucket beyond the final bucket, then e1 is true and the result is e2, which is n. This is the logic which folds all values that would fall into a bucket beyond the final bucket into that final bucket. If, however, i is not greater than n, then e1 is false, i is a valid bucket index, and the ternary operator returns e3 (i).

I don't recommend this type of coding, as it's difficult to decipher. Even an expert programmer has to give it a close look to be certain of what's going on. My only defense is that it makes it more fun for me to contribute here, as I attempt to be as concise as possible. A possible beneficial side effect is that it may help others learn more about the language in question.

A much more readable, maintainable, and professional version:
Code:
i = int($3/s)
if (i > n)
    i = n
b[i] = b[i] + 1

Regards,
Alister
# 13  
Old 03-18-2013
The fact you took your time to explain in detail how it works where even a 5 years old kid can understand is very much appreciated.

I've seen many smart users replying with solutions who don't fail to explain the logic ... in my view that is a useless answer since it doesn't help the requester to understand/ learn how it works.

All the best!!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Cannot subset ranges from another range set

Ca21chr2_C_albicans_SC5314 2159343 2228327 Ca21chr2_C_albicans_SC5314 636587 638608 Ca21chr2_C_albicans_SC5314 5286 50509 Ca21chr2_C_albicans_SC5314 634021 636276 Ca21chr2_C_albicans_SC5314 1886545 1900975 Ca21chr2_C_albicans_SC5314 610758 613544... (9 Replies)
Discussion started by: cryptodice
9 Replies

2. Shell Programming and Scripting

Regex to exclude numeric

Dear All, My regex is like below. Its says all the number in coloum is include. 11666 11777 11888 ^(?\: (0|11)(666|777|888))\\d+$ How to exclude all the numeric that not mentioned in above regex. Regards, (3 Replies)
Discussion started by: tpx99
3 Replies

3. Shell Programming and Scripting

Zipping files by numeric name range

Hi there, Not being too up on bash shell programming at this point, could anyone throw me a bone about how to zip up a set of numerically-named files by range? For example, in a folder that contains files 1.pdf through 132000.pdf, I'd like to zip up just those files that are 50000.pdf and... (6 Replies)
Discussion started by: enwood
6 Replies

4. Shell Programming and Scripting

sed filtering lines by range fails 1-line-ranges

The following is part of a larger project and sed is (right now) a given. I am working on a recursive Korn shell function to "peel off" XML tags from a larger text. Just for context i will show the complete function (not working right now) here: function pGetXML { typeset chTag="$1" typeset... (5 Replies)
Discussion started by: bakunin
5 Replies

5. Shell Programming and Scripting

getting files between specific date ranges in solaris

hi ! how can i get files in a directory between certain date ranges ? say all files created/modified between Jan24 - Jan31 thanks (10 Replies)
Discussion started by: aliyesami
10 Replies

6. Shell Programming and Scripting

Awk numeric range match only one digit?

Hello, I have a text file with lines that look like this: 1974 12 27 -0.72743 -1.0169 2 1.25029 1974 12 28 -0.4958 -0.72926 2 0.881839 1974 12 29 -0.26331 -0.53426 2 0.595623 1974 12 30 7.71432E-02 -0.71887 3 0.723001 1974 12 31 0.187789 -1.07114 3 1.08748 1975 1 1 0.349933 -1.02217... (2 Replies)
Discussion started by: meridionaljet
2 Replies

7. Programming

Perl : Numeric Range Pattern Matching

hi Experts just wondering if you can help me check a number between a specific range if i have an ip address , how can i say the valid number for ip between 1 to 254 something like this if ($ip ) =~ /.../ { } what the pattern i need to type thanks (3 Replies)
Discussion started by: doubando
3 Replies

8. Shell Programming and Scripting

Count occurences of a numeric string falling in a range

Dear all, I have numerous dat files (1.dat, 2.dat...) containing 500 numeric values each. I would like to count them, based on their range and obtain a histogram or a counter. INPUT: 1.dat 1.3 2.16 0.34 ...... 2.dat 1.54 0.94 3.13 ..... ... (3 Replies)
Discussion started by: chen.xiao.po
3 Replies

9. Shell Programming and Scripting

awk to match a numeric range specified by two columns

Hi Everyone, Here's a snippet of my data: File 1 = testRef2: A1BG - 13208 13284 AAA1 - 34758475 34873943 AAAS - 53701240 53715412File 2 = 42MLN.3.bedS2: 13208 13208 13360 13363 13484 13518 13518My awk script: awk 'NR == FNR{a=$1;next} {$1>=a}{$1<=a}{print... (5 Replies)
Discussion started by: heecha
5 Replies

10. Shell Programming and Scripting

numeric range comparisons

I have two files.And a sort of matrix analysis. Both files have a string followed by two numbers: File 1: A 2 7 B 3 11 C 5 10 ...... File 2: X 1 10 Y 3 5 Z 5 9 What I'd like to do is for each set of numbers in the second file indicate if the first or second number (or both) in... (7 Replies)
Discussion started by: dcfargo
7 Replies
Login or Register to Ask a Question