Grep regex to ignore sequence only if surrounded by fwd-slashes

12-10-2013

Registered User

51, 0

Join Date: Mar 2010

Last Activity: 16 December 2013, 11:39 AM EST

Posts: 51

Thanks Given: 28

Thanked 0 Times in 0 Posts

Hi Don,

Quote:

Originally Posted by Don Cragun

I apologize for taking so long to get back to you. But when I have a choice between spending some time with the grandkids or evaluating an awk script; the grandkids are going to win every time. Smilie

Obviously absolutely no apology needed whatsoever. I hope you had a great time with your grandkids.

Quote:

Originally Posted by Don Cragun

Do you understand now why we don't need to waste time or space assigning the index of each array element as the value of the array element as well instead of just using the index itself?

Yes. Thanks for the excellent explaination, now I get it.

Chubler_XL: My humblest apologies for taking your name in vain, and incorrectly stating that you were wrong when you wrote that "the code never uses the value of the array element" - clearly you were not wrong, it was me that was wrong and I'm currently feeling somewhat guilty.

The problem was that I have seen this kind of loop construct so many times before: for (ArrayElement in Array). In fact in something like half a dozen languages (PHP, Perl, Java, Python, JavaScript...) and always in the past it has meant to loop through all the elements of the array placing the value of the element in ArrayElement. I've never come across that same construct but meaning for (ArrayIndex in Array) before, and suspect this may be unique to Awk. I suppose it's not quite not being able to see the forest for the trees but more like not being able to see the trees for the forest.

Interestingly the Wikipedia Foreach loop page lists 33 languages which use the for (ArrayElement in Array) type of loop and neither that page, nor the loop section of the Control flow page (both linked below), mention the Awk variation of the construct. I suppose Awk predates all of those languages and maybe even inspired the modern ForEach construct. Brian Kernighan is certainly a very clever man.

https://en.wikipedia.org/wiki/Foreach_loop
https://en.wikipedia.org/wiki/Loop_%...uting%29#Loops

Thanks for being so patient with me, best wishes and all that,

Gencon

Last edited by gencon; 12-10-2013 at 10:53 AM..

gencon

View Public Profile for gencon

Find all posts by gencon

12-11-2013

Registered User

51, 0

Join Date: Mar 2010

Last Activity: 16 December 2013, 11:39 AM EST

Posts: 51

Thanks Given: 28

Thanked 0 Times in 0 Posts

...and just because this might be of interest:

Quote:

Originally Posted by Don Cragun

Why did you use [0-9][0-9]* at the end of digitSequenceTooLongNotIP instead of using [0-9]+?

Quote:

Originally Posted by gencon

I'm far from being a regex pro, my thought process went like this: I need to match a minimum of 4 digits in a row, so [0-9][0-9][0-9][0-9], then optionally a 5th or more digit so I need [0-9][0-9][0-9][0-9][0-9]*.

[0-9][0-9][0-9][0-9]+ is more concise, is it any more efficient?

I've had a look into the question I've placed in bold above...

I created a dataset of 5 million numbers each with a random number of digits (between 1 and 10 digits). 10 numbers per line, each separated by a space.

Then I used time to time 10 runs of an awk program which used [0-9][0-9][0-9][0-9]+ and then 10 runs with [0-9][0-9][0-9][0-9][0-9]*.

Since it was being run on my Linux desktop PC, I used chrt and set the scheduling policy to SCHED_FIFO with a priority of 99 which as far as I know gives the process the highest priority possible. The commands were:

Code:

chrt -f 99 time -f "\n***\nSecs: %e \nCPU: %P \nContext Switches: %c \nWaits: %w"
awk 'BEGIN { regex = "[0-9][0-9][0-9][0-9]+"; } { line = $0; gsub(regex, "x", line); }'
< NumsData5MillionNums >> ResRegex4 2>&1

chrt -f 99 time -f "\n***\nSecs: %e \nCPU: %P \nContext Switches: %c \nWaits: %w"
awk 'BEGIN { regex = "[0-9][0-9][0-9][0-9][0-9]*"; } { line = $0; gsub(regex, "x", line); }'
< NumsData5MillionNums >> ResRegex5 2>&1

I don't think the results can be considered as particularly scientific... But they were fairly consistent. BTW as expected each run had 0 context switches and 1 wait.

In fact the results were so close that I think that the Awk interpreter was probably running the same code in both cases, after all the 2 regexes [0-9][0-9][0-9][0-9]+ and [0-9][0-9][0-9][0-9][0-9]* are logically interchangeable.

I sorted the times and discarded the 3 fastest and 3 slowest times of the 10 runs, leaving me with:

Code:

ResRegex4 - regex = "[0-9][0-9][0-9][0-9]+" :

Secs: 7.16
Secs: 7.21
Secs: 7.27
Secs: 7.28

Mean: 7.23
Median: 7.24

ResRegex5 - regex = "[0-9][0-9][0-9][0-9][0-9]*" :

Secs: 7.14
Secs: 7.17
Secs: 7.18
Secs: 7.25

Mean: 7.185
Median: 7.175

Full output of "[0-9][0-9][0-9][0-9]+" is here: http://pastebin.com/VqC5dbna

Full output of "[0-9][0-9][0-9][0-9][0-9]*" is here: http://pastebin.com/U6rpULd6

The C code to create the data file of 5 million numbers, each 1-10 digits in length, and with 10 numbers on each line is here: http://pastebin.com/6vG9WQwj

gencon

View Public Profile for gencon

Find all posts by gencon

12-11-2013

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Quote:

Originally Posted by gencon

My humblest apologies for taking your name in vain, and incorrectly stating that you were wrong when you wrote that "the code never uses the value of the array element" - clearly you were not wrong, it was me that was wrong and I'm currently feeling somewhat guilty. Smilie

No need to feel guilty I have a feeling I'm much better at solving problems than explaining what I mean, my wife says I the most left-brained person she knows.

On your RegEx performance testing.

I've done this sort of analysis myself in the past and have found little difference in the various flavours of RE.

However I did find catching lines in the pattern section (pattern { action }) with a well constructed RE instead of using logic in the action section gave significant performance improvements. Probably because you avoid all the overhead of splitting up the fields assigning NF and such.

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

12-12-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

While doing some further testing, I came up with a few questions. If you had the following input file:

Code:

1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28
11.22.33.44version 55.66.77.99.100.110.120.130.140.150.160.170.180.190
99.88.77.66/55.44.33.22.11/111.112.113.114

what, if any, valid IP addresses would you like your script to report? I'm guessing that none should be found here, but one of the scripts you posted early in this thread will come up with something like the following:

Code:

21.22.23.24
55.44.33.22
17.18.19.20
9.10.11.12
13.14.15.16
25.26.27.28
1.2.3.4
111.112.113.114
100.110.120.130
5.6.7.8
99.88.77.66
11.22.33.44
140.150.160.170

I'm looking at a different way to evaluate possible IP addresses, but I need to know what you want to be required to appear before and after a valid IP address. Am I correct in assuming that a valid IP address should appear at the start of a line or be preceded by a white-space character, be followed by a white-space character or appear at the end of a line, and contain four 1 to 3 digit numbers separated by single occurrences of a period where the values of the numbers are 0 <= number <= 255?

Note that if my assumption is correct, an IP address surrounded by alphabetic or punctuation characters (in addition to slashes) should also be rejected. If my assumption is correct, should an exception be made allowing commas (or comma followed by space) to separate IP addresses?

Are we having fun yet?

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-12-2013

Registered User

51, 0

Join Date: Mar 2010

Last Activity: 16 December 2013, 11:39 AM EST

Posts: 51

Thanks Given: 28

Thanked 0 Times in 0 Posts

Hi Don,

Quote:

Originally Posted by Don Cragun

While doing some further testing, I came up with a few questions. If you had the following input file:

Code:

1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28
11.22.33.44version 55.66.77.99.100.110.120.130.140.150.160.170.180.190
99.88.77.66/55.44.33.22.11/111.112.113.114

Code:

21.22.23.24
55.44.33.22
17.18.19.20
9.10.11.12
13.14.15.16
25.26.27.28
1.2.3.4
111.112.113.114
100.110.120.130
5.6.7.8
99.88.77.66
11.22.33.44
140.150.160.170

The current script would not match all those and wouldn't have since I moved to an Awk only solution. I posted about this in post num 17, see paragraph "The code has failed to handle one simple limitation." Post num 17 is here: https://www.unix.com/showpost.php?p=3...9&postcount=17

BUT the key point was changing:

ipSequence = "[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+"

to:

ipLikeSequence = "[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+[0-9.]*";

Which more greedily matches number dot sequences and leaves it to the conditional in the END section to determine actual IP validity.

Putting your input file (above) into the script should have given just this IP address: 11.22.33.44 - that being the only valid IP address in your data (as far as my intentions are concerned).

What it actually gave was this:

Code:

99.88.77.66
111.112.113.114
11.22.33.44

Well done and thanks you've spotted a bug. What happened was this:

Code:

Before and after the gsub() calls:
Line In:  99.88.77.66/55.44.33.22.11/111.112.113.114
Line Out: 99.88.77.66x111.112.113.114

The regex encInFwdSlashesNotIP = "[/]" ipLikeSequence "[/]"; replaced the IP surrounded by forward slashes with an x, leaving 2 valid IP addresses on either side of the x - both of which should have been removed because, in the input line, one begins with a forward slash while the other ends with one.

I've modified the code. I had already introduced the self explanatory beginsWithFwdSlashNotIP and endsWithFwdSlashNotIP regexes back in post number 20 to handle version numbers in Urls (which look like IPs) more robustly. I've now removed enclosedInFwdSlashesNotIP realizing that it is redundant (also making the thread title redundant) and solved the issue by using '/' instead of 'x' as the replacement char in the 'begins with fwd slash' and 'ends with fwd slash' regex gsub() calls. So now I have:

Code:

    BEGIN {
        ipLikeSequence = "[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+[0-9.]*";
        digitSequenceTooLongNotIP = "[0-9][0-9][0-9][0-9]+";
        versioningNotIP = "[Vv]([Ee][Rr]([Ss][Ii][Oo][Nn])?)?[ .:]*" ipLikeSequence;
        beginsWithFwdSlashNotIP = "[/]" ipLikeSequence;
        endsWithFwdSlashNotIP = ipLikeSequence "[/]";
    }
    {
        line = $0;

        gsub(digitSequenceTooLongNotIP, "x", line);
        gsub(versioningNotIP, "x", line);
        gsub(beginsWithFwdSlashNotIP, "/", line);
        gsub(endsWithFwdSlashNotIP, "/", line);

        snippity...snip

and with your input file I now get just 11.22.33.44 which is what I want. The problem line now has all 3 IP addresses which begin or end with a forward slash removed:

Code:

Before and after the gsub() calls:
Line In:  99.88.77.66/55.44.33.22.11/111.112.113.114
Line Out: //

Quote:

Originally Posted by Don Cragun

No your assumptions are not correct, typical input is HTML, though sometimes just plain text, and sometimes just a file consisting of nothing at all except for a single IP address. The script as a whole retrieves a computer's WAN IP address from behind a router by downloading web pages which display a user's IP address. The script randomly downloads 2 or more such pages to use as verification. Do you want a copy of the whole thing to have a look at?

You are however correct in thinking I want (quote Don): "four 1 to 3 digit numbers separated by single occurrences of a period where the values of the numbers are 0 <= number <= 255". [I know that that notation means the inclusive range of 0..255 but I've never understood why the accepted notation is not "0 >= number <= 255" since what is wanted is greater than or equal to 0 and not less than or equal to 0 which is how it reads to me.]

All of the below are real world examples of valid IPs that should be accepted (many of these are sections of a line and not the whole line which are often quite long):

Code:

Note: My IP address replaced with one of Google's.

<textarea id="do" class="ip">64.15.115.103</textarea>
X-Real-Ip: 64.15.115.103<br>
<input type="text" id="ip" name="ip" value="64.15.115.103"
<b>Your IP: 64.15.115.103&nbsp;</b>
<b><img src="flags/gb.gif"> 64.15.115.103</b><br />
Show Me My IP! - Your IP address is: 64.15.115.103
<tr class="l2"><td>64.15.115.103</td></tr>
<h1 id="current-ip">Current IP: <em>64.15.115.103</em></h1>
var ips = $H({"ip0":"64.15.115.103"});
<h2>Your current IP address is: <span style="color:black">64.15.115.103</span></h2>
        64.15.115.103        <br>
<meta name="DESCRIPTION" content="View my IP information: 64.15.115.103">
<TITLE>Your Ip Address 64.15.115.103</TITLE>
<a href='ipwhois-64.15.115.103.html'>whois 64.15.115.103</a>
'http://www.geobytes.com/IpLocator.htm?GetLocation&amp;template=php3.txt&amp;IpAddress=64.15.115.103');<br />
<p style="text-align: center;"><span style="font-size:24pt"><B>&nbsp;64.15.115.103</B>&nbsp;</p>
<script type="text/javascript" src="http://api.my-proxy.com/ip.js.php?ip=64.15.115.103"></script>
<input name="ip" type="text" value="64.15.115.103"> <input type="submit" value="Get location">
infowindow.setContent('64.15.115.103');

Quote:

Originally Posted by Don Cragun

Are we having fun yet?

Yes Sir.

The current Awk code with some helpful debugging print statements is:

Code:

    BEGIN {
        ipLikeSequence = "[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+[0-9.]*";
        digitSequenceTooLongNotIP = "[0-9][0-9][0-9][0-9]+";
        versioningNotIP = "[Vv]([Ee][Rr]([Ss][Ii][Oo][Nn])?)?[ .:]*" ipLikeSequence;
        beginsWithFwdSlashNotIP = "[/]" ipLikeSequence;
        endsWithFwdSlashNotIP = ipLikeSequence "[/]";
    }
    {
        line = $0;

        printf("Line<%s\n", line);
        gsub(digitSequenceTooLongNotIP, "x", line);
        gsub(versioningNotIP, "x", line);
        gsub(beginsWithFwdSlashNotIP, "/", line);
        gsub(endsWithFwdSlashNotIP, "/", line);
        printf("Line>%s\n", line);

        while (match(line, ipLikeSequence))
        {
            ip = substr(line, RSTART, RLENGTH);
            ipUnique[ip];
            line = substr(line, RSTART + RLENGTH + 1);
            printf("Storing possible IP: %s\n", ip);
        }
    }
    END {
        ipRangeMin = 0;
        ipRangeMax = 255;
        ipNumSegments = 4;
        ipDelimiter = ".";

        for (ipIndex in ipUnique)
        {
            numSegments = split(ipIndex, ipSegments, ipDelimiter);
            if (numSegments == ipNumSegments &&
                ipSegments[1] >= ipRangeMin && ipSegments[1] <= ipRangeMax &&
                ipSegments[2] >= ipRangeMin && ipSegments[2] <= ipRangeMax &&
                ipSegments[3] >= ipRangeMin && ipSegments[3] <= ipRangeMax &&
                ipSegments[4] >= ipRangeMin && ipSegments[4] <= ipRangeMax)
            {
                printf("Valid IP:   %s\n", ipIndex);
            }
            else
            {
                printf("Invalid IP: %s\n", ipIndex);
            }
        }
    }

Thanks again,

Gencon

Last edited by gencon; 12-12-2013 at 02:03 PM..

gencon

View Public Profile for gencon

Find all posts by gencon

12-13-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I can accept your decision on what is allowed as an IP address, but I don't believe that any of the following following lines should be thought to contain any valid IP addresses:

Code:

1.2.3.4.5555.6.7.8.9
ab1.2.3.5
1.2.3.6cd
1.2.3.7version...1.2.3.8

although I believe your current script will say that all of the following are valid IP addresses from the above input:

Code:

6.7.8.9
1.2.3.5
1.2.3.6
1.2.3.7

I believe the 1st one is still a bug in your code; the other three are a difference of opinion. I understand (and agree) that an IP address, surrounded by punctuation characters (other than period and slash) needs to be recognized as an IP address, but I don't see the logic behind allowing other characters to be adjacent to an IP address. It also seems that you might want to recognize an IP address followed by a period at the end of a sentence as a valid IP address. (If you want to do that it takes about ten more lines of code in my latest test script.)

We could also debate whether or not an IP address embedded in an HTML tag should be excluded from consideration, but excluding them is a bigger project.

I would like to see your larger test files.

I note that the HTML sample you posted hides a lot of details about what is allowed and disallowed because only one IP address is used in places where some addresses are valid and at least one is not (64.15.115.103.html), but I'm not sure if it should be excluded just because of the .html or if it should also be excluded because it is inside an HTML a tag:

Code:

<a href='ipwhois-64.15.115.103.html'>

PS Also try the following with your current script:

Code:

/1.2.3.4/.10.9.8.7

Last edited by Don Cragun; 12-13-2013 at 12:38 AM.. Reason: Add PS.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-15-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Fix discussion of 0 >= number <= 255

Back to my long promised response to message #17 in this thread:

Quote:

Originally Posted by gencon

Hi Don,

Thanks for such a comprehensive reply. Some points for me to answer and some questions to ask.

On the use of local variables it simply didn't occur to me that this was not standardized. I don't need to use them and will remove them. I have never tried the script with any shell other than Bash but will do so as there may be other issues.

My code is inside a function so no complaint at my end... and in my test file, where the code isn't in a function, I'd removed the 2 local keywords. When posting I've been copy'n'pasting from the actual source code file that it's part of.

BUT on that subject, am I correct in thinking that, even if Bash is not the shell a user has chosen to use, /bin/bash will still be present on all (modern) Unix/Linux systems? Ignoring the fact that, of course, a user could remove it themselves. Bash has certainly been available on every Unix/Linux system that I've used - almost always it being the default shell as well (though I seem to remember changing my own default shell to Bash from sh on a Solaris system circa 1995).

Understood. I don't care whether you use local or not, nor whether you restrict yourself to POSIX defined shell features or not. I just thought that since you're restricting yourself to the subset of POSIX awk features that are supported by default in gawk, you would at least want to know that local is not in the standards.

While I don't know of any modern UNIX- or Linux-systems that do not include bash (or at least have a way to install one if it isn't there by default), I do not know if it would always be installed in /bin.

Quote:

Originally Posted by gencon

If there is a maximum of 1 valid IP address on each line then they get printed in reverse order, if more than 1 valid IP on a single line then the ordering is a mystery to me, but I've not devoted a lot of attention to it as ordering is not an issue for me.

The standards are very clear that when you go through an array using:

Code:

for (index in array)

the order in which array elements are processed is unspecified. We could add code to make sure that we printed them out in the order in which they were encountered or to print them in sorted order. Since you said the output order doesn't matter, I just added the note in case you tried my input and got a different output order from the awk (or gawk) you're using.

Quote:

Originally Posted by gencon

Quote:

Originally Posted by Don Cragun
As Chubler_XL suggested earlier, I would change all occurrences of "\\." to "[.]" (except for the one marked in red above). Using the backslash escapes instead of the matching list bracket expression keeps you from reusing the common parts of three of these expression. The backslashes in the occurrence of "\\." marked in red can just be removed. (The period is just a period in a bracket expression and doesn't need to be escaped.)

I must have overlooked that suggestion, but agreed that's better.

This has been a lengthy discussion; it is easy to miss details. I'm glad I pointed it out again.

Quote:

Originally Posted by gencon

I'm far from being a regex pro, my thought process went like this: I need to match a minimum of 4 digits in a row, so [0-9][0-9][0-9][0-9, then optionally a 5th or more digit so I need [0-9][0-9][0-9][0-9][0-9]*. [0-9][0-9][0-9][0-9]+ is more concise, is it any more efficient?

Which is more efficient would depend on the RE engine used by your version of awk. It contains fewer items to evaluate, so my gut feeling is it should be faster; but gut feelings are frequently wrong when it comes to performance.

As I have hinted in my last two posts, I think a different approach may be better. I believe the gsub() calls in your current script that replace "rejected strings" with an "x" or "/" may lead to unintended false positives.

Quote:

Originally Posted by gencon

[Vv]([Ee][Rr]([Ss][Ii][Oo][Nn])?)? along with skipping the tolower() bit of line = tolower($0), is a better solution (though the nested brackets make it hard for me to read, no probs a comment to explain it will help me understand it if I revisit the code years down the line).

It doesn't save much time though. I had to create a 1 MB file from 50 real world datasets to work it out but the low-down is that it saves approx. 0.6 milliseconds in a typical real world sized dataset. 30 milliseconds when processing the 1 MB file. Not exactly scientific but close enough. What I can tell you as a result is that calls to tolower() in awk are cheap. Discounting whatever other processes were running on my PC and the speed difference between the 2 regexes, 30 ms per 1 MB ain't at all bad IMO. [Note: median value from 5 runs used for both, mine was 115 ms, yours 85 ms.]

There's always a tradeoff. My suggestion is slightly faster. Your code is easier for some people to read. I certainly won't be offended if you decide to skip this suggestion as too confusing to use. However, from your latest scripts, it looks like you're using it.

Quote:

Originally Posted by gencon

Quote:

Originally Posted by Don Cragun
On a completely different note; why did you choose to define the awk script as a single line awk script using backslashes at the end of the awkExtractIPAddresses variable assignment to denote line continuation?

Only because when I tried the code without them it failed, and I did not know that to fix it I just needed to keep the '{' of BEGIN { and END { on one line. Thanks for that, keeping the backslashes nice and neat was a pain - I'm a bit OCD when it comes to things like that.

Every fibre of my being revolts at the mere suggestion of that. It's the C/C++ programmer in me, I just can't do it. If I'm allowed to use ';' at the end of a line then I will do so until the end of time, even under threat of execution or the Spanish Inquisition... well, maybe not then. Smilie

[Please note I know you wrote: (but would not have to).]

Note that awk is not C or C++; the rules are different. In C and C++ semicolon is a statement terminator; in awk semicolon is a statement separator. In awk the BEGIN clause:

Code:

BEGIN {
        x = "abc";
        y = "def";
        z = "ghi";
}

is logically equivalent to:

Code:

BEGIN {
        x = "abc"

        y = "def"

        z = "ghi"

}

because the semicolon at the end of each of the assignments separates that assignment from the empty command that follows it. The empty command is a comment, so you get the same results, but why add separators when you don't have anything to separate? If you want to make your awk code look sort of like C, you can do that up to a limit; if you want to make your awk code look like shell code, you can do that up to a limit. Personally, I prefer to make my awk code look like awk code, my shell code look like shell code, and my C code look like C.

The general form of an awk program is one or more occurrence of expression action pairs of the form:

Code:

expression{action}

If {action} is missing, the action defaults to print every line for which expression evaluates to a non-zero value. But, BEGIN and END are special expressions, and the results are unspecified if they have no action associated with them (there is no current line when either of these special expressions evaluate to true). So when you had the backslashes at the end of the lines of your script definition (which turned your entire script into a very long single line awk script, awk saw:

Code:

    BEGIN                                                                                \
    {                                                                                    \
        ipSequence = "[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+";                                \
        digitSequenceTooLongNotIP = "[0-9][0-9][0-9][0-9][0-9]*";                        \
        encInFwdSlashesNotIP = "[/][0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+[/]";                \
        versioningNotIP = "(version|ver|v)+[ \\.]*[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+";    \
    }

as a single long line and it worked, but this code:

Code:

    BEGIN
    {
        ipSequence = "[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+";
        digitSequenceTooLongNotIP = "[0-9][0-9][0-9][0-9][0-9]*";
        encInFwdSlashesNotIP = "[/][0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+[/]";
        versioningNotIP = "(version|ver|v)+[ \\.]*[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+";
    }

is seen as a BEGIN clause with no action followed by an action with no expression (which will be evaluated once after every input line is read. (Re-defining these variable to be the same thing every time to read a line won't do anything bad in this script, but a BEGIN with no action yields a diagnstic message and an abnormal termiantaion of awk on the version of awk I use (on Mac OS X). The same logc holds for the END clause, but running the end clause for each line you read will definitely produce unwanted output.

Quote:

Originally Posted by gencon

Well one of them does and one of them doesn't.

The one that doesn't is the replacement of 'xxx' with 'x', which was always my intention, but when testing 'xxx' is so much easier to spot when scanning the test output.

Agreed.

Quote:

Originally Posted by gencon

The one that does has got me flummoxed. Your line ipUnique[ip] in:

... ... ...

I believe this point has already been resolved...

Quote:

Originally Posted by gencon

Back to the code debugging - I have spotted another 'bug' / overlooked possibility:

The code has failed to handle one simple limitation. My original (not fully posted) code did actually handle it but not since the move to an awk only solution and stupidly I forgot about it.

Consider (Phlebas) this, no slashes, no out of range numbers, no versioning:

A random number and dot sequence that happens to exist 1.2.3.4.5 but is too long to be an IP address.

ipSequence = "[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+" will match just the 1.2.3.4 part of it, ignoring the final ".5", and the IP number of segments and number range checking will confirm it. A number and dot sequence which is definately not an IP address has been identified as a valid one. Ouch!

Easily solved by simply making a more greedy version of ipSequence:

ipLikeSequence = "[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+[0-9.]*";

With that the whole of 1.2.3.4.5 (and anything too long to be an IP) gets matched, it gets added to the ipUnique array and the code below does not print it because of the if (numSegments == ipNumSegments &&... section which was totally unnecessary before but was added due to my OCD need to check array bounds before evaluation slash good programming practice.

Code:

for (ip in ipUnique)
{
    numSegments = split(ip, ipSegments, ipDelimiter);
    if (numSegments == ipNumSegments &&
        ipSegments[1] >= ipRangeMin && ipSegments[1] <= ipRangeMax &&
        ipSegments[2] >= ipRangeMin && ipSegments[2] <= ipRangeMax &&
        ipSegments[3] >= ipRangeMin && ipSegments[3] <= ipRangeMax &&
        ipSegments[4] >= ipRangeMin && ipSegments[4] <= ipRangeMax)
    {
        print ip;
    }
}

So now we must be at the finished article, or very near it. Wow this has ended up being a long post. Smilie

Note that none of the above comparisons against ipRangeMin can ever fail. None of the strings in ipSegments can contain anything other than decimal digits, so they can't yield a negative value (unless there is overflow, and the gsub() with digitSequenceTooLongNotIP avoids that).

If we can anchor the RE matches and make the RE a little more complex, we can guarantee that elements added to ipUnique only contain 4 sequences of one, two, or three digit numbers separated by periods (without the gsubs()). The RE to do this is more concise with a POSIX Conforming awk:

Code:

RE = "^([0-9]{1,3}[.]){3}[0-9]{1,3}$"

but we can get the same results using gawk with the RE:

Code:

RE = "^(([0-9]|[0-9][0-9]|[0-9][0-9][0-9])[.])" \
        "(([0-9]|[0-9][0-9]|[0-9][0-9][0-9])[.])" \
        "(([0-9]|[0-9][0-9]|[0-9][0-9][0-9])[.])" \
        "([0-9]|[0-9][0-9]|[0-9][0-9][0-9])$"

Splitting on space, tab, and punctuation characters (other than period and slash) seems to do what we want as long as strings like "abc1.2.3.4" don't have to be treated as valid IP addresses. And, with the addition of six lines of code, we can have the code recognize that the following lines in an input file:

Code:

IP Addresses: 1.2.3.4,1.2.3.5, 1.2.3.6,1.2.3.7.
No IP Addresses: 1.2.3.8/1.2.3.9/.1.2.3.10
.1.2.3.11
1.2.3.12..
1.2.3.13xyz

contain the four IP addresses:

Code:

1.2.3.4
1.2.3.5
1.2.3.6
1.2.3.7

The following script also includes several diagnostic messages showing how it works. Try it and see if this alternative approach is worth considering:

Code:

    BEGIN {
        doubleQuote = "\""
        # Set field separator to one or more adjacent <space>, <tab>, and all
        # punctuation characters in the C Locale except <slash> and <period>.
        FS = "[][ \t!#$%&'()*+,:;<=>?@\^{|}~" doubleQuote "-]+"
        endsInVersion = "[Vv]([Ee][Rr]([Ss][Ii][Oo][Nn])?)?[.]*$"
        # Use following definition of possibleIP with gawk:
        possibleIP= "^(([0-9]|[0-9][0-9]|[0-9][0-9][0-9])[.])" \
                        "(([0-9]|[0-9][0-9]|[0-9][0-9][0-9])[.])" \
                        "(([0-9]|[0-9][0-9]|[0-9][0-9][0-9])[.])" \
                        "([0-9]|[0-9][0-9]|[0-9][0-9][0-9])"
        # Use following definition of possibleIP with POSIX-conforming awk:
#       possibleIP= "^([0-9]{1,3}[.]){3}[0-9]{1,3}"
        IPonly = possibleIP "$"
        # Comment out following line if valid IP address followed by a period
        # should not be accepted as a valid IP address.
        IPwithDot = possibleIP "[.]$"
    }
    {
        printf("Line %d, fields %d<%s\n", NR, NF, $0);
        for (i = 1; i <= NF; i++) {
            printf("\tfield %d>%s\n", i, $i);
            if ($i ~ IPonly && $(i - 1) !~ endsInVersion) {
                printf("Storing possible IP:\t%s\n", $i);
                maybeIP[$i]
            # Delete next three lines if an IP address at the end of a sentence
            # (i.e., with a single trailing <period>) should not be accepted as
            # a valid IP address.
            } else if ($i ~ IPwithDot && $(i - 1) !~ endsInVersion) {
                printf("Storing possible IP found at EOS:\t%s\n", $i)
                maybeIP[substr($i, 1, length($i) - 1)]
            }
        }
    }
    END {
        ipDelimiter = "."
        ipRangeMax = 255
        for (ip in maybeIP) {
            split(ip, ipSegments, ipDelimiter)
            if (ipSegments[1] <= ipRangeMax && ipSegments[2] <= ipRangeMax &&
                ipSegments[3] <= ipRangeMax && ipSegments[4] <= ipRangeMax) {
                    printf("Valid IP:\t%s\n", ip);
            } else {
                printf("Invalid IP:\t%s\n", ip);
            }
        }
    }

And in response to your comment in message #26 in this thread:

Quote:

You are however correct in thinking I want (quote Don): "four 1 to 3 digit numbers separated by single occurrences of a period where the values of the numbers are 0 <= number <= 255". [I know that that notation means the inclusive range of 0..255 but I've never understood why the accepted notation is not "0 >= number <= 255" since what is wanted is greater than or equal to 0 and not less than or equal to 0 which is how it reads to me.]

The notation:

Code:

min <= variable <= max

is mathematical shorthand for:

Code:

(min <= variable) && (variable <= max)

which says that variable must have a value between min and max, inclusive. With the range specified as:

Code:

0 >= number <= 255

the expression is true with any number less than or equal to 0 (not including any positive value) and with any number less than 255 (including -1000). Only zero and negative numbers are true for both subexpressions.

Cheers,
Don

Last edited by Don Cragun; 12-15-2013 at 11:24 PM.. Reason: Fix discussion of 0 >= number <= 255

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Grep regex to ignore sequence only if surrounded by fwd-slashes

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Grep and ignore list from file

Discussion started by: jhonnyrip

2. Shell Programming and Scripting

Grep command to ignore line starting with hyphen

Discussion started by: Srinraj Rao

3. Shell Programming and Scripting

Need sequence no in the grep output

Discussion started by: ksgnathan

4. Shell Programming and Scripting

Ignore escape sequence in sed

Discussion started by: jothi basu

5. Shell Programming and Scripting

regex - start with a word but ignore that word

Discussion started by: ratneshnagori

6. Shell Programming and Scripting

Grep but ignore first column

Discussion started by: danhodges99

7. Shell Programming and Scripting

ignore fields to check in grep

Discussion started by: ashwin3086

8. UNIX for Dummies Questions & Answers

| help | unix | grep (GNU grep) 2.5.1 | advanced regex syntax

Discussion started by: MykC

9. Fedora

Hosting issue regarding subdirectories and fwd Slashes

Discussion started by: iecowboy

10. Shell Programming and Scripting

To grep in sequence

Discussion started by: helplineinc