Grep regex to ignore sequence only if surrounded by fwd-slashes

12-05-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I didn't notice when you slipped in local in your variable definitions; local isn't in the POSIX standards. (A proposal to add local to a future revision of the POSIX standards is being discussed. But different shells that provide local variables disagree on both the syntax and the semantics of how it should done. I'm not confidant at this point that this proposal will make it into the standards.)

After adding:

Code:

#!/bin/ksh
tempFileName=${1:-file}
    or
#!/bin/bash
tempFileName=${1:-file}

to the start of the script, changing:

Code:

    local ipAddressMatches=$(awk "$awkExtractIPAddresses" < "$tempFileName")

to:

Code:

    local ipAddressMatches=$(awk "$awkExtractIPAddresses" < "$tempFileName")
    printf "%s\n" "$ipAddressMatches"
                or equivalently
    ipAddressMatches=$(awk "$awkExtractIPAddresses" "$tempFileName")
    printf "%s\n" "$ipAddressMatches"

and saving the script in an executable file named tester, and putting the following sample data in a file named file:

Code:

1.2.3.4
V1.2.3.5
Version 1.2.3.6
VeRsIoNvErSv 1.2.3.7
1.2.3.8 vErSiOn 1.2.3.9 1.2.3.10 1.2.3.257
12.34.56.78 http://12.34.56.79/ 12.34.56.80
12.34.45.100 12.34/12.34.56.101/.56.102 12.34.56.103

neither ksh nor bash (on Mac OS X version 10.7.5) accepted local in this context. ksh said:

Code:

tester: line 3: local: not found
tester: line 43: local: not found

and bash said:

Code:

tester: line 41: local: can only be used in a function
tester: line 43: local: can only be used in a function

After removing both occurrences of local, running tester produced:

Code:

12.34.56.78
12.34.45.100
12.34.56.80
1.2.3.4
12.34.56.103
1.2.3.8
1.2.3.10

which contains exactly the output I expected (in seemingly random order).

Your ERE definitions:

Code:

        ipSequence = "[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+"
        digitSequenceTooLongNotIP = "[0-9][0-9][0-9][0-9][0-9]*"
        encInFwdSlashesNotIP = "[/][0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+[/]"
        versioningNotIP = "(version|ver|v)+[ \\.]*[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+"

all conform to POSIX ERE requirements. But, I would make several tweaks:

As Chubler_XL suggested earlier, I would change all occurrences of "\\." to "[.]" (except for the one marked in red above). Using the backslash escapes instead of the matching list bracket expression keeps you from reusing the common parts of three of these expression. The backslashes in the occurrence of "\\." marked in red can just be removed. (The period is just a period in a bracket expression and doesn't need to be escaped.)

Why did you use [0-9][0-9]* at the end of digitSequenceTooLongNotIP instead of using [0-9]+?

Why did you use (version|ver|v)+ instead of just (version|ver|v) at the start of versioningNotIP? (Anything matching "v", "ver", or "version" at the end of a string once will also match any string that ends in one or more of those strings.) Note that if you change that expression to:

Code:

[Vv]([Ee][Rr]([Ss][Ii][Oo][Nn])?)?

it will also match exactly the same expression but you won't need to translate all of your input records to lowercase with:

Code:

        line = tolower($0)

You could get rid of the line variable completely and just use $0, but changing the above line to:

Code:

        line = $0

rather than just using $0 feels like it would be more efficient (since every update to $0 forces awk to re-evaluate the current line).

On a completely different note; why did you choose to define the awk script as a single line awk script using backslashes at the end of the awkExtractIPAddresses variable assignment to denote line continuation? If you delete the trailing backslashes and all of the spaces and or tabs that come just before them, the size of your awk script drops from 3,373 bytes to just over 1,000 bytes requiring only two other changes in your script:

Code:

    BEGIN                                                                                \
    {                                                                                    \

and:

Code:

    END                                                                                  \
    {                                                                                    \

have to be change to:

Code:

    BEGIN {

and:

Code:

    END {

respectively. If your code had tabs rather than spaces at the ends of lines before the backslashes, the space savings won't be as drastic, but may still be significant (and makes it easier to make changes to the script without worrying about keeping the backslashes lined up). If you do this, you could (but would not have to) also remove a lot of semicolons from your code.

If you like (or at least would like to further investigate these ideas), the following script incorporates these suggestions (and a few tiny changes not worth discussing) and produces exactly the same output:

Code:

#!/bin/bash
tempFileName=${1:-file}

    awkExtractIPAddresses='
    BEGIN {
        ipSequence = "[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+"
        digitSequenceTooLongNotIP = "[0-9][0-9][0-9][0-9]+"
        encInFwdSlashesNotIP = "[/]" ipSequence "[/]"
        versioningNotIP = "[Vv]([Ee][Rr]([Ss][Ii][Oo][Nn])?)?[ .]*" ipSequence
    }
    {
        line = $0
        gsub(digitSequenceTooLongNotIP, "x", line)
        gsub(encInFwdSlashesNotIP, "x", line)
        gsub(versioningNotIP, "x", line)
        while (match(line, ipSequence)) {
            ip = substr(line, RSTART, RLENGTH)
            ipUnique[ip]
            line = substr(line, RSTART + RLENGTH + 1)
        }
    }
    END {
        ipRangeMin = 0
        ipRangeMax = 255
        ipNumSegments = 4
        ipDelimiter = "."
        for (ip in ipUnique) {
            numSegments = split(ip, ipSegments, ipDelimiter)
            if (numSegments == ipNumSegments &&
                ipSegments[1] >= ipRangeMin && ipSegments[1] <= ipRangeMax &&
                ipSegments[2] >= ipRangeMin && ipSegments[2] <= ipRangeMax &&
                ipSegments[3] >= ipRangeMin && ipSegments[3] <= ipRangeMax &&
                ipSegments[4] >= ipRangeMin && ipSegments[4] <= ipRangeMax) {
                    print ip
            }
        }
    }'

    ipAddressMatches=$(awk "$awkExtractIPAddresses" "$tempFileName")

printf "%s\n" $ipAddressMatches

I hope this helps...

These 2 Users Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-05-2013

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Well done Don Cragun and Gencon, it's good to see a script followed through to an nice compliant solution.

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

12-06-2013

Registered User

51, 0

Join Date: Mar 2010

Last Activity: 16 December 2013, 11:39 AM EST

Posts: 51

Thanks Given: 28

Thanked 0 Times in 0 Posts

Hi Don,

Thanks for such a comprehensive reply. Some points for me to answer and some questions to ask.

On the use of local variables it simply didn't occur to me that this was not standardized. I don't need to use them and will remove them. I have never tried the script with any shell other than Bash but will do so as there may be other issues.

Quote: