Grep regex to ignore sequence only if surrounded by fwd-slashes

12-02-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by gencon

Thanks. That is an ever so slightly more elegant solution and the one which I shall in fact use. Don will get over it eventually, I'm sure it's nothing that some intensive therapy won't cure. Smilie

Of course Don might point out that yours will not actually work at the moment due to the mysterious disappearance of any actual input.

Thanks all.

I'm over it. No intensive therapy required.

Note that I clearly stated that my suggestion had a limitation because I knew it didn't work with one of your sample lines of input. (It seemed to work because the sample input used the ip address 11.11.11.11 multiple times.) I see no reason to believe that the removal of the ip-like strings between slashes should have any bad effect and agree with using that concept to improve my code.

Do note, however, that although grep -E is required by the standards and grep -E (or egrep or both) is available on any UNIX or Linux implementation, the -E option to sed is not required by the standards and is not always available. But, if this is a concern, the -o option to grep is not required by the standards either and is not always available.

This could be rewritten using options only available in the standards, but if the systems you care about have sed -E and grep -o there isn't any reason to spend the time to work it out.

Cheers,
Don

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-02-2013

Registered User

51, 0

Join Date: Mar 2010

Last Activity: 16 December 2013, 11:39 AM EST

Posts: 51

Thanks Given: 28

Thanked 0 Times in 0 Posts

[EDIT: Forget the below - I'll re-write without using -o, may as well get it right. I wrote the below before noticing Don's latest reply, having spotted the possible problem with sed -E myself. I didn't however realize grep -o is not in the standard.]

Thanks for your help everyone.

The final code is:

Code:

# This sed expression removes IP-like sequences if they are surrounded by a / at each end.
# e.g. src="http://web.com/libs/1.6.1.0/file.js" would have the "/1.6.1.0/" bit removed.
# A basic regex is used to remain Posix compatible and because the Free BSD / Apple OS X
# sed use -E to specify extended regex, while GNU sed uses -r (both may accept both?).
local sedRemoveExpr="s:\/[0-9]*\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]*\/::g"

# This grep extended regex will match all remaining IP-like sequences. -E to specify
# extended regex is specified by Posix and is used in all versions of grep.
local grepExtractExpr="[0-9]+\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]+"

# grep options: -E = extended regex, -o = only matching. sort -u removes duplicates.
local ipLikeAddressMatches=$(sed "$sedRemoveExpr" < "$tempFileName" | \
                             grep -Eo "$grepExtractExpr" | sort -u)

Cheers.

Last edited by gencon; 12-02-2013 at 02:40 PM..

gencon

View Public Profile for gencon

Find all posts by gencon

12-02-2013

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Perhaps you could get some more portability out of an awk script (you could also test the 0-255 limit on the octets with this awk script if you liked):

Code:

local ipLikeAddressMatches=$(awk '
BEGIN { IP_RE="[0-9]+\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]+" }
{
 gsub("/"IP_RE"/","")
 while(match($0, IP_RE)) {
    fnd=substr($0,RSTART,RLENGTH);
    if(!have[fnd]++) print fnd;
    $0=substr($0,RSTART+RLENGTH+1);
 }
}' "$tempFileName")

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

12-03-2013

Registered User

51, 0

Join Date: Mar 2010

Last Activity: 16 December 2013, 11:39 AM EST

Posts: 51

Thanks Given: 28

Thanked 0 Times in 0 Posts

Thanks very much Chubler_XL. That's an excellent way to do it and what I've used with some changes.

Your IP regex code IP_RE="[0-9]+\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]+" is not Posix compatible. Posix awk does not include interval expressions, e.g. {1,3}. The GNU Awk v.3.1.8 (on my system) for instance requires the --re-interval option to allow their use. I have simply used an extra sed replace expression to remove all number sequences greater than 3 digits in length to get around this.

Also the gsub("/"IP_RE"/","") line clearly prevents the rest of the code from working by replacing all the IP regex matches in each line with the empty string, and I assume that the line only made it into your post by accident.

I also spotted a potential problem with the removal of IP-like addresses enclosed by slashes using sed. Consider the following url:

http://web.com/libs/v.15.5/15.5.2.1/.23.12/file.js

The old sed expression (not the one below) would simply remove this bit: /15.5.2.1/

Which would leave behind this: http://web.com/libs/v.15.5.23.12/file.js

Inadvertidly a valid IP address of 15.5.23.12 has been created from the digits on either side of the removed section. Okay so it's not all that likely to happen regularly but using 'xxx' as the replacement string, instead of an empty string, in the sed expressions makes sure it won't happen.

I think the code below is now fully Posix compatible, the question is: does it get the thumbs up from Don?

Thanks again for all the help I've been given.

Code:

# This sed expression removes IP-like sequences if they are surrounded by a / at each end.
# e.g. src="http://web.com/libs/1.6.1.0/file.js" would have the "/1.6.1.0/" bit removed.
# A basic regex is used to remain Posix compatible and because the Free BSD / Apple OS X
# seds use -E to specify extended regex, while GNU sed uses -r (both may accept both?).
# The use of 'xxx' as the replacement string, instead of using the expty string, is to
# prevent inadvertidly creating a valid IP address with digits on either side of the
# removed section. e.g. http://web.com/libs/v.15.5/15.5.2.1/.23.12/file.js

local sedRemoveEncInSlashes="s:\/[0-9]*\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]*\/:xxx:g"

# This sed expression removes number sequences longer than 3 digits. The awk described by
# Posix does not include interval expressions, e.g. {1,3}, GNU Awk 3.1.8 for instance
# requires the --re-interval command option to allow their use. By removing any number
# sequences longer than 3 digits with sed, it is not necessary to check for them in the
# awk code below. 'xxx' is used as the replacement string for the same reason as above.

local sedRemoveNumDigitsOutOfRange="s:[0-9]\{4,\}:xxx:g"

# Combine the two sed removal expressions, so sed can do them with one call.
# Clarity of what the expressions do above, efficiency in processing them below.

local sedRemove="$sedRemoveEncInSlashes;$sedRemoveNumDigitsOutOfRange"

# This awk code will print all IP-like sequences which match the ipRegEx expression. It
# does not check that the numbers within any IP-like sequences are in the valid range of
# an IP address, 0..255, the function Is_Ip_Address(), called below, will do so.
# Note: RSTART and RLENGTH are awk built-in variables, after the match() function has
# been called they will contain the start position and length of the matched string.

local awkExtract='BEGIN { ipRegEx = "[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+" }    \
                  {                                                          \
                       line = $0;                                            \
                       while (match(line, ipRegEx))                          \
                       {                                                     \
                            ip = substr(line, RSTART, RLENGTH);              \
                            line = substr(line, RSTART + RLENGTH + 1);       \
                            print ip;                                        \
                       }                                                     \
                  }'

# Extract IP-like sequences from the temp file. sort -u removes duplicates.

local ipLikeAddressMatches=$(sed "$sedRemove" < "$tempFileName" | \
                             awk "$awkExtract" | sort -u)

Last edited by gencon; 12-03-2013 at 02:01 PM..

gencon

View Public Profile for gencon

Find all posts by gencon

12-03-2013

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Quote:

Originally Posted by gencon

Also the gsub("/"IP_RE"/","") line clearly prevents the rest of the code from working by replacing all the IP regex matches in each line with the empty string, and I assume that the line only made it into your post by accident.

No, that is an attempt to remove the need for the sed replace call it is supposed to match slash + ipRegEx + slash.
Perhaps it would be clearer if we replaced it with something like gsub("[/]"IP_RE"[/]","xxx") making the slashes more distinct from the gsub() form with slash delimiters.

Also the line if(!have[fnd]++) print fnd; is designed to remove the need for the sort -u as I don't think the -u option is POSIX.

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

12-03-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by Chubler_XL

Hi gencon and Chubler_XL,
I agree that Chubler_XL is taking the better approach. There is no need to fire up both sed and awk. You just need to fix the gsub() that is accidentally deleting all ip-address instead of just replacing those surrounded by slashes. I was getting ready to test out the suggestion of using "[/]"IP_RE"[/]" as a fix for the problem you found, but it looks like the two of you beat me to it.

The sort -u option has been in POSIX from the beginning. It was in the first POSIX shell and utilities standard when it was adopted by IEEE in 1992 and by ISO/IEC in 1993. But, since it seems that the need for sort was to remove duplicates and Chubler_XL's scripts already does that, sort -u isn't needed. In fact the sort in that pipeline isn't needed unless gencon wants the list in sorted order.

It looks to me like you two have almost completed debugging a very efficient script that will handle any number of ip addresses and ip-like addresses between slashes on a single line (as long as your input doesn't exceed LINE_MAX limits).

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

12-04-2013

Registered User

51, 0

Join Date: Mar 2010

Last Activity: 16 December 2013, 11:39 AM EST

Posts: 51

Thanks Given: 28

Thanked 0 Times in 0 Posts

Hi Chubler_XL and Don,

Thanks again guys, your support is appreciated.

Quote:

Originally Posted by Chubler_XL

No, that is an attempt to remove the need for the sed replace call it is supposed to match slash + ipRegEx + slash.

Yes, my apologies. I misread it as having the double quotes escaped. Oops. That's now back in the code.

I've also used the if(!have[fnd]++) print fnd; code (conceptually anyway), and removed the pipe to: sort -u

Quote:

Originally Posted by Don Cragun

I agree that Chubler_XL is taking the better approach.

No doubt at all.

I've now had a few mins to alter the code to perform the whole operation with awk, including IP number range checking which is another sensible suggestion of Chubler_XL's.

It also seemed sensible to add a regex to spot (obvious) version numbers and remove them as well, see the regex in: versioningNotIP After all if I'm ignoring version numbers in urls then I may as well ignore IP-like sequences if they follow Version, Ver, V. and so on. The input line is therefore converted to lower case so that that regex works whatever the case.

Quote:

Originally Posted by Don Cragun

It looks to me like you two have almost completed debugging a very efficient script that will handle any number of ip addresses and ip-like addresses between slashes on a single line (as long as your input doesn't exceed LINE_MAX limits).

I hope so.

Have a look at what may be the finished, fully Posix compliant, article. Thumbs up if all okay please guys. If not, I will persevere and fix anything that needs fixing.

Just remembered one thing I'm not 100% sure about. In the versioningNotIP regex I've enclosed the OR variations in (). I couldn't find online whether that is acceptable with Posix awk, is it?

Code:

    local awkExtractIPAddresses='                                                        \
    BEGIN                                                                                \
    {                                                                                    \
        ipSequence = "[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+";                                \
        digitSequenceTooLongNotIP = "[0-9][0-9][0-9][0-9][0-9]*";                        \
        encInFwdSlashesNotIP = "[/][0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+[/]";                \
        versioningNotIP = "(version|ver|v)+[ \\.]*[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+";    \
    }                                                                                    \
    {                                                                                    \
        line = tolower($0);                                                              \
        gsub(digitSequenceTooLongNotIP, "xxx", line);                                    \
        gsub(encInFwdSlashesNotIP, "xxx", line);                                         \
        gsub(versioningNotIP, "xxx", line);                                              \
        while (match(line, ipSequence))                                                  \
        {                                                                                \
            ip = substr(line, RSTART, RLENGTH);                                          \
            ipUnique[ip] = ip;                                                           \
            line = substr(line, RSTART + RLENGTH + 1);                                   \
        }                                                                                \
    }                                                                                    \
    END                                                                                  \
    {                                                                                    \
        ipRangeMin = 0;                                                                  \
        ipRangeMax = 255;                                                                \
        ipNumSegments = 4;                                                               \
        ipDelimiter = ".";                                                               \
        for (ip in ipUnique)                                                             \
        {                                                                                \
            numSegments = split(ip, ipSegments, ipDelimiter);                            \
            if (numSegments == ipNumSegments &&                                          \
                ipSegments[1] >= ipRangeMin && ipSegments[1] <= ipRangeMax &&            \
                ipSegments[2] >= ipRangeMin && ipSegments[2] <= ipRangeMax &&            \
                ipSegments[3] >= ipRangeMin && ipSegments[3] <= ipRangeMax &&            \
                ipSegments[4] >= ipRangeMin && ipSegments[4] <= ipRangeMax)              \
            {                                                                            \
                print ip;                                                                \
            }                                                                            \
        }                                                                                \
    }'

    local ipAddressMatches=$(awk "$awkExtractIPAddresses" < "$tempFileName")

Cheers.

gencon

View Public Profile for gencon

Find all posts by gencon

Shell Programming and Scripting

Grep regex to ignore sequence only if surrounded by fwd-slashes

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Grep and ignore list from file

Discussion started by: jhonnyrip

2. Shell Programming and Scripting

Grep command to ignore line starting with hyphen

Discussion started by: Srinraj Rao

3. Shell Programming and Scripting

Need sequence no in the grep output

Discussion started by: ksgnathan

4. Shell Programming and Scripting

Ignore escape sequence in sed

Discussion started by: jothi basu

5. Shell Programming and Scripting

regex - start with a word but ignore that word

Discussion started by: ratneshnagori

6. Shell Programming and Scripting

Grep but ignore first column

Discussion started by: danhodges99

7. Shell Programming and Scripting

ignore fields to check in grep

Discussion started by: ashwin3086

8. UNIX for Dummies Questions & Answers

| help | unix | grep (GNU grep) 2.5.1 | advanced regex syntax

Discussion started by: MykC

9. Fedora

Hosting issue regarding subdirectories and fwd Slashes

Discussion started by: iecowboy

10. Shell Programming and Scripting

To grep in sequence

Discussion started by: helplineinc