Grep regex to ignore sequence only if surrounded by fwd-slashes


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Grep regex to ignore sequence only if surrounded by fwd-slashes
# 8  
Old 12-02-2013
Quote:
Originally Posted by gencon
Thanks. That is an ever so slightly more elegant solution and the one which I shall in fact use. Don will get over it eventually, I'm sure it's nothing that some intensive therapy won't cure. Smilie

Of course Don might point out that yours will not actually work at the moment due to the mysterious disappearance of any actual input.

Thanks all.
I'm over it. No intensive therapy required.

Note that I clearly stated that my suggestion had a limitation because I knew it didn't work with one of your sample lines of input. (It seemed to work because the sample input used the ip address 11.11.11.11 multiple times.) I see no reason to believe that the removal of the ip-like strings between slashes should have any bad effect and agree with using that concept to improve my code.

Do note, however, that although grep -E is required by the standards and grep -E (or egrep or both) is available on any UNIX or Linux implementation, the -E option to sed is not required by the standards and is not always available. But, if this is a concern, the -o option to grep is not required by the standards either and is not always available.

This could be rewritten using options only available in the standards, but if the systems you care about have sed -E and grep -o there isn't any reason to spend the time to work it out.

Cheers,
Don
This User Gave Thanks to Don Cragun For This Post:
# 9  
Old 12-02-2013
[EDIT: Forget the below - I'll re-write without using -o, may as well get it right. I wrote the below before noticing Don's latest reply, having spotted the possible problem with sed -E myself. I didn't however realize grep -o is not in the standard.]

Thanks for your help everyone.

The final code is:
Code:
# This sed expression removes IP-like sequences if they are surrounded by a / at each end.
# e.g. src="http://web.com/libs/1.6.1.0/file.js" would have the "/1.6.1.0/" bit removed.
# A basic regex is used to remain Posix compatible and because the Free BSD / Apple OS X
# sed use -E to specify extended regex, while GNU sed uses -r (both may accept both?).
local sedRemoveExpr="s:\/[0-9]*\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]*\/::g"

# This grep extended regex will match all remaining IP-like sequences. -E to specify
# extended regex is specified by Posix and is used in all versions of grep.
local grepExtractExpr="[0-9]+\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]+"

# grep options: -E = extended regex, -o = only matching. sort -u removes duplicates.
local ipLikeAddressMatches=$(sed "$sedRemoveExpr" < "$tempFileName" | \
                             grep -Eo "$grepExtractExpr" | sort -u)

Cheers.

Last edited by gencon; 12-02-2013 at 02:40 PM..
# 10  
Old 12-02-2013
Perhaps you could get some more portability out of an awk script (you could also test the 0-255 limit on the octets with this awk script if you liked):

Code:
local ipLikeAddressMatches=$(awk '
BEGIN { IP_RE="[0-9]+\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]+" }
{
 gsub("/"IP_RE"/","")
 while(match($0, IP_RE)) {
    fnd=substr($0,RSTART,RLENGTH);
    if(!have[fnd]++) print fnd;
    $0=substr($0,RSTART+RLENGTH+1);
 }
}' "$tempFileName")

This User Gave Thanks to Chubler_XL For This Post:
# 11  
Old 12-03-2013
Thanks very much Chubler_XL. That's an excellent way to do it and what I've used with some changes.

Your IP regex code IP_RE="[0-9]+\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]+" is not Posix compatible. Posix awk does not include interval expressions, e.g. {1,3}. The GNU Awk v.3.1.8 (on my system) for instance requires the --re-interval option to allow their use. I have simply used an extra sed replace expression to remove all number sequences greater than 3 digits in length to get around this.

Also the gsub("/"IP_RE"/","") line clearly prevents the rest of the code from working by replacing all the IP regex matches in each line with the empty string, and I assume that the line only made it into your post by accident.

I also spotted a potential problem with the removal of IP-like addresses enclosed by slashes using sed. Consider the following url:

http://web.com/libs/v.15.5/15.5.2.1/.23.12/file.js

The old sed expression (not the one below) would simply remove this bit: /15.5.2.1/

Which would leave behind this: http://web.com/libs/v.15.5.23.12/file.js

Inadvertidly a valid IP address of 15.5.23.12 has been created from the digits on either side of the removed section. Okay so it's not all that likely to happen regularly but using 'xxx' as the replacement string, instead of an empty string, in the sed expressions makes sure it won't happen.

I think the code below is now fully Posix compatible, the question is: does it get the thumbs up from Don?

Thanks again for all the help I've been given. Smilie

Code:
# This sed expression removes IP-like sequences if they are surrounded by a / at each end.
# e.g. src="http://web.com/libs/1.6.1.0/file.js" would have the "/1.6.1.0/" bit removed.
# A basic regex is used to remain Posix compatible and because the Free BSD / Apple OS X
# seds use -E to specify extended regex, while GNU sed uses -r (both may accept both?).
# The use of 'xxx' as the replacement string, instead of using the expty string, is to
# prevent inadvertidly creating a valid IP address with digits on either side of the
# removed section. e.g. http://web.com/libs/v.15.5/15.5.2.1/.23.12/file.js

local sedRemoveEncInSlashes="s:\/[0-9]*\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]*\/:xxx:g"

# This sed expression removes number sequences longer than 3 digits. The awk described by
# Posix does not include interval expressions, e.g. {1,3}, GNU Awk 3.1.8 for instance
# requires the --re-interval command option to allow their use. By removing any number
# sequences longer than 3 digits with sed, it is not necessary to check for them in the
# awk code below. 'xxx' is used as the replacement string for the same reason as above.

local sedRemoveNumDigitsOutOfRange="s:[0-9]\{4,\}:xxx:g"

# Combine the two sed removal expressions, so sed can do them with one call.
# Clarity of what the expressions do above, efficiency in processing them below.

local sedRemove="$sedRemoveEncInSlashes;$sedRemoveNumDigitsOutOfRange"

# This awk code will print all IP-like sequences which match the ipRegEx expression. It
# does not check that the numbers within any IP-like sequences are in the valid range of
# an IP address, 0..255, the function Is_Ip_Address(), called below, will do so.
# Note: RSTART and RLENGTH are awk built-in variables, after the match() function has
# been called they will contain the start position and length of the matched string.

local awkExtract='BEGIN { ipRegEx = "[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+" }    \
                  {                                                          \
                       line = $0;                                            \
                       while (match(line, ipRegEx))                          \
                       {                                                     \
                            ip = substr(line, RSTART, RLENGTH);              \
                            line = substr(line, RSTART + RLENGTH + 1);       \
                            print ip;                                        \
                       }                                                     \
                  }'

# Extract IP-like sequences from the temp file. sort -u removes duplicates.

local ipLikeAddressMatches=$(sed "$sedRemove" < "$tempFileName" | \
                             awk "$awkExtract" | sort -u)


Last edited by gencon; 12-03-2013 at 02:01 PM..
# 12  
Old 12-03-2013
Quote:
Originally Posted by gencon
Also the gsub("/"IP_RE"/","") line clearly prevents the rest of the code from working by replacing all the IP regex matches in each line with the empty string, and I assume that the line only made it into your post by accident.
No, that is an attempt to remove the need for the sed replace call it is supposed to match slash + ipRegEx + slash.
Perhaps it would be clearer if we replaced it with something like gsub("[/]"IP_RE"[/]","xxx") making the slashes more distinct from the gsub() form with slash delimiters.

Also the line if(!have[fnd]++) print fnd; is designed to remove the need for the sort -u as I don't think the -u option is POSIX.
This User Gave Thanks to Chubler_XL For This Post:
# 13  
Old 12-03-2013
Quote:
Originally Posted by Chubler_XL
No, that is an attempt to remove the need for the sed replace call it is supposed to match slash + ipRegEx + slash.
Perhaps it would be clearer if we replaced it with something like gsub("[/]"IP_RE"[/]","xxx") making the slashes more distinct from the gsub() form with slash delimiters.

Also the line if(!have[fnd]++) print fnd; is designed to remove the need for the sort -u as I don't think the -u option is POSIX.
Hi gencon and Chubler_XL,
I agree that Chubler_XL is taking the better approach. There is no need to fire up both sed and awk. You just need to fix the gsub() that is accidentally deleting all ip-address instead of just replacing those surrounded by slashes. I was getting ready to test out the suggestion of using "[/]"IP_RE"[/]" as a fix for the problem you found, but it looks like the two of you beat me to it.

The sort -u option has been in POSIX from the beginning. It was in the first POSIX shell and utilities standard when it was adopted by IEEE in 1992 and by ISO/IEC in 1993. But, since it seems that the need for sort was to remove duplicates and Chubler_XL's scripts already does that, sort -u isn't needed. In fact the sort in that pipeline isn't needed unless gencon wants the list in sorted order.

It looks to me like you two have almost completed debugging a very efficient script that will handle any number of ip addresses and ip-like addresses between slashes on a single line (as long as your input doesn't exceed LINE_MAX limits). Smilie
This User Gave Thanks to Don Cragun For This Post:
# 14  
Old 12-04-2013
Hi Chubler_XL and Don,

Thanks again guys, your support is appreciated.

Quote:
Originally Posted by Chubler_XL
No, that is an attempt to remove the need for the sed replace call it is supposed to match slash + ipRegEx + slash.
Yes, my apologies. I misread it as having the double quotes escaped. Oops. That's now back in the code.

I've also used the if(!have[fnd]++) print fnd; code (conceptually anyway), and removed the pipe to: sort -u

Quote:
Originally Posted by Don Cragun
I agree that Chubler_XL is taking the better approach.
No doubt at all.

I've now had a few mins to alter the code to perform the whole operation with awk, including IP number range checking which is another sensible suggestion of Chubler_XL's.

It also seemed sensible to add a regex to spot (obvious) version numbers and remove them as well, see the regex in: versioningNotIP After all if I'm ignoring version numbers in urls then I may as well ignore IP-like sequences if they follow Version, Ver, V. and so on. The input line is therefore converted to lower case so that that regex works whatever the case.

Quote:
Originally Posted by Don Cragun
It looks to me like you two have almost completed debugging a very efficient script that will handle any number of ip addresses and ip-like addresses between slashes on a single line (as long as your input doesn't exceed LINE_MAX limits).
I hope so.

Have a look at what may be the finished, fully Posix compliant, article. Thumbs up if all okay please guys. If not, I will persevere and fix anything that needs fixing.

Just remembered one thing I'm not 100% sure about. In the versioningNotIP regex I've enclosed the OR variations in (). I couldn't find online whether that is acceptable with Posix awk, is it?

Code:
    local awkExtractIPAddresses='                                                        \
    BEGIN                                                                                \
    {                                                                                    \
        ipSequence = "[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+";                                \
        digitSequenceTooLongNotIP = "[0-9][0-9][0-9][0-9][0-9]*";                        \
        encInFwdSlashesNotIP = "[/][0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+[/]";                \
        versioningNotIP = "(version|ver|v)+[ \\.]*[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+";    \
    }                                                                                    \
    {                                                                                    \
        line = tolower($0);                                                              \
        gsub(digitSequenceTooLongNotIP, "xxx", line);                                    \
        gsub(encInFwdSlashesNotIP, "xxx", line);                                         \
        gsub(versioningNotIP, "xxx", line);                                              \
        while (match(line, ipSequence))                                                  \
        {                                                                                \
            ip = substr(line, RSTART, RLENGTH);                                          \
            ipUnique[ip] = ip;                                                           \
            line = substr(line, RSTART + RLENGTH + 1);                                   \
        }                                                                                \
    }                                                                                    \
    END                                                                                  \
    {                                                                                    \
        ipRangeMin = 0;                                                                  \
        ipRangeMax = 255;                                                                \
        ipNumSegments = 4;                                                               \
        ipDelimiter = ".";                                                               \
        for (ip in ipUnique)                                                             \
        {                                                                                \
            numSegments = split(ip, ipSegments, ipDelimiter);                            \
            if (numSegments == ipNumSegments &&                                          \
                ipSegments[1] >= ipRangeMin && ipSegments[1] <= ipRangeMax &&            \
                ipSegments[2] >= ipRangeMin && ipSegments[2] <= ipRangeMax &&            \
                ipSegments[3] >= ipRangeMin && ipSegments[3] <= ipRangeMax &&            \
                ipSegments[4] >= ipRangeMin && ipSegments[4] <= ipRangeMax)              \
            {                                                                            \
                print ip;                                                                \
            }                                                                            \
        }                                                                                \
    }'

    local ipAddressMatches=$(awk "$awkExtractIPAddresses" < "$tempFileName")

Cheers.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Grep and ignore list from file

cat /tmp/i.txt '(ORA-28001|ORA-00100|ORA-28001|ORA-20026|ORA-20025|ORA-02291|ORA-01458|ORA-01017|ORA-1017|ORA-28000|ORA-06512|ORA-06512|Domestic Phone|ENCRYPTION)' grep -ia 'ORA-\{5\}:' Rep* |grep -iavE `cat /tmp/i.txt` grep: Unmatched ( or \( Please tell me why am i getting that (6 Replies)
Discussion started by: jhonnyrip
6 Replies

2. Shell Programming and Scripting

Grep command to ignore line starting with hyphen

Hi, I want to read a file line by line and exclude the lines that are beginning with special characters. The below code is working fine except when the line starts with hyphen (-) in the file. for TEST in `cat $FILE | grep -E -v '#|/+' | awk '{FS=":"}NF > 0{print $1}'` do . . done How... (4 Replies)
Discussion started by: Srinraj Rao
4 Replies

3. Shell Programming and Scripting

Need sequence no in the grep output

Hi, How to achieve the displaying of sequence no while doing grep for an output. Ex., need the output like below with the serial no, but not the available line number in the file S.No Array Lun 1 AABC 7080 2 AABC 7081 3 AADD 8070 4 AADD 8071 5 ... (3 Replies)
Discussion started by: ksgnathan
3 Replies

4. Shell Programming and Scripting

Ignore escape sequence in sed

Friends, In the file i am having more then 100 lines like, File1 had the values like this: #Example East.server_01=EAST.SERVER_01 East.server_01=EAST.SERVER_01 West.server_01=WEST.SERVER_01 File2 had the values like this: #Example EAST.SERVER_01=http://yahoo.com... (3 Replies)
Discussion started by: jothi basu
3 Replies

5. Shell Programming and Scripting

regex - start with a word but ignore that word

Hi Guys. I guess I have a very basic query but stuck with it :( I have a file in which I want to extract particular content. The content is between standard format like : Verify stats A=0 B=12 C=34 TEST Failed Now I want to extract data between "Verify stats" & "TEST Failed" but do... (6 Replies)
Discussion started by: ratneshnagori
6 Replies

6. Shell Programming and Scripting

Grep but ignore first column

Hi, I need to perform a grep from a file, but ignore any results from the first column. For simplicity I have changed the actual data, but for arguments sake, I have a file that reads: MONACO Monaco ASMonaco MANUTD ManUtd ManchesterUnited NEWCAS NewcastleUnited NAC000 NAC ... (5 Replies)
Discussion started by: danhodges99
5 Replies

7. Shell Programming and Scripting

ignore fields to check in grep

Hi, I have a pipe delimited file. I am checking for junk characters ( non printable characters and unicode values). I am using the following code grep '' file.txt But i want to ignore the name fields. For example field2 is firstname so i want to ignore if the junk characters occur... (4 Replies)
Discussion started by: ashwin3086
4 Replies

8. UNIX for Dummies Questions & Answers

| help | unix | grep (GNU grep) 2.5.1 | advanced regex syntax

Hello, I'm working on unix with grep (GNU grep) 2.5.1. I'm going through some of the newer regex syntax using Regular Expression Reference - Advanced Syntax a guide. ls -aLl /bin | grep "\(x\)" Which works, just highlights 'x' where ever, when ever. I'm trying to to get (?:) to work but... (4 Replies)
Discussion started by: MykC
4 Replies

9. Fedora

Hosting issue regarding subdirectories and fwd Slashes

I admin two co-located servers. I built an app that creates subdirectories for users ie www.site.com/username. one server that works just fine when you hit that url, it sees the index within and does as it should. I moved the app to my other server running FEDORA 1 i686 standard, cPanel... (3 Replies)
Discussion started by: iecowboy
3 Replies

10. Shell Programming and Scripting

To grep in sequence

Hi, I have a log file containg records in sequence <CRMSUB:MSIN=2200380,BSNBC=TELEPHON-7553&TS21-7716553&TS22-7716553,NDC=70,MSCAT=ORDINSUB,SUBRES=ONAOFPLM,ACCSUB=BSS,NUMTYP=SINGLE; <ENTROPRSERV:MSIN=226380,OPRSERV=OCSI-PPSMOC-ACT-DACT&TCSI-PPSMTC-ACT-DACT&UCSI-USSD;... (17 Replies)
Discussion started by: helplineinc
17 Replies
Login or Register to Ask a Question