Grep regex to ignore sequence only if surrounded by fwd-slashes Post: 302877823

Sponsored Content

Top Forums Shell Programming and Scripting Grep regex to ignore sequence only if surrounded by fwd-slashes Post 302877823 by gencon on Tuesday 3rd of December 2013 12:49:32 PM

12-03-2013

Registered User

Thanks very much Chubler_XL. That's an excellent way to do it and what I've used with some changes.

Your IP regex code IP_RE="[0-9]+\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]+" is not Posix compatible. Posix awk does not include interval expressions, e.g. {1,3}. The GNU Awk v.3.1.8 (on my system) for instance requires the --re-interval option to allow their use. I have simply used an extra sed replace expression to remove all number sequences greater than 3 digits in length to get around this.

Also the gsub("/"IP_RE"/","") line clearly prevents the rest of the code from working by replacing all the IP regex matches in each line with the empty string, and I assume that the line only made it into your post by accident.

I also spotted a potential problem with the removal of IP-like addresses enclosed by slashes using sed. Consider the following url:

http://web.com/libs/v.15.5/15.5.2.1/.23.12/file.js

The old sed expression (not the one below) would simply remove this bit: /15.5.2.1/

Which would leave behind this: http://web.com/libs/v.15.5.23.12/file.js

Inadvertidly a valid IP address of 15.5.23.12 has been created from the digits on either side of the removed section. Okay so it's not all that likely to happen regularly but using 'xxx' as the replacement string, instead of an empty string, in the sed expressions makes sure it won't happen.

I think the code below is now fully Posix compatible, the question is: does it get the thumbs up from Don?

Thanks again for all the help I've been given. Smilie

Code:

# This sed expression removes IP-like sequences if they are surrounded by a / at each end.
# e.g. src="http://web.com/libs/1.6.1.0/file.js" would have the "/1.6.1.0/" bit removed.
# A basic regex is used to remain Posix compatible and because the Free BSD / Apple OS X
# seds use -E to specify extended regex, while GNU sed uses -r (both may accept both?).
# The use of 'xxx' as the replacement string, instead of using the expty string, is to
# prevent inadvertidly creating a valid IP address with digits on either side of the
# removed section. e.g. http://web.com/libs/v.15.5/15.5.2.1/.23.12/file.js

local sedRemoveEncInSlashes="s:\/[0-9]*\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]*\/:xxx:g"

# This sed expression removes number sequences longer than 3 digits. The awk described by
# Posix does not include interval expressions, e.g. {1,3}, GNU Awk 3.1.8 for instance
# requires the --re-interval command option to allow their use. By removing any number
# sequences longer than 3 digits with sed, it is not necessary to check for them in the
# awk code below. 'xxx' is used as the replacement string for the same reason as above.

local sedRemoveNumDigitsOutOfRange="s:[0-9]\{4,\}:xxx:g"

# Combine the two sed removal expressions, so sed can do them with one call.
# Clarity of what the expressions do above, efficiency in processing them below.

local sedRemove="$sedRemoveEncInSlashes;$sedRemoveNumDigitsOutOfRange"

# This awk code will print all IP-like sequences which match the ipRegEx expression. It
# does not check that the numbers within any IP-like sequences are in the valid range of
# an IP address, 0..255, the function Is_Ip_Address(), called below, will do so.
# Note: RSTART and RLENGTH are awk built-in variables, after the match() function has
# been called they will contain the start position and length of the matched string.

local awkExtract='BEGIN { ipRegEx = "[0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+" }    \
                  {                                                          \
                       line = $0;                                            \
                       while (match(line, ipRegEx))                          \
                       {                                                     \
                            ip = substr(line, RSTART, RLENGTH);              \
                            line = substr(line, RSTART + RLENGTH + 1);       \
                            print ip;                                        \
                       }                                                     \
                  }'

# Extract IP-like sequences from the temp file. sort -u removes duplicates.

local ipLikeAddressMatches=$(sed "$sedRemove" < "$tempFileName" | \
                             awk "$awkExtract" | sort -u)

Last edited by gencon; 12-03-2013 at 02:01 PM..

gencon

View Public Profile for gencon

Find all posts by gencon

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

To grep in sequence

Hi, I have a log file containg records in sequence <CRMSUB:MSIN=2200380,BSNBC=TELEPHON-7553&TS21-7716553&TS22-7716553,NDC=70,MSCAT=ORDINSUB,SUBRES=ONAOFPLM,ACCSUB=BSS,NUMTYP=SINGLE; <ENTROPRSERV:MSIN=226380,OPRSERV=OCSI-PPSMOC-ACT-DACT&TCSI-PPSMTC-ACT-DACT&UCSI-USSD;...

2. Fedora

Hosting issue regarding subdirectories and fwd Slashes

I admin two co-located servers. I built an app that creates subdirectories for users ie www.site.com/username. one server that works just fine when you hit that url, it sees the index within and does as it should. I moved the app to my other server running FEDORA 1 i686 standard, cPanel...

3. UNIX for Dummies Questions & Answers

| help | unix | grep (GNU grep) 2.5.1 | advanced regex syntax

Hello, I'm working on unix with grep (GNU grep) 2.5.1. I'm going through some of the newer regex syntax using Regular Expression Reference - Advanced Syntax a guide. ls -aLl /bin | grep "$x$" Which works, just highlights 'x' where ever, when ever. I'm trying to to get (?:) to work but...

4. Shell Programming and Scripting

ignore fields to check in grep

Hi, I have a pipe delimited file. I am checking for junk characters ( non printable characters and unicode values). I am using the following code grep '' file.txt But i want to ignore the name fields. For example field2 is firstname so i want to ignore if the junk characters occur...

5. Shell Programming and Scripting

Grep but ignore first column

Hi, I need to perform a grep from a file, but ignore any results from the first column. For simplicity I have changed the actual data, but for arguments sake, I have a file that reads: MONACO Monaco ASMonaco MANUTD ManUtd ManchesterUnited NEWCAS NewcastleUnited NAC000 NAC ...

6. Shell Programming and Scripting

regex - start with a word but ignore that word

Hi Guys. I guess I have a very basic query but stuck with it :( I have a file in which I want to extract particular content. The content is between standard format like : Verify stats A=0 B=12 C=34 TEST Failed Now I want to extract data between "Verify stats" & "TEST Failed" but do...

7. Shell Programming and Scripting

Ignore escape sequence in sed

Friends, In the file i am having more then 100 lines like, File1 had the values like this: #Example East.server_01=EAST.SERVER_01 East.server_01=EAST.SERVER_01 West.server_01=WEST.SERVER_01 File2 had the values like this: #Example EAST.SERVER_01=http://yahoo.com...

8. Shell Programming and Scripting

Need sequence no in the grep output

Hi, How to achieve the displaying of sequence no while doing grep for an output. Ex., need the output like below with the serial no, but not the available line number in the file S.No Array Lun 1 AABC 7080 2 AABC 7081 3 AADD 8070 4 AADD 8071 5 ...

9. Shell Programming and Scripting

Grep command to ignore line starting with hyphen

Hi, I want to read a file line by line and exclude the lines that are beginning with special characters. The below code is working fine except when the line starts with hyphen (-) in the file. for TEST in `cat $FILE | grep -E -v '#|/+' | awk '{FS=":"}NF > 0{print $1}'` do . . done How...

10. Shell Programming and Scripting

Grep and ignore list from file

cat /tmp/i.txt '(ORA-28001|ORA-00100|ORA-28001|ORA-20026|ORA-20025|ORA-02291|ORA-01458|ORA-01017|ORA-1017|ORA-28000|ORA-06512|ORA-06512|Domestic Phone|ENCRYPTION)' grep -ia 'ORA-\{5\}:' Rep* |grep -iavE `cat /tmp/i.txt` grep: Unmatched ( or \( Please tell me why am i getting that

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

To grep in sequence

Discussion started by: helplineinc

2. Fedora

Hosting issue regarding subdirectories and fwd Slashes

Discussion started by: iecowboy

3. UNIX for Dummies Questions & Answers

| help | unix | grep (GNU grep) 2.5.1 | advanced regex syntax

Discussion started by: MykC

4. Shell Programming and Scripting

ignore fields to check in grep

Discussion started by: ashwin3086

5. Shell Programming and Scripting

Grep but ignore first column

Discussion started by: danhodges99

6. Shell Programming and Scripting

regex - start with a word but ignore that word

Discussion started by: ratneshnagori

7. Shell Programming and Scripting

Ignore escape sequence in sed

Discussion started by: jothi basu

8. Shell Programming and Scripting

Need sequence no in the grep output

Discussion started by: ksgnathan

9. Shell Programming and Scripting

Grep command to ignore line starting with hyphen

Discussion started by: Srinraj Rao

10. Shell Programming and Scripting

Grep and ignore list from file

Discussion started by: jhonnyrip