Grep regex to ignore sequence only if surrounded by fwd-slashes Post: 302879424

Sponsored Content

Top Forums Shell Programming and Scripting Grep regex to ignore sequence only if surrounded by fwd-slashes Post 302879424 by gencon on Thursday 12th of December 2013 11:07:14 AM

12-12-2013

Registered User

Hi Don,

Quote:

Originally Posted by Don Cragun

While doing some further testing, I came up with a few questions. If you had the following input file:

Code:

1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28
11.22.33.44version 55.66.77.99.100.110.120.130.140.150.160.170.180.190
99.88.77.66/55.44.33.22.11/111.112.113.114

what, if any, valid IP addresses would you like your script to report? I'm guessing that none should be found here, but one of the scripts you posted early in this thread will come up with something like the following:

Code:

21.22.23.24
55.44.33.22
17.18.19.20
9.10.11.12
13.14.15.16
25.26.27.28
1.2.3.4
111.112.113.114
100.110.120.130
5.6.7.8
99.88.77.66
11.22.33.44
140.150.160.170

The current script would not match all those and wouldn't have since I moved to an Awk only solution. I posted about this in post num 17, see paragraph "The code has failed to handle one simple limitation." Post num 17 is here: https://www.unix.com/showpost.php?p=3...9&postcount=17

BUT the key point was changing:

ipSequence = "[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+"

to:

ipLikeSequence = "[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+[0-9.]*";

Which more greedily matches number dot sequences and leaves it to the conditional in the END section to determine actual IP validity.

Putting your input file (above) into the script should have given just this IP address: 11.22.33.44 - that being the only valid IP address in your data (as far as my intentions are concerned).

What it actually gave was this:

Code:

99.88.77.66
111.112.113.114
11.22.33.44

Well done and thanks you've spotted a bug. What happened was this:

Code:

Before and after the gsub() calls:
Line In:  99.88.77.66/55.44.33.22.11/111.112.113.114
Line Out: 99.88.77.66x111.112.113.114

The regex encInFwdSlashesNotIP = "[/]" ipLikeSequence "[/]"; replaced the IP surrounded by forward slashes with an x, leaving 2 valid IP addresses on either side of the x - both of which should have been removed because, in the input line, one begins with a forward slash while the other ends with one.

I've modified the code. I had already introduced the self explanatory beginsWithFwdSlashNotIP and endsWithFwdSlashNotIP regexes back in post number 20 to handle version numbers in Urls (which look like IPs) more robustly. I've now removed enclosedInFwdSlashesNotIP realizing that it is redundant (also making the thread title redundant) and solved the issue by using '/' instead of 'x' as the replacement char in the 'begins with fwd slash' and 'ends with fwd slash' regex gsub() calls. So now I have:

Code:

    BEGIN {
        ipLikeSequence = "[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+[0-9.]*";
        digitSequenceTooLongNotIP = "[0-9][0-9][0-9][0-9]+";
        versioningNotIP = "[Vv]([Ee][Rr]([Ss][Ii][Oo][Nn])?)?[ .:]*" ipLikeSequence;
        beginsWithFwdSlashNotIP = "[/]" ipLikeSequence;
        endsWithFwdSlashNotIP = ipLikeSequence "[/]";
    }
    {
        line = $0;

        gsub(digitSequenceTooLongNotIP, "x", line);
        gsub(versioningNotIP, "x", line);
        gsub(beginsWithFwdSlashNotIP, "/", line);
        gsub(endsWithFwdSlashNotIP, "/", line);

        snippity...snip

and with your input file I now get just 11.22.33.44 which is what I want. The problem line now has all 3 IP addresses which begin or end with a forward slash removed:

Code:

Before and after the gsub() calls:
Line In:  99.88.77.66/55.44.33.22.11/111.112.113.114
Line Out: //

Quote:

Originally Posted by Don Cragun

I'm looking at a different way to evaluate possible IP addresses, but I need to know what you want to be required to appear before and after a valid IP address. Am I correct in assuming that a valid IP address should appear at the start of a line or be preceded by a white-space character, be followed by a white-space character or appear at the end of a line, and contain four 1 to 3 digit numbers separated by single occurrences of a period where the values of the numbers are 0 <= number <= 255?

Note that if my assumption is correct, an IP address surrounded by alphabetic or punctuation characters (in addition to slashes) should also be rejected. If my assumption is correct, should an exception be made allowing commas (or comma followed by space) to separate IP addresses?

No your assumptions are not correct, typical input is HTML, though sometimes just plain text, and sometimes just a file consisting of nothing at all except for a single IP address. The script as a whole retrieves a computer's WAN IP address from behind a router by downloading web pages which display a user's IP address. The script randomly downloads 2 or more such pages to use as verification. Do you want a copy of the whole thing to have a look at?

You are however correct in thinking I want (quote Don): "four 1 to 3 digit numbers separated by single occurrences of a period where the values of the numbers are 0 <= number <= 255". [I know that that notation means the inclusive range of 0..255 but I've never understood why the accepted notation is not "0 >= number <= 255" since what is wanted is greater than or equal to 0 and not less than or equal to 0 which is how it reads to me.]

All of the below are real world examples of valid IPs that should be accepted (many of these are sections of a line and not the whole line which are often quite long):

Code:

Note: My IP address replaced with one of Google's.

<textarea id="do" class="ip">64.15.115.103</textarea>
X-Real-Ip: 64.15.115.103<br>
<input type="text" id="ip" name="ip" value="64.15.115.103"
<b>Your IP: 64.15.115.103&nbsp;</b>
<b><img src="flags/gb.gif"> 64.15.115.103</b><br />
Show Me My IP! - Your IP address is: 64.15.115.103
<tr class="l2"><td>64.15.115.103</td></tr>
<h1 id="current-ip">Current IP: <em>64.15.115.103</em></h1>
var ips = $H({"ip0":"64.15.115.103"});
<h2>Your current IP address is: <span style="color:black">64.15.115.103</span></h2>
        64.15.115.103        <br>
<meta name="DESCRIPTION" content="View my IP information: 64.15.115.103">
<TITLE>Your Ip Address 64.15.115.103</TITLE>
<a href='ipwhois-64.15.115.103.html'>whois 64.15.115.103</a>
'http://www.geobytes.com/IpLocator.htm?GetLocation&amp;template=php3.txt&amp;IpAddress=64.15.115.103');<br />
<p style="text-align: center;"><span style="font-size:24pt"><B>&nbsp;64.15.115.103</B>&nbsp;</p>
<script type="text/javascript" src="http://api.my-proxy.com/ip.js.php?ip=64.15.115.103"></script>
<input name="ip" type="text" value="64.15.115.103"> <input type="submit" value="Get location">
infowindow.setContent('64.15.115.103');

Quote:

Originally Posted by Don Cragun

Are we having fun yet?

Yes Sir.

The current Awk code with some helpful debugging print statements is:

Code:

    BEGIN {
        ipLikeSequence = "[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+[0-9.]*";
        digitSequenceTooLongNotIP = "[0-9][0-9][0-9][0-9]+";
        versioningNotIP = "[Vv]([Ee][Rr]([Ss][Ii][Oo][Nn])?)?[ .:]*" ipLikeSequence;
        beginsWithFwdSlashNotIP = "[/]" ipLikeSequence;
        endsWithFwdSlashNotIP = ipLikeSequence "[/]";
    }
    {
        line = $0;

        printf("Line<%s\n", line);
        gsub(digitSequenceTooLongNotIP, "x", line);
        gsub(versioningNotIP, "x", line);
        gsub(beginsWithFwdSlashNotIP, "/", line);
        gsub(endsWithFwdSlashNotIP, "/", line);
        printf("Line>%s\n", line);

        while (match(line, ipLikeSequence))
        {
            ip = substr(line, RSTART, RLENGTH);
            ipUnique[ip];
            line = substr(line, RSTART + RLENGTH + 1);
            printf("Storing possible IP: %s\n", ip);
        }
    }
    END {
        ipRangeMin = 0;
        ipRangeMax = 255;
        ipNumSegments = 4;
        ipDelimiter = ".";

        for (ipIndex in ipUnique)
        {
            numSegments = split(ipIndex, ipSegments, ipDelimiter);
            if (numSegments == ipNumSegments &&
                ipSegments[1] >= ipRangeMin && ipSegments[1] <= ipRangeMax &&
                ipSegments[2] >= ipRangeMin && ipSegments[2] <= ipRangeMax &&
                ipSegments[3] >= ipRangeMin && ipSegments[3] <= ipRangeMax &&
                ipSegments[4] >= ipRangeMin && ipSegments[4] <= ipRangeMax)
            {
                printf("Valid IP:   %s\n", ipIndex);
            }
            else
            {
                printf("Invalid IP: %s\n", ipIndex);
            }
        }
    }

Thanks again,

Gencon

Last edited by gencon; 12-12-2013 at 02:03 PM..

gencon

View Public Profile for gencon

Find all posts by gencon

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

To grep in sequence

Hi, I have a log file containg records in sequence <CRMSUB:MSIN=2200380,BSNBC=TELEPHON-7553&TS21-7716553&TS22-7716553,NDC=70,MSCAT=ORDINSUB,SUBRES=ONAOFPLM,ACCSUB=BSS,NUMTYP=SINGLE; <ENTROPRSERV:MSIN=226380,OPRSERV=OCSI-PPSMOC-ACT-DACT&TCSI-PPSMTC-ACT-DACT&UCSI-USSD;...

2. Fedora

Hosting issue regarding subdirectories and fwd Slashes

I admin two co-located servers. I built an app that creates subdirectories for users ie www.site.com/username. one server that works just fine when you hit that url, it sees the index within and does as it should. I moved the app to my other server running FEDORA 1 i686 standard, cPanel...

3. UNIX for Dummies Questions & Answers

| help | unix | grep (GNU grep) 2.5.1 | advanced regex syntax

Hello, I'm working on unix with grep (GNU grep) 2.5.1. I'm going through some of the newer regex syntax using Regular Expression Reference - Advanced Syntax a guide. ls -aLl /bin | grep "$x$" Which works, just highlights 'x' where ever, when ever. I'm trying to to get (?:) to work but...

4. Shell Programming and Scripting

ignore fields to check in grep

Hi, I have a pipe delimited file. I am checking for junk characters ( non printable characters and unicode values). I am using the following code grep '' file.txt But i want to ignore the name fields. For example field2 is firstname so i want to ignore if the junk characters occur...

5. Shell Programming and Scripting

Grep but ignore first column

Hi, I need to perform a grep from a file, but ignore any results from the first column. For simplicity I have changed the actual data, but for arguments sake, I have a file that reads: MONACO Monaco ASMonaco MANUTD ManUtd ManchesterUnited NEWCAS NewcastleUnited NAC000 NAC ...

6. Shell Programming and Scripting

regex - start with a word but ignore that word

Hi Guys. I guess I have a very basic query but stuck with it :( I have a file in which I want to extract particular content. The content is between standard format like : Verify stats A=0 B=12 C=34 TEST Failed Now I want to extract data between "Verify stats" & "TEST Failed" but do...

7. Shell Programming and Scripting

Ignore escape sequence in sed

Friends, In the file i am having more then 100 lines like, File1 had the values like this: #Example East.server_01=EAST.SERVER_01 East.server_01=EAST.SERVER_01 West.server_01=WEST.SERVER_01 File2 had the values like this: #Example EAST.SERVER_01=http://yahoo.com...

8. Shell Programming and Scripting

Need sequence no in the grep output

Hi, How to achieve the displaying of sequence no while doing grep for an output. Ex., need the output like below with the serial no, but not the available line number in the file S.No Array Lun 1 AABC 7080 2 AABC 7081 3 AADD 8070 4 AADD 8071 5 ...

9. Shell Programming and Scripting

Grep command to ignore line starting with hyphen

Hi, I want to read a file line by line and exclude the lines that are beginning with special characters. The below code is working fine except when the line starts with hyphen (-) in the file. for TEST in `cat $FILE | grep -E -v '#|/+' | awk '{FS=":"}NF > 0{print $1}'` do . . done How...

10. Shell Programming and Scripting

Grep and ignore list from file

cat /tmp/i.txt '(ORA-28001|ORA-00100|ORA-28001|ORA-20026|ORA-20025|ORA-02291|ORA-01458|ORA-01017|ORA-1017|ORA-28000|ORA-06512|ORA-06512|Domestic Phone|ENCRYPTION)' grep -ia 'ORA-\{5\}:' Rep* |grep -iavE `cat /tmp/i.txt` grep: Unmatched ( or \( Please tell me why am i getting that

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

To grep in sequence

Discussion started by: helplineinc

2. Fedora

Hosting issue regarding subdirectories and fwd Slashes

Discussion started by: iecowboy

3. UNIX for Dummies Questions & Answers

| help | unix | grep (GNU grep) 2.5.1 | advanced regex syntax

Discussion started by: MykC

4. Shell Programming and Scripting

ignore fields to check in grep

Discussion started by: ashwin3086

5. Shell Programming and Scripting

Grep but ignore first column

Discussion started by: danhodges99

6. Shell Programming and Scripting

regex - start with a word but ignore that word

Discussion started by: ratneshnagori

7. Shell Programming and Scripting

Ignore escape sequence in sed

Discussion started by: jothi basu

8. Shell Programming and Scripting

Need sequence no in the grep output

Discussion started by: ksgnathan

9. Shell Programming and Scripting

Grep command to ignore line starting with hyphen

Discussion started by: Srinraj Rao

10. Shell Programming and Scripting

Grep and ignore list from file

Discussion started by: jhonnyrip