Sponsored Content
Top Forums Shell Programming and Scripting Grep regex to ignore sequence only if surrounded by fwd-slashes Post 302878845 by gencon on Monday 9th of December 2013 08:13:36 AM
Old 12-09-2013
NOTE: Scrutinizer - see bottom of this post.

Hi again and thanks Chubler_XL,

Quote:
Originally Posted by Chubler_XL
The code ipUnique[ip] = ip; and ipUnique[ip]; are not equivalent.

The first creates array element ip with a null value (if it doesn't already exist) and then assigns it's value to ip.

The second creates array element ip with a value of null only.
That's exactly what I thought/think too. BUT strange things are going on, read on...

What I don't understand is why both ipUnique[ip] = ip; and ipUnique[ip]; appear to function equivalently BECAUSE please note that the bit I've made bold in what you wrote below is not correct...

Quote:
Originally Posted by Chubler_XL
The reason the replacement works is the code never uses the value of the array element and this is irrelevant. It could be 1, NULL or equal to the index and the code still works as intended.
Below I am reposting Don's code from the end of his most recent post, which is here: https://www.unix.com/showpost.php?p=3...8&postcount=15

In his code he uses this line ipUnique[ip] and not ipUnique[ip] = ip; but then in the END section he is able to access the values of the array (which 'should be' the null string but aren't) as if he had used ipUnique[ip] = ip;. The relevant bits are highlighted in red in the code below.

Code:
#!/bin/bash
tempFileName=${1:-file}

    awkExtractIPAddresses='
    BEGIN {
        ipSequence = "[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+"
        digitSequenceTooLongNotIP = "[0-9][0-9][0-9][0-9]+"
        encInFwdSlashesNotIP = "[/]" ipSequence "[/]"
        versioningNotIP = "[Vv]([Ee][Rr]([Ss][Ii][Oo][Nn])?)?[ .]*" ipSequence
    }
    {
        line = $0
        gsub(digitSequenceTooLongNotIP, "x", line)
        gsub(encInFwdSlashesNotIP, "x", line)
        gsub(versioningNotIP, "x", line)
        while (match(line, ipSequence)) {
            ip = substr(line, RSTART, RLENGTH)
            ipUnique[ip]
            line = substr(line, RSTART + RLENGTH + 1)
        }
    }
    END {
        ipRangeMin = 0
        ipRangeMax = 255
        ipNumSegments = 4
        ipDelimiter = "."
        for (ip in ipUnique) {
            numSegments = split(ip, ipSegments, ipDelimiter)
            if (numSegments == ipNumSegments &&
                ipSegments[1] >= ipRangeMin && ipSegments[1] <= ipRangeMax &&
                ipSegments[2] >= ipRangeMin && ipSegments[2] <= ipRangeMax &&
                ipSegments[3] >= ipRangeMin && ipSegments[3] <= ipRangeMax &&
                ipSegments[4] >= ipRangeMin && ipSegments[4] <= ipRangeMax) {
                    print ip
            }
        }
    }'

    ipAddressMatches=$(awk "$awkExtractIPAddresses" "$tempFileName")

printf "%s\n" $ipAddressMatches

Do you need convincing? I certainly did !!

Save this code snippet as your test input data:

Code:
# test_data
This is a valid IP address 192.168.1.1
This is not valid 111.222.333.444 some values are out of range.
This is way off: 1.2.3.4.5.6
This is a version 1.2.1.2 number in the same form as an IP.
This is a version number inside an url http://web.com/lib/1.2.3.4/file.js
This is another valid IP but in some CSS <ip>192.168.1.2</ip>
This is one with a too long digit sequence 192.16888.1.3
This is a final valid IP 192.168.1.3 just for luck.

Now save this awk program (ditching the Bash 'wrapper' is long overdue):

Code:
BEGIN {
    ipLikeSequence = "[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+[0-9.]*";
    digitSequenceTooLongNotIP = "[0-9][0-9][0-9][0-9]+";
    versioningNotIP = "[Vv]([Ee][Rr]([Ss][Ii][Oo][Nn])?)?[ .:]*" ipLikeSequence;
    enclosedInFwdSlashesNotIP = "[/]" ipLikeSequence "[/]";
    beginsWithFwdSlashNotIP = "[/]" ipLikeSequence;
    endsWithFwdSlashNotIP = ipLikeSequence "[/]";
}
{
    line = $0;

    gsub(digitSequenceTooLongNotIP, "x", line);
    gsub(versioningNotIP, "x", line);
    gsub(enclosedInFwdSlashesNotIP, "x", line);
    gsub(beginsWithFwdSlashNotIP, "x", line);
    gsub(endsWithFwdSlashNotIP, "x", line);

    while (match(line, ipLikeSequence))
    {
        ip = substr(line, RSTART, RLENGTH);
        ipUnique[ip] = ip;
        # ipUnique[ip];
        line = substr(line, RSTART + RLENGTH + 1);
        printf("Storing possible IP address: %s\n", ip);
    }
}
END {
    ipRangeMin = 0;
    ipRangeMax = 255;
    ipNumSegments = 4;
    ipDelimiter = ".";

    for (ip in ipUnique)
    {
        numSegments = split(ip, ipSegments, ipDelimiter);
        if (numSegments == ipNumSegments &&
            ipSegments[1] >= ipRangeMin && ipSegments[1] <= ipRangeMax &&
            ipSegments[2] >= ipRangeMin && ipSegments[2] <= ipRangeMax &&
            ipSegments[3] >= ipRangeMin && ipSegments[3] <= ipRangeMax &&
            ipSegments[4] >= ipRangeMin && ipSegments[4] <= ipRangeMax)
        {
            printf("Valid IP:   %s\n", ip);
        }
        else
        {
            printf("Invalid IP: %s\n", ip);
        }
    }
}

Please note that the while loop has both ipUnique[ip] = ip; and ipUnique[ip]; but that ipUnique[ip]; is commented out.

Now run it twice, once with ipUnique[ip] = ip; and then swapping the comment around so ipUnique[ip]; is used instead and you should get something like this:

Code:
$ awk -f test.awk < test_data
Storing possible IP address: 192.168.1.1
Storing possible IP address: 111.222.333.444
Storing possible IP address: 1.2.3.4.5.6
Storing possible IP address: 192.168.1.2
Storing possible IP address: 192.168.1.3
Invalid IP: 1.2.3.4.5.6
Valid IP:   192.168.1.1
Valid IP:   192.168.1.2
Valid IP:   192.168.1.3
Invalid IP: 111.222.333.444

# Swap "ipUnique[ip] = ip;" for "ipUnique[ip];" in test.awk

$ awk -f test.awk < test_data
Storing possible IP address: 192.168.1.1
Storing possible IP address: 111.222.333.444
Storing possible IP address: 1.2.3.4.5.6
Storing possible IP address: 192.168.1.2
Storing possible IP address: 192.168.1.3
Invalid IP: 1.2.3.4.5.6
Valid IP:   192.168.1.1
Valid IP:   192.168.1.2
Valid IP:   192.168.1.3
Invalid IP: 111.222.333.444

IDENTICAL - use diff as well if you want, I did. Smilie

So I repeat myself (at least in essence): ipUnique[ip] = ip; and ipUnique[ip]; function equivalently in the code above. I do not understand why ipUnique[ip]; works at all. As I said in my reply to Don, my best guess is that it has something to do with stack manipulation because, as you pointed out and the manual clearly says, when an array is referenced (with no assignment) the null string is assigned to that array element's value.

Here's hoping the Don Craguneleone will get back into the action, if ever I needed The Godfather it's now. Cue a (somewhat slimmer) Marlon Brandoesque figure in the heavily shaded study of his mansion, with a blinking cursor wizzing across the line like a speeding bullet and wedding guests waiting patiently with their own coding problems. Smilie

All the best, thanks for taking the time to read this,

Gencon

---------- Post updated at 01:13 PM ---------- Previous update was at 01:12 PM ----------

Thanks for the info. Scrutinizer.

Quote:
Originally Posted by Scrutinizer
The --re-interval option used by gawk 3 is automatically switched on by the --posix option. In gawk 4 the --re-interval option is on by default. So it may be a good idea to use gawk with the --posix option.
Please note that neither the --re-interval nor the --posix options are actually defined by POSIX.

and there lies the problem. I'd like my script to run on any UNIX/Linux system. Since different awks/gawks handle enabling interval expressions in different ways (including possibly requiring non-POSIX command line options) it seems simplest to me to simply avoid their use as I have done in the code; especially as this is so easily accomplished in this particular case by doing a gsub() replace of all number sequences greater than 3 digits in length.

Cheers.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

To grep in sequence

Hi, I have a log file containg records in sequence <CRMSUB:MSIN=2200380,BSNBC=TELEPHON-7553&TS21-7716553&TS22-7716553,NDC=70,MSCAT=ORDINSUB,SUBRES=ONAOFPLM,ACCSUB=BSS,NUMTYP=SINGLE; <ENTROPRSERV:MSIN=226380,OPRSERV=OCSI-PPSMOC-ACT-DACT&TCSI-PPSMTC-ACT-DACT&UCSI-USSD;... (17 Replies)
Discussion started by: helplineinc
17 Replies

2. Fedora

Hosting issue regarding subdirectories and fwd Slashes

I admin two co-located servers. I built an app that creates subdirectories for users ie www.site.com/username. one server that works just fine when you hit that url, it sees the index within and does as it should. I moved the app to my other server running FEDORA 1 i686 standard, cPanel... (3 Replies)
Discussion started by: iecowboy
3 Replies

3. UNIX for Dummies Questions & Answers

| help | unix | grep (GNU grep) 2.5.1 | advanced regex syntax

Hello, I'm working on unix with grep (GNU grep) 2.5.1. I'm going through some of the newer regex syntax using Regular Expression Reference - Advanced Syntax a guide. ls -aLl /bin | grep "\(x\)" Which works, just highlights 'x' where ever, when ever. I'm trying to to get (?:) to work but... (4 Replies)
Discussion started by: MykC
4 Replies

4. Shell Programming and Scripting

ignore fields to check in grep

Hi, I have a pipe delimited file. I am checking for junk characters ( non printable characters and unicode values). I am using the following code grep '' file.txt But i want to ignore the name fields. For example field2 is firstname so i want to ignore if the junk characters occur... (4 Replies)
Discussion started by: ashwin3086
4 Replies

5. Shell Programming and Scripting

Grep but ignore first column

Hi, I need to perform a grep from a file, but ignore any results from the first column. For simplicity I have changed the actual data, but for arguments sake, I have a file that reads: MONACO Monaco ASMonaco MANUTD ManUtd ManchesterUnited NEWCAS NewcastleUnited NAC000 NAC ... (5 Replies)
Discussion started by: danhodges99
5 Replies

6. Shell Programming and Scripting

regex - start with a word but ignore that word

Hi Guys. I guess I have a very basic query but stuck with it :( I have a file in which I want to extract particular content. The content is between standard format like : Verify stats A=0 B=12 C=34 TEST Failed Now I want to extract data between "Verify stats" & "TEST Failed" but do... (6 Replies)
Discussion started by: ratneshnagori
6 Replies

7. Shell Programming and Scripting

Ignore escape sequence in sed

Friends, In the file i am having more then 100 lines like, File1 had the values like this: #Example East.server_01=EAST.SERVER_01 East.server_01=EAST.SERVER_01 West.server_01=WEST.SERVER_01 File2 had the values like this: #Example EAST.SERVER_01=http://yahoo.com... (3 Replies)
Discussion started by: jothi basu
3 Replies

8. Shell Programming and Scripting

Need sequence no in the grep output

Hi, How to achieve the displaying of sequence no while doing grep for an output. Ex., need the output like below with the serial no, but not the available line number in the file S.No Array Lun 1 AABC 7080 2 AABC 7081 3 AADD 8070 4 AADD 8071 5 ... (3 Replies)
Discussion started by: ksgnathan
3 Replies

9. Shell Programming and Scripting

Grep command to ignore line starting with hyphen

Hi, I want to read a file line by line and exclude the lines that are beginning with special characters. The below code is working fine except when the line starts with hyphen (-) in the file. for TEST in `cat $FILE | grep -E -v '#|/+' | awk '{FS=":"}NF > 0{print $1}'` do . . done How... (4 Replies)
Discussion started by: Srinraj Rao
4 Replies

10. Shell Programming and Scripting

Grep and ignore list from file

cat /tmp/i.txt '(ORA-28001|ORA-00100|ORA-28001|ORA-20026|ORA-20025|ORA-02291|ORA-01458|ORA-01017|ORA-1017|ORA-28000|ORA-06512|ORA-06512|Domestic Phone|ENCRYPTION)' grep -ia 'ORA-\{5\}:' Rep* |grep -iavE `cat /tmp/i.txt` grep: Unmatched ( or \( Please tell me why am i getting that (6 Replies)
Discussion started by: jhonnyrip
6 Replies
All times are GMT -4. The time now is 08:27 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy