I didn't notice when you slipped in local in your variable definitions; local isn't in the POSIX standards. (A proposal to add local to a future revision of the POSIX standards is being discussed. But different shells that provide local variables disagree on both the syntax and the semantics of how it should done. I'm not confidant at this point that this proposal will make it into the standards.)
After adding:
to the start of the script, changing:
to:
and saving the script in an executable file named tester, and putting the following sample data in a file named file:
neither ksh nor bash (on Mac OS X version 10.7.5) accepted local in this context. ksh said:
and bash said:
After removing both occurrences of local, running tester produced:
which contains exactly the output I expected (in seemingly random order).
Your ERE definitions:
all conform to POSIX ERE requirements. But, I would make several tweaks:
As Chubler_XL suggested earlier, I would change all occurrences of "\\." to "[.]" (except for the one marked in red above). Using the backslash escapes instead of the matching list bracket expression keeps you from reusing the common parts of three of these expression. The backslashes in the occurrence of "\\." marked in red can just be removed. (The period is just a period in a bracket expression and doesn't need to be escaped.)
Why did you use [0-9][0-9]* at the end of digitSequenceTooLongNotIP instead of using [0-9]+?
Why did you use (version|ver|v)+ instead of just (version|ver|v) at the start of versioningNotIP? (Anything matching "v", "ver", or "version" at the end of a string once will also match any string that ends in one or more of those strings.) Note that if you change that expression to:
it will also match exactly the same expression but you won't need to translate all of your input records to lowercase with:
You could get rid of the line variable completely and just use $0, but changing the above line to:
rather than just using $0 feels like it would be more efficient (since every update to $0 forces awk to re-evaluate the current line).
On a completely different note; why did you choose to define the awk script as a single line awk script using backslashes at the end of the awkExtractIPAddresses variable assignment to denote line continuation? If you delete the trailing backslashes and all of the spaces and or tabs that come just before them, the size of your awk script drops from 3,373 bytes to just over 1,000 bytes requiring only two other changes in your script:
and:
have to be change to:
and:
respectively. If your code had tabs rather than spaces at the ends of lines before the backslashes, the space savings won't be as drastic, but may still be significant (and makes it easier to make changes to the script without worrying about keeping the backslashes lined up). If you do this, you could (but would not have to) also remove a lot of semicolons from your code.
If you like (or at least would like to further investigate these ideas), the following script incorporates these suggestions (and a few tiny changes not worth discussing) and produces exactly the same output:
I hope this helps...
These 2 Users Gave Thanks to Don Cragun For This Post:
Thanks for such a comprehensive reply. Some points for me to answer and some questions to ask.
On the use of local variables it simply didn't occur to me that this was not standardized. I don't need to use them and will remove them. I have never tried the script with any shell other than Bash but will do so as there may be other issues.
Quote:
Originally Posted by Don Cragun
and bash said:
My code is inside a function so no complaint at my end... and in my test file, where the code isn't in a function, I'd removed the 2 local keywords. When posting I've been copy'n'pasting from the actual source code file that it's part of.
BUT on that subject, am I correct in thinking that, even if Bash is not the shell a user has chosen to use, /bin/bash will still be present on all (modern) Unix/Linux systems? Ignoring the fact that, of course, a user could remove it themselves. Bash has certainly been available on every Unix/Linux system that I've used - almost always it being the default shell as well (though I seem to remember changing my own default shell to Bash from sh on a Solaris system circa 1995).
Quote:
Originally Posted by Don Cragun
which contains exactly the output I expected (in seemingly random order).
If there is a maximum of 1 valid IP address on each line then they get printed in reverse order, if more than 1 valid IP on a single line then the ordering is a mystery to me, but I've not devoted a lot of attention to it as ordering is not an issue for me.
Quote:
Originally Posted by Don Cragun
As Chubler_XL suggested earlier, I would change all occurrences of "\\." to "[.]" (except for the one marked in red above). Using the backslash escapes instead of the matching list bracket expression keeps you from reusing the common parts of three of these expression. The backslashes in the occurrence of "\\." marked in red can just be removed. (The period is just a period in a bracket expression and doesn't need to be escaped.)
I must have overlooked that suggestion, but agreed that's better.
Quote:
Originally Posted by Don Cragun
Why did you use [0-9][0-9]* at the end of digitSequenceTooLongNotIP instead of using [0-9]+?
I'm far from being a regex pro, my thought process went like this: I need to match a minimum of 4 digits in a row, so [0-9][0-9][0-9][0-9, then optionally a 5th or more digit so I need [0-9][0-9][0-9][0-9][0-9]*. [0-9][0-9][0-9][0-9]+ is more concise, is it any more efficient?
Quote:
Originally Posted by Don Cragun
Why did you use (version|ver|v)+ instead of just (version|ver|v) at the start of versioningNotIP?
That was a mistake and shouldn't be there as you have spotted.
[Vv]([Ee][Rr]([Ss][Ii][Oo][Nn])?)? along with skipping the tolower() bit of line = tolower($0), is a better solution (though the nested brackets make it hard for me to read, no probs a comment to explain it will help me understand it if I revisit the code years down the line).
It doesn't save much time though. I had to create a 1 MB file from 50 real world datasets to work it out but the low-down is that it saves approx. 0.6 milliseconds in a typical real world sized dataset. 30 milliseconds when processing the 1 MB file. Not exactly scientific but close enough. What I can tell you as a result is that calls to tolower() in awk are cheap. Discounting whatever other processes were running on my PC and the speed difference between the 2 regexes, 30 ms per 1 MB ain't at all bad IMO. [Note: median value from 5 runs used for both, mine was 115 ms, yours 85 ms.]
Quote:
Originally Posted by Don Cragun
On a completely different note; why did you choose to define the awk script as a single line awk script using backslashes at the end of the awkExtractIPAddresses variable assignment to denote line continuation?
Only because when I tried the code without them it failed, and I did not know that to fix it I just needed to keep the '{' of BEGIN { and END { on one line. Thanks for that, keeping the backslashes nice and neat was a pain - I'm a bit OCD when it comes to things like that.
Quote:
Originally Posted by Don Cragun
If you do this, you could (but would not have to) also remove a lot of semicolons from your code.
Every fibre of my being revolts at the mere suggestion of that. It's the C/C++ programmer in me, I just can't do it. If I'm allowed to use ';' at the end of a line then I will do so until the end of time, even under threat of execution or the Spanish Inquisition... well, maybe not then. [Please note I know you wrote: (but would not have to).]
Quote:
Originally Posted by Don Cragun
the following script incorporates these suggestions (and a few tiny changes not worth discussing)
Well one of them does and one of them doesn't.
The one that doesn't is the replacement of 'xxx' with 'x', which was always my intention, but when testing 'xxx' is so much easier to spot when scanning the test output.
The one that does has got me flummoxed. Your line ipUnique[ip] in:
Leaving aside my feeling of "every fibre of my being revolts" at an array assignment with no assignment. I am willing to somewhat unhappily concede that it works, as no doubt you are aware. BUT I am hoping that you will tell me why it works because the Gnu Awk manual tells me it shouldn't (my awk is 'GNU Awk 3.1.8').
See section "8.1.2 Referring to an Array Element and also 8.1.3 Assigning Array Elements", direct link here:
"NOTE: A reference to an element that does not exist automatically creates that array element, with the null string as its value."
and...
"Array elements can be assigned values just like awk variables: array[index-expression] = value"
Nothing saying: "if no assignment is made then the value of index-expression is assigned as the value of the element..."
and...
Is the reason it actually works in your code something to do with manipulating the stack? If so, I don't see how, but that's all I got.
Back to the code debugging - I have spotted another 'bug' / overlooked possibility:
The code has failed to handle one simple limitation. My original (not fully posted) code did actually handle it but not since the move to an awk only solution and stupidly I forgot about it.
Consider (Phlebas) this, no slashes, no out of range numbers, no versioning:
A random number and dot sequence that happens to exist 1.2.3.4.5 but is too long to be an IP address.
ipSequence = "[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+" will match just the 1.2.3.4 part of it, ignoring the final ".5", and the IP number of segments and number range checking will confirm it. A number and dot sequence which is definately not an IP address has been identified as a valid one. Ouch!
Easily solved by simply making a more greedy version of ipSequence:
With that the whole of 1.2.3.4.5 (and anything too long to be an IP) gets matched, it gets added to the ipUnique array and the code below does not print it because of the if (numSegments == ipNumSegments &&... section which was totally unnecessary before but was added due to my OCD need to check array bounds before evaluation slash good programming practice.
So now we must be at the finished article, or very near it. Wow this has ended up being a long post.
[..]
Your IP regex code IP_RE="[0-9]+\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]+" is not Posix compatible. Posix awk does not include interval expressions, e.g. {1,3}. The GNU Awk v.3.1.8 (on my system) for instance requires the --re-interval option to allow their use.
Actually, interval expressions for awk are part of POSIX (at least since SUS v3):
Quote:
When an ERE matching a single character or an ERE enclosed in parentheses is followed by an interval expression of the format "{m}", "{m,}", or "{m,n}", together with that interval expression it shall match what repeated consecutive occurrences of the ERE would match. The values of m and n are decimal integers in the range 0 <= m<= n<= {RE_DUP_MAX}, where m specifies the exact or minimum number of occurrences and n specifies the maximum number of occurrences. The expression "{m}" matches exactly m occurrences of the preceding ERE, "{m,}" matches at least m occurrences, and "{m,n}" matches any number of occurrences between m and n, inclusive.
Regular expressions in awk have been extended somewhat from historical implementations to make them a pure superset of extended regular expressions, as defined by POSIX.1-2008 (see XBD Extended Regular Expressions). The main extensions are internationalization features and interval expressions.
The --re-interval option used by gawk 3 is automatically switched on by the --posix option. In gawk 4 the --re-interval option is on by default. So it may be a good idea to use gawk with the --posix option.
S.
Last edited by Scrutinizer; 12-07-2013 at 07:58 AM..
This User Gave Thanks to Scrutinizer For This Post:
"NOTE: A reference to an element that does not exist automatically creates that array element, with the null string as its value."
and...
"Array elements can be assigned values just like awk variables: array[index-expression] = value"
Nothing saying: "if no assignment is made then the value of index-expression is assigned as the value of the element..."
The code ipUnique[ip] = ip; and ipUnique[ip]; are not equivalent.
The first creates array element ip with a null value (if it doesn't already exist) and then assigns it's value to ip.
The second creates array element ip with a value of null only.
The reason the replacement works is the code never uses the value of the array element and this is irrelevant. It could be 1, NULL or equal to the index and the code still works as intended.
Here is a small example that might help to illustrate.
This User Gave Thanks to Chubler_XL For This Post:
The code ipUnique[ip] = ip; and ipUnique[ip]; are not equivalent.
The first creates array element ip with a null value (if it doesn't already exist) and then assigns it's value to ip.
The second creates array element ip with a value of null only.
That's exactly what I thought/think too. BUT strange things are going on, read on...
What I don't understand is why both ipUnique[ip] = ip; and ipUnique[ip]; appear to function equivalently BECAUSE please note that the bit I've made bold in what you wrote below is not correct...
Quote:
Originally Posted by Chubler_XL
The reason the replacement works is the code never uses the value of the array element and this is irrelevant. It could be 1, NULL or equal to the index and the code still works as intended.
In his code he uses this line ipUnique[ip] and not ipUnique[ip] = ip; but then in the END section he is able to access the values of the array (which 'should be' the null string but aren't) as if he had used ipUnique[ip] = ip;. The relevant bits are highlighted in red in the code below.
Do you need convincing? I certainly did !!
Save this code snippet as your test input data:
Now save this awk program (ditching the Bash 'wrapper' is long overdue):
Please note that the while loop has both ipUnique[ip] = ip; and ipUnique[ip]; but that ipUnique[ip]; is commented out.
Now run it twice, once with ipUnique[ip] = ip; and then swapping the comment around so ipUnique[ip]; is used instead and you should get something like this:
IDENTICAL - use diff as well if you want, I did.
So I repeat myself (at least in essence): ipUnique[ip] = ip; and ipUnique[ip]; function equivalently in the code above. I do not understand why ipUnique[ip]; works at all. As I said in my reply to Don, my best guess is that it has something to do with stack manipulation because, as you pointed out and the manual clearly says, when an array is referenced (with no assignment) the null string is assigned to that array element's value.
Here's hoping the Don Craguneleone will get back into the action, if ever I needed The Godfather it's now. Cue a (somewhat slimmer) Marlon Brandoesque figure in the heavily shaded study of his mansion, with a blinking cursor wizzing across the line like a speeding bullet and wedding guests waiting patiently with their own coding problems.
All the best, thanks for taking the time to read this,
Gencon
---------- Post updated at 01:13 PM ---------- Previous update was at 01:12 PM ----------
Thanks for the info. Scrutinizer.
Quote:
Originally Posted by Scrutinizer
The --re-interval option used by gawk 3 is automatically switched on by the --posix option. In gawk 4 the --re-interval option is on by default. So it may be a good idea to use gawk with the --posix option.
Please note that neither the --re-interval nor the --posix options are actually defined by POSIX.
and there lies the problem. I'd like my script to run on any UNIX/Linux system. Since different awks/gawks handle enabling interval expressions in different ways (including possibly requiring non-POSIX command line options) it seems simplest to me to simply avoid their use as I have done in the code; especially as this is so easily accomplished in this particular case by doing a gsub() replace of all number sequences greater than 3 digits in length.
… … …
What I don't understand is why both ipUnique[ip] = ip; and ipUnique[ip]; appear to function equivalently BECAUSE please note that the bit I've made bold in what you wrote below is not correct...
In his code he uses this line ipUnique[ip] and not ipUnique[ip] = ip; but then in the END section he is able to access the values of the array (which 'should be' the null string but aren't) as if he had used ipUnique[ip] = ip;. The relevant bits are highlighted in red in the code below.
Do you need convincing? I certainly did !!
… … …
So I repeat myself (at least in essence): ipUnique[ip] = ip; and ipUnique[ip]; function equivalently in the code above. I do not understand why ipUnique[ip]; works at all. As I said in my reply to Don, my best guess is that it has something to do with stack manipulation because, as you pointed out and the manual clearly says, when an array is referenced (with no assignment) the null string is assigned to that array element's value.
Here's hoping the Don Craguneleone will get back into the action, if ever I needed The Godfather it's now. Cue a (somewhat slimmer) Marlon Brandoesque figure in the heavily shaded study of his mansion, with a blinking cursor wizzing across the line like a speeding bullet and wedding guests waiting patiently with their own coding problems.
All the best, thanks for taking the time to read this,
Gencon
… … …
I apologize for taking so long to get back to you. But when I have a choice between spending some time with the grandkids or evaluating an awk script; the grandkids are going to win every time.
There is no stack manipulation going on… I am not referencing the value of any ipUnique[] array elements in the END clause.
I'm working on a much lengthier response to message #17 in this thread, but I may not be ready to post it for a couple of days (while I get caught up on other things). But, this point seems to be bothering you and (I hope) will be easy to explain. As you have said, the command ipUnique[ip] creates an element in the array ipUniqe with index ip and assigns a null value to it. But the command
never looks at the value assigned to any element in the array; it only looks at the indices of the elements in the array. Perhaps a simpler example will help:
which produces the output:
Your full script (and mine) never use ipUnique[ip] (which is the value of an array element) in the END clause; they only reference ip (which is the index of an array element).
We would need to use:
instead of:
if we used:
instead of:
Do you understand now why we don't need to waste time or space assigning the index of each array element as the value of the array element as well instead of just using the index itself?
Last edited by Don Cragun; 12-10-2013 at 04:57 AM..
Reason: Remove extraneous end code tag.
This User Gave Thanks to Don Cragun For This Post:
Hi,
I want to read a file line by line and exclude the lines that are beginning with special characters. The below code is working fine except when the line starts with hyphen (-) in the file.
for TEST in `cat $FILE | grep -E -v '#|/+' | awk '{FS=":"}NF > 0{print $1}'`
do
.
.
done
How... (4 Replies)
Hi,
How to achieve the displaying of sequence no while doing grep for an output.
Ex., need the output like below with the serial no, but not the available line number in the file
S.No Array Lun
1 AABC 7080
2 AABC 7081
3 AADD 8070
4 AADD 8071
5 ... (3 Replies)
Friends,
In the file i am having more then 100 lines like,
File1 had the values like this:
#Example East.server_01=EAST.SERVER_01
East.server_01=EAST.SERVER_01
West.server_01=WEST.SERVER_01
File2 had the values like this:
#Example EAST.SERVER_01=http://yahoo.com... (3 Replies)
Hi Guys.
I guess I have a very basic query but stuck with it :(
I have a file in which I want to extract particular content. The content is between standard format like :
Verify stats
A=0
B=12
C=34
TEST Failed
Now I want to extract data between "Verify stats" & "TEST Failed" but do... (6 Replies)
Hi,
I need to perform a grep from a file, but ignore any results from the first column.
For simplicity I have changed the actual data, but for arguments sake, I have a file that reads:
MONACO Monaco ASMonaco
MANUTD ManUtd ManchesterUnited
NEWCAS NewcastleUnited
NAC000 NAC ... (5 Replies)
Hi,
I have a pipe delimited file. I am checking for junk characters ( non printable characters and unicode values).
I am using the following code
grep '' file.txt
But i want to ignore the name fields. For example field2 is firstname so i want to ignore if the junk characters occur... (4 Replies)
Hello,
I'm working on unix with grep (GNU grep) 2.5.1. I'm going through some of the newer regex syntax using Regular Expression Reference - Advanced Syntax a guide.
ls -aLl /bin | grep "\(x\)"
Which works, just highlights 'x' where ever, when ever.
I'm trying to to get (?:) to work but... (4 Replies)
I admin two co-located servers. I built an app that creates subdirectories for users ie www.site.com/username.
one server that works just fine when you hit that url, it sees the index within and does as it should.
I moved the app to my other server running FEDORA 1 i686 standard, cPanel... (3 Replies)
Hi,
I have a log file containg records in sequence
<CRMSUB:MSIN=2200380,BSNBC=TELEPHON-7553&TS21-7716553&TS22-7716553,NDC=70,MSCAT=ORDINSUB,SUBRES=ONAOFPLM,ACCSUB=BSS,NUMTYP=SINGLE;
<ENTROPRSERV:MSIN=226380,OPRSERV=OCSI-PPSMOC-ACT-DACT&TCSI-PPSMTC-ACT-DACT&UCSI-USSD;... (17 Replies)