Thanks. That is an ever so slightly more elegant solution and the one which I shall in fact use. Don will get over it eventually, I'm sure it's nothing that some intensive therapy won't cure.
Of course Don might point out that yours will not actually work at the moment due to the mysterious disappearance of any actual input.
Thanks all.
I'm over it. No intensive therapy required.
Note that I clearly stated that my suggestion had a limitation because I knew it didn't work with one of your sample lines of input. (It seemed to work because the sample input used the ip address 11.11.11.11 multiple times.) I see no reason to believe that the removal of the ip-like strings between slashes should have any bad effect and agree with using that concept to improve my code.
Do note, however, that although grep -E is required by the standards and grep -E (or egrep or both) is available on any UNIX or Linux implementation, the -E option to sed is not required by the standards and is not always available. But, if this is a concern, the -o option to grep is not required by the standards either and is not always available.
This could be rewritten using options only available in the standards, but if the systems you care about have sed -E and grep -o there isn't any reason to spend the time to work it out.
Cheers,
Don
This User Gave Thanks to Don Cragun For This Post:
[EDIT: Forget the below - I'll re-write without using -o, may as well get it right. I wrote the below before noticing Don's latest reply, having spotted the possible problem with sed -E myself. I didn't however realize grep -o is not in the standard.]
Perhaps you could get some more portability out of an awk script (you could also test the 0-255 limit on the octets with this awk script if you liked):
This User Gave Thanks to Chubler_XL For This Post:
Thanks very much Chubler_XL. That's an excellent way to do it and what I've used with some changes.
Your IP regex code IP_RE="[0-9]+\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]+" is not Posix compatible. Posix awk does not include interval expressions, e.g. {1,3}. The GNU Awk v.3.1.8 (on my system) for instance requires the --re-interval option to allow their use. I have simply used an extra sed replace expression to remove all number sequences greater than 3 digits in length to get around this.
Also the gsub("/"IP_RE"/","") line clearly prevents the rest of the code from working by replacing all the IP regex matches in each line with the empty string, and I assume that the line only made it into your post by accident.
I also spotted a potential problem with the removal of IP-like addresses enclosed by slashes using sed. Consider the following url:
The old sed expression (not the one below) would simply remove this bit: /15.5.2.1/
Which would leave behind this: http://web.com/libs/v.15.5.23.12/file.js
Inadvertidly a valid IP address of 15.5.23.12 has been created from the digits on either side of the removed section. Okay so it's not all that likely to happen regularly but using 'xxx' as the replacement string, instead of an empty string, in the sed expressions makes sure it won't happen.
I think the code below is now fully Posix compatible, the question is: does it get the thumbs up from Don?
Also the gsub("/"IP_RE"/","") line clearly prevents the rest of the code from working by replacing all the IP regex matches in each line with the empty string, and I assume that the line only made it into your post by accident.
No, that is an attempt to remove the need for the sed replace call it is supposed to match slash + ipRegEx + slash.
Perhaps it would be clearer if we replaced it with something like gsub("[/]"IP_RE"[/]","xxx") making the slashes more distinct from the gsub() form with slash delimiters.
Also the line if(!have[fnd]++) print fnd; is designed to remove the need for the sort -u as I don't think the -u option is POSIX.
This User Gave Thanks to Chubler_XL For This Post:
No, that is an attempt to remove the need for the sed replace call it is supposed to match slash + ipRegEx + slash.
Perhaps it would be clearer if we replaced it with something like gsub("[/]"IP_RE"[/]","xxx") making the slashes more distinct from the gsub() form with slash delimiters.
Also the line if(!have[fnd]++) print fnd; is designed to remove the need for the sort -u as I don't think the -u option is POSIX.
Hi gencon and Chubler_XL,
I agree that Chubler_XL is taking the better approach. There is no need to fire up both sed and awk. You just need to fix the gsub() that is accidentally deleting all ip-address instead of just replacing those surrounded by slashes. I was getting ready to test out the suggestion of using "[/]"IP_RE"[/]" as a fix for the problem you found, but it looks like the two of you beat me to it.
The sort -u option has been in POSIX from the beginning. It was in the first POSIX shell and utilities standard when it was adopted by IEEE in 1992 and by ISO/IEC in 1993. But, since it seems that the need for sort was to remove duplicates and Chubler_XL's scripts already does that, sort -u isn't needed. In fact the sort in that pipeline isn't needed unless gencon wants the list in sorted order.
It looks to me like you two have almost completed debugging a very efficient script that will handle any number of ip addresses and ip-like addresses between slashes on a single line (as long as your input doesn't exceed LINE_MAX limits).
This User Gave Thanks to Don Cragun For This Post:
No, that is an attempt to remove the need for the sed replace call it is supposed to match slash + ipRegEx + slash.
Yes, my apologies. I misread it as having the double quotes escaped. Oops. That's now back in the code.
I've also used the if(!have[fnd]++) print fnd; code (conceptually anyway), and removed the pipe to: sort -u
Quote:
Originally Posted by Don Cragun
I agree that Chubler_XL is taking the better approach.
No doubt at all.
I've now had a few mins to alter the code to perform the whole operation with awk, including IP number range checking which is another sensible suggestion of Chubler_XL's.
It also seemed sensible to add a regex to spot (obvious) version numbers and remove them as well, see the regex in: versioningNotIP After all if I'm ignoring version numbers in urls then I may as well ignore IP-like sequences if they follow Version, Ver, V. and so on. The input line is therefore converted to lower case so that that regex works whatever the case.
Quote:
Originally Posted by Don Cragun
It looks to me like you two have almost completed debugging a very efficient script that will handle any number of ip addresses and ip-like addresses between slashes on a single line (as long as your input doesn't exceed LINE_MAX limits).
I hope so.
Have a look at what may be the finished, fully Posix compliant, article. Thumbs up if all okay please guys. If not, I will persevere and fix anything that needs fixing.
Just remembered one thing I'm not 100% sure about. In the versioningNotIP regex I've enclosed the OR variations in (). I couldn't find online whether that is acceptable with Posix awk, is it?
Hi,
I want to read a file line by line and exclude the lines that are beginning with special characters. The below code is working fine except when the line starts with hyphen (-) in the file.
for TEST in `cat $FILE | grep -E -v '#|/+' | awk '{FS=":"}NF > 0{print $1}'`
do
.
.
done
How... (4 Replies)
Hi,
How to achieve the displaying of sequence no while doing grep for an output.
Ex., need the output like below with the serial no, but not the available line number in the file
S.No Array Lun
1 AABC 7080
2 AABC 7081
3 AADD 8070
4 AADD 8071
5 ... (3 Replies)
Friends,
In the file i am having more then 100 lines like,
File1 had the values like this:
#Example East.server_01=EAST.SERVER_01
East.server_01=EAST.SERVER_01
West.server_01=WEST.SERVER_01
File2 had the values like this:
#Example EAST.SERVER_01=http://yahoo.com... (3 Replies)
Hi Guys.
I guess I have a very basic query but stuck with it :(
I have a file in which I want to extract particular content. The content is between standard format like :
Verify stats
A=0
B=12
C=34
TEST Failed
Now I want to extract data between "Verify stats" & "TEST Failed" but do... (6 Replies)
Hi,
I need to perform a grep from a file, but ignore any results from the first column.
For simplicity I have changed the actual data, but for arguments sake, I have a file that reads:
MONACO Monaco ASMonaco
MANUTD ManUtd ManchesterUnited
NEWCAS NewcastleUnited
NAC000 NAC ... (5 Replies)
Hi,
I have a pipe delimited file. I am checking for junk characters ( non printable characters and unicode values).
I am using the following code
grep '' file.txt
But i want to ignore the name fields. For example field2 is firstname so i want to ignore if the junk characters occur... (4 Replies)
Hello,
I'm working on unix with grep (GNU grep) 2.5.1. I'm going through some of the newer regex syntax using Regular Expression Reference - Advanced Syntax a guide.
ls -aLl /bin | grep "\(x\)"
Which works, just highlights 'x' where ever, when ever.
I'm trying to to get (?:) to work but... (4 Replies)
I admin two co-located servers. I built an app that creates subdirectories for users ie www.site.com/username.
one server that works just fine when you hit that url, it sees the index within and does as it should.
I moved the app to my other server running FEDORA 1 i686 standard, cPanel... (3 Replies)
Hi,
I have a log file containg records in sequence
<CRMSUB:MSIN=2200380,BSNBC=TELEPHON-7553&TS21-7716553&TS22-7716553,NDC=70,MSCAT=ORDINSUB,SUBRES=ONAOFPLM,ACCSUB=BSS,NUMTYP=SINGLE;
<ENTROPRSERV:MSIN=226380,OPRSERV=OCSI-PPSMOC-ACT-DACT&TCSI-PPSMTC-ACT-DACT&UCSI-USSD;... (17 Replies)