Go Back   The UNIX and Linux Forums > Top Forums > UNIX for Dummies Questions & Answers


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 06-10-2012
sudon't's Avatar
Registered User
 
Join Date: May 2012
Location: The Cape Fear...ooooh!
Posts: 75
Thanks: 46
Thanked 0 Times in 0 Posts
What's the Diff Between These Two Regexes?

Trying to understand what's happening here, but I cannot figure it out.
I'm reading Mastering Regular Expressions, by Friedl, and he uses this as an example of how to grab quoted text:

Code:
egrep -o '"[^"]*"' ~/File.txt

...should pull in any quoted phrases. Match a literal double-quote, match anything not a double-quote until you hit the next literal double-quote.
But, he says [^"]* can match a newline, thereby returning quoted text even if it crosses lines. If you want to keep it from crossing lines, you should use this:

Code:
egrep -o '"[^"\n]*"' ~/File.txt

But this is where my head starts to hurt because a star should never fail, right? In other words, if it hits a newline, isn't a newline zero (or more) occurrences of 'not a newline', thereby allowing the regex to keep chugging along?
But looking at the difference between what each regex returns, the differences don't seem to have anything to do with newlines. Check out the different results each regex pulls from this snippet:
Quote:
those with whom he was most pleased. Having asked one Zeno, upon his
using some far-fetched phrases, "What uncouth dialect is that?" he
replied, "The Doric." For this answer he banished him to Cinara [354],

Code:
egrep -o '"[^"]*"' ~/File.txt
"What uncouth dialect is that?"
"The Doric."

egrep -o '"[^"\n]*"' ~/File.txt
"The Doric."

Why does the second regex miss the first quote? The first regex returns about twice as many hits as the second one, and they all appear to be valid, single line quotes.
I should also mention that I'm not trying to accomplish anything. My interest is purely academic, and I'm a total noob.

GNU grep
OS X 10.6.8
Original file is plain vanilla ASCII, each line ends in a newline.
Sponsored Links
    #2  
Old 06-10-2012
Registered User
 
Join Date: Dec 2011
Posts: 18
Thanks: 0
Thanked 8 Times in 7 Posts
Regexes are greedy, so only on the last line matches the class of no double quotes and no newlines.
Try options -P or -z of grep to match across newlines.

B.t.w.: this is an area were the regex functionality differs a bit between grep, awk, Perl, etc.
Sponsored Links
    #3  
Old 06-10-2012
alister alister is offline Forum Advisor  
Registered User
 
Join Date: Dec 2009
Posts: 2,601
Thanks: 123
Thanked 717 Times in 600 Posts
Quote:
Originally Posted by sudon't View Post
Code:
egrep -o '"[^"]*"' ~/File.txt
"What uncouth dialect is that?"
"The Doric."

egrep -o '"[^"\n]*"' ~/File.txt
"The Doric."

egrep/fgrep/grep is a line-oriented tool. The regular expression will NEVER span across lines because grep operates one line at a time.

The reason the regular expression with the bracket expression [^"\n] does not match "What uncouth dialect is that?" is because within the bracket expression the backslash ceases to be a special character; the sequence \n in [^"\n] does not represent a newline character, but a forward slash and an n, two separate characters. Since this translates to any character that is not a quote, backslash, or n, the n in "uncouth" prevents the match from ocurring.

For the nitty gritty on bracket expressions, refer to Regular Expressions, from which the following is extracted:
Quote:
Originally Posted by POSIX
The special characters '.', '*', '[', and '\' (period, asterisk, left-bracket, and backslash, respectively) shall lose their special meaning within a bracket expression.
Regards,
Alister
The Following User Says Thank You to alister For This Useful Post:
sudon't (06-11-2012)
    #4  
Old 06-11-2012
Registered User
 
Join Date: Dec 2011
Posts: 18
Thanks: 0
Thanked 8 Times in 7 Posts
@alister: My bad! Greedyness has indeed nothing to do with this problem. Thanks for putting that right.
Sponsored Links
    #5  
Old 06-11-2012
Scrutinizer's Avatar
Moderator
 
Join Date: Nov 2008
Location: Amsterdam
Posts: 7,346
Thanks: 144
Thanked 1,755 Times in 1,592 Posts
@alister, it may be interesting to add that like grep, sed is also line oriented, but that it does have the ability to match \n which can occur through the use of N or H commands. But indeed \n loses its meaning withing square brackets. \n is not part of POSIX regular expression, but for this POSIX sed has an addition:
Quote:
The escape sequence '\n' shall match a <newline> embedded in the pattern space
sed: regular expressions

POSIX awk goes even further, as it extends POSIX regular expressions, to include the C-language extensions and they are valid within bracket extensions...

Quote:
The awk utility shall make use of the extended regular expression notation (see XBD Extended Regular Expressions ) except that it shall allow the use of C-language conventions for escaping special characters within the EREs, as specified in the table in XBD File Format Notation ( '\\' , '\a' , '\b' , '\f' , '\n' , '\r' , '\t' , '\v' ) and the following table; these escape sequences shall be recognized both inside and outside bracket expressions. Note that records need not be separated by <newline> characters and string constants can contain <newline> characters, so even the "\n" sequence is valid in awk EREs.
awk: regular expressions
Sponsored Links
    #6  
Old 06-11-2012
alister alister is offline Forum Advisor  
Registered User
 
Join Date: Dec 2009
Posts: 2,601
Thanks: 123
Thanked 717 Times in 600 Posts
Quote:
Originally Posted by Scrutinizer View Post
sed is also line oriented, but that it does have the ability to match \n which can occur through the use of N or H commands.
To that list of sed commands which can inject a newline you can also add y, G, and s.


Quote:
Originally Posted by Scrutinizer View Post
POSIX awk goes even further
ex also has its own flavor.

Even within the confines of the POSIX standard, there are quite a few RE flavors to master: basic, extended, sed, AWK, and also ex. Add to that proprietary extensions by implementations of the standard tools and the dynamic programming languages and you have quite a melange.

Regards,
Alister
Sponsored Links
    #7  
Old 06-11-2012
sudon't's Avatar
Registered User
 
Join Date: May 2012
Location: The Cape Fear...ooooh!
Posts: 75
Thanks: 46
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by alister View Post

the n in "uncouth" prevents the match from occurring.

Regards,
Alister
Wow, yes I did know about the bracket rule, and should've thought of that. On the other hand, he's using it as an example in the book. Perhaps it's meant for perl, (and why the -P flag was suggested)? That's the thing I'm having the most trouble with, is understanding what works with what.
It even turns out that there are different greps, who behave differently! A lifetime of Mac OS use has not prepared me for unix.
Sponsored Links
Closed Thread

Tags
egrep, regex, regular expressions

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Print Lines between two regexes polsum Shell Programming and Scripting 4 01-11-2012 12:24 PM
serach diff filename in diff location using shell scripting Lucky123 Shell Programming and Scripting 1 11-25-2011 02:44 AM
.procmailrc and uudeview (put attachments from diff senders to diff folders) optik77 Shell Programming and Scripting 1 03-27-2011 06:57 AM
Simulate SVN diff using plain diff ackbarr Shell Programming and Scripting 3 02-07-2009 12:01 PM
diff 2 files; output diff's to 3rd file blt123 Shell Programming and Scripting 2 05-28-2002 11:29 AM



All times are GMT -4. The time now is 09:20 AM.