What's the Diff Between These Two Regexes?


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers What's the Diff Between These Two Regexes?
# 1  
Old 06-10-2012
What's the Diff Between These Two Regexes?

Trying to understand what's happening here, but I cannot figure it out.
I'm reading Mastering Regular Expressions, by Friedl, and he uses this as an example of how to grab quoted text:
Code:
egrep -o '"[^"]*"' ~/File.txt

...should pull in any quoted phrases. Match a literal double-quote, match anything not a double-quote until you hit the next literal double-quote.
But, he says [^"]* can match a newline, thereby returning quoted text even if it crosses lines. If you want to keep it from crossing lines, you should use this:
Code:
egrep -o '"[^"\n]*"' ~/File.txt

But this is where my head starts to hurt because a star should never fail, right? In other words, if it hits a newline, isn't a newline zero (or more) occurrences of 'not a newline', thereby allowing the regex to keep chugging along?
But looking at the difference between what each regex returns, the differences don't seem to have anything to do with newlines. Check out the different results each regex pulls from this snippet:
Quote:
those with whom he was most pleased. Having asked one Zeno, upon his
using some far-fetched phrases, "What uncouth dialect is that?" he
replied, "The Doric." For this answer he banished him to Cinara [354],
Code:
egrep -o '"[^"]*"' ~/File.txt
"What uncouth dialect is that?"
"The Doric."

egrep -o '"[^"\n]*"' ~/File.txt
"The Doric."

Why does the second regex miss the first quote? The first regex returns about twice as many hits as the second one, and they all appear to be valid, single line quotes.
I should also mention that I'm not trying to accomplish anything. My interest is purely academic, and I'm a total noob.

GNU grep
OS X 10.6.8
Original file is plain vanilla ASCII, each line ends in a newline.
# 2  
Old 06-10-2012
Regexes are greedy, so only on the last line matches the class of no double quotes and no newlines.
Try options -P or -z of grep to match across newlines.

B.t.w.: this is an area were the regex functionality differs a bit between grep, awk, Perl, etc.
# 3  
Old 06-10-2012
Quote:
Originally Posted by sudon't
Code:
egrep -o '"[^"]*"' ~/File.txt
"What uncouth dialect is that?"
"The Doric."

egrep -o '"[^"\n]*"' ~/File.txt
"The Doric."

egrep/fgrep/grep is a line-oriented tool. The regular expression will NEVER span across lines because grep operates one line at a time.

The reason the regular expression with the bracket expression [^"\n] does not match "What uncouth dialect is that?" is because within the bracket expression the backslash ceases to be a special character; the sequence \n in [^"\n] does not represent a newline character, but a forward slash and an n, two separate characters. Since this translates to any character that is not a quote, backslash, or n, the n in "uncouth" prevents the match from ocurring.

For the nitty gritty on bracket expressions, refer to Regular Expressions, from which the following is extracted:
Quote:
Originally Posted by POSIX
The special characters '.', '*', '[', and '\' (period, asterisk, left-bracket, and backslash, respectively) shall lose their special meaning within a bracket expression.
Regards,
Alister
This User Gave Thanks to alister For This Post:
# 4  
Old 06-11-2012
@alister: My bad! Greedyness has indeed nothing to do with this problem. Thanks for putting that right.
# 5  
Old 06-11-2012
@alister, it may be interesting to add that like grep, sed is also line oriented, but that it does have the ability to match \n which can occur through the use of N or H commands. But indeed \n loses its meaning withing square brackets. \n is not part of POSIX regular expression, but for this POSIX sed has an addition:
Quote:
The escape sequence '\n' shall match a <newline> embedded in the pattern space
sed: regular expressions

POSIX awk goes even further, as it extends POSIX regular expressions, to include the C-language extensions and they are valid within bracket extensions...

Quote:
The awk utility shall make use of the extended regular expression notation (see XBD Extended Regular Expressions ) except that it shall allow the use of C-language conventions for escaping special characters within the EREs, as specified in the table in XBD File Format Notation ( '\\' , '\a' , '\b' , '\f' , '\n' , '\r' , '\t' , '\v' ) and the following table; these escape sequences shall be recognized both inside and outside bracket expressions. Note that records need not be separated by <newline> characters and string constants can contain <newline> characters, so even the "\n" sequence is valid in awk EREs.
awk: regular expressions
# 6  
Old 06-11-2012
Quote:
Originally Posted by Scrutinizer
sed is also line oriented, but that it does have the ability to match \n which can occur through the use of N or H commands.
To that list of sed commands which can inject a newline you can also add y, G, and s.


Quote:
Originally Posted by Scrutinizer
POSIX awk goes even further
ex also has its own flavor.

Even within the confines of the POSIX standard, there are quite a few RE flavors to master: basic, extended, sed, AWK, and also ex. Add to that proprietary extensions by implementations of the standard tools and the dynamic programming languages and you have quite a melange.

Regards,
Alister
# 7  
Old 06-11-2012
Quote:
Originally Posted by alister

the n in "uncouth" prevents the match from occurring.

Regards,
Alister
Wow, yes I did know about the bracket rule, and should've thought of that. On the other hand, he's using it as an example in the book. Perhaps it's meant for perl, (and why the -P flag was suggested)? That's the thing I'm having the most trouble with, is understanding what works with what.
It even turns out that there are different greps, who behave differently! A lifetime of Mac OS use has not prepared me for unix.
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Regexes for three column data to create a dictionary

I am working on a multilingual dictionary and I have data in three columns. The data structure can be word=word=gloss or word word=word word=gloss gloss = acts as a delimiter The number of words separated by the delimiter can be up to 8 or 10. The structure is well defined in the sense... (6 Replies)
Discussion started by: gimley
6 Replies

2. Shell Programming and Scripting

Diff 3 files, but diff only their 2nd column

Guys i have 3 files, but i want to compare and diff only the 2nd column path=`/home/whois/doms` for i in `cat domain.tx` do whois $i| sed -n '/Registry Registrant ID:/,/Registrant Email:/p' > $path/$i.registrant whois $i| sed -n '/Registry Admin ID:/,/Admin Email:/p' > $path/$i.admin... (10 Replies)
Discussion started by: kenshinhimura
10 Replies

3. Shell Programming and Scripting

Print Lines between two regexes

Hi I have a file like this I need to delete all the lines between SQ and // and not the lines containing them. So the desired output should be I tried by using flip-flop operator perl -wlne 'print if !(/SQ/../\/\//)'But its not printing the lines containing regexes. Thanks in advance:b: (4 Replies)
Discussion started by: polsum
4 Replies

4. Shell Programming and Scripting

serach diff filename in diff location using shell scripting

Hi, I am new to shell scripting. please help me to find out the solution. I need a script where we need to read the text file(consists of all file names) and get the file names one by one and append the date suffix for each file name as 'yyyymmdd' . Then search each file if exists... (1 Reply)
Discussion started by: Lucky123
1 Replies

5. Shell Programming and Scripting

.procmailrc and uudeview (put attachments from diff senders to diff folders)

Moderator, please, delete this topic (1 Reply)
Discussion started by: optik77
1 Replies

6. Shell Programming and Scripting

diff

OS : SuSE Linux 10 (zOS) I create two files test1 and test2 /home/me # more test1 1 2 3 4 5 /home/me # more test2 1 2 3 I entered the following command on cronjob and its work diff /home/me/test1 /home/me/test2 > /home/me/test3 its created test3. But the output of test3 is as... (1 Reply)
Discussion started by: sdhn1900
1 Replies

7. UNIX for Dummies Questions & Answers

Using diff

is there any way to make the diff function compare 1 folder to another instead of just file to file? also, can binary files be compared? (2 Replies)
Discussion started by: puzzler
2 Replies

8. Shell Programming and Scripting

Simulate SVN diff using plain diff

Hi, svn diff does not work very well with 2 local folders, so I am trying to do this diff using diff locally. since there's a bunch of meta files in an svn directory, I want to do a diff that excludes everything EXCEPT *.java files. there seems to be only an --exclude option, so I'm not sure... (3 Replies)
Discussion started by: ackbarr
3 Replies

9. UNIX for Dummies Questions & Answers

diff

hi all, i want to do this shell script. create a script that will check the transferred file vs. orig file. 1. diff the file1 and file2 2. if difference found, retain the original file and email to netcracker team. 3. if no difference found, delete the previous file and retain... (3 Replies)
Discussion started by: tungaw2004
3 Replies

10. Shell Programming and Scripting

diff 2 files; output diff's to 3rd file

Hello, I want to compare two files. All records in file 2 that are not in file 1 should be output to file 3. For example: file 1 123 1234 123456 file 2 123 2345 23456 file 3 should have 2345 23456 I have looked at diff, bdiff, cmp, comm, diff3 without any luck! (2 Replies)
Discussion started by: blt123
2 Replies
Login or Register to Ask a Question