A bug in Perl regex


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting A bug in Perl regex
# 1  
Old 02-18-2011
A bug in Perl regex

The regex
Code:
'ab' =~ /((\w+)(?{print defined $2 ? "\$2=$2\n" : "\$2 not defined\n"})){2}/;

outputs:
Code:
$2=ab
$2 not defined
$2=b

Why $2 not defined? I think, the regex here must print $2=a. Is it a bug?

Last edited by pludi; 02-18-2011 at 05:18 AM..
# 2  
Old 02-18-2011
I'm not completely sure, but I think it has to do with the regex being greedy AND required to run twice, although I can't explain why. Maybe the gurus at the monastery can help.
# 3  
Old 02-18-2011
Quote:
Originally Posted by pludi
Maybe the gurus at the PerlMonks - The Monastery Gates can help.
Thanks, pludi, I wrote both an email to Jeffrey Friedl, and a post on perlmonks.

Here's my answer to Perl gurus from perl.org:

===
Let's present the re

'ab' =~ /((\w+)(?{print defined $2 ? "\$2=$2\n" : "\$2 not defined\n"})){2}/;

as

((\w+)(?{print...}))((\w+)(?{print...}))

\w{2} is equivalent to \w\w, right? But we assume that the second copy of the
re produces also the same $1 and $2 (not $3 and $4). Current position in the re
marked with |.

1. First (\w+) captures all the text:
((\w+) | (?{print...}))((\w+)(?{print...}))
$2 receives the value 'ab', eval prints $2=ab.

2. Then we enter second copy of (\w+):
((\w+)(?{print...}))(( | \w+)(?{print...}))
$2 (and also $+, $^N, \2) receives the value undefined.

3. We see that \w not match. We do backtracking:
((\w+ | )(?{print...}))((\w+)(?{print...}))
We enter first copy of (\w+) from right to left, and $2 again receives the value undefined.

4. (\w+) captures the letter a:
((\w+) | (?{print...}))((\w+)(?{print...}))
$2 must receive the value a, but in current version of Perl $2 receives
undefined... Why? Probably, two values of undefined are stored in $2 as in a stack,
then last value is removed from the stack, and $2 again equal undefined?
Here eval must print $2=a.

5. Second copy of (\w+) captures the letter b:
((\w+)(?{print...}))((\w+) | (?{print...}))
Eval prints $2=b. Match successfull.

Do you see any mistake in this reasoning?
===

Latest editing:

It seems, I've mistaken. Here's my correction to my previous reasoning.
Let's present the re

'ab' =~ /((\w+)(?{print defined $2 ? "\$2=$2\n" : "\$2 not defined\n"})){2}/;

as

((\w+)(?{print...}))((\w+)(?{print...}))

Is \w{2} equivalent to \w\w, right? But we assume that the second copy of the
re produces also the same $1 and $2 (not $3 and $4). Current position in the re
marked with |.

1. First (\w+) captures all the text:
((\w+) | (?{print...}))((\w+)(?{print...}))
$2 receives the value 'ab', eval prints $2=ab.

2. Then we enter second copy of (\w+):
((\w+)(?{print...}))(( | \w+)(?{print...}))
$2 (and also $+, $^N, \2) receives the value undefined.

3. We see that \w not match. We do backtracking:
((\w+ | )(?{print...}))((\w+)(?{print...}))
We enter first copy of (\w+) from right to left, and $2 again receives the value undefined.

4. \w+ gives back the letter b (but $2 remains undefined, because we did not come left of the opening parenthesis for $2):
(( | \w+(?{print...}))((\w+)(?{print...}))
$2 remains undefined.

4. (\w+) captures none, because we did not come left of the opening parenthesis for $2:
((\w+) | (?{print...}))((\w+)(?{print...}))
$2 remains undefined. Eval prints $2=undefined.

5. Second copy of (\w+) captures the letter b:
((\w+)(?{print...}))((\w+) | (?{print...}))
Eval prints $2=b. Match successfull.

Sorry for my poor English.

===

After previous post I think again and now I think than intuitively $2=undefined should be incorrect, and $2=a correct.
After that I've received an email from guru of regex Jeffrey Friedl (regex.info):
---
Hi Serge,
I've been thinking about this for a while, and as far as I can tell it does seem
to be a bug. By definition, $2 must be defined before the (?{...}) can run.

It's probably a problem with how it backtracks. I'd suggest filing a bug report..
---
Splitting the regex:
((\w+)(?{print...}))((\w+)(?{print...}))
is wrong, really the regex is not split.
After (\w+) captures all the string:
(\w+)) | {2}
we see, that second repetition of \w not match. We do backtracking and enter second parentheses going from right to left:
(/w | )+
In this case the regex engine (as I think) set $2=undefined, but why? Intuitively it seems set $2=undefined should do after we leave the open second parenthesis going from right to left.

Last edited by cronc; 02-19-2011 at 11:38 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Perl REGEX help

Experts - I found a script on one of the servers that I work on and I need help understanding one of the lines. I know what the script does, but I'm having a hard time understanding the grouping. Can someone help me with this? Here's the script... #!/usr/bin/perl use strict; use... (2 Replies)
Discussion started by: timj123
2 Replies

2. Shell Programming and Scripting

Perl, RegEx - Help me to understand the regex!

I am not a big expert in regex and have just little understanding of that language. Could you help me to understand the regular Perl expression: ^(?!if\b|else\b|while\b|)(?:+?\s+){1,6}(+\s*)\(*\) *?(?:^*;?+){0,10}\{ ------ This is regex to select functions from a C/C++ source and defined in... (2 Replies)
Discussion started by: alex_5161
2 Replies

3. Programming

Perl regex

Hello folks, Looking for a quick help on regex in my perl script. here's the string i want to parse and get the 2nd field out of it. $str = " 2013-08-07 12:29 Beta ACTIVE"; I want to extract 'Beta' out of this string. This string will keep on changing... (2 Replies)
Discussion started by: jhamaks
2 Replies

4. Programming

Perl regex

Hi Guys I have the following regex $OSRELEASE = $1 if ($output =~ /(Mac OS X (Server )?10.\d)/); output is currently Mac OS X 10.7.5 when the introduction of Mac 10.8 output changes to OS X 10.8.2 they have dropped the Mac bit so i changed the regex to be (2 Replies)
Discussion started by: ab52
2 Replies

5. Programming

Perl regex

HI, I'm new to perl and need simple regex for reading a file using my perl script. The text file reads as - filename=/pot/uio/current/myremificates.txt certificates=/pot/uio/current/userdir/conf/user/gamma/settings/security/... (3 Replies)
Discussion started by: jhamaks
3 Replies

6. UNIX for Dummies Questions & Answers

Perl Regex Help!!!

Hi, I get the following when I cat a file *.log xxxxx ===== dasdas gwdgsg fdsagfsag agsdfag ===== random data ===== My output should look like : If the random data after the 2nd ==== is null then OK should be printed else the random data should be printed. How do I go about this... (5 Replies)
Discussion started by: manutd
5 Replies

7. Shell Programming and Scripting

Converting perl regex to sed regex

I am having trouble parsing rpm filenames in a shell script.. I found a snippet of perl code that will perform the task but I really don't have time to rewrite the entire script in perl. I cannot for the life of me convert this code into something sed-friendly: if ($rpm =~ /(*)-(*)-(*)\.(.*)/)... (1 Reply)
Discussion started by: suntzu
1 Replies

8. Shell Programming and Scripting

Perl regex

I have got numbers like l255677 l376039 l188144 l340482 l440700 l254113 to match the numbers starting with '13' what would be the regex =~/13(.*)/ =======>This is not working .... But for user123,user657 regex =~/user(.*)/ ========>works Thanks for help..!! (7 Replies)
Discussion started by: trina_1
7 Replies

9. Shell Programming and Scripting

q with Perl Regex

For a programming exercise, I am mean to design a Perl script that detects double letters in a text file. I tried the following expressions # Check for any double letter within the alphabet /+/ # Check for any repetition of an alphanumeric character /\w+/ Im aware that the... (8 Replies)
Discussion started by: JamesGoh
8 Replies
Login or Register to Ask a Question