q with Perl Regex

07-24-2008

Registered User

89, 0

Join Date: Nov 2007

Last Activity: 4 March 2009, 5:01 PM EST

Posts: 89

Thanks Given: 0

Thanked 0 Times in 0 Posts

q with Perl Regex

For a programming exercise, I am mean to design a Perl script that detects double letters in a text file.

I tried the following expressions

Code:

# Check for any double letter within the alphabet

/[a-zA-Z]+/

# Check for any repetition of an alphanumeric character

/\w+/

Im aware that the + means to search for one or more occurences of that character, however trying both of these did not meet the requirements of my program.

Also

Code:

/[a-zA-Z]{1}/

did not prove to be helpful as well

After doing some searching, I stumbled across the correct form of the regex for the double letter case. It turned out to be

Code:

/(.)\1/

Now I know that . refers to any single character and the \1 refers to the first character in the line being read (if s/..../.... is being used), but Im still puzzled as to why /(.)\1/ works instead of /[a-zA-Z]+/ for the case of double letters ?

many thanks
James

JamesGoh

View Public Profile for JamesGoh

Find all posts by JamesGoh

07-24-2008

Registered User

1,622, 11

Join Date: Sep 2002

Last Activity: 4 May 2014, 6:22 AM EDT

Location: Hong Kong, China

Posts: 1,622

Thanks Given: 0

Thanked 11 Times in 10 Posts

Quote:

Originally Posted by JamesGoh

Now I know that . refers to any single character and the \1 refers to the first character in the line being read (if s/..../.... is being used), but Im still puzzled as to why /(.)\1/ works instead of /[a-zA-Z]+/ for the case of double letters ?

* Incorrect text removed *

/[a-zA-Z]+/ only means matching a contiguous sequence of letters, so not only 'AA' or 'zz' will match, 'Az' will match too.

Last edited by cbkihong; 07-24-2008 at 02:28 AM.. Reason: Incorrect text removed

cbkihong

View Public Profile for cbkihong

Find all posts by cbkihong

07-24-2008

Registered User

729, 0

Join Date: Jan 2008

Last Activity: 11 September 2009, 3:44 PM EDT

Posts: 729

Thanks Given: 0

Thanked 0 Times in 0 Posts

\1 is a backreference to what is matched in the parenthesis in the regexp. So /(.)\1/ finds a double occurance of whatever (.) matched. It is similar to $1 but is used inside the regexp. It is discussed in some detail here:

perlretut - perldoc.perl.org

KevinADC

View Public Profile for KevinADC

Find all posts by KevinADC

07-24-2008

Registered User

729, 0

Join Date: Jan 2008

Last Activity: 11 September 2009, 3:44 PM EDT

Posts: 729

Thanks Given: 0

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by cbkihong

Actually, not even /(.)\1/ is correct. In Perl, you should use /(.)$1/. The former syntax is there for compatibility with I think awk or sed but that should in general not be used in Perl, because Perl has more uses of backslash that may interfere with backtracking.

That is not correct. Using \1 is perfectly good perl code. \1 and $1 really have two seperate uses. See the link I posted in my previous post. A short test shows they do not do the same thing:

Code:

$_ = 'foobar';
if (/(.)$1/) {
   print "\$1 = $1","\n";
}	
if (/(.)\1/) {
   print "\\1 = $1";
}

output:

Code:

$1 = f
\1 = o

KevinADC

View Public Profile for KevinADC

Find all posts by KevinADC

07-24-2008

Registered User

89, 0

Join Date: Nov 2007

Last Activity: 4 March 2009, 5:01 PM EST

Posts: 89

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks everyone for your messages.

Also I found that re-reading my notes in better detail was very helpful !

JamesGoh

View Public Profile for JamesGoh

Find all posts by JamesGoh

07-24-2008

Registered User

729, 0

Join Date: Jan 2008

Last Activity: 11 September 2009, 3:44 PM EDT

Posts: 729

Thanks Given: 0

Thanked 0 Times in 0 Posts

this does not work:

/[a-zA-Z]+/

because it means one or more of the characters inside the square brackets, any of the characters, in any order. You want to find two of the same character repeated in a string, not one or more of any character inside the [] brackets.

KevinADC

View Public Profile for KevinADC

Find all posts by KevinADC

07-24-2008

Registered User

4,996, 477

Join Date: Dec 2003

Last Activity: 12 June 2016, 11:03 PM EDT

Location: /dev/ph

Posts: 4,996

Thanks Given: 73

Thanked 477 Times in 439 Posts

Interesting and thoughtful question. You use "(" and ")" to mark (remember) a pattern and recall the remembered pattern with "\" followed by a single digit (back reference).

In your particular case, "(.)\1" means remember a character and recall the character.

You can extend this method to find words with multiple double letters. '(.)\1(.)\2(.)\3' will match any word with three double letters, e.g. bookkeeper.

fpmurphy

View Public Profile for fpmurphy

Find all posts by fpmurphy

Shell Programming and Scripting

q with Perl Regex

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Perl REGEX help

Discussion started by: timj123

2. Shell Programming and Scripting

Perl, RegEx - Help me to understand the regex!

Discussion started by: alex_5161

3. Shell Programming and Scripting

?= in perl regex

Discussion started by: scriptscript

4. Programming

Perl regex

Discussion started by: jhamaks

5. Programming

Perl regex

Discussion started by: ab52

6. Programming

Perl regex

Discussion started by: jhamaks

7. UNIX for Dummies Questions & Answers

Perl Regex Help!!!

Discussion started by: manutd

8. Shell Programming and Scripting

Converting perl regex to sed regex

Discussion started by: suntzu

9. Shell Programming and Scripting

Perl regex

Discussion started by: trina_1

10. Shell Programming and Scripting

Perl REGEX

Discussion started by: evilfreakz