debian regexp::optimizer man page on unix.com

Regexp::Optimizer(3pm)					User Contributed Perl Documentation				    Regexp::Optimizer(3pm)

NAME
       Regexp::Optimizer - optimizes regular expressions

SYNOPSIS
	 use Regexp::Optimizer;
	 my $o	= Regexp::Optimizer->new;
	 my $re = $o->optimize(qr/foobar|fooxar|foozap/);
	 # $re is now qr/foo(?:[bx]ar|zap)/

ABSTRACT
       This module does, ahem, attempts to, optimize regular expressions.

INSTALLATION
       To install this module type the following:

	  perl Makefile.PL
	  make
	  make test
	  make install

DESCRIPTION
       Here is a quote from perltodo.

	   Factoring out common suffices/prefices in regexps (trie optimization)

	   Currently, the user has to optimize "foo|far" and "foo|goo" into "f(?:oo|ar)" and "[fg]oo" by hand; this could be done automatically.

       This module implements just that.

   EXPORT
       Since this is an OO module there is no symbol exported.

METHODS
       This module is implemented as a subclass of Regexp::List.  For methods not listed here, see Regexp::List.

       $o  = Regexp::Optimizer->new;
       $o->set(key => value, ...)
	   Just the same us Regexp::List except for the attribute below;

	   unexpand
	       When set to one, $o->optimize() tries to $o->expand before actually starting the operation.

		 # cases you need to set expand => 1
		 $o->set(expand => 1)->optimize(qr/
						  foobar|
						  fooxar|
						  foozar
						  /x);

       $re = $o->optimize(regexp);
	   Does the job.  Note that unlike "->list2re()" in Regexp::List, the argument is the regular expression itself.  What it basically does
	   is to find groups will alterations and replace it with the result of "$o->list2re".

       $re = $o->list2re(list of words ...)
	   Same as "list2re()" in Regexp::List in terms of functionality but how it tokenize "atoms" is different since the arguments can be
	   regular expressions, not just strings.  Here is a brief example.

	     my @expr = qw/foobar fooba+/;
	     Regexp::List->new->list2re(@expr) eq qr/fooba[+r]/;
	     Regexp::Optimizer->new->list2re(@expr) eq qr/foob(?:a+ar)/;

CAVEATS
       This module is still experimental.  Do not assume that the result is the same as the unoptimized version.

       o   When you just want a regular expression which matches normal words with not metacharacters, use <Regexp::List>.  It's more robus and
	   much faster.

       o   When you have a list of regular expessions which you want to aggregate, use "list2re" of THIS MODULE.

       o   Use "->optimize()" when and only when you already have a big regular expression with alterations therein.

	   "->optimize()" does support nested groups but its parser is not tested very well.

BUGS
       o   Regex parser in this module (which itself is implemented by regular expression) is not as thoroughly tested as Regexp::List

       o   May still fall into deep recursion when you attempt to optimize deeply nested regexp.  See "PRACTICALITY".

       o   Does not grok (?{expression}) and (?(cond)yes|no) constructs yet

       o   You need to escape characters in character classes.

	     $o->optimize(qr/[a-z()]|[A-Z]/);		   # wrong
	     $o->optimize(qr/[a-z()]|[A-Z]/); 	   # right
	     $o->optimize(qr/[0-9A-Za-z]|[Q-_.!~*"'()E]/ # right, too.

       o   When character(?: class(?:es)?)? are aggregated, duplicate ranges are left as is.  Though functionally OK, it is cosmetically ugly.

	     $o->optimize(qr/[0-5]|[5-9]|0123456789/);
	     # simply turns into [0-5][5-9]0123456789] not [0-9]

	   I left it that way because marking-rearranging approach can result a humongous result when unicode characters are concerned (and
	   p{Properties}).

PRACTICALITY
       Though this module is still experimental, It is still good enough even for such deeply nested regexes as the followng.

	 # See 3.2.2 of  http://www.ietf.org/rfc/rfc2616.txt
	 # BNF faithfully turned into a regex
	 http://(?:(?:(?:(?:(?:[a-z]|[A-Z])|[0-9])|(?:(?:[a-z]|[A-Z])|[0-9])(?:(?:(?:[a-z]|[A-Z])|[0-9])|-)*(?:(?:[a-z]|[A-Z])|[0-9])).)*(?:(?:[a-z]|[A-Z])|(?:[a-z]|[A-Z])(?:(?:(?:[a-z]|[A-Z])|[0-9])|-)*(?:(?:[a-z]|[A-Z])|[0-9])).?|[0-9]+.[0-9]+.[0-9]+.[0-9]+)(?::[0-9]*)?(?:/(?:(?:(?:(?:[a-z]|[A-Z])|[0-9])|[-\_.!~*'()])|%(?:[0-9]|[A-Fa-f])(?:[0-9]|[A-Fa-f])|[:@&=+$,])*(?:;(?:(?:(?:(?:[a-z]|[A-Z])|[0-9])|[-\_.!~*'()])|%(?:[0-9]|[A-Fa-f])(?:[0-9]|[A-Fa-f])|[:@&=+$,])*)*(?:/(?:(?:(?:(?:[a-z]|[A-Z])|[0-9])|[-\_.!~*'()])|%(?:[0-9]|[A-Fa-f])(?:[0-9]|[A-Fa-f])|[:@&=+$,])*(?:;(?:(?:(?:(?:[a-z]|[A-Z])|[0-9])|[-\_.!~*'()])|%(?:[0-9]|[A-Fa-f])(?:[0-9]|[A-Fa-f])|[:@&=+$,])*)*)*(?:\?(?:[;/?:@&=+$,]|(?:(?:(?:[a-z]|[A-Z])|[0-9])|[-\_.!~*'()])|%(?:[0-9]|[A-Fa-f])(?:[0-9]|[A-Fa-f]))*)?)?

	 # and optimized
	 http://(?::?[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?.[a-zA-Z]*(?:[a-zA-Z0-9-]*[a-zA-Z0-9])?.?|[0-9]+.[0-9]+.[0-9]+.[0-9]+)(?::[0-9]*)?(?:/(?:(?:(?:[a-zA-Z0-9-\_.!~*'x28x29]|%[0-9A-Fa-f][0-9A-Fa-f])|[:@&=+$,]))*(?:;(?:(?:(?:[a-zA-Z0-9-\_.!~*'x28x29]|%[0-9A-Fa-f][0-9A-Fa-f])|[:@&=+$,]))*)*(?:/(?:(?:(?:[a-zA-Z0-9-\_.!~*'x28x29]|%[0-9A-Fa-f][0-9A-Fa-f])|[:@&=+$,]))*(?:;(?:(?:(?:[a-zA-Z0-9-\_.!~*'x28x29]|%[0-9A-Fa-f][0-9A-Fa-f])|[:@&=+$,]))*)*)*(?:\?(?:(?:[;/?:@&=+$,a-zA-Z0-9-\_.!~*'x28x29]|%[0-9A-Fa-f][0-9A-Fa-f]))*)?)?

       By carefully examine both you can find that character classes are properly aggregated.

SEE ALSO
       Regexp::List -- upon which this module is based

       "eg/" directory in this package contains example scripts.

       Perl standard documents
	    L<perltodo>, L<perlre>

       CPAN Modules
	   Regexp::Presuf, Text::Trie

       Books
	   Mastering Regular Expressions  <http://www.oreilly.com/catalog/regex2/>

AUTHOR
       Dan Kogai <dankogai@dan.co.jp>

COPYRIGHT AND LICENSE
       Copyright 2003 by Dan Kogai

       This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

perl v5.14.2							    2004-12-05						    Regexp::Optimizer(3pm)
debian man page for regexp::optimizer