Query: perlretut
OS: centos
Section: 1
Links: centos man pages all man pages
Forums: unix linux community forum categories
Format: Original Unix Latex Style Formatted with HTML and a Horizontal Scroll Bar
PERLRETUT(1) Perl Programmers Reference Guide PERLRETUT(1)NAMEperlretut - Perl regular expressions tutorialDESCRIPTIONThis page provides a basic tutorial on understanding, creating and using regular expressions in Perl. It serves as a complement to the reference page on regular expressions perlre. Regular expressions are an integral part of the "m//", "s///", "qr//" and "split" operators and so this tutorial also overlaps with "Regexp Quote-Like Operators" in perlop and "split" in perlfunc. Perl is widely renowned for excellence in text processing, and regular expressions are one of the big factors behind this fame. Perl regular expressions display an efficiency and flexibility unknown in most other computer languages. Mastering even the basics of regular expressions will allow you to manipulate text with surprising ease. What is a regular expression? A regular expression is simply a string that describes a pattern. Patterns are in common use these days; examples are the patterns typed into a search engine to find web pages and the patterns used to list files in a directory, e.g., "ls *.txt" or "dir *.*". In Perl, the patterns described by regular expressions are used to search strings, extract desired parts of strings, and to do search and replace operations. Regular expressions have the undeserved reputation of being abstract and difficult to understand. Regular expressions are constructed using simple concepts like conditionals and loops and are no more difficult to understand than the corresponding "if" conditionals and "while" loops in the Perl language itself. In fact, the main challenge in learning regular expressions is just getting used to the terse notation used to express these concepts. This tutorial flattens the learning curve by discussing regular expression concepts, along with their notation, one at a time and with many examples. The first part of the tutorial will progress from the simplest word searches to the basic regular expression concepts. If you master the first part, you will have all the tools needed to solve about 98% of your needs. The second part of the tutorial is for those comfortable with the basics and hungry for more power tools. It discusses the more advanced regular expression operators and introduces the latest cutting-edge innovations. A note: to save time, 'regular expression' is often abbreviated as regexp or regex. Regexp is a more natural abbreviation than regex, but is harder to pronounce. The Perl pod documentation is evenly split on regexp vs regex; in Perl, there is more than one way to abbreviate it. We'll use regexp in this tutorial. Part 1: The basics Simple word matching The simplest regexp is simply a word, or more generally, a string of characters. A regexp consisting of a word matches any string that contains that word: "Hello World" =~ /World/; # matches What is this Perl statement all about? "Hello World" is a simple double-quoted string. "World" is the regular expression and the "//" enclosing "/World/" tells Perl to search a string for a match. The operator "=~" associates the string with the regexp match and produces a true value if the regexp matched, or false if the regexp did not match. In our case, "World" matches the second word in "Hello World", so the expression is true. Expressions like this are useful in conditionals: if ("Hello World" =~ /World/) { print "It matches "; } else { print "It doesn't match "; } There are useful variations on this theme. The sense of the match can be reversed by using the "!~" operator: if ("Hello World" !~ /World/) { print "It doesn't match "; } else { print "It matches "; } The literal string in the regexp can be replaced by a variable: $greeting = "World"; if ("Hello World" =~ /$greeting/) { print "It matches "; } else { print "It doesn't match "; } If you're matching against the special default variable $_, the "$_ =~" part can be omitted: $_ = "Hello World"; if (/World/) { print "It matches "; } else { print "It doesn't match "; } And finally, the "//" default delimiters for a match can be changed to arbitrary delimiters by putting an 'm' out front: "Hello World" =~ m!World!; # matches, delimited by '!' "Hello World" =~ m{World}; # matches, note the matching '{}' "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', # '/' becomes an ordinary char "/World/", "m!World!", and "m{World}" all represent the same thing. When, e.g., the quote (""") is used as a delimiter, the forward slash '/' becomes an ordinary character and can be used in this regexp without trouble. Let's consider how different regexps would match "Hello World": "Hello World" =~ /world/; # doesn't match "Hello World" =~ /o W/; # matches "Hello World" =~ /oW/; # doesn't match "Hello World" =~ /World /; # doesn't match The first regexp "world" doesn't match because regexps are case-sensitive. The second regexp matches because the substring 'o W' occurs in the string "Hello World". The space character ' ' is treated like any other character in a regexp and is needed to match in this case. The lack of a space character is the reason the third regexp 'oW' doesn't match. The fourth regexp 'World ' doesn't match because there is a space at the end of the regexp, but not at the end of the string. The lesson here is that regexps must match a part of the string exactly in order for the statement to be true. If a regexp matches in more than one place in the string, Perl will always match at the earliest possible point in the string: "Hello World" =~ /o/; # matches 'o' in 'Hello' "That hat is red" =~ /hat/; # matches 'hat' in 'That' With respect to character matching, there are a few more points you need to know about. First of all, not all characters can be used 'as is' in a match. Some characters, called metacharacters, are reserved for use in regexp notation. The metacharacters are {}[]()^$.|*+? The significance of each of these will be explained in the rest of the tutorial, but for now, it is important only to know that a metacharacter can be matched by putting a backslash before it: "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter "2+2=4" =~ /2+2/; # matches, + is treated like an ordinary + "The interval is [0,1)." =~ /[0,1)./ # is a syntax error! "The interval is [0,1)." =~ /[0,1)./ # matches "#!/usr/bin/perl" =~ /#!/usr/bin/perl/; # matches In the last regexp, the forward slash '/' is also backslashed, because it is used to delimit the regexp. This can lead to LTS (leaning toothpick syndrome), however, and it is often more readable to change delimiters. "#!/usr/bin/perl" =~ m!#!/usr/bin/perl!; # easier to read The backslash character '' is a metacharacter itself and needs to be backslashed: 'C:WIN32' =~ /C:\WIN/; # matches In addition to the metacharacters, there are some ASCII characters which don't have printable character equivalents and are instead represented by escape sequences. Common examples are " " for a tab, " " for a newline, " " for a carriage return and "a" for a bell (or alert). If your string is better thought of as a sequence of arbitrary bytes, the octal escape sequence, e.g., "