Pattern Matching bakunin - Shell Programming and Scripting - Unix Linux Forums

  Go Back    



Pattern Matching

Shell Programming and Scripting




Kindly Note - This is a Single User Post by Forum Member bakunin Regarding:
Pattern Matching.
Please Follow The Primary Link Above to View the Full Discussion.

   
Old 04-12-2009
bakunin bakunin is offline Forum Staff  
Bughunter Extraordinaire
 
Join Date: May 2005
Last Activity: 31 October 2014, 2:56 PM EDT
Location: In the leftmost byte of /dev/kmem
Posts: 4,301
Thanks: 45
Thanked 827 Times in 654 Posts
Quote:
Originally Posted by Aveltium View Post
Hi, I'm very new to Linux and I'm sorry if this question is too dumb.
There are no dumb questions. Welcome on board. ;-)

Quote:
Originally Posted by Aveltium View Post
If it is ok, please link me to some beginner guides for questions like this one.
We have a special "Tips and Tutorials" board here and if you use the search feature on "book recommendation" you will find a lot of threads.

My personal favourite regarding regular expressions is "sed & awk" by Dale Dougherty published by O'Reilly. It is well written with a good dose of humor and it covers everything there is to know about these two regex-based programs. There is a specialized book about regular expressions too from the same publisher. It is well written but i didn't like it as much as the aforementioned book.

Ok, having said this, here is a

Short (very short!) Introduction to Regular Expressions

As soon as you deal with text documents invariably you need to search for some content sooner or later. It is easy to search for strings, but in most cases (fixed) strings match not everything they are supposed to match or match things they are not supposed to match. Regexps are not searching for strings but searching for patterns and the regexp language is about describing these patterns.

Suppose you have a long text and want to find the word "colour".

(We will use a small Unix program called "grep" for the examples. It is given an expression to search for and a file in which it carries out the search. It will return all the lines containing the expression. The calling convention is "grep <expr> <file>".)

Ok, here is your first regular expression:


Code:
grep "colour" /path/to/file

That wasn't too hard, was it? Well, yes, but it isn't too useful either. We are just searching for a fixed string. Anyway, a fixed string is the simpliest, most basic form of a regular expression.

Now suppose that the text was written by several people, some speak english and some are american *) and therefore "colour" is sometimes written "colour" and sometimes "color". Of course we would like to find both versions ad we have to tell the program somehow that the "u" we are looking for is optional. We want to find "color" as well as "colour" but we wouldn't want to find words like "colonel-major", where something else then a "u" is between the "colo-" and the "-r". Here we go:


Code:
grep "colou*r" /path/to/file

The asterisk ("*") tells the regexp-program that the character preceeding it is optional.

We call this a "metacharacter". Most characters only match themselves: an "a" will match an "a" and nothing else (not even the "A", because regexps are case-sensitive). But some special characters do not match anything directly but change the way other characters are matched. A regular expression is usually a mixture of characters and metacharacters.

Looking at the output of the last command we see that it did match also the word "colourful" or "water-color". We might want to match only "colour" (however it is written) but not any conglomerate words.

We do this by matching only whitespace (blanks and tabs) before and after the word but exclude any other character. We use "character set" for this. It says "one of the following" characters (note that i use "<b>" for a blank and "<tab>" for a tab here because they are non-printing characters. Enter literal spaces and tabs instead when you type that in):


Code:
grep "[<b><tab>]colou*r[<b><tab>]" /path/to/file

Any ONE character inside "[...]" is matched, but not several! Therefore "d[ae]n" will match "dan" and "den" but not "dean".

-*-

Ok, so far. My time is limited today and i can't explain something in a few words others write books about. I hope you got an impression about how regular expressions work and upon request i might expand this text a little.



____________________
*) sorry - i just can't resist these opportunities ;-))






Quote:
Originally Posted by Aveltium View Post
I want to check if the entered string is a number and has 4 digits.
The regex - without further explanation, but parts of it you will recognize - is:

"^[0-9]\{4\}$"

"^" used this way is the begin of a line, so the expression will only be found if it starts at the beginning"

"$" analogously end of line - we make sure the string contains only 4 digits

"[0-9]" is short for "[0123456789]", it is possible to use ranges instead of single characters to form sets

"\{n\}" match the previous expression (the brackets) exactly n times

Here is the whole script:


Code:
echo -n "Input: " ; read x
if [ $(echo "$x" | grep -c "^[0-9]\{4\}$") -eq 1 ] ; then
     echo "Valid"
else
     echo "Invalid"
fi

I hope this helps.

bakunin

Last edited by bakunin; 04-12-2009 at 08:25 AM..