Pattern Matching

04-12-2009

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by Aveltium

Hi, I'm very new to Linux and I'm sorry if this question is too dumb.

There are no dumb questions. Welcome on board. ;-)

Quote:

Originally Posted by Aveltium

If it is ok, please link me to some beginner guides for questions like this one.

We have a special "Tips and Tutorials" board here and if you use the search feature on "book recommendation" you will find a lot of threads.

My personal favourite regarding regular expressions is "sed & awk" by Dale Dougherty published by O'Reilly. It is well written with a good dose of humor and it covers everything there is to know about these two regex-based programs. There is a specialized book about regular expressions too from the same publisher. It is well written but i didn't like it as much as the aforementioned book.

Ok, having said this, here is a

Short (very short!) Introduction to Regular Expressions

As soon as you deal with text documents invariably you need to search for some content sooner or later. It is easy to search for strings, but in most cases (fixed) strings match not everything they are supposed to match or match things they are not supposed to match. Regexps are not searching for strings but searching for patterns and the regexp language is about describing these patterns.

Suppose you have a long text and want to find the word "colour".

(We will use a small Unix program called "grep" for the examples. It is given an expression to search for and a file in which it carries out the search. It will return all the lines containing the expression. The calling convention is "grep <expr> <file>".)

Ok, here is your first regular expression:

Code:

grep "colour" /path/to/file

That wasn't too hard, was it? Well, yes, but it isn't too useful either. We are just searching for a fixed string. Anyway, a fixed string is the simpliest, most basic form of a regular expression.

Now suppose that the text was written by several people, some speak english and some are american *) and therefore "colour" is sometimes written "colour" and sometimes "color". Of course we would like to find both versions ad we have to tell the program somehow that the "u" we are looking for is optional. We want to find "color" as well as "colour" but we wouldn't want to find words like "colonel-major", where something else then a "u" is between the "colo-" and the "-r". Here we go:

Code:

grep "colou*r" /path/to/file

The asterisk ("*") tells the regexp-program that the character preceeding it is optional.

We call this a "metacharacter". Most characters only match themselves: an "a" will match an "a" and nothing else (not even the "A", because regexps are case-sensitive). But some special characters do not match anything directly but change the way other characters are matched. A regular expression is usually a mixture of characters and metacharacters.

Looking at the output of the last command we see that it did match also the word "colourful" or "water-color". We might want to match only "colour" (however it is written) but not any conglomerate words.

We do this by matching only whitespace (blanks and tabs) before and after the word but exclude any other character. We use "character set" for this. It says "one of the following" characters (note that i use "<b>" for a blank and "<tab>" for a tab here because they are non-printing characters. Enter literal spaces and tabs instead when you type that in):

Code:

grep "[<b><tab>]colou*r[<b><tab>]" /path/to/file

Any ONE character inside "[...]" is matched, but not several! Therefore "d[ae]n" will match "dan" and "den" but not "dean".

-*-

Ok, so far. My time is limited today and i can't explain something in a few words others write books about. I hope you got an impression about how regular expressions work and upon request i might expand this text a little.

____________________
*) sorry - i just can't resist these opportunities ;-))

Quote:

Originally Posted by Aveltium

I want to check if the entered string is a number and has 4 digits.

The regex - without further explanation, but parts of it you will recognize - is:

"^[0-9]\{4\}$"

"^" used this way is the begin of a line, so the expression will only be found if it starts at the beginning"

"$" analogously end of line - we make sure the string contains only 4 digits

"[0-9]" is short for "[0123456789]", it is possible to use ranges instead of single characters to form sets

"\{n\}" match the previous expression (the brackets) exactly n times

Here is the whole script:

Code:

echo -n "Input: " ; read x
if [ $(echo "$x" | grep -c "^[0-9]\{4\}$") -eq 1 ] ; then
     echo "Valid"
else
     echo "Invalid"
fi

I hope this helps.

bakunin

Last edited by bakunin; 04-12-2009 at 09:25 AM..

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

04-12-2009

Registered User

6, 0

Join Date: Apr 2009

Last Activity: 17 April 2009, 3:26 AM EDT

Posts: 6

Thanks Given: 0

Thanked 0 Times in 0 Posts

Wow holycow! Thanks a bunch!! I never got such nice answer like that! Very informative and helpful! Thanks again!!

Aveltium

View Public Profile for Aveltium

Find all posts by Aveltium

Shell Programming and Scripting