Sponsored Content
Full Discussion: Noob trying to improve
Operating Systems OS X (Apple) Noob trying to improve Post 302990471 by bakunin on Thursday 26th of January 2017 04:43:25 PM
Old 01-26-2017
Quote:
Originally Posted by Ardzii
Since there are quite a few examples and tutos to use SED on the web I started with that.
I understand now very basic concepts of the tool such as the "-n" and "-i"/"-i.bak" options or the p/s/d commands. Smilie
Very good! sed might be "love at third sight", but it is an immensely mighty tool.

Here is a very short introduction to my favourite topic:

Regular Expressions

Regular expressions are basically a way to describe text patterns. Because they describe text by (other) text they consist of two classes of characters: "characters" and "metacharacters". Characters are only stand-ins for themselves:
Code:
bla-foo

is a valid regular expression - one that matches the string "bla-foo" and nothing else. Now that is not very helpful in itself, but we could use this to use sed as a grep-equivalent. The following commands will do the same:

Code:
grep "bla-foo" /some/file
sed -n '/bla-foo/p' /some/file

Therefore there is another class of characters, so-called "metacharacters". These are expressions that either modify other characters (or groups thereof - it is possible to group) or match classes of characters. For modifiers we have two:

* - the character before may be repeated zero or more times
\{m,n\} - the character before has to be repeated between m and n times (m and n being integers)

Let us try an example: we look for the word "colour". The regexp for this (regexps are traditionally enclosed in slashes, which are not part of the regexp):

Code:
/colour/

Now suppose several people have written this text, some British, some American, so "colour" is sometimes written "color" and sometimes "colour". The regexp for this is:

Code:
/colou*r/

The asterisk makes the "u" optional (zero or more). The downside is that the hypothetical word "colouuuuur" would also be matched, but more on that later. Whenever you construct regular expressions you need to answer several questions:

1. will it match the lines i want matched?
2. will it not match lines i want to be matched too? (false negatives)
3. will it match lines i don't want to be matched? (false positives)

Now, suppose we would want to match not only "colour" and "color" but every word starting with a "c" and ending with an "r". For this we use another metacharacter:

. - matches any one character

This is the biggest problem for beginners, btw., especially when they come from DOS-derived systems or have only used "file-globs": "/some/file*" means any file named "file" and whatever trails it. In regexps "*" has a different meaning and what comes closest to "*" is ".". You can use it in conjunction with "*" to match strings of unknown composition:

Code:
/c.*r/

This will match: a "c", followed by any amount ("*") of any character(s) ("."), followed by an "r". Will this be a solution for our expample?

Sorry: no. Yes it will match "color" and "colour" and "colonel-major" and "conquistador" but - because "any character" also includes a space - it will also produce matches for "chicken hawk breeder" and the like. We would need to limit our wildcard to only non-whitespace.

For this there is the "character-class" and its inverted counterpart:

[a1x] - will match exactly one occurrence of either "a", "1" or "x"
[^a1x] - negation - will match any character except "a", "1" or "x"

Note,that the "^" to be used as negation it has to be the first character inside the brackets. [^^] is "anthing else than a caret sign" and [X^] is "either "X" or a caret sign. There are also predefined classes: [[:upper:]] (all capitalized characters) and [[:lower:]] (all non-capitalized characters) and so on. You can also specify sequences: [A-D1-3] (all capitalized characters from "A" through "D" or a number from 1 to 3 or: one of "A", "B", "C", "D" "1", "2" or "3").

With this we stand better chance of constructing our regexp:

Code:
/c[a-z]*r/

This is coming closer but now "colonel-major" is not matched any more. It is a matter of definition of hyphenated words should count as "one word", but suppose they do: you can sure change the regexp yourself now to include hyphenated words, no? Like this:

Code:
/c[a-z-]*r/

Alas, there is another sort false positives we haven't touched upon yet. How about the word "escrow": would it be matched by our regexp or not? Unfortunately: yes. Its middle part consists of "c" followed by zero or more non-capitalized characters including the hyphen, followed by an "r": "escrow". We would now have to make sure that the "c" is indeed at the beginning of the word and the "r" is at its end. This is in fact quite complicated because naively adding leading and trailing blanks:

Code:
/ c[a-z-]*r /

Would help for words in the middle of the line, but fail for words at the line end or the line start. Furthermore it might happen that a word is followed by punctuation like here where "colour, intended as an example" would fail because we look for trailing blanks only.

But let us make it simple: suppose all our text consists of one word per line. We could use the line start and the line end as a sort-of "anchor" for our regexp then. Fortunately this is possible:

^ - beginning of line
$ - end of line

Notice that "^" has two different functions: inside the brackets it means negate the class and at the beginning of a regexp it means beginning of line. Now we finally can construct our regexp:

Code:
/^c[a-z-]*r$/

This means (you can already read that yourself, but let me prove that i can read it too, for the record): beginning of line, followed by a "c", followed by any number of any non-capitalized characters or hyphens, followed by a "r" and the line end.

One last thing you need to know: regexps are greedy! That means: of there are more than one possible matching, regexps will always take the longest possible match (non-greedy would be the shortest-possible). For instance, consider following regexp: /a.*b/. Here is some text, the matched part is marked bold:

Code:
this is a blatant example of how greedy matches will be for beginners

If this regexp would be used to change the matched text and if you only want to match the "a b" at the beginning of the match you would need to use a negated character class:

Code:
/a[^b]*b/
this is a blatant example of how greedy matches will be for beginners

I hope this helps.

bakunin
This User Gave Thanks to bakunin For This Post:
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Can I improve this script ???

Hi all, Still a newbie and learning as I go ... as you do :) Have created this script to report on disc usage and I've just included the ChkSpace function this morning. It's the first time I've read a file (line-by-bloody-line) and would like to know if I can improve this script ? FYI - I... (11 Replies)
Discussion started by: Cameron
11 Replies

2. UNIX for Advanced & Expert Users

improve performance by using ls better than find

Hi , i'm searching for files over many Aix servers with rsh command using this request : find /dir1 -name '*.' -exec ls {} \; and then count them with "wc" but i would improve this search because it's too long and replace directly find with ls command but "ls *. " doesn't work. and... (3 Replies)
Discussion started by: Nicol
3 Replies

3. Shell Programming and Scripting

improve this?

Wrote this script to find the date x days before or after today. Is there any way that this script can be speeded up or otherwise improved? #!/usr/bin/sh check_done() { if then daysofmth=31 elif then if ... (11 Replies)
Discussion started by: blowtorch
11 Replies

4. UNIX for Dummies Questions & Answers

Improve Performance

hi someone tell me which ways i can improve disk I/O and system process performance.kindly refer some commands so i can do it on my test machine.thanks, Mazhar (2 Replies)
Discussion started by: mazhar99
2 Replies

5. Shell Programming and Scripting

Any way to improve performance of this script

I have a data file of 2 gig I need to do all these, but its taking hours, any where i can improve performance, thanks a lot #!/usr/bin/ksh echo TIMESTAMP="$(date +'_%y-%m-%d.%H-%M-%S')" function showHelp { cat << EOF >&2 syntax extreme.sh FILENAME Specify filename to parse EOF... (3 Replies)
Discussion started by: sirababu
3 Replies

6. IP Networking

How to improve throughput?

I have a 10Gbps network link connecting two machines A and B. I want to transfer 20GB data from A to B using TCP. With default setting, I can use 50% bandwidth. How to improve the throughput? Is there any way to make throughput as close to 10Gbps as possible? thanks~ :) (3 Replies)
Discussion started by: andrewust
3 Replies

7. Shell Programming and Scripting

Want to improve the performance of script

Hi All, I have written a script as follows which is taking lot of time in executing/searching only 3500 records taken as input from one file in log file of 12 GB Approximately. Working of script is read the csv file as an input having 2 arguments which are transaction_id,mobile_number and search... (6 Replies)
Discussion started by: poweroflinux
6 Replies

8. AIX

improve sulog

I just wrote a very small script that improves readability on system sulog. The problem with all sulog is there is lack of clarity whether the info you are looking at is the most current. So if you just need a simple soution instead of going thru the trouble of writing a script that rotate logs and... (0 Replies)
Discussion started by: sparcguy
0 Replies

9. Shell Programming and Scripting

How to improve an script?

Gents. I have 2 different scripts for the same purpose: raw2csv_1 Script raw2csv_1 finish the process in less that 1 minute raw2csv_2 Script raw2csv_2 finish the process in more that 6 minutes. Can you please check if there is any option to improve the raw2csv_2. To finish the job... (4 Replies)
Discussion started by: jiam912
4 Replies

10. Shell Programming and Scripting

Improve script

Gents, Is there the possibility to improve this script to be able to have same output information. I did this script, but I believe there is a very short code to get same output here my script awk -F, '{if($10>0 && $10<=15) print $6}' tmp1 | sort -k1n | awk '{a++} END { for (n in a )... (23 Replies)
Discussion started by: jiam912
23 Replies
All times are GMT -4. The time now is 07:36 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy