Noob trying to improve


 
Thread Tools Search this Thread
Operating Systems OS X (Apple) Noob trying to improve
# 29  
Old 01-26-2017
[^"] is a character that is not a quote
[^"]* is any consecutive number of non-quote characters
\( \) does not mean a character but is a group mark, for later reference
Code:
s/.*href="\([^"]*\).*/\1/p

\1 is the reference. It becomes the string that matched within the \( \). The leading and trailing .* ensure that the entire line is matched, i.e. is deleted+substituted by the back-reference.
\1 actually referes to the 1st \( \); \2 would refer to the 2nd...
The -n sed option suppresses the default print. the /p at the end of the substitution is a print if there was a match. So non-matching lines are not printed.

Last edited by MadeInGermany; 01-26-2017 at 12:13 PM..
This User Gave Thanks to MadeInGermany For This Post:
# 30  
Old 01-26-2017
OK! Thanks MadeInGermany!
This changes the deal quite a bit! But it gives me a better view of the substitution being made!

I got:
Code:
substitution command / text that is going to be substituted / substitution / print

Now what I'm not sure to grasp is how it manages to stop at the
Quote:
"
? Is that thanks to the
Quote:
Quote:
[^"]*
is any consecutive number of non-quote characters
thingy? Does the deal go like: Start at
Quote:
href="
up to the next quote character?

Also why are there
Quote:
.* .... .*
in the structure?
s/.*href="\([^"]*\).*/\1/p
# 31  
Old 01-26-2017
Exactly. The first character that matches in the trailing .* is a quote.
As I said, the leading and trailing .* are needed to "match away" the entire line. Otherwise only the matching portion would be substituted.

---------- Post updated at 12:15 ---------- Previous update was at 11:44 ----------

Now to your second requirement. Can give a headache even for experienced guys.
In your example the ' is a problem for the shell, in which you call
Code:
sed -n '...'

There is no problem if you save the sed code in a separate file and run it with
Code:
sed -n -f sed-script result2.txt

And the contents of the sed-script
Code:
/itemprop='brand'/ s/.*'brand'>\([^<]*\).*/\1/p

You can add another match in a second line
Code:
/itemprop='name'/ s/.*'name'>\([^<]*\).*/\1/p

but it won't match if the first match was successful and the input line was substituted.
It is necessary to save and restore the line.
Code:
h
/itemprop='brand'/ s/.*'brand'>\([^<]*\).*/\1/p
g
/itemprop='name'/ s/.*'name'>\([^<]*\).*/\1/p

Another aspect is greediness. The * wants to match as much it can. A leftmost * is most greedy.
That means /.*'branch'/ matches the rightmost 'branch'.
--
Last but not least, the shell method to print a ' within a ' ' string goes like this
Code:
 echo 'left'\''right'

Actually it is a concatenation of 'left' and 'right' with a \' in between.
For an embedded sed script it is enough to remember to exchange each literal ' by '\''.

Last edited by MadeInGermany; 01-26-2017 at 01:25 PM..
This User Gave Thanks to MadeInGermany For This Post:
# 32  
Old 01-26-2017
Quote:
Originally Posted by Ardzii
Since there are quite a few examples and tutos to use SED on the web I started with that.
I understand now very basic concepts of the tool such as the "-n" and "-i"/"-i.bak" options or the p/s/d commands. Smilie
Very good! sed might be "love at third sight", but it is an immensely mighty tool.

Here is a very short introduction to my favourite topic:

Regular Expressions

Regular expressions are basically a way to describe text patterns. Because they describe text by (other) text they consist of two classes of characters: "characters" and "metacharacters". Characters are only stand-ins for themselves:
Code:
bla-foo

is a valid regular expression - one that matches the string "bla-foo" and nothing else. Now that is not very helpful in itself, but we could use this to use sed as a grep-equivalent. The following commands will do the same:

Code:
grep "bla-foo" /some/file
sed -n '/bla-foo/p' /some/file

Therefore there is another class of characters, so-called "metacharacters". These are expressions that either modify other characters (or groups thereof - it is possible to group) or match classes of characters. For modifiers we have two:

* - the character before may be repeated zero or more times
\{m,n\} - the character before has to be repeated between m and n times (m and n being integers)

Let us try an example: we look for the word "colour". The regexp for this (regexps are traditionally enclosed in slashes, which are not part of the regexp):

Code:
/colour/

Now suppose several people have written this text, some British, some American, so "colour" is sometimes written "color" and sometimes "colour". The regexp for this is:

Code:
/colou*r/

The asterisk makes the "u" optional (zero or more). The downside is that the hypothetical word "colouuuuur" would also be matched, but more on that later. Whenever you construct regular expressions you need to answer several questions:

1. will it match the lines i want matched?
2. will it not match lines i want to be matched too? (false negatives)
3. will it match lines i don't want to be matched? (false positives)

Now, suppose we would want to match not only "colour" and "color" but every word starting with a "c" and ending with an "r". For this we use another metacharacter:

. - matches any one character

This is the biggest problem for beginners, btw., especially when they come from DOS-derived systems or have only used "file-globs": "/some/file*" means any file named "file" and whatever trails it. In regexps "*" has a different meaning and what comes closest to "*" is ".". You can use it in conjunction with "*" to match strings of unknown composition:

Code:
/c.*r/

This will match: a "c", followed by any amount ("*") of any character(s) ("."), followed by an "r". Will this be a solution for our expample?

Sorry: no. Yes it will match "color" and "colour" and "colonel-major" and "conquistador" but - because "any character" also includes a space - it will also produce matches for "chicken hawk breeder" and the like. We would need to limit our wildcard to only non-whitespace.

For this there is the "character-class" and its inverted counterpart:

[a1x] - will match exactly one occurrence of either "a", "1" or "x"
[^a1x] - negation - will match any character except "a", "1" or "x"

Note,that the "^" to be used as negation it has to be the first character inside the brackets. [^^] is "anthing else than a caret sign" and [X^] is "either "X" or a caret sign. There are also predefined classes: [[:upper:]] (all capitalized characters) and [[:lower:]] (all non-capitalized characters) and so on. You can also specify sequences: [A-D1-3] (all capitalized characters from "A" through "D" or a number from 1 to 3 or: one of "A", "B", "C", "D" "1", "2" or "3").

With this we stand better chance of constructing our regexp:

Code:
/c[a-z]*r/

This is coming closer but now "colonel-major" is not matched any more. It is a matter of definition of hyphenated words should count as "one word", but suppose they do: you can sure change the regexp yourself now to include hyphenated words, no? Like this:

Code:
/c[a-z-]*r/

Alas, there is another sort false positives we haven't touched upon yet. How about the word "escrow": would it be matched by our regexp or not? Unfortunately: yes. Its middle part consists of "c" followed by zero or more non-capitalized characters including the hyphen, followed by an "r": "escrow". We would now have to make sure that the "c" is indeed at the beginning of the word and the "r" is at its end. This is in fact quite complicated because naively adding leading and trailing blanks:

Code:
/ c[a-z-]*r /

Would help for words in the middle of the line, but fail for words at the line end or the line start. Furthermore it might happen that a word is followed by punctuation like here where "colour, intended as an example" would fail because we look for trailing blanks only.

But let us make it simple: suppose all our text consists of one word per line. We could use the line start and the line end as a sort-of "anchor" for our regexp then. Fortunately this is possible:

^ - beginning of line
$ - end of line

Notice that "^" has two different functions: inside the brackets it means negate the class and at the beginning of a regexp it means beginning of line. Now we finally can construct our regexp:

Code:
/^c[a-z-]*r$/

This means (you can already read that yourself, but let me prove that i can read it too, for the record): beginning of line, followed by a "c", followed by any number of any non-capitalized characters or hyphens, followed by a "r" and the line end.

One last thing you need to know: regexps are greedy! That means: of there are more than one possible matching, regexps will always take the longest possible match (non-greedy would be the shortest-possible). For instance, consider following regexp: /a.*b/. Here is some text, the matched part is marked bold:

Code:
this is a blatant example of how greedy matches will be for beginners

If this regexp would be used to change the matched text and if you only want to match the "a b" at the beginning of the match you would need to use a negated character class:

Code:
/a[^b]*b/
this is a blatant example of how greedy matches will be for beginners

I hope this helps.

bakunin
This User Gave Thanks to bakunin For This Post:
# 33  
Old 01-30-2017
My god Bakunin, you're a master in sed! SmilieSmilie
Thank you so much for taking the time to write these lines! SmilieSmilie

OK then, let me try on your sed to see if I understood:

Code:
sed -n '/href.*view more/ s/.*href="\([^"]*\).*/\1/p')

I'll stay with my understanding of the first part of the command. You're actually not passing any command yet to sed. So what you're looking for in:
Code:
'/href.*view more/

is the line that matches "href [any kind of character in between]view more" or to put it another way:
Code:
SED, find me the line that has "href" + some string and "view more"in it.

you get that line:
Code:
<a href="/listing/magnet/ge/ramp-shim/2322185"> view more </a>

Now comes the good part:
Code:
s/.*href="\([^"]*\).*/\1/p'

Within that line, substitute: "[any kind of character before] href=" [following string omitting the possible " characters within the string] by [this same string without " characters that you just found] and print.

but how come the "> view more </a>" portion of the line was left out of the sed? because from what I understand you're including .* which still should include all the characters at the end of the line, shouldn't it?

Thanks as usual!

Best!

EDIT-----

I just tried:
Code:
sed -n '/href.*view more/ s/.*href="\(.*\)/\1/p'

and it gave me:
Code:
/listing/magnet/ge/ramp-shim/2322185"> view more </a>

So I guess that what's happening with your code is that when you tell sed to exclude the " it simply stops at it and do not go on with the rest of the line.

---------- Post updated at 06:08 PM ---------- Previous update was at 04:43 PM ----------

Quote:
Originally Posted by MadeInGermany
Exactly. The first character that matches in the trailing .* is a quote.
As I said, the leading and trailing .* are needed to "match away" the entire line. Otherwise only the matching portion would be substituted.

---------- Post updated at 12:15 ---------- Previous update was at 11:44 ----------

Now to your second requirement. Can give a headache even for experienced guys.
In your example the ' is a problem for the shell, in which you call
Code:
sed -n '...'

There is no problem if you save the sed code in a separate file and run it with
Code:
sed -n -f sed-script result2.txt

And the contents of the sed-script
Code:
/itemprop='brand'/ s/.*'brand'>\([^<]*\).*/\1/p

You can add another match in a second line
Code:
/itemprop='name'/ s/.*'name'>\([^<]*\).*/\1/p

but it won't match if the first match was successful and the input line was substituted.
It is necessary to save and restore the line.
Code:
h
/itemprop='brand'/ s/.*'brand'>\([^<]*\).*/\1/p
g
/itemprop='name'/ s/.*'name'>\([^<]*\).*/\1/p

Another aspect is greediness. The * wants to match as much it can. A leftmost * is most greedy.
That means /.*'branch'/ matches the rightmost 'branch'.
--
Last but not least, the shell method to print a ' within a ' ' string goes like this
Code:
 echo 'left'\''right'

Actually it is a concatenation of 'left' and 'right' with a \' in between.
For an embedded sed script it is enough to remember to exchange each literal ' by '\''.
Hey MadeinGermany!

Bakunin's explaination helped me a lot go through your answer but I still got a few questions:

Quote:
In your example the ' is a problem for the shell, in which you call
Code:
sed -n '...'

There is no problem if you save the sed code in a separate file and run it with
Code:
sed -n -f sed-script result2.txt

Why is it a problem for the shell?
When I paste the previous "PHP" (I assume it's PHP) code into a txt (examplesed.txt) for testing the command:
Code:
sed -n '/itemprop='brand'/ s/.*'brand'>\([^<]*\).*/\1/p' examplesed.txt

it yields nothing. There's no output whatsoever...

Quote:
Code:
/itemprop='brand'/ s/.*'brand'>\([^<]*\).*/\1/p
You can add another match in a second line

Code:
/itemprop='name'/ s/.*'name'>\([^<]*\).*/\1/p
but it won't match if the first match was successful and the input line was substituted.
It is necessary to save and restore the line.

Code:
h /itemprop='brand'/ s/.*'brand'>\([^<]*\).*/\1/p g /itemprop='name'/ s/.*'name'>\([^<]*\).*/\1/p
On that one, I'm not sure to follow either... My objective is to integrate these commands within a loop. So I will have the first iteration and it'll write the output to a file, then the second command (with 'name' for instance) > echo to a file and go to the third iteration etc...
Wouldn't that work under that setting?

Quote:
Another aspect is greediness. The * wants to match as much it can. A leftmost * is most greedy.
That means /.*'branch'/ matches the rightmost 'branch'.
I guess that in my case, my problem child would be 'name' that has a first appearance at the beginning of the line.
But in that case couldn't I use 2 right before the 'p' (print command).
I learnt on the web that putting a 1 or a 2 before the p would yield the first or second appearance of the term I'm looking for... wouldn't that work?

All the best!

Last edited by Ardzii; 01-30-2017 at 11:54 AM.. Reason: More discoveries!!
# 34  
Old 01-30-2017
The /2 option does not work if the .* has already matched too much. For example
Code:
echo "name something name something" | sed -n 's/.*name/XXXX/p'
XXXX something
echo "name something name something" | sed -n 's/.*name/XXXX/2p'

There is no 2nd match.
But it does work without the .*
Code:
echo "name something name something" | sed -n 's/name/XXXX/p'
XXXX something name something
echo "name something name something" | sed -n 's/name/XXXX/2p'
name something XXXX something

This User Gave Thanks to MadeInGermany For This Post:
# 35  
Old 01-30-2017
Quote:
Originally Posted by Ardzii
My god Bakunin, you're a master in sed!
The force flows strong in me, LOL!

Quote:
Originally Posted by Ardzii
OK then, let me try on your sed to see if I understood:
Actually you came very close. What you didn't get was the part i left out in my little introduction, so here is part two:

Grouping
To combine several characters or metacharacters into a single expression which you can handle together there is grouping: it works like grouping in mathematical expressions:

Code:
(x+y+z) * 3 =

The * 3 affects all that is inside the brackets as a single entity. The same works for regular expressions, just that the brackets are "escaped" (you put a backslash in front of them, otherwise they would be simple characters) and you can do really cool things with it:

Code:
/\(aa\)*/

Because the asterisk now addresses what is inside the brackets this matches any even number of a's (zero, two, four, ...), but not an odd number. Try the following file:

Code:
xy
xay
xaay
xaaay
xaaaay
xaaaaay
xaaaaaay

and apply this sed-command to it. Watch the output:

Code:
sed -n '/x\(aa\)*y/p' /your/file

Always keep in mind, btw.: i told you regexps are greedy in nature ("greedy" is really the term for it. The opposite is "non-greedy [matching]". More often then not if regexps do not do what you expect them to do this is the problem - they are matching more than you expect them to match.) This means i.e. that /\(aa\)*/ on its own would also match a line with 3 a's - it would match the 2 a's and just ignore the third one -> false positives, i warned you!

Grouping also has another use: you can use it for so-called backreferences. Backreferences are are parts of the matched line which you can use in a substitution command to put the matched part back into the substituted portion.

The most basic backreference is the &, but let us first examine the "s"-command of sed:

Code:
sed 's/<regexp1>/<regexp2>/'

This will scan the text (line by line) and try to match <regexp1>. Whenever it does, it substitutes <regexp2> for it, then the line is shipped to output.

"&" now can be used in <regexp2> to put there everything regexp1 has matched. Lets try something very simple: the regexp to match everything in a line is /^.*$/. We want to output all the input but put => and <= around every line. Here it is:

Code:
sed 's/^.*$/=> & <=/' /some/file

Cool, no?

Another form of backreference is "\n" where "n" is a number: 1, 2, 3, ... It will signify the portion of the <regexp1>, which is surrounded by the first (second, third, ...) pair of brackets. Suppose the input file from above with the "xa*y"-lines. Suppose we would want to exchange the first and last characters (and suppose they weren't fixed "x"s and "y"s). Here it is:

Code:
sed 's/^\(.\)\(a*\)\(.\)$/\3\2\1/' /path/to/file

We use the grouping here only to fill our various backreferences: first, we split the input into three parts: ^\(.\) (beginning of line, followed by a single character), \(a*\) (any number of a's) and \(.\)$ (again a single character, followed by the line end). In the substitution part we put them together reversed, first the third part, then the second one (the a's), then the former first part.

Most of the original sed-script should be clear by now, but we need to establish a few more things for the last bits:

When you write a substitute-command like about it is implied that it should be applied to every line. In fact, sed works like this:

- read the first/next line of input and put it into the so-called "pattern space"
- apply the first command of the script to this pattern space, it might change it (or not)
- apply the next command of the script to the changed pattern space, changing it further (or not)
- and so on, until the last command. If sed was started without the "-n" option print the pattern space now to output
- if this was not the last line of input, go to the start again and repeat
- if it was the last line, exit.

Ranges
Coming back to the substitute-commands: in their simplest form they are applied to every line. Here is some input file:

Code:
old
= old1
== Start ==
= old2
old3
== End ==
old4
= old5

The following will change all the "old" strings to "NEW":

Code:
sed 's/old/NEW/' /path/to/file
NEW
= NEW1
== Start ==
= NEW2
NEW3
== End ==
NEW4
= NEW5

But we could limit this command to only take place on lines starting with a "=":

Code:
sed '/^=/ s/old/NEW/' /path/to/file
old
= NEW1
== Start ==
= NEW2
old3
== End ==
old4
= NEW5

The first regexp /^=/ works like an "if"-statement: if the line (or something in it) matches the expression, then the substitute-command is applied, otherwise not.

There is also another form, where you can define a range of lines where the following command(s) are applied:

Code:
sed '/^== Start.*$/,/^== End.*$/ s/old/NEW/' /path/to/file
old
= old1
== Start ==
= NEW2
NEW3
== End ==
old4
= old5

Instead of regexps you can also use line numbers. This will apply the substitute-command only on lines 1,2 and 3:

Code:
sed '1,3 s/old/NEW/' /path/to/file


Was that all? No! One last thing: modifiers. Per default a substitute-command only changes the FIRST occurrence of a pattern:

Code:
$ echo "old old old" | sed 's/old/NEW/'
NEW old old

If you add some number at the end, this is the number of matching instance, which will be changed. If you add a "g" (global) all occurrences will be changed:

Code:
$ echo "old old old" | sed 's/old/NEW/'
NEW old old

$ echo "old old old" | sed 's/old/NEW/2'
old NEW old

$ echo "old old old" | sed 's/old/NEW/g'
NEW NEW NEW

Finally, there is one more modifier: "p". It prints the result of the substitution to the output. So far we have only had scripts consisting of only one command so that hasn't affected us but look above how sed works: what a command gets is basically what the command before has produced:

Code:
echo "white white white" | sed 's/white/blue/g
                                s/blue/green/g
                                s/green/red/g'
red red red

The second command would do nothing if they would get the input text without the first command already processing it and the same goes for the third command. but suppose you want to have the intermediary steps displayed: you can use the p-modifier for that (note that for the last line the "p" is implied):

Code:
echo "white white white" | sed 's/white/blue/gp
                                s/blue/green/gp
                                s/green/red/g'
blue blue blue
green green green
red red red

The p-modifier comes especially handy when you switch off the automatically implied printing at the end with the "-n" switch for sed: This way you do not need to filter out lines you do not want, you just print explicitly the ones you are interested in - a technique we used to filter out all lines not interesting in your text.

OK, was that all? No, not even close! sed is such a mighty tool i still am finding new ways to use it every day.

But - hey, in for a penny, in for a pound - here is a last one: you can use the ranges i talked about above and apply more than one command to them by using curly braces:

Code:
sed '/<regex1>/,/<regex2>/ {
                 s/<regex3>/<regex4>/
                 s/<regex5>/<regex6>/
                 s/<regex7>/<regex8>/
             }' /path/to/file

Now, the three substitutions will only be applied to a range of lines starting with <regex1> and ending with <regex2>. You can also negate/invert that:

Code:
sed '/<regex1>/,/<regex2>/ ! {
                 s/<regex3>/<regex4>/
                 s/<regex5>/<regex6>/
                 s/<regex7>/<regex8>/
             }' /path/to/file

Apply the three substitutions to all lines except for a range of lines starting ..... Of the same goes for the other forms of range specifications i showed you above.

I hope this helps.

bakunin
This User Gave Thanks to bakunin For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Improve script

Gents, Is there the possibility to improve this script to be able to have same output information. I did this script, but I believe there is a very short code to get same output here my script awk -F, '{if($10>0 && $10<=15) print $6}' tmp1 | sort -k1n | awk '{a++} END { for (n in a )... (23 Replies)
Discussion started by: jiam912
23 Replies

2. Shell Programming and Scripting

How to improve an script?

Gents. I have 2 different scripts for the same purpose: raw2csv_1 Script raw2csv_1 finish the process in less that 1 minute raw2csv_2 Script raw2csv_2 finish the process in more that 6 minutes. Can you please check if there is any option to improve the raw2csv_2. To finish the job... (4 Replies)
Discussion started by: jiam912
4 Replies

3. AIX

improve sulog

I just wrote a very small script that improves readability on system sulog. The problem with all sulog is there is lack of clarity whether the info you are looking at is the most current. So if you just need a simple soution instead of going thru the trouble of writing a script that rotate logs and... (0 Replies)
Discussion started by: sparcguy
0 Replies

4. Shell Programming and Scripting

Want to improve the performance of script

Hi All, I have written a script as follows which is taking lot of time in executing/searching only 3500 records taken as input from one file in log file of 12 GB Approximately. Working of script is read the csv file as an input having 2 arguments which are transaction_id,mobile_number and search... (6 Replies)
Discussion started by: poweroflinux
6 Replies

5. IP Networking

How to improve throughput?

I have a 10Gbps network link connecting two machines A and B. I want to transfer 20GB data from A to B using TCP. With default setting, I can use 50% bandwidth. How to improve the throughput? Is there any way to make throughput as close to 10Gbps as possible? thanks~ :) (3 Replies)
Discussion started by: andrewust
3 Replies

6. Shell Programming and Scripting

Any way to improve performance of this script

I have a data file of 2 gig I need to do all these, but its taking hours, any where i can improve performance, thanks a lot #!/usr/bin/ksh echo TIMESTAMP="$(date +'_%y-%m-%d.%H-%M-%S')" function showHelp { cat << EOF >&2 syntax extreme.sh FILENAME Specify filename to parse EOF... (3 Replies)
Discussion started by: sirababu
3 Replies

7. UNIX for Dummies Questions & Answers

Improve Performance

hi someone tell me which ways i can improve disk I/O and system process performance.kindly refer some commands so i can do it on my test machine.thanks, Mazhar (2 Replies)
Discussion started by: mazhar99
2 Replies

8. Shell Programming and Scripting

improve this?

Wrote this script to find the date x days before or after today. Is there any way that this script can be speeded up or otherwise improved? #!/usr/bin/sh check_done() { if then daysofmth=31 elif then if ... (11 Replies)
Discussion started by: blowtorch
11 Replies

9. UNIX for Advanced & Expert Users

improve performance by using ls better than find

Hi , i'm searching for files over many Aix servers with rsh command using this request : find /dir1 -name '*.' -exec ls {} \; and then count them with "wc" but i would improve this search because it's too long and replace directly find with ls command but "ls *. " doesn't work. and... (3 Replies)
Discussion started by: Nicol
3 Replies

10. Shell Programming and Scripting

Can I improve this script ???

Hi all, Still a newbie and learning as I go ... as you do :) Have created this script to report on disc usage and I've just included the ChkSpace function this morning. It's the first time I've read a file (line-by-bloody-line) and would like to know if I can improve this script ? FYI - I... (11 Replies)
Discussion started by: Cameron
11 Replies
Login or Register to Ask a Question