Noob trying to improve

01-16-2017

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

grep stands for 'g/re/p' of sed [where g is Global, re is RegularExpression and p is Print]

These 2 Users Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

01-26-2017

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

[^"] is a character that is not a quote
[^"]* is any consecutive number of non-quote characters
 does not mean a character but is a group mark, for later reference

Code:

s/.*href="\([^"]*\).*/\1/p

\1 is the reference. It becomes the string that matched within the . The leading and trailing .* ensure that the entire line is matched, i.e. is deleted+substituted by the back-reference.
\1 actually referes to the 1st ; \2 would refer to the 2nd...
The -n sed option suppresses the default print. the /p at the end of the substitution is a print if there was a match. So non-matching lines are not printed.

Last edited by MadeInGermany; 01-26-2017 at 12:13 PM..

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

01-26-2017

Registered User

19, 0

Join Date: Dec 2016

Last Activity: 24 February 2017, 7:52 AM EST

Posts: 19

Thanks Given: 19

Thanked 0 Times in 0 Posts

OK! Thanks MadeInGermany!
This changes the deal quite a bit! But it gives me a better view of the substitution being made!

I got:

Code:

substitution command / text that is going to be substituted / substitution / print

Now what I'm not sure to grasp is how it manages to stop at the

Quote:

"

? Is that thanks to the

Quote:

[^"]*

is any consecutive number of non-quote characters

thingy? Does the deal go like: Start at

Quote:

href="

up to the next quote character?

Also why are there

Quote:

.* .... .*

in the structure?
s/.*href="$[^"]*$.*/\1/p

Ardzii

View Public Profile for Ardzii

Find all posts by Ardzii

01-26-2017

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Exactly. The first character that matches in the trailing .* is a quote.
As I said, the leading and trailing .* are needed to "match away" the entire line. Otherwise only the matching portion would be substituted.

---------- Post updated at 12:15 ---------- Previous update was at 11:44 ----------

Now to your second requirement. Can give a headache even for experienced guys.
In your example the ' is a problem for the shell, in which you call

Code:

sed -n '...'

There is no problem if you save the sed code in a separate file and run it with

Code:

sed -n -f sed-script result2.txt

And the contents of the sed-script

Code:

/itemprop='brand'/ s/.*'brand'>\([^<]*\).*/\1/p

You can add another match in a second line

Code:

/itemprop='name'/ s/.*'name'>\([^<]*\).*/\1/p

but it won't match if the first match was successful and the input line was substituted.
It is necessary to save and restore the line.

Code:

h
/itemprop='brand'/ s/.*'brand'>\([^<]*\).*/\1/p
g
/itemprop='name'/ s/.*'name'>\([^<]*\).*/\1/p

Another aspect is greediness. The * wants to match as much it can. A leftmost * is most greedy.
That means /.*'branch'/ matches the rightmost 'branch'.
--
Last but not least, the shell method to print a ' within a ' ' string goes like this

Code:

 echo 'left'\''right'

Actually it is a concatenation of 'left' and 'right' with a \' in between.
For an embedded sed script it is enough to remember to exchange each literal ' by '\''.

Last edited by MadeInGermany; 01-26-2017 at 01:25 PM..

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

01-30-2017

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

The /2 option does not work if the .* has already matched too much. For example

Code:

echo "name something name something" | sed -n 's/.*name/XXXX/p'
XXXX something
echo "name something name something" | sed -n 's/.*name/XXXX/2p'

There is no 2nd match.
But it does work without the .*

Code:

echo "name something name something" | sed -n 's/name/XXXX/p'
XXXX something name something
echo "name something name something" | sed -n 's/name/XXXX/2p'
name something XXXX something

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

01-31-2017

Registered User

19, 0

Join Date: Dec 2016

Last Activity: 24 February 2017, 7:52 AM EST

Posts: 19

Thanks Given: 19

Thanked 0 Times in 0 Posts

Hey Bakunin!

Thanks for the followup on your tuto! Again, I know it takes a lot of your time to write everything down so thank you very very much for that!

I tried out almost all of your explanations (except for the last multicommand part)!

The portion on sed greediness:

Quote:

Always keep in mind, btw.: i told you regexps are greedy in nature ("greedy" is really the term for it. The opposite is "non-greedy [matching]". More often then not if regexps do not do what you expect them to do this is the problem - they are matching more than you expect them to match.) This means i.e. that /$aa$*/ on its own would also match a line with 3 a's - it would match the 2 a's and just ignore the third one -> false positives, i warned you!

As far as I understand: The more I detail what I'm looking for to the command, the more I will be able to extract what I really want.
As you said: I tried with the $aa$* alone on your text and indeed I got more things that I really wished for:

Code:

sed -n '/\(aa\)*/p' sedgroupingtest.txt 
xy
xay
xaay
xaaay
xaaaay
xaaaaay
xaaaaaay

However, I'm not sure to understand why both xy and xay were matched as well? From what I understood, $aa$ looks for at least 2 "a"s in each line doesn't it?
I also found a way to get what I was looking for ie. each line that has at least 2 "a"s:

Code:

ardzii@debian:~$ sed -n '/\(aa\).*/p' sedgroupingtest.txt 
xaay
xaaay
xaaaay
xaaaaay
xaaaaaay

The "selection" portion was particularly interesting:

Quote:

/^== Start.*$/,/^== End.*$/

If I read it correctly and with my sed knowledge now :P it goes:

Code:

the portion of text that is located in between the lines that start with "== Start + anything else to the end of the line ($)" and "== End + anything else to the end of the line ($)"

Now why my command doesn't work?
I've got a text file (that I personally called "examplesed.txt" which contains:

PHP Code:


<div id="category_listing" itemscope itemtype="http://data-vocabulary.org/Product">
        
        <div id="category_bg">
        <div class="title">
            <h1 itemprop='name'>For Sale <span itemprop='brand'>HITACHI </span> <span itemprop='name'>AIRIS 1  Magnet</span></h1></div>
            <meta itemprop="category" content="Business &amp; Industrial>Medical Medical Equipment" />
        <!-- end div title -->
                <div class="listing_num">LISTING #2229540</div>
           </div> 
        <div style='border-bottom: dotted 1px #666' class="clr"></div>
        <div id="category_listing_body">
            
<div id="list_detail">

Now it seems that sed doesn't find for some reason the line I'm looking for:

Code:

> sed -n '/^<h1 itemprop='name'>For Sale.*$/p' examplesed.txt
>

so obviously when I try to do:

Code:

sed -n '/^<h1 itemprop='name'>For Sale.*$/ s/^.*itemprop='brand'>\([^<]*\).*/\1/p' examplesed.txt

The same happens: ie. NOTHING Hahahaha!

Why doesn't sed find this line correctly?
I though that maybe the command was considering the tabs that exist before the "<h1 itemprop='name'>For Sale" as a bunch of spaces and therefore I tried:

Code:

sed -n '/.*<h1 itemprop='name'>For Sale.*/p' examplesed.txt

But still nothing...

Thanks for your much appreciated help yall!

Best!

ardzii

Last edited by Ardzii; 01-31-2017 at 09:11 AM.. Reason: copy-paste error :)

Ardzii

View Public Profile for Ardzii

Find all posts by Ardzii

01-31-2017

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by Ardzii

The portion on sed greediness:
As far as I understand: The more I detail what I'm looking for to the command, the more I will be able to extract what I really want.

Yes - and no. Yes, the better you define what you want the better results you will get. No, this has nothing to do with greedyness. Greedyness is the fact that if there several possible matches for a certain regexp always the LONGEST POSSIBLE one will be used.

In a regexp like /xa*y/ the a* will match all a's there are, regardless of how many there are. This is sometimes a desired effect and sometimes not. Here is an example for when it is not desired. Consider this text:

Code:

<tag>bla foo</tag> <othertag>more text</othertag>
<newtag>happy text</newtag> <moretag>just to fill in</moretag>

The task is to remove all the tags and just leave the text. The end result is like this:

Code:

bla foo more text
happy text just to fill in

Lets see: a "tag" is basically: a "<", followed by text, followed by ">". Hold on, there is an optional "/" after the opening "<" for the ending tag, but that is it, yes? Ok, this regexp will match that (the slash ("/") has to be escaped here, so that it is not confused with the "/" delimiting the regexp):

Code:

/<\/*.*>/

OK? Now let us try a simple sed-command. We will - for testing purposes - not delete the tags but overwrite them with "BLOB" to make sure we got everything right:

Code:

sed 's/<\/*.*>/BLOB/g' /path/to/file

That did really work well, did it? ;-)

Question: why were both lines changed to a single "BLOB"? Answer: because of the greedyness of regexps! What is the longest possible match for <\/*.*> in the first line?

The "<" matches the "<" at the beginning o the line.
The "\/*" matches nothing, but it is optional, so that doesn't matter.
The ".*" matches everything, until the penultimate character of the line. This is the longest possible match and the problem.
And the ">" matches - again, longest possible - the last ">" in the line, which happens to be at lines end.

Solution? Instead of ".", which matches everything, match only non-">" characters with a negated character-class:

Code:

sed 's/<\/*[^>]*>/BLOB/g' /path/to/file

Now, by encountering the first ">" the character-class "[^>]" (everything except ">") will not cover that and therefore the longest possible match is the first ">", not the last one.

Quote:

Originally Posted by Ardzii

However, I'm not sure to understand why both xy and xay were matched as well? From what I understood, $aa$ looks for at least 2 "a"s in each line doesn't it?

No. As i said at the beginning "*" means "zero or more of what is before". Before that are two a's, hence the string "aa". This string, zero times, is? ;-))

In fact, the regexp would match absolutely everything, because it effectively matches the empty string.

If you want to match at least one instance of something, you write it two times and make one optional:

Code:

/x\(aa\)*y/            # any even number of a's, including 0
/xaa\(aa\)*y/          # any even number of a's, starting with 2
/xaa*y/                # any number of a's but at least one
/xa*y/                 # any number of a's, even none at all

Quote:

Originally Posted by Ardzii

I also found a way to get what I was looking for ie. each line that has at least 2 "a"s:

Code:

ardzii@debian:~$ sed -n '/\(aa\).*/p' sedgroupingtest.txt

Yes, but the reason why this worked is not what you probably believe it to be: you search for 2 a's in a row (grouped, but you could leave out the grouping here, it serves no purpose), followed by any number ("*") of any character ("."). You could have left out the .* and get the same.

I hope this helps.

bakunin

PS: if you are discouraged now and think "i'll never get that damn thing into my head" - don't be! It took all of us weeks and months to bend our brains hard enough to finally get it around thinking in sed-terms. That you dont get it in days - is, in fact, expected. Just keep trying and you will soon be able to finish my little tutorial for the next newbie for me.

Last edited by bakunin; 01-31-2017 at 05:37 PM..

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

OS X (Apple)

Noob trying to improve

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Improve script

Discussion started by: jiam912

2. Shell Programming and Scripting

How to improve an script?

Discussion started by: jiam912

3. AIX

improve sulog

Discussion started by: sparcguy

4. Shell Programming and Scripting

Want to improve the performance of script

Discussion started by: poweroflinux

5. IP Networking

How to improve throughput?

Discussion started by: andrewust

6. Shell Programming and Scripting

Any way to improve performance of this script

Discussion started by: sirababu

7. UNIX for Dummies Questions & Answers

Improve Performance

Discussion started by: mazhar99

8. Shell Programming and Scripting

improve this?

Discussion started by: blowtorch

9. UNIX for Advanced & Expert Users

improve performance by using ls better than find

Discussion started by: Nicol

10. Shell Programming and Scripting

Can I improve this script ???

Discussion started by: Cameron