Noob trying to improve


 
Thread Tools Search this Thread
Operating Systems OS X (Apple) Noob trying to improve
# 1  
Old 01-16-2017
grep stands for 'g/re/p' of sed [where g is Global, re is RegularExpression and p is Print]
These 2 Users Gave Thanks to vgersh99 For This Post:
# 2  
Old 01-26-2017
[^"] is a character that is not a quote
[^"]* is any consecutive number of non-quote characters
\( \) does not mean a character but is a group mark, for later reference
Code:
s/.*href="\([^"]*\).*/\1/p

\1 is the reference. It becomes the string that matched within the \( \). The leading and trailing .* ensure that the entire line is matched, i.e. is deleted+substituted by the back-reference.
\1 actually referes to the 1st \( \); \2 would refer to the 2nd...
The -n sed option suppresses the default print. the /p at the end of the substitution is a print if there was a match. So non-matching lines are not printed.

Last edited by MadeInGermany; 01-26-2017 at 12:13 PM..
This User Gave Thanks to MadeInGermany For This Post:
# 3  
Old 01-26-2017
OK! Thanks MadeInGermany!
This changes the deal quite a bit! But it gives me a better view of the substitution being made!

I got:
Code:
substitution command / text that is going to be substituted / substitution / print

Now what I'm not sure to grasp is how it manages to stop at the
Quote:
"
? Is that thanks to the
Quote:
Quote:
[^"]*
is any consecutive number of non-quote characters
thingy? Does the deal go like: Start at
Quote:
href="
up to the next quote character?

Also why are there
Quote:
.* .... .*
in the structure?
s/.*href="\([^"]*\).*/\1/p
# 4  
Old 01-26-2017
Exactly. The first character that matches in the trailing .* is a quote.
As I said, the leading and trailing .* are needed to "match away" the entire line. Otherwise only the matching portion would be substituted.

---------- Post updated at 12:15 ---------- Previous update was at 11:44 ----------

Now to your second requirement. Can give a headache even for experienced guys.
In your example the ' is a problem for the shell, in which you call
Code:
sed -n '...'

There is no problem if you save the sed code in a separate file and run it with
Code:
sed -n -f sed-script result2.txt

And the contents of the sed-script
Code:
/itemprop='brand'/ s/.*'brand'>\([^<]*\).*/\1/p

You can add another match in a second line
Code:
/itemprop='name'/ s/.*'name'>\([^<]*\).*/\1/p

but it won't match if the first match was successful and the input line was substituted.
It is necessary to save and restore the line.
Code:
h
/itemprop='brand'/ s/.*'brand'>\([^<]*\).*/\1/p
g
/itemprop='name'/ s/.*'name'>\([^<]*\).*/\1/p

Another aspect is greediness. The * wants to match as much it can. A leftmost * is most greedy.
That means /.*'branch'/ matches the rightmost 'branch'.
--
Last but not least, the shell method to print a ' within a ' ' string goes like this
Code:
 echo 'left'\''right'

Actually it is a concatenation of 'left' and 'right' with a \' in between.
For an embedded sed script it is enough to remember to exchange each literal ' by '\''.

Last edited by MadeInGermany; 01-26-2017 at 01:25 PM..
This User Gave Thanks to MadeInGermany For This Post:
# 5  
Old 01-30-2017
The /2 option does not work if the .* has already matched too much. For example
Code:
echo "name something name something" | sed -n 's/.*name/XXXX/p'
XXXX something
echo "name something name something" | sed -n 's/.*name/XXXX/2p'

There is no 2nd match.
But it does work without the .*
Code:
echo "name something name something" | sed -n 's/name/XXXX/p'
XXXX something name something
echo "name something name something" | sed -n 's/name/XXXX/2p'
name something XXXX something

This User Gave Thanks to MadeInGermany For This Post:
# 6  
Old 01-31-2017
Hey Bakunin!

Thanks for the followup on your tuto! Again, I know it takes a lot of your time to write everything down so thank you very very much for that!

I tried out almost all of your explanations (except for the last multicommand part)!

The portion on sed greediness:
Quote:
Always keep in mind, btw.: i told you regexps are greedy in nature ("greedy" is really the term for it. The opposite is "non-greedy [matching]". More often then not if regexps do not do what you expect them to do this is the problem - they are matching more than you expect them to match.) This means i.e. that /\(aa\)*/ on its own would also match a line with 3 a's - it would match the 2 a's and just ignore the third one -> false positives, i warned you!
As far as I understand: The more I detail what I'm looking for to the command, the more I will be able to extract what I really want.
As you said: I tried with the \(aa\)* alone on your text and indeed I got more things that I really wished for:
Code:
sed -n '/\(aa\)*/p' sedgroupingtest.txt 
xy
xay
xaay
xaaay
xaaaay
xaaaaay
xaaaaaay

However, I'm not sure to understand why both xy and xay were matched as well? From what I understood, \(aa\) looks for at least 2 "a"s in each line doesn't it?
I also found a way to get what I was looking for ie. each line that has at least 2 "a"s:
Code:
ardzii@debian:~$ sed -n '/\(aa\).*/p' sedgroupingtest.txt 
xaay
xaaay
xaaaay
xaaaaay
xaaaaaay

The "selection" portion was particularly interesting:
Quote:
/^== Start.*$/,/^== End.*$/
If I read it correctly and with my sed knowledge now :P it goes:
Code:
the portion of text that is located in between the lines that start with "== Start + anything else to the end of the line ($)" and "== End + anything else to the end of the line ($)"

Now why my command doesn't work?
I've got a text file (that I personally called "examplesed.txt" which contains:
PHP Code:
<div id="category_listing" itemscope itemtype="http://data-vocabulary.org/Product">
        
        <
div id="category_bg">
        <
div class="title">
            <
h1 itemprop='name'>For Sale <span itemprop='brand'>HITACHI </span> <span itemprop='name'>AIRIS 1  Magnet</span></h1></div>
            <
meta itemprop="category" content="Business &amp; Industrial>Medical Medical Equipment" />
        <!-- 
end div title -->
                <
div class="listing_num">LISTING #2229540</div>
           
</div
        <
div style='border-bottom: dotted 1px #666' class="clr"></div>
        <
div id="category_listing_body">
            
<
div id="list_detail"
Now it seems that sed doesn't find for some reason the line I'm looking for:
Code:
> sed -n '/^<h1 itemprop='name'>For Sale.*$/p' examplesed.txt
>

so obviously when I try to do:
Code:
sed -n '/^<h1 itemprop='name'>For Sale.*$/ s/^.*itemprop='brand'>\([^<]*\).*/\1/p' examplesed.txt

The same happens: ie. NOTHING Hahahaha!

Why doesn't sed find this line correctly?
I though that maybe the command was considering the tabs that exist before the "<h1 itemprop='name'>For Sale" as a bunch of spaces and therefore I tried:
Code:
sed -n '/.*<h1 itemprop='name'>For Sale.*/p' examplesed.txt

But still nothing...

Thanks for your much appreciated help yall!

Best!

ardzii

Last edited by Ardzii; 01-31-2017 at 09:11 AM.. Reason: copy-paste error :)
# 7  
Old 01-31-2017
Quote:
Originally Posted by Ardzii
The portion on sed greediness:
As far as I understand: The more I detail what I'm looking for to the command, the more I will be able to extract what I really want.
Yes - and no. Yes, the better you define what you want the better results you will get. No, this has nothing to do with greedyness. Greedyness is the fact that if there several possible matches for a certain regexp always the LONGEST POSSIBLE one will be used.

In a regexp like /xa*y/ the a* will match all a's there are, regardless of how many there are. This is sometimes a desired effect and sometimes not. Here is an example for when it is not desired. Consider this text:

Code:
<tag>bla foo</tag> <othertag>more text</othertag>
<newtag>happy text</newtag> <moretag>just to fill in</moretag>

The task is to remove all the tags and just leave the text. The end result is like this:

Code:
bla foo more text
happy text just to fill in

Lets see: a "tag" is basically: a "<", followed by text, followed by ">". Hold on, there is an optional "/" after the opening "<" for the ending tag, but that is it, yes? Ok, this regexp will match that (the slash ("/") has to be escaped here, so that it is not confused with the "/" delimiting the regexp):

Code:
/<\/*.*>/

OK? Now let us try a simple sed-command. We will - for testing purposes - not delete the tags but overwrite them with "BLOB" to make sure we got everything right:

Code:
sed 's/<\/*.*>/BLOB/g' /path/to/file

That did really work well, did it? ;-)

Question: why were both lines changed to a single "BLOB"? Answer: because of the greedyness of regexps! What is the longest possible match for <\/*.*> in the first line?

The "<" matches the "<" at the beginning o the line.
The "\/*" matches nothing, but it is optional, so that doesn't matter.
The ".*" matches everything, until the penultimate character of the line. This is the longest possible match and the problem.
And the ">" matches - again, longest possible - the last ">" in the line, which happens to be at lines end.

Solution? Instead of ".", which matches everything, match only non-">" characters with a negated character-class:

Code:
sed 's/<\/*[^>]*>/BLOB/g' /path/to/file

Now, by encountering the first ">" the character-class "[^>]" (everything except ">") will not cover that and therefore the longest possible match is the first ">", not the last one.

Quote:
Originally Posted by Ardzii
However, I'm not sure to understand why both xy and xay were matched as well? From what I understood, \(aa\) looks for at least 2 "a"s in each line doesn't it?
No. As i said at the beginning "*" means "zero or more of what is before". Before that are two a's, hence the string "aa". This string, zero times, is? ;-))

In fact, the regexp would match absolutely everything, because it effectively matches the empty string.

If you want to match at least one instance of something, you write it two times and make one optional:

Code:
/x\(aa\)*y/            # any even number of a's, including 0
/xaa\(aa\)*y/          # any even number of a's, starting with 2
/xaa*y/                # any number of a's but at least one
/xa*y/                 # any number of a's, even none at all

Quote:
Originally Posted by Ardzii
I also found a way to get what I was looking for ie. each line that has at least 2 "a"s:
Code:
ardzii@debian:~$ sed -n '/\(aa\).*/p' sedgroupingtest.txt

Yes, but the reason why this worked is not what you probably believe it to be: you search for 2 a's in a row (grouped, but you could leave out the grouping here, it serves no purpose), followed by any number ("*") of any character ("."). You could have left out the .* and get the same.

I hope this helps.

bakunin

PS: if you are discouraged now and think "i'll never get that damn thing into my head" - don't be! It took all of us weeks and months to bend our brains hard enough to finally get it around thinking in sed-terms. That you dont get it in days - is, in fact, expected. Just keep trying and you will soon be able to finish my little tutorial for the next newbie for me.

Last edited by bakunin; 01-31-2017 at 05:37 PM..
This User Gave Thanks to bakunin For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Improve script

Gents, Is there the possibility to improve this script to be able to have same output information. I did this script, but I believe there is a very short code to get same output here my script awk -F, '{if($10>0 && $10<=15) print $6}' tmp1 | sort -k1n | awk '{a++} END { for (n in a )... (23 Replies)
Discussion started by: jiam912
23 Replies

2. Shell Programming and Scripting

How to improve an script?

Gents. I have 2 different scripts for the same purpose: raw2csv_1 Script raw2csv_1 finish the process in less that 1 minute raw2csv_2 Script raw2csv_2 finish the process in more that 6 minutes. Can you please check if there is any option to improve the raw2csv_2. To finish the job... (4 Replies)
Discussion started by: jiam912
4 Replies

3. AIX

improve sulog

I just wrote a very small script that improves readability on system sulog. The problem with all sulog is there is lack of clarity whether the info you are looking at is the most current. So if you just need a simple soution instead of going thru the trouble of writing a script that rotate logs and... (0 Replies)
Discussion started by: sparcguy
0 Replies

4. Shell Programming and Scripting

Want to improve the performance of script

Hi All, I have written a script as follows which is taking lot of time in executing/searching only 3500 records taken as input from one file in log file of 12 GB Approximately. Working of script is read the csv file as an input having 2 arguments which are transaction_id,mobile_number and search... (6 Replies)
Discussion started by: poweroflinux
6 Replies

5. IP Networking

How to improve throughput?

I have a 10Gbps network link connecting two machines A and B. I want to transfer 20GB data from A to B using TCP. With default setting, I can use 50% bandwidth. How to improve the throughput? Is there any way to make throughput as close to 10Gbps as possible? thanks~ :) (3 Replies)
Discussion started by: andrewust
3 Replies

6. Shell Programming and Scripting

Any way to improve performance of this script

I have a data file of 2 gig I need to do all these, but its taking hours, any where i can improve performance, thanks a lot #!/usr/bin/ksh echo TIMESTAMP="$(date +'_%y-%m-%d.%H-%M-%S')" function showHelp { cat << EOF >&2 syntax extreme.sh FILENAME Specify filename to parse EOF... (3 Replies)
Discussion started by: sirababu
3 Replies

7. UNIX for Dummies Questions & Answers

Improve Performance

hi someone tell me which ways i can improve disk I/O and system process performance.kindly refer some commands so i can do it on my test machine.thanks, Mazhar (2 Replies)
Discussion started by: mazhar99
2 Replies

8. Shell Programming and Scripting

improve this?

Wrote this script to find the date x days before or after today. Is there any way that this script can be speeded up or otherwise improved? #!/usr/bin/sh check_done() { if then daysofmth=31 elif then if ... (11 Replies)
Discussion started by: blowtorch
11 Replies

9. UNIX for Advanced & Expert Users

improve performance by using ls better than find

Hi , i'm searching for files over many Aix servers with rsh command using this request : find /dir1 -name '*.' -exec ls {} \; and then count them with "wc" but i would improve this search because it's too long and replace directly find with ls command but "ls *. " doesn't work. and... (3 Replies)
Discussion started by: Nicol
3 Replies

10. Shell Programming and Scripting

Can I improve this script ???

Hi all, Still a newbie and learning as I go ... as you do :) Have created this script to report on disc usage and I've just included the ChkSpace function this morning. It's the first time I've read a file (line-by-bloody-line) and would like to know if I can improve this script ? FYI - I... (11 Replies)
Discussion started by: Cameron
11 Replies
Login or Register to Ask a Question