Noob trying to improve

01-31-2017

Registered User

19, 0

Join Date: Dec 2016

Last Activity: 24 February 2017, 7:52 AM EST

Posts: 19

Thanks Given: 19

Thanked 0 Times in 0 Posts

Hey Bakunin!

Thanks for the followup on your tuto! Again, I know it takes a lot of your time to write everything down so thank you very very much for that!

I tried out almost all of your explanations (except for the last multicommand part)!

The portion on sed greediness:

Quote:

Always keep in mind, btw.: i told you regexps are greedy in nature ("greedy" is really the term for it. The opposite is "non-greedy [matching]". More often then not if regexps do not do what you expect them to do this is the problem - they are matching more than you expect them to match.) This means i.e. that /$aa$*/ on its own would also match a line with 3 a's - it would match the 2 a's and just ignore the third one -> false positives, i warned you!

As far as I understand: The more I detail what I'm looking for to the command, the more I will be able to extract what I really want.
As you said: I tried with the $aa$* alone on your text and indeed I got more things that I really wished for:

Code:

sed -n '/\(aa\)*/p' sedgroupingtest.txt 
xy
xay
xaay
xaaay
xaaaay
xaaaaay
xaaaaaay

However, I'm not sure to understand why both xy and xay were matched as well? From what I understood, $aa$ looks for at least 2 "a"s in each line doesn't it?
I also found a way to get what I was looking for ie. each line that has at least 2 "a"s:

Code:

ardzii@debian:~$ sed -n '/\(aa\).*/p' sedgroupingtest.txt 
xaay
xaaay
xaaaay
xaaaaay
xaaaaaay

The "selection" portion was particularly interesting:

Quote:

/^== Start.*$/,/^== End.*$/

If I read it correctly and with my sed knowledge now :P it goes:

Code:

the portion of text that is located in between the lines that start with "== Start + anything else to the end of the line ($)" and "== End + anything else to the end of the line ($)"

Now why my command doesn't work?
I've got a text file (that I personally called "examplesed.txt" which contains:

PHP Code:


<div id="category_listing" itemscope itemtype="http://data-vocabulary.org/Product">
        
        <div id="category_bg">
        <div class="title">
            <h1 itemprop='name'>For Sale <span itemprop='brand'>HITACHI </span> <span itemprop='name'>AIRIS 1  Magnet</span></h1></div>
            <meta itemprop="category" content="Business &amp; Industrial>Medical Medical Equipment" />
        <!-- end div title -->
                <div class="listing_num">LISTING #2229540</div>
           </div> 
        <div style='border-bottom: dotted 1px #666' class="clr"></div>
        <div id="category_listing_body">
            
<div id="list_detail">

Now it seems that sed doesn't find for some reason the line I'm looking for:

Code:

> sed -n '/^<h1 itemprop='name'>For Sale.*$/p' examplesed.txt
>

so obviously when I try to do:

Code:

sed -n '/^<h1 itemprop='name'>For Sale.*$/ s/^.*itemprop='brand'>\([^<]*\).*/\1/p' examplesed.txt

The same happens: ie. NOTHING Hahahaha!

Why doesn't sed find this line correctly?
I though that maybe the command was considering the tabs that exist before the "<h1 itemprop='name'>For Sale" as a bunch of spaces and therefore I tried:

Code:

sed -n '/.*<h1 itemprop='name'>For Sale.*/p' examplesed.txt

But still nothing...

Thanks for your much appreciated help yall!

Best!

ardzii

Last edited by Ardzii; 01-31-2017 at 09:11 AM.. Reason: copy-paste error :)

Ardzii

View Public Profile for Ardzii

Find all posts by Ardzii

01-31-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Your first pattern doesn't match for two reasons, of which you found and (roughly) eliminated one (congrats!): As the pattern is anchored at begin-of-line with the [ICODE] ^ [/ICODE, you need to allow for the leading white space in front of the <h1 sub pattern. While you matched any character .* allowing for matches towards line end as well, an exact match with e.g. character classes like [[:blank:]]* should be preferred, allowing matches of spaces and <TAB>s only.
The other reason your pattern fails is quoting. As the sed first parameter, the script, is enclosed by single quotes. So the quotes around 'name' unquote and requote the parameter, factually removing the quotes from the string. Try either allowing for one wild card character ".name." if you're sure no other patterns will match, or use double quotes (with mayhap other side effects on the parameter) around the script including the pattern. Like:

Code:

sed -n "/^[[:blank:]]*<h1 itemprop='name'>For Sale.*$/p" file
            <h1 itemprop='name'>For Sale <span itemprop='brand'>HITACHI </span> <span itemprop='name'>AIRIS 1  Magnet</span></h1></div>

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

01-31-2017

Registered User

19, 0

Join Date: Dec 2016

Last Activity: 24 February 2017, 7:52 AM EST

Posts: 19

Thanks Given: 19

Thanked 0 Times in 0 Posts

Great!! Thanks RudiC!!

Ardzii

View Public Profile for Ardzii

Find all posts by Ardzii

01-31-2017

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Quote:

Originally Posted by Ardzii

The portion on sed greediness:
As far as I understand: The more I detail what I'm looking for to the command, the more I will be able to extract what I really want.

Yes - and no. Yes, the better you define what you want the better results you will get. No, this has nothing to do with greedyness. Greedyness is the fact that if there several possible matches for a certain regexp always the LONGEST POSSIBLE one will be used.

In a regexp like /xa*y/ the a* will match all a's there are, regardless of how many there are. This is sometimes a desired effect and sometimes not. Here is an example for when it is not desired. Consider this text:

Code:

<tag>bla foo</tag> <othertag>more text</othertag>
<newtag>happy text</newtag> <moretag>just to fill in</moretag>

The task is to remove all the tags and just leave the text. The end result is like this:

Code:

bla foo more text
happy text just to fill in

Lets see: a "tag" is basically: a "<", followed by text, followed by ">". Hold on, there is an optional "/" after the opening "<" for the ending tag, but that is it, yes? Ok, this regexp will match that (the slash ("/") has to be escaped here, so that it is not confused with the "/" delimiting the regexp):

Code:

/<\/*.*>/

OK? Now let us try a simple sed-command. We will - for testing purposes - not delete the tags but overwrite them with "BLOB" to make sure we got everything right:

Code:

sed 's/<\/*.*>/BLOB/g' /path/to/file

That did really work well, did it? ;-)

Question: why were both lines changed to a single "BLOB"? Answer: because of the greedyness of regexps! What is the longest possible match for <\/*.*> in the first line?

The "<" matches the "<" at the beginning o the line.
The "\/*" matches nothing, but it is optional, so that doesn't matter.
The ".*" matches everything, until the penultimate character of the line. This is the longest possible match and the problem.
And the ">" matches - again, longest possible - the last ">" in the line, which happens to be at lines end.

Solution? Instead of ".", which matches everything, match only non-">" characters with a negated character-class:

Code:

sed 's/<\/*[^>]*>/BLOB/g' /path/to/file

Now, by encountering the first ">" the character-class "[^>]" (everything except ">") will not cover that and therefore the longest possible match is the first ">", not the last one.

Quote:

Originally Posted by Ardzii

However, I'm not sure to understand why both xy and xay were matched as well? From what I understood, $aa$ looks for at least 2 "a"s in each line doesn't it?

No. As i said at the beginning "*" means "zero or more of what is before". Before that are two a's, hence the string "aa". This string, zero times, is? ;-))

In fact, the regexp would match absolutely everything, because it effectively matches the empty string.

If you want to match at least one instance of something, you write it two times and make one optional:

Code:

/x\(aa\)*y/            # any even number of a's, including 0
/xaa\(aa\)*y/          # any even number of a's, starting with 2
/xaa*y/                # any number of a's but at least one
/xa*y/                 # any number of a's, even none at all

Quote:

Originally Posted by Ardzii

I also found a way to get what I was looking for ie. each line that has at least 2 "a"s:

Code:

ardzii@debian:~$ sed -n '/\(aa\).*/p' sedgroupingtest.txt

Yes, but the reason why this worked is not what you probably believe it to be: you search for 2 a's in a row (grouped, but you could leave out the grouping here, it serves no purpose), followed by any number ("*") of any character ("."). You could have left out the .* and get the same.

I hope this helps.

bakunin

PS: if you are discouraged now and think "i'll never get that damn thing into my head" - don't be! It took all of us weeks and months to bend our brains hard enough to finally get it around thinking in sed-terms. That you dont get it in days - is, in fact, expected. Just keep trying and you will soon be able to finish my little tutorial for the next newbie for me.

Last edited by bakunin; 01-31-2017 at 05:37 PM..

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

02-08-2017

Registered User

19, 0

Join Date: Dec 2016

Last Activity: 24 February 2017, 7:52 AM EST

Posts: 19

Thanks Given: 19

Thanked 0 Times in 0 Posts

Thank you all! Noob Improved!! :P

Hey guys!

I wanted to let you know that I was able to finish up my script. It gives less information that I really need but I'm amazed of what I was able to do all by myself.

Let's be realistic though: I couldn't have done it without you!

I learned so much this is crazy... from 0 to *not the best but something at least*!

So I guess: THANK YOU for your patience, your support and all the time you invested in showing me the way!!!

Obviously you can use the script. All you have to do is get a proper link: So that you can get an example you can use:
https://www.dotmed.com/equipment/2/26/2974/all
with no more than 116 scroll (figure as of 8.2.17) and, as it says in the script, you will need to create a dir called "DotMedListings" in:
~/

I guess that you will find it sort of messy but it works for now and it's a good basis!

I'm definitely open to your comments and suggestions, as you can imagine! (for instance my progress is not very friendly)

here goes the script:

Code:

#!/bin/bash
#
#
#
# For this script to work, you will first need to create a DotMedListings dir in your /home/XXXX/ directory.
#
#
#

declare link=""                #Will store the link for each iteration
declare linkStart=""            #Defines the type of equipment to crawl. To be found in Find Listings For Sale or Wanted On DOTmed.com
declare brand=""            #output
declare price=""            #output
declare currency=""            #output
declare condition=""            #output
declare dateListing=""            #output
declare country=""            #output
declare title=""            #output
declare description=""            #output
declare equipment=""            #output
declare yom=""                #output
declare -i totalCrawl=1            #Variable to define the scope of the crawl (total number of listing to crawl)
declare fileNameBase=""            #Used for the name of the output file via curl: Corresponds to the name of the equipment
declare fileName=""            #Definitve name of the Output file: dateCrawl + fileBaseName
declare dateCrawl=$(date +"%d-%m-%y")    #Date of the crawl used for the name
declare -i offset=0            #Base iteration of the offset. Gets +1 after each iteration
declare -i firstIndex=1            #index for the while - Gets +1 after each iteration but starts on 1 instead of 0 (for the offset).
declare nameToHome=$(cd ~ ; pwd | sed -n 's/.home.\([^\/]*\).*/\1/p')    #name for the path of the file to search if already created

echo
echo 
echo "************* Give the link to the equipment type, from https:// to /all included (last '/' excluded): *************"
read linkStart
echo
echo "************* Now, the total number of listings for the equipment: *************"
read totalCrawl
echo 
echo

#
# Naming the output file
#

fileNameBase=$(curl -s "$linkStart/offset/$offset/all?key=&limit=1&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" | sed -n '/^[[:blank:]]*<li itemscope itemtype="http:..data-vocabulary.org.Breadcrumb"><span itemprop="title">.*$/ s/^.*itemprop="title">\([^<]*\).*$/\1/p')

fileName=$dateCrawl"-"$fileNameBase".csv"

#
# Looking if it already exists
#

if test -f "/home/$nameToHome/DotMedListings/$fileName"
then
    echo
    echo "************* WARNING ************* WARNING *************"
    echo "************* You already crawled that today! *************"
    echo "************* Delete the file or try another *************"
    echo "************* WARNING ************* WARNING *************"
    echo
#
# If not, starting the script
#

else

    echo
    echo
    echo
    echo "************* You will find your result in ~/DotMedListings/$fileName *************"
    echo
    echo
    echo

    echo "brand;equipment;title;description;price;currency;condition;dateListing;country;YoM" >> ~/DotMedListings/"$fileName"    #defining each category for the crawl.


    while [ $firstIndex -le $totalCrawl ]        #Starting the crawling loop
    do

        awk -v t1="$firstIndex" -v t2="$totalCrawl" 'BEGIN{print (t1/t2) * 100}'        # Prints the percentage advancement instead of having the curl info.

        link=$(curl -s "$linkStart/offset/$offset/all?key=&limit=1&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" | sed -n '/href.*view more/ s/.*href="\([^"]*\).*/\1/p')    # Get the corredponding listing. If it's the first iteration then it will get the first listing for the equipment.

        curl -s "https://www.dotmed.com$link" -o ~/curl"$totalCrawl".xml        #Saves one curl for the first listing to avoid various curls for the same listing

#
# Getting the info out of the curl
#

        brand=$(sed -n "/^[[:blank:]]*<h1 itemprop='name'.*/ s/^.*itemprop='brand'>\([^<]*\).*/\1/p" ~/curl"$totalCrawl".xml)
        equipment=$(sed -n '/^[[:blank:]]*<meta property="og:url".*$/ s/.*"https:\/\/www.dotmed.com\/listing\/\([^\/]*\).*/\1/p' ~/curl"$totalCrawl".xml)
        price=$(sed -n "/^[[:blank:]]*<ul><li class=.left.>Price.*$/ s/^.*amount=\([^&]*\).*/\1/p" ~/curl"$totalCrawl".xml)
        currency=$(sed -n '/^[[:blank:]]*<ul><li class=.left.>Price.*$/ s/^.*currency_from=\([^"]*\).*/\1/p' ~/curl"$totalCrawl".xml)
        condition=$(sed -n '/^[[:blank:]]*<ul><li class="left">Condition:.*$/ s/^.*content=.used.>\([^<]*\).*/\1/p' ~/curl"$totalCrawl".xml)
        dateListing=$(sed -n '/^[[:blank:]]*<ul><li class="left">Date updated.*$/ s/^.*id="date_updated">\([^<]*\).*/\1/p' ~/curl"$totalCrawl".xml)
        country=$(sed -n "/^[[:blank:]]*<p class=.nation.>.*$/ s/^.*'This listing comes from \([^']*\).*/\1/p" ~/curl"$totalCrawl".xml)
        title=$(sed -n '/^[[:blank:]]*<meta property="og:title".*$/ s/.*content="\([^-]*\).*/\1/p' ~/curl"$totalCrawl".xml)
        description=$(sed -n '/^[[:blank:]]*<meta property="og:description".*$/ s/.*content="\([^-]*\).*/\1/p' ~/curl"$totalCrawl".xml)
        yom=$(sed -n '/^.*Specifications: Year of Manufacture.*$/ s/^.*Specifications: Year of Manufacture,\([^,]*\).*/\1/p' ~/curl"$totalCrawl".xml)

# 
# Sending the info to the output file
#


        echo $brand";"$equipment";"$title";"$description";"$price";"$currency";"$condition";"$dateListing";"$country";"$yom >> ~/DotMedListings/"$fileName"

        rm ~/curl"$totalCrawl".xml    # Deleting the curl file to leave space for the next iteration. FYI, I nammed the curl file with the number of crawls to be done to be able to launch simutaniously the script and be able to crawl various equipments at a time.

#
# Resetting for the next iteration.
#

        link=""    
        brand=""
        price=""
        currency=""
        condition=""
        dateListing=""
        country=""
        title=""    
        description=""
        equipment=""
        yom=""

        (( firstIndex++ ))
        (( offset++ ))
    done

    echo
    echo
    echo 
    echo "************* Done! Again, you will find the result in ~/DotMedListings/$fileName *************"
    echo
    echo
    echo
fi

Thanks again to you guys!

All the best!

Ardzii

View Public Profile for Ardzii

Find all posts by Ardzii

OS X (Apple)

Noob trying to improve

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Improve script

Discussion started by: jiam912

2. Shell Programming and Scripting

How to improve an script?

Discussion started by: jiam912

3. AIX

improve sulog

Discussion started by: sparcguy

4. Shell Programming and Scripting

Want to improve the performance of script

Discussion started by: poweroflinux

5. IP Networking

How to improve throughput?

Discussion started by: andrewust

6. Shell Programming and Scripting

Any way to improve performance of this script

Discussion started by: sirababu

7. UNIX for Dummies Questions & Answers

Improve Performance

Discussion started by: mazhar99

8. Shell Programming and Scripting

improve this?

Discussion started by: blowtorch

9. UNIX for Advanced & Expert Users

improve performance by using ls better than find

Discussion started by: Nicol

10. Shell Programming and Scripting

Can I improve this script ???

Discussion started by: Cameron