Noob trying to improve

01-12-2017

Registered User

19, 0

Join Date: Dec 2016

Last Activity: 24 February 2017, 7:52 AM EST

Posts: 19

Thanks Given: 19

Thanked 0 Times in 0 Posts

Back from the dead

Hey RudiC!

It's been a while I know, but as I said I was busy learning bash

Not saying I got it all, I still got a long way to go...
I just wanted to post here what I've been able to do all on my own until now.
It will definitely seem barbaric to you

and less elegant that what you did earlier with the awk command but as I'm not sure how to control it, I'm taking another road

:

Code:

#!/bin/bash

#setting variable for the link construction. This will be the part that comes after the www.dotmed.com"$link" for the second curl
set link

#Setting the index for the while loop. The limit U (constant in the while loop)  will define the amount of equipment to "crawl"
i=1

#Setting the offset variable that helps passing from one href to the next. This variable is used in the first curl link
offset=0

#Starting the loop for the crawl
while [ $i -lt 5 ]
do

#Getting the listing and assigning each listing to the variable "link"
        link=$(curl "https://www.dotmed.com/equipment/2/5/2693/all/offset/$offset/all?key=&limit=1&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" | egrep "href.*view more" | sed -n 's/.*href="\([^"]*\).*/\1/p')

#Getting information from each listing
        curl "https://www.dotmed.com$link" | fgrep -e "id=\"price"

#Reseting for next iteration
        unset link      
        (( i++ ))
        (( offset++ ))
done

The great thing is that I can run it on any Linux machine plus I'm getting into each listing with this script to get info from there...
Now I've got to learn more about sed and grep to extract the information I need automatically and I'll be done

.
Easy right? hopefully I will be able to do it soon.

If you have any comment on the script please be my guest! still trying to learn!

All the best!

Last edited by Ardzii; 01-12-2017 at 02:29 PM.. Reason: English

Ardzii

View Public Profile for Ardzii

Find all posts by Ardzii

01-12-2017

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Even if i am not RudiC: you do quite fine.

Quote:

Originally Posted by Ardzii

It will definitely seem barbaric to you Smilie

and less elegant that what you did earlier with the awk command but as I'm not sure how to control it, I'm taking another road Smilie

:

In (german) medicine there is a proverb: who heals is right. In programming the same is true: as long as a program is doing what it is supposed to do it is kinda hard to argue ... ;-)

A few suggestions, though:

Code:

#setting variable for the link construction. 
local link=""

There is a difference between an unset variable and one that has a value of "" (empty string) or zero (for numbers). What you want is to declare the variable, so you can give some (meaningful) value to it, which is - if this yet to be determined - an empty value. In bash the keyword to declare variables is "local" or "declare" (or even "typeset", perhaps in an effort to be compatible to the Korn shell).

Code:

local -i i=1
local -i offset=1

see above. As a suggestion: always give variables meaningful names. Once your script grows to some length and you juggle around several indexes at the same time you might want to have one i.e. "fooidx" and one "baridx" instead of "i" and "j".

Code:

#Reseting for next iteration
        link=""      
        (( i++ ))
        (( offset++ ))

You don't want to unset (that is: the opposite of define) the variable, just clear its content. So, like in the declaration, you just assign an empty string instead of unsetting it.

As a suggestion: i put commentary always at the same line as the line which it belongs to and always at a fixed horizontal position. Hence, instead of your loop, I'd write:

Code:

#Starting the loop for the crawl
while [ $i -lt 5 ] ; do                     # crawling loop
                                            # getting the link
     link=$( curl "your-link-here" |\
             egrep "href.*view more" |\
             sed -n 's/.*href="\([^"]*\).*/\1/p' \
           )
                                            # extract link
     curl "https://www.dotmed.com$link" | fgrep -e "id=\"price"

        link=""                             # Reset for next iteration
        (( i++ ))
        (( offset++ ))
done

For my eyes this is easier to read, but again: whatever helps you you should do. In the pipeline:

Code:

     link=$( curl "your-link-here" |\
             egrep "href.*view more" |\
             sed -n 's/.*href="\([^"]*\).*/\1/p' \
           )

You can do all in sed without an additional egrep:

Code:

     link=$( curl "your-link-here" |\
             sed -n '/href.*view more/ s/.*href="\([^"]*\).*/\1/p' \
           )

As a rule of thumb: grep/sed/awk | grep/sed/awk is always wrong because it can be done in the respective tool chosen.

I hope this helps and have (more) fun programming.

bakunin

This User Gave Thanks to bakunin For This Post:

bakunin

View Public Profile for bakunin

Find all posts by bakunin

12-26-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

This may serve as a starting point (file contains the web content downloaded before):

Code:

awk '/href.*view more/ {sub (/^[^<]*<a href="/, "curl -s https://www.dotmed.com")
                        sub (/">.*$/, "")
                        print}
' file | sh | awk '
/<\/*title>/ ||
/id=\"price/ ||
/id=\"condition/ ||
/id=\"date_updated/     {gsub (/<[^>]*>/, _)
                         if (length) print
                        }
' 

 
Used GE Lunar DPX Bone Densitometer For Sale - DOTmed Listing #2299124: 
			Price:$20,000.00 USD [convert]
			Condition:Used - Excellent
			Date updated:December  18, 2016
 
New OSTEOSYS DEXXUM T Bone Densitometer For Sale - DOTmed Listing #2299556: 
			Price:$19,990.00 USD [convert]
			Condition:New
			Date updated:December  09, 2016
 
Used HOLOGIC DISCOVERY C Bone Densitometer For Sale - DOTmed Listing #1184884: 
			Price:$19,000.00 USD [convert]
			Condition:Used - Good
			Date updated:December  07, 2016
.
.
.

Last edited by RudiC; 12-26-2016 at 02:41 PM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

12-26-2016

Registered User

19, 0

Join Date: Dec 2016

Last Activity: 24 February 2017, 7:52 AM EST

Posts: 19

Thanks Given: 19

Thanked 0 Times in 0 Posts

Hey RudiC!

Thanks for that!

It looks really great! I have got a lot of work ahead to understand your lines though

I'll let you know once I'm able to get the results as you did!

Best!

Ardzii

Ardzii

View Public Profile for Ardzii

Find all posts by Ardzii

12-26-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

A wee bit improved so you can add the search words at the end as parameters separated by pipe symbols:

Code:

curl -s "https://www.dotmed.com/equipment/2/92/1209/all/offset/0/all?key=&limit=20&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" |
awk '/href.*view more/ {sub (/^[^<]*<a href="/, "curl -s https://www.dotmed.com")
                        sub (/">.*$/, "")
                        print}
' |
sh |
awk '
match ($0, "id=\"(" IDS ")\"")  ||
/<\/*title>/    {gsub (/<[^>]*>/, _)
                 print
                }
' IDS="price|condition|date_updated" 

Used GE Lunar DPX Bone Densitometer For Sale - DOTmed Listing #2299124: 
			Price:$20,000.00 USD [convert]
			Condition:Used - Excellent
			Date updated:December  18, 2016
 
New OSTEOSYS DEXXUM T Bone Densitometer For Sale - DOTmed Listing #2299556: 
			Price:$19,990.00 USD [convert]
			Condition:New
			Date updated:December  09, 2016
 
Used HOLOGIC DISCOVERY C Bone Densitometer For Sale - DOTmed Listing #1184884: 
			Price:$19,000.00 USD [convert]
			Condition:Used - Good
			Date updated:December  07, 2016
.
.
.

Last edited by RudiC; 12-26-2016 at 03:01 PM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

12-26-2016

Registered User

19, 0

Join Date: Dec 2016

Last Activity: 24 February 2017, 7:52 AM EST

Posts: 19

Thanks Given: 19

Thanked 0 Times in 0 Posts

OK! I got most of it...

For the first step you get the listing:

Code:

curl -s "https://www.dotmed.com/equipment/2/92/1209/all/offset/0/all?key=&limit=5&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" | awk '/href.*view more/ {sub (/^[^<]*<a href="/, "curl -s https://www.dotmed.com")
                        sub (/">.*$/, "")
                        print}

The second one you created a variable IDS that looks for the price, condition and date_updated and print the results.

Code:

curl -s "https://www.dotmed.com/equipment/2/92/1209/all/offset/0/all?key=&limit=20&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" | awk '/href.*view more/ {sub (/^[^<]*<a href="/, "curl -s https://www.dotmed.com")                         sub (/">.*$/, "")                         print} ' | sh | awk ' match ($0, "id=\"(" IDS ")\"")  || /<\/*title>/    {gsub (/<[^>]*>/, _)                  print                 } ' IDS="price|condition|date_updated"

I added a >> "/Users/myuser/Desktop/test.csv" to get the print exported to a CSV file.
I've been looking around for the past hour now and I can't seem to find how I can put each listing in a line with a ";" dividing the "description" (or "title") and the price, condition and date_updated instead of having 4 lines create per entry.

I know that something has to change between the "||" after the match and before the print, but I have no idea where and how...
Could you help me once more?

Thanks as usual!!

Ardzii

View Public Profile for Ardzii

Find all posts by Ardzii

12-27-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

If you'd accept a trailing comma (removal would need additional measures), set the output record separator to comma: ORS=",". As ALL info would come in a long line, then, we need to find out how to separate a single machine's data from the next. I used the begin of a HTML doc for this. Try adding the following to your script

Code:

.
.
.
/^<!DOCTYPE/    {printf RS
                }
END             {printf RS
                }
' IDS="price|condition|date_updated|in_stock" ORS=","

Please be aware that any comma INSIDE fields will lead to misinterpretation if the result is read somewhere else based on comma separated fields.

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

OS X (Apple)

Noob trying to improve

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Improve script

Discussion started by: jiam912

2. Shell Programming and Scripting

How to improve an script?

Discussion started by: jiam912

3. AIX

improve sulog

Discussion started by: sparcguy

4. Shell Programming and Scripting

Want to improve the performance of script

Discussion started by: poweroflinux

5. IP Networking

How to improve throughput?

Discussion started by: andrewust

6. Shell Programming and Scripting

Any way to improve performance of this script

Discussion started by: sirababu

7. UNIX for Dummies Questions & Answers

Improve Performance

Discussion started by: mazhar99

8. Shell Programming and Scripting

improve this?

Discussion started by: blowtorch

9. UNIX for Advanced & Expert Users

improve performance by using ls better than find

Discussion started by: Nicol

10. Shell Programming and Scripting

Can I improve this script ???

Discussion started by: Cameron