Noob trying to improve

12-22-2016

Registered User

19, 0

Join Date: Dec 2016

Last Activity: 24 February 2017, 7:52 AM EST

Posts: 19

Thanks Given: 19

Thanked 0 Times in 0 Posts

Noob trying to improve

Hi everyone!

This is my first post here, I hope that I will not already be violating any rule! I also would like to apologize in advance as my post will definitely be a noob post... please have patience and faith

!

Now that I have set the ground rules

, my objective is trying to understand how to write a sort of script which could search and extract information out of the web and put that information into a CSV file.

To be completely transparent, I sort of already have a specific idea of what kind of information I'd like to get: The prices and details of some asset listings on the web.

I have checked the web and found out that with a while loop with a bunch of curl and grep commands could do the trick. Is there anyone who can help me in building this sort of "web-crawler".

Thanks in advance to you all!

Ardzii

Ardzii

View Public Profile for Ardzii

Find all posts by Ardzii

12-22-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Welcome to the forum.

It is always beneficial to post the OS and shell version you are using as well as tools and their versions (e.g. awk, sed, ...) available.

This is not a new request; it might be worthwhile to search these fora for similar problems to get a starting point for your special solution. Any attempts/ideas/thoughts from your side? Do you have any preferences as for the tools to be deployed? Sample input and output data would help as well!

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

12-23-2016

Registered User

19, 0

Join Date: Dec 2016

Last Activity: 24 February 2017, 7:52 AM EST

Posts: 19

Thanks Given: 19

Thanked 0 Times in 0 Posts

Hey RudiC!

Thanks for your quick reply!

Indeed, with this little information, it is hard to help right?
OK, let's then precise this a little :P

First off, I'm using macOS Sierra V.10.12.2 and regarding the Shell version, I seem to be working on:

Code:

GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin16)
Copyright (C) 2007 Free Software Foundation, Inc.

For the tools I'm currently trying to use, I've got:

Code:

$ grep --version
grep (BSD grep) 2.5.1-FreeBSD
$ curl --version
curl 7.51.0 (x86_64-apple-darwin16.0) libcurl/7.51.0 SecureTransport zlib/1.2.8
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz UnixSockets

I can't seem to find my "sed" version (as the command "sed --version" doesn't exist apparently but I'm using the "basic" one that comes with macOS.

For now, I was only able to get and "echo" a listing in my terminal (for 20 entries) by executing the following:

Code:

$ curl "https://www.dotmed.com/equipment/2/92/1209/all/offset/0/all?key=&limit=20&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" | egrep "href.*view more"

To tell you the truth a friend helped me with the commands, I'm not even really sure to understand the difference between egrep and grep as I get the same results using "grep"... I mean, I've read the manual and I understand that egrep can handle more complex expression than grep but I'm not sure of what an "extended regular expression" is... :S

I will definitely search the forum to get more info, but what I was thinking was to get into a while loop so that I can get the price (and more details) of each listing that my command already provides.
I was thinking of first printing the output into an XML but can't seem to find the command to output my grep.

EDIT: I forgot, I also thought of using "sed" to get extract only the info that I'm looking for.

Thanks again!

Last edited by Ardzii; 12-23-2016 at 06:55 AM..

Ardzii

View Public Profile for Ardzii

Find all posts by Ardzii

12-23-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Thanks for the system details. I'm not familiar with the macOS, and the bash version is somewhat aged, but you might get along.

So, now we have a text to work upon. What do you want to extract from it? Above just lists several URLs, but no technical details nor prices.

Extending the grep regex with an alternation like

Code:

egrep "href.*view more|Asking Price"

gives you a link and a price

Code:

			<a href="/listing/bone-densitometer/ge/lunar-dpx/2299124"> view more </a>
			<p style="margin-top:11px"><span style="font-size:11px;">Asking Price:<br /></span>$20,000 USD

I'm not sure this would suffice to fulfill your needs; you might need to dive into the next level URLs, now.

While we can start a nice ping pong conversation about every single detail needed and how to get at it, I firmly believe it were better if you took a step back and rephrase the specification, giving the framework of what you really want, in which shape, where and how to get it, and ideas about what to do to transform the input data into the output. We then can help to implement and/or improve that.

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

12-23-2016

Registered User

19, 0

Join Date: Dec 2016

Last Activity: 24 February 2017, 7:52 AM EST

Posts: 19

Thanks Given: 19

Thanked 0 Times in 0 Posts

Hey there RudiC!

This is indeed a great step! So now I can get the make, model and price from the search directly. But you're right:

Quote:

While we can start a nice ping pong conversation about every single detail needed and how to get at it, I firmly believe it were better if you took a step back and rephrase the specification, giving the framework of what you really want, in which shape, where and how to get it, and ideas about what to do to connect the input data with the output. We then can help to implement and/or improve that.

To take a step back then: my objective is to get for each type of medical equipment within the website:

The make and model (which apparently appear in the http link)
the Asking price (if available)

but also some information that is detailed once we get into each listing:

Year of manufacture
Country
Publication date
Availability (if available)
Condition

And send all this information to a CSV or any kind of database file (even spreadsheet).

All the best!

Ardzii

Last edited by rbatte1; 12-23-2016 at 09:48 AM.. Reason: Converted textual lists to formatted lists

Ardzii

View Public Profile for Ardzii

Find all posts by Ardzii

12-23-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Please back these details up with input samples and how and where to get those! How to identify the data you need. You don't expect people in here to crawl through all those sites, do you?

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

12-26-2016

Registered User

19, 0

Join Date: Dec 2016

Last Activity: 24 February 2017, 7:52 AM EST

Posts: 19

Thanks Given: 19

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by RudiC

Please back these details up with input samples and how and where to get those! How to identify the data you need. You don't expect people in here to crawl through all those sites, do you?

Hey there RudiC!

Sorry for not answering earlier and, as you'll see, I deleted all the of the "https" from my reply as the forum doesn't let me post URL until I have at least 5 posts.
You're right, I don't expect people to crawl though the site. I'm sure to understand what you mean though.

To get the data, you need to generate a listing through this link:

Code:

$ ://www.dotmed.com/equipment/2/92/1209/all/offset/0/all?key=&limit=20&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES

The URL is pretty easy to adapt and I think that I adapted it to my current needs. This will get a Densitometer equipment listing and afterwards I could easily adapt the URL myself to get to the other equipments (as the structure is the same across all equipments).

A few comments on the link itself though:

Code:

&limit=20

Obviously limits the output to 20 equipments. I am using 20 right now so that the requests are fast and easy but I change it to 200 to get much more informations and listings afterwards

Code:

&price_sort=descending

I'm mostly interest in listing where the price is mentioned, so I decided to sort by descending prices so that I get the listings with prices first (more relevant to me).

Code:

&country=ES

I chose Spain as a filter, but it's not much of a relevance. I'd rather have EU listing first which is why I chose Spain.

Now back to the command:
With "curl" I'm getting the listing (I could import that listing locally into an HTML file but since that's not the objective, I get right away with the grep command).
The grep then lists the links available for the listing I specified and that's it for now.

The expected part:
The remaining of the info I mentionned earlier is now located in each URL.
What I need to do now is, based on the previous "grep":

Code:

0    <a href="/listing/bone-densitometer/ge/lunar-dpx/2299124"> view more </a>
            <a href="/listing/bone-densitometer/osteosys/dexxum-t/2299556"> view more </a>
            <a href="/listing/bone-densitometer/hologic/discovery-c/1184884"> view more </a>
            <a href="/listing/bone-densitometer/ge/prodigy/1184904"> view more </a>
            <a href="/listing/bone-densitometer/ge/lunar-idxa/2246457"> view more </a>
            <a href="/listing/bone-densitometer/ge/lunar-prodigy/1668884"> view more </a>
            <a href="/listing/bone-densitometer/hologic/qdr-4500-elite/1738541"> view more </a>
            <a href="/listing/bone-densitometer/hologic/discovery-c/1405820"> view more </a>
            <a href="/listing/bone-densitometer/alara/metriscan/653936"> view more </a>
            <a href="/listing/bone-densitometer/sunlight/omnisense-7000s/470081"> view more </a>
            <a href="/listing/bone-densitometer/hologic/delphi-c/99115"> view more </a>
            <a href="/listing/bone-densitometer/lunar/dpx-nt/2310470"> view more </a>
            <a href="/listing/bone-densitometer/hologic/qdr-4500/2219929"> view more </a>
            <a href="/listing/bone-densitometer/norland/excell/1184892"> view more </a>
            <a href="/listing/bone-densitometer/ge/lunar-dpx-duo/875678"> view more </a>
            <a href="/listing/bone-densitometer/ge/lunar-dpx-nt/2284643"> view more </a>
            <a href="/listing/bone-densitometer/hologic/discovery-qdr-10041/2257994"> view more </a>
            <a href="/listing/bone-densitometer/sunlight/mini-omni-por/2183339"> view more </a>
            <a href="/listing/bone-densitometer/ge/lunar-dpx-bravo/2225055"> view more </a>

for each link (for instance, starting with the first on in my list:

Code:

<a href="/listing/bone-densitometer/ge/lunar-dpx/2299124"> view more </a>

, go and get:

The price:

Code:

$ curl ://www.dotmed.com//listing/bone-densitometer/osteosys/dexxum-t/2299556 | fgrep -e "id=\"price"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 43 38179   43 16384    0     0  16698      0  0:00:02 --:--:--  0:00:02 16684"<ul><li class="left">Price:</li><li class="right" id="price"><span itemprop='price' content='19990.00'>$19,990.00 <span itemprop='currency'>USD</span> <a style='font-size: 5pt' href='#' title='Convert the Currency' onClick='javascript:window.open("/listings/currency.html?amount=19990.00&currency_from=USD", "listing", config="height=200,width=500,toolbar=no,menubar=no,scrollbars=yes,resizable=no,location=no,directories=no,status=yes"); return false;'>[convert]</a></span></li></ul>

The condition:

Code:

$ curl ://www.dotmed.com//listing/bone-densitometer/osteosys/dexxum-t/2299556 | fgrep -e "id=\"condition"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 43 38179   43 16384    0     0   9241      0  0:00:04  0:00:01  0:00:03  9240    <ul><li class="left">Condition:</li><li class="right" id="condition"><span itemprop='condition' content='new'>New</span></li></ul>

The date_updaed:

Code:

$ curl ://www.dotmed.com//listing/bone-densitometer/osteosys/dexxum-t/2299556 | fgrep -e "id=\"date_updated"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0 38179    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0    <ul><li class="left">Date updated:</li><li class="right" id="date_updated">December  09, 2016</li></ul>

Obviously, my objective is to try to generate a loop that will get me this info for each link in the listing, see how I could clean up the info and send it to CSV or any other similar file to stock the information.

I hope that some of this long post contains the info you were looking for?

If not I apologize and please, if you could detail a little more, that'd be great!

Thanks again and as usual!

Ardzii

Ardzii

View Public Profile for Ardzii

Find all posts by Ardzii

OS X (Apple)

Noob trying to improve

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Improve script

Discussion started by: jiam912

2. Shell Programming and Scripting

How to improve an script?

Discussion started by: jiam912

3. AIX

improve sulog

Discussion started by: sparcguy

4. Shell Programming and Scripting

Want to improve the performance of script

Discussion started by: poweroflinux

5. IP Networking

How to improve throughput?

Discussion started by: andrewust

6. Shell Programming and Scripting

Any way to improve performance of this script

Discussion started by: sirababu

7. UNIX for Dummies Questions & Answers

Improve Performance

Discussion started by: mazhar99

8. Shell Programming and Scripting

improve this?

Discussion started by: blowtorch

9. UNIX for Advanced & Expert Users

improve performance by using ls better than find

Discussion started by: Nicol

10. Shell Programming and Scripting

Can I improve this script ???

Discussion started by: Cameron