Noob trying to improve


 
Thread Tools Search this Thread
Operating Systems OS X (Apple) Noob trying to improve
# 1  
Old 12-22-2016
Code Noob trying to improve

Hi everyone!

This is my first post here, I hope that I will not already be violating any rule! I also would like to apologize in advance as my post will definitely be a noob post... please have patience and faith Smilie!

Now that I have set the ground rules SmilieSmilie, my objective is trying to understand how to write a sort of script which could search and extract information out of the web and put that information into a CSV file.

To be completely transparent, I sort of already have a specific idea of what kind of information I'd like to get: The prices and details of some asset listings on the web.

I have checked the web and found out that with a while loop with a bunch of curl and grep commands could do the trick. Is there anyone who can help me in building this sort of "web-crawler".

Thanks in advance to you all!

Ardzii
# 2  
Old 12-22-2016
Welcome to the forum.

It is always beneficial to post the OS and shell version you are using as well as tools and their versions (e.g. awk, sed, ...) available.

This is not a new request; it might be worthwhile to search these fora for similar problems to get a starting point for your special solution. Any attempts/ideas/thoughts from your side? Do you have any preferences as for the tools to be deployed? Sample input and output data would help as well!
This User Gave Thanks to RudiC For This Post:
# 3  
Old 12-23-2016
Hey RudiC!

Thanks for your quick reply! Smilie
Indeed, with this little information, it is hard to help right?
OK, let's then precise this a little :P

First off, I'm using macOS Sierra V.10.12.2 and regarding the Shell version, I seem to be working on:
Code:
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin16)
Copyright (C) 2007 Free Software Foundation, Inc.

For the tools I'm currently trying to use, I've got:
Code:
$ grep --version
grep (BSD grep) 2.5.1-FreeBSD
$ curl --version
curl 7.51.0 (x86_64-apple-darwin16.0) libcurl/7.51.0 SecureTransport zlib/1.2.8
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz UnixSockets

I can't seem to find my "sed" version (as the command "sed --version" doesn't exist apparently but I'm using the "basic" one that comes with macOS.

For now, I was only able to get and "echo" a listing in my terminal (for 20 entries) by executing the following:
Code:
$ curl "https://www.dotmed.com/equipment/2/92/1209/all/offset/0/all?key=&limit=20&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" | egrep "href.*view more"

To tell you the truth a friend helped me with the commands, I'm not even really sure to understand the difference between egrep and grep as I get the same results using "grep"... I mean, I've read the manual and I understand that egrep can handle more complex expression than grep but I'm not sure of what an "extended regular expression" is... :S

I will definitely search the forum to get more info, but what I was thinking was to get into a while loop so that I can get the price (and more details) of each listing that my command already provides.
I was thinking of first printing the output into an XML but can't seem to find the command to output my grep.

EDIT: I forgot, I also thought of using "sed" to get extract only the info that I'm looking for.

Thanks again!

Last edited by Ardzii; 12-23-2016 at 06:55 AM..
# 4  
Old 12-23-2016
Thanks for the system details. I'm not familiar with the macOS, and the bash version is somewhat aged, but you might get along.

So, now we have a text to work upon. What do you want to extract from it? Above just lists several URLs, but no technical details nor prices.

Extending the grep regex with an alternation like
Code:
egrep "href.*view more|Asking Price"

gives you a link and a price

Code:
			<a href="/listing/bone-densitometer/ge/lunar-dpx/2299124"> view more </a>
			<p style="margin-top:11px"><span style="font-size:11px;">Asking Price:<br /></span>$20,000 USD

I'm not sure this would suffice to fulfill your needs; you might need to dive into the next level URLs, now.

While we can start a nice ping pong conversation about every single detail needed and how to get at it, I firmly believe it were better if you took a step back and rephrase the specification, giving the framework of what you really want, in which shape, where and how to get it, and ideas about what to do to transform the input data into the output. We then can help to implement and/or improve that.
This User Gave Thanks to RudiC For This Post:
# 5  
Old 12-23-2016
Hey there RudiC!

This is indeed a great step! So now I can get the make, model and price from the search directly. But you're right:
Quote:
While we can start a nice ping pong conversation about every single detail needed and how to get at it, I firmly believe it were better if you took a step back and rephrase the specification, giving the framework of what you really want, in which shape, where and how to get it, and ideas about what to do to connect the input data with the output. We then can help to implement and/or improve that.
To take a step back then: my objective is to get for each type of medical equipment within the website:
  • The make and model (which apparently appear in the http link)
  • the Asking price (if available)
but also some information that is detailed once we get into each listing:
  • Year of manufacture
  • Country
  • Publication date
  • Availability (if available)
  • Condition

And send all this information to a CSV or any kind of database file (even spreadsheet).

All the best!

Ardzii

Last edited by rbatte1; 12-23-2016 at 09:48 AM.. Reason: Converted textual lists to formatted lists
# 6  
Old 12-23-2016
Please back these details up with input samples and how and where to get those! How to identify the data you need. You don't expect people in here to crawl through all those sites, do you?
This User Gave Thanks to RudiC For This Post:
# 7  
Old 12-26-2016
Quote:
Originally Posted by RudiC
Please back these details up with input samples and how and where to get those! How to identify the data you need. You don't expect people in here to crawl through all those sites, do you?
Hey there RudiC!

Sorry for not answering earlier and, as you'll see, I deleted all the of the "https" from my reply as the forum doesn't let me post URL until I have at least 5 posts.
You're right, I don't expect people to crawl though the site. I'm sure to understand what you mean though. SmilieSmilieSmilieSmilie

To get the data, you need to generate a listing through this link:
Code:
$ ://www.dotmed.com/equipment/2/92/1209/all/offset/0/all?key=&limit=20&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES

The URL is pretty easy to adapt and I think that I adapted it to my current needs. This will get a Densitometer equipment listing and afterwards I could easily adapt the URL myself to get to the other equipments (as the structure is the same across all equipments).

A few comments on the link itself though:
Code:
&limit=20

Obviously limits the output to 20 equipments. I am using 20 right now so that the requests are fast and easy but I change it to 200 to get much more informations and listings afterwards

Code:
&price_sort=descending

I'm mostly interest in listing where the price is mentioned, so I decided to sort by descending prices so that I get the listings with prices first (more relevant to me).

Code:
&country=ES

I chose Spain as a filter, but it's not much of a relevance. I'd rather have EU listing first which is why I chose Spain.

Now back to the command:

With "curl" I'm getting the listing (I could import that listing locally into an HTML file but since that's not the objective, I get right away with the grep command).
The grep then lists the links available for the listing I specified and that's it for now.

The expected part:
The remaining of the info I mentionned earlier is now located in each URL.
What I need to do now is, based on the previous "grep":
Code:
0    <a href="/listing/bone-densitometer/ge/lunar-dpx/2299124"> view more </a>
            <a href="/listing/bone-densitometer/osteosys/dexxum-t/2299556"> view more </a>
            <a href="/listing/bone-densitometer/hologic/discovery-c/1184884"> view more </a>
            <a href="/listing/bone-densitometer/ge/prodigy/1184904"> view more </a>
            <a href="/listing/bone-densitometer/ge/lunar-idxa/2246457"> view more </a>
            <a href="/listing/bone-densitometer/ge/lunar-prodigy/1668884"> view more </a>
            <a href="/listing/bone-densitometer/hologic/qdr-4500-elite/1738541"> view more </a>
            <a href="/listing/bone-densitometer/hologic/discovery-c/1405820"> view more </a>
            <a href="/listing/bone-densitometer/alara/metriscan/653936"> view more </a>
            <a href="/listing/bone-densitometer/sunlight/omnisense-7000s/470081"> view more </a>
            <a href="/listing/bone-densitometer/hologic/delphi-c/99115"> view more </a>
            <a href="/listing/bone-densitometer/lunar/dpx-nt/2310470"> view more </a>
            <a href="/listing/bone-densitometer/hologic/qdr-4500/2219929"> view more </a>
            <a href="/listing/bone-densitometer/norland/excell/1184892"> view more </a>
            <a href="/listing/bone-densitometer/ge/lunar-dpx-duo/875678"> view more </a>
            <a href="/listing/bone-densitometer/ge/lunar-dpx-nt/2284643"> view more </a>
            <a href="/listing/bone-densitometer/hologic/discovery-qdr-10041/2257994"> view more </a>
            <a href="/listing/bone-densitometer/sunlight/mini-omni-por/2183339"> view more </a>
            <a href="/listing/bone-densitometer/ge/lunar-dpx-bravo/2225055"> view more </a>

for each link (for instance, starting with the first on in my list:
Code:
<a href="/listing/bone-densitometer/ge/lunar-dpx/2299124"> view more </a>

, go and get:

The price:
Code:
$ curl ://www.dotmed.com//listing/bone-densitometer/osteosys/dexxum-t/2299556 | fgrep -e "id=\"price"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 43 38179   43 16384    0     0  16698      0  0:00:02 --:--:--  0:00:02 16684"<ul><li class="left">Price:</li><li class="right" id="price"><span itemprop='price' content='19990.00'>$19,990.00 <span itemprop='currency'>USD</span> <a style='font-size: 5pt' href='#' title='Convert the Currency' onClick='javascript:window.open("/listings/currency.html?amount=19990.00&currency_from=USD", "listing", config="height=200,width=500,toolbar=no,menubar=no,scrollbars=yes,resizable=no,location=no,directories=no,status=yes"); return false;'>[convert]</a></span></li></ul>

The condition:
Code:
$ curl ://www.dotmed.com//listing/bone-densitometer/osteosys/dexxum-t/2299556 | fgrep -e "id=\"condition"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 43 38179   43 16384    0     0   9241      0  0:00:04  0:00:01  0:00:03  9240    <ul><li class="left">Condition:</li><li class="right" id="condition"><span itemprop='condition' content='new'>New</span></li></ul>

The date_updaed:
Code:
$ curl ://www.dotmed.com//listing/bone-densitometer/osteosys/dexxum-t/2299556 | fgrep -e "id=\"date_updated"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0 38179    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0    <ul><li class="left">Date updated:</li><li class="right" id="date_updated">December  09, 2016</li></ul>

Obviously, my objective is to try to generate a loop that will get me this info for each link in the listing, see how I could clean up the info and send it to CSV or any other similar file to stock the information.SmilieSmilie

I hope that some of this long post contains the info you were looking for?Smilie If not I apologize and please, if you could detail a little more, that'd be great!Smilie

Thanks again and as usual!

Ardzii
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Improve script

Gents, Is there the possibility to improve this script to be able to have same output information. I did this script, but I believe there is a very short code to get same output here my script awk -F, '{if($10>0 && $10<=15) print $6}' tmp1 | sort -k1n | awk '{a++} END { for (n in a )... (23 Replies)
Discussion started by: jiam912
23 Replies

2. Shell Programming and Scripting

How to improve an script?

Gents. I have 2 different scripts for the same purpose: raw2csv_1 Script raw2csv_1 finish the process in less that 1 minute raw2csv_2 Script raw2csv_2 finish the process in more that 6 minutes. Can you please check if there is any option to improve the raw2csv_2. To finish the job... (4 Replies)
Discussion started by: jiam912
4 Replies

3. AIX

improve sulog

I just wrote a very small script that improves readability on system sulog. The problem with all sulog is there is lack of clarity whether the info you are looking at is the most current. So if you just need a simple soution instead of going thru the trouble of writing a script that rotate logs and... (0 Replies)
Discussion started by: sparcguy
0 Replies

4. Shell Programming and Scripting

Want to improve the performance of script

Hi All, I have written a script as follows which is taking lot of time in executing/searching only 3500 records taken as input from one file in log file of 12 GB Approximately. Working of script is read the csv file as an input having 2 arguments which are transaction_id,mobile_number and search... (6 Replies)
Discussion started by: poweroflinux
6 Replies

5. IP Networking

How to improve throughput?

I have a 10Gbps network link connecting two machines A and B. I want to transfer 20GB data from A to B using TCP. With default setting, I can use 50% bandwidth. How to improve the throughput? Is there any way to make throughput as close to 10Gbps as possible? thanks~ :) (3 Replies)
Discussion started by: andrewust
3 Replies

6. Shell Programming and Scripting

Any way to improve performance of this script

I have a data file of 2 gig I need to do all these, but its taking hours, any where i can improve performance, thanks a lot #!/usr/bin/ksh echo TIMESTAMP="$(date +'_%y-%m-%d.%H-%M-%S')" function showHelp { cat << EOF >&2 syntax extreme.sh FILENAME Specify filename to parse EOF... (3 Replies)
Discussion started by: sirababu
3 Replies

7. UNIX for Dummies Questions & Answers

Improve Performance

hi someone tell me which ways i can improve disk I/O and system process performance.kindly refer some commands so i can do it on my test machine.thanks, Mazhar (2 Replies)
Discussion started by: mazhar99
2 Replies

8. Shell Programming and Scripting

improve this?

Wrote this script to find the date x days before or after today. Is there any way that this script can be speeded up or otherwise improved? #!/usr/bin/sh check_done() { if then daysofmth=31 elif then if ... (11 Replies)
Discussion started by: blowtorch
11 Replies

9. UNIX for Advanced & Expert Users

improve performance by using ls better than find

Hi , i'm searching for files over many Aix servers with rsh command using this request : find /dir1 -name '*.' -exec ls {} \; and then count them with "wc" but i would improve this search because it's too long and replace directly find with ls command but "ls *. " doesn't work. and... (3 Replies)
Discussion started by: Nicol
3 Replies

10. Shell Programming and Scripting

Can I improve this script ???

Hi all, Still a newbie and learning as I go ... as you do :) Have created this script to report on disc usage and I've just included the ChkSpace function this morning. It's the first time I've read a file (line-by-bloody-line) and would like to know if I can improve this script ? FYI - I... (11 Replies)
Discussion started by: Cameron
11 Replies
Login or Register to Ask a Question