This is my first post here, I hope that I will not already be violating any rule! I also would like to apologize in advance as my post will definitely be a noob post... please have patience and faith !
Now that I have set the ground rules , my objective is trying to understand how to write a sort of script which could search and extract information out of the web and put that information into a CSV file.
To be completely transparent, I sort of already have a specific idea of what kind of information I'd like to get: The prices and details of some asset listings on the web.
I have checked the web and found out that with a while loop with a bunch of curl and grep commands could do the trick. Is there anyone who can help me in building this sort of "web-crawler".
It is always beneficial to post the OS and shell version you are using as well as tools and their versions (e.g. awk, sed, ...) available.
This is not a new request; it might be worthwhile to search these fora for similar problems to get a starting point for your special solution. Any attempts/ideas/thoughts from your side? Do you have any preferences as for the tools to be deployed? Sample input and output data would help as well!
Thanks for your quick reply!
Indeed, with this little information, it is hard to help right?
OK, let's then precise this a little :P
First off, I'm using macOS Sierra V.10.12.2 and regarding the Shell version, I seem to be working on:
For the tools I'm currently trying to use, I've got:
I can't seem to find my "sed" version (as the command "sed --version" doesn't exist apparently but I'm using the "basic" one that comes with macOS.
For now, I was only able to get and "echo" a listing in my terminal (for 20 entries) by executing the following:
To tell you the truth a friend helped me with the commands, I'm not even really sure to understand the difference between egrep and grep as I get the same results using "grep"... I mean, I've read the manual and I understand that egrep can handle more complex expression than grep but I'm not sure of what an "extended regular expression" is... :S
I will definitely search the forum to get more info, but what I was thinking was to get into a while loop so that I can get the price (and more details) of each listing that my command already provides.
I was thinking of first printing the output into an XML but can't seem to find the command to output my grep.
EDIT: I forgot, I also thought of using "sed" to get extract only the info that I'm looking for.
Thanks for the system details. I'm not familiar with the macOS, and the bash version is somewhat aged, but you might get along.
So, now we have a text to work upon. What do you want to extract from it? Above just lists several URLs, but no technical details nor prices.
Extending the grep regex with an alternation like
gives you a link and a price
I'm not sure this would suffice to fulfill your needs; you might need to dive into the next level URLs, now.
While we can start a nice ping pong conversation about every single detail needed and how to get at it, I firmly believe it were better if you took a step back and rephrase the specification, giving the framework of what you really want, in which shape, where and how to get it, and ideas about what to do to transform the input data into the output. We then can help to implement and/or improve that.
This is indeed a great step! So now I can get the make, model and price from the search directly. But you're right:
Quote:
While we can start a nice ping pong conversation about every single detail needed and how to get at it, I firmly believe it were better if you took a step back and rephrase the specification, giving the framework of what you really want, in which shape, where and how to get it, and ideas about what to do to connect the input data with the output. We then can help to implement and/or improve that.
To take a step back then: my objective is to get for each type of medical equipment within the website:
The make and model (which apparently appear in the http link)
the Asking price (if available)
but also some information that is detailed once we get into each listing:
Year of manufacture
Country
Publication date
Availability (if available)
Condition
And send all this information to a CSV or any kind of database file (even spreadsheet).
All the best!
Ardzii
Last edited by rbatte1; 12-23-2016 at 09:48 AM..
Reason: Converted textual lists to formatted lists
Please back these details up with input samples and how and where to get those! How to identify the data you need. You don't expect people in here to crawl through all those sites, do you?
Please back these details up with input samples and how and where to get those! How to identify the data you need. You don't expect people in here to crawl through all those sites, do you?
Hey there RudiC!
Sorry for not answering earlier and, as you'll see, I deleted all the of the "https" from my reply as the forum doesn't let me post URL until I have at least 5 posts.
You're right, I don't expect people to crawl though the site. I'm sure to understand what you mean though.
To get the data, you need to generate a listing through this link:
The URL is pretty easy to adapt and I think that I adapted it to my current needs. This will get a Densitometer equipment listing and afterwards I could easily adapt the URL myself to get to the other equipments (as the structure is the same across all equipments).
A few comments on the link itself though:
Obviously limits the output to 20 equipments. I am using 20 right now so that the requests are fast and easy but I change it to 200 to get much more informations and listings afterwards
I'm mostly interest in listing where the price is mentioned, so I decided to sort by descending prices so that I get the listings with prices first (more relevant to me).
I chose Spain as a filter, but it's not much of a relevance. I'd rather have EU listing first which is why I chose Spain.
Now back to the command:
With "curl" I'm getting the listing (I could import that listing locally into an HTML file but since that's not the objective, I get right away with the grep command).
The grep then lists the links available for the listing I specified and that's it for now.
The expected part:
The remaining of the info I mentionned earlier is now located in each URL.
What I need to do now is, based on the previous "grep":
for each link (for instance, starting with the first on in my list:
, go and get:
The price: The condition: The date_updaed:
Obviously, my objective is to try to generate a loop that will get me this info for each link in the listing, see how I could clean up the info and send it to CSV or any other similar file to stock the information.
I hope that some of this long post contains the info you were looking for? If not I apologize and please, if you could detail a little more, that'd be great!
Gents,
Is there the possibility to improve this script to be able to have same output information.
I did this script, but I believe there is a very short code to get same output
here my script
awk -F, '{if($10>0 && $10<=15) print $6}' tmp1 | sort -k1n | awk '{a++} END { for (n in a )... (23 Replies)
Gents.
I have 2 different scripts for the same purpose:
raw2csv_1
Script raw2csv_1 finish the process in less that 1 minute
raw2csv_2
Script raw2csv_2 finish the process in more that 6 minutes.
Can you please check if there is any option to improve the raw2csv_2. To finish the job... (4 Replies)
I just wrote a very small script that improves readability on system sulog. The problem with all sulog is there is lack of clarity whether the info you are looking at is the most current. So if you just need a simple soution instead of going thru the trouble of writing a script that rotate logs and... (0 Replies)
Hi All,
I have written a script as follows which is taking lot of time in executing/searching only 3500 records taken as input from one file in log file of 12 GB Approximately.
Working of script is read the csv file as an input having 2 arguments which are transaction_id,mobile_number and search... (6 Replies)
I have a 10Gbps network link connecting two machines A and B. I want to transfer 20GB data from A to B using TCP. With default setting, I can use 50% bandwidth. How to improve the throughput? Is there any way to make throughput as close to 10Gbps as possible? thanks~ :) (3 Replies)
I have a data file of 2 gig
I need to do all these, but its taking hours, any where i can improve performance, thanks a lot
#!/usr/bin/ksh
echo TIMESTAMP="$(date +'_%y-%m-%d.%H-%M-%S')"
function showHelp {
cat << EOF >&2
syntax extreme.sh FILENAME
Specify filename to parse
EOF... (3 Replies)
hi someone tell me which ways i can improve disk I/O and system process performance.kindly refer some commands so i can do it on my test machine.thanks, Mazhar (2 Replies)
Wrote this script to find the date x days before or after today. Is there any way that this script can be speeded up or otherwise improved?
#!/usr/bin/sh
check_done() {
if
then
daysofmth=31
elif
then
if
... (11 Replies)
Hi ,
i'm searching for files over many Aix servers with rsh command using this request :
find /dir1 -name '*.' -exec ls {} \;
and then count them with "wc"
but i would improve this search because it's too long and replace directly find with ls command but "ls *. " doesn't work.
and... (3 Replies)
Hi all,
Still a newbie and learning as I go ... as you do :)
Have created this script to report on disc usage and I've just included the ChkSpace function this morning.
It's the first time I've read a file (line-by-bloody-line) and would like to know if I can improve this script ?
FYI - I... (11 Replies)