I wanted to let you know that I was able to finish up my script. It gives less information that I really need but I'm amazed of what I was able to do all by myself.
Let's be realistic though: I couldn't have done it without you! I learned so much this is crazy... from 0 to *not the best but something at least*!
So I guess: THANK YOU for your patience, your support and all the time you invested in showing me the way!!!
Obviously you can use the script. All you have to do is get a proper link: So that you can get an example you can use: https://www.dotmed.com/equipment/2/26/2974/all
with no more than 116 scroll (figure as of 8.2.17) and, as it says in the script, you will need to create a dir called "DotMedListings" in: ~/
I guess that you will find it sort of messy but it works for now and it's a good basis!
I'm definitely open to your comments and suggestions, as you can imagine! (for instance my progress is not very friendly)
here goes the script:
Code:
#!/bin/bash
#
#
#
# For this script to work, you will first need to create a DotMedListings dir in your /home/XXXX/ directory.
#
#
#
declare link="" #Will store the link for each iteration
declare linkStart="" #Defines the type of equipment to crawl. To be found in Find Listings For Sale or Wanted On DOTmed.com
declare brand="" #output
declare price="" #output
declare currency="" #output
declare condition="" #output
declare dateListing="" #output
declare country="" #output
declare title="" #output
declare description="" #output
declare equipment="" #output
declare yom="" #output
declare -i totalCrawl=1 #Variable to define the scope of the crawl (total number of listing to crawl)
declare fileNameBase="" #Used for the name of the output file via curl: Corresponds to the name of the equipment
declare fileName="" #Definitve name of the Output file: dateCrawl + fileBaseName
declare dateCrawl=$(date +"%d-%m-%y") #Date of the crawl used for the name
declare -i offset=0 #Base iteration of the offset. Gets +1 after each iteration
declare -i firstIndex=1 #index for the while - Gets +1 after each iteration but starts on 1 instead of 0 (for the offset).
declare nameToHome=$(cd ~ ; pwd | sed -n 's/.home.\([^\/]*\).*/\1/p') #name for the path of the file to search if already created
echo
echo
echo "************* Give the link to the equipment type, from https:// to /all included (last '/' excluded): *************"
read linkStart
echo
echo "************* Now, the total number of listings for the equipment: *************"
read totalCrawl
echo
echo
#
# Naming the output file
#
fileNameBase=$(curl -s "$linkStart/offset/$offset/all?key=&limit=1&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" | sed -n '/^[[:blank:]]*<li itemscope itemtype="http:..data-vocabulary.org.Breadcrumb"><span itemprop="title">.*$/ s/^.*itemprop="title">\([^<]*\).*$/\1/p')
fileName=$dateCrawl"-"$fileNameBase".csv"
#
# Looking if it already exists
#
if test -f "/home/$nameToHome/DotMedListings/$fileName"
then
echo
echo "************* WARNING ************* WARNING *************"
echo "************* You already crawled that today! *************"
echo "************* Delete the file or try another *************"
echo "************* WARNING ************* WARNING *************"
echo
#
# If not, starting the script
#
else
echo
echo
echo
echo "************* You will find your result in ~/DotMedListings/$fileName *************"
echo
echo
echo
echo "brand;equipment;title;description;price;currency;condition;dateListing;country;YoM" >> ~/DotMedListings/"$fileName" #defining each category for the crawl.
while [ $firstIndex -le $totalCrawl ] #Starting the crawling loop
do
awk -v t1="$firstIndex" -v t2="$totalCrawl" 'BEGIN{print (t1/t2) * 100}' # Prints the percentage advancement instead of having the curl info.
link=$(curl -s "$linkStart/offset/$offset/all?key=&limit=1&price_sort=descending&cond=all&continent_filter=0&zip=&distance=5&att_1=0&att_row_num=1&additionalkeywords=&country=ES" | sed -n '/href.*view more/ s/.*href="\([^"]*\).*/\1/p') # Get the corredponding listing. If it's the first iteration then it will get the first listing for the equipment.
curl -s "https://www.dotmed.com$link" -o ~/curl"$totalCrawl".xml #Saves one curl for the first listing to avoid various curls for the same listing
#
# Getting the info out of the curl
#
brand=$(sed -n "/^[[:blank:]]*<h1 itemprop='name'.*/ s/^.*itemprop='brand'>\([^<]*\).*/\1/p" ~/curl"$totalCrawl".xml)
equipment=$(sed -n '/^[[:blank:]]*<meta property="og:url".*$/ s/.*"https:\/\/www.dotmed.com\/listing\/\([^\/]*\).*/\1/p' ~/curl"$totalCrawl".xml)
price=$(sed -n "/^[[:blank:]]*<ul><li class=.left.>Price.*$/ s/^.*amount=\([^&]*\).*/\1/p" ~/curl"$totalCrawl".xml)
currency=$(sed -n '/^[[:blank:]]*<ul><li class=.left.>Price.*$/ s/^.*currency_from=\([^"]*\).*/\1/p' ~/curl"$totalCrawl".xml)
condition=$(sed -n '/^[[:blank:]]*<ul><li class="left">Condition:.*$/ s/^.*content=.used.>\([^<]*\).*/\1/p' ~/curl"$totalCrawl".xml)
dateListing=$(sed -n '/^[[:blank:]]*<ul><li class="left">Date updated.*$/ s/^.*id="date_updated">\([^<]*\).*/\1/p' ~/curl"$totalCrawl".xml)
country=$(sed -n "/^[[:blank:]]*<p class=.nation.>.*$/ s/^.*'This listing comes from \([^']*\).*/\1/p" ~/curl"$totalCrawl".xml)
title=$(sed -n '/^[[:blank:]]*<meta property="og:title".*$/ s/.*content="\([^-]*\).*/\1/p' ~/curl"$totalCrawl".xml)
description=$(sed -n '/^[[:blank:]]*<meta property="og:description".*$/ s/.*content="\([^-]*\).*/\1/p' ~/curl"$totalCrawl".xml)
yom=$(sed -n '/^.*Specifications: Year of Manufacture.*$/ s/^.*Specifications: Year of Manufacture,\([^,]*\).*/\1/p' ~/curl"$totalCrawl".xml)
#
# Sending the info to the output file
#
echo $brand";"$equipment";"$title";"$description";"$price";"$currency";"$condition";"$dateListing";"$country";"$yom >> ~/DotMedListings/"$fileName"
rm ~/curl"$totalCrawl".xml # Deleting the curl file to leave space for the next iteration. FYI, I nammed the curl file with the number of crawls to be done to be able to launch simutaniously the script and be able to crawl various equipments at a time.
#
# Resetting for the next iteration.
#
link=""
brand=""
price=""
currency=""
condition=""
dateListing=""
country=""
title=""
description=""
equipment=""
yom=""
(( firstIndex++ ))
(( offset++ ))
done
echo
echo
echo
echo "************* Done! Again, you will find the result in ~/DotMedListings/$fileName *************"
echo
echo
echo
fi
Hi all,
Still a newbie and learning as I go ... as you do :)
Have created this script to report on disc usage and I've just included the ChkSpace function this morning.
It's the first time I've read a file (line-by-bloody-line) and would like to know if I can improve this script ?
FYI - I... (11 Replies)
Hi ,
i'm searching for files over many Aix servers with rsh command using this request :
find /dir1 -name '*.' -exec ls {} \;
and then count them with "wc"
but i would improve this search because it's too long and replace directly find with ls command but "ls *. " doesn't work.
and... (3 Replies)
Wrote this script to find the date x days before or after today. Is there any way that this script can be speeded up or otherwise improved?
#!/usr/bin/sh
check_done() {
if
then
daysofmth=31
elif
then
if
... (11 Replies)
hi someone tell me which ways i can improve disk I/O and system process performance.kindly refer some commands so i can do it on my test machine.thanks, Mazhar (2 Replies)
I have a data file of 2 gig
I need to do all these, but its taking hours, any where i can improve performance, thanks a lot
#!/usr/bin/ksh
echo TIMESTAMP="$(date +'_%y-%m-%d.%H-%M-%S')"
function showHelp {
cat << EOF >&2
syntax extreme.sh FILENAME
Specify filename to parse
EOF... (3 Replies)
I have a 10Gbps network link connecting two machines A and B. I want to transfer 20GB data from A to B using TCP. With default setting, I can use 50% bandwidth. How to improve the throughput? Is there any way to make throughput as close to 10Gbps as possible? thanks~ :) (3 Replies)
Hi All,
I have written a script as follows which is taking lot of time in executing/searching only 3500 records taken as input from one file in log file of 12 GB Approximately.
Working of script is read the csv file as an input having 2 arguments which are transaction_id,mobile_number and search... (6 Replies)
I just wrote a very small script that improves readability on system sulog. The problem with all sulog is there is lack of clarity whether the info you are looking at is the most current. So if you just need a simple soution instead of going thru the trouble of writing a script that rotate logs and... (0 Replies)
Gents.
I have 2 different scripts for the same purpose:
raw2csv_1
Script raw2csv_1 finish the process in less that 1 minute
raw2csv_2
Script raw2csv_2 finish the process in more that 6 minutes.
Can you please check if there is any option to improve the raw2csv_2. To finish the job... (4 Replies)
Gents,
Is there the possibility to improve this script to be able to have same output information.
I did this script, but I believe there is a very short code to get same output
here my script
awk -F, '{if($10>0 && $10<=15) print $6}' tmp1 | sort -k1n | awk '{a++} END { for (n in a )... (23 Replies)