So scraping 34 pages, using random request times, you think isnt that bad.
I think it's OK to do that once. If a determined human is allowed to do it by hand, standalone is no different. You hadn't mentioned periodically though -- for that, there's standards and rules.
Your program is nothing more or less than an ordinary, ubiquitous web-spider. Does wordpress.org militate against them? Check http://wordpress.org/robots.txt :
They have places spiders clearly aren't supposed to go, mostly their search functions (probably slow and expensive). Their flat listing of popular plugins is not excluded -- wget would have flat-out refused to work if it was, it's polite like that.
They could have also refused wget's very well-known user agent value and haven't. They don't militate against referer values either, and haven't obfuscated anything with logins, javscript, or cookie nonsense. They don't even have advertising to be annoyed about humans never seeing. As far as I'm concerned it's in the clear. You are inside the marked areas, not doing anything obviously wrong, on a site with no ads or interactive restrictions. If you don't impact their performance and don't do anything suspicious, why would they care? Downloading information is what their web site is for.
The rules for polite spiders, as I see them:
Obey The Signs: If robots.txt says go away, go away. wget obeys this.
Don't Abuse It: Don't connect too often, connect too many times at once, or consume traffic too fast. Again, wget is helpful here: --wait=seconds to wait between retrievals, --limit-rate=50k to limit download speed to 50 kilobits, etc. wget also bundles retrievals going to the same domain name: 34 noisy retrievals become one long one. I bet the wget version is already far more polite than your python one.
Reduce The Load: Only download what you need, as politely as you can. Avoid recursive or open-ended retrievals. Since we're only downloading plain text, we can get a huge improvement from compression -- each 100-kilobyte page is reduced to 15 kilobytes. This makes an entire scrape as small as four ordinary pageviews and saves almost 4 megabytes of their traffic. That's worth doing if you consider viewing wordpress.org four times per day a reasonable amount of traffic -- it's small enough it's no worse than one human casually browsing. Again, wget can do this relatively simply: --header="Accept-Encoding: gzip"
Don't Lie: Don't conceal who you are. Don't obfuscate your traffic patterns. Don't coerce it into working via false referer values, request times, or any other sort of forgery. That is alarming behavior -- if they catch it they should be suspicious. Do the opposite instead --
Tell Them What You're Doing: Set your user-agent to something they can look up. Seeing it in the logs will prompt them to search for your 'metallica-spider' or whatever it is, they will discover its limited scope and harmless intent.
A more polite spider which flags itself, limits impact and bandwidth, and obeys robots.txt:
Last edited by Corona688; 10-08-2015 at 04:14 PM..
Dear unix gurus,
I have a data file with header information about a subject and also 3 columns of n rows of data on various items he owns. The data file looks something like this:
adam peter
blah blah blah
blah blah blah
car
01 30 200
02 31 400
03 57 121
.. .. ..
.. .. ..
n y... (8 Replies)
Hi friends,
I have a script that sets the env variable path based on different conditions.
Now the new path variable setting should not done in the same terminal or same shell.
Only a new terminal or new shell should have the new path env variable set.
I am able to do this only as follows:
>cd... (1 Reply)
find . -type f -name "*.sql" -print|xargs perl -i -pe 's/pattern/replaced/g'
this is simple logic to find and replace in multiple files & folders
Hope this helps.
Thanks
Zaheer (0 Replies)
Hi, I hope the title does not scare people to look into this thread but it describes roughly what I'm trying to do. I need a solution in PHP.
I'm a programming beginner, so it might be that the approach to solve this, might be easier to solve with an other approach of someone else, so if you... (0 Replies)
Hello All
I have a xml file with many sets of records
like this
<mytag>mydata</mytag>
<tag2>data&</tag2>
also same file can be like this
<mytag>mydata</mytag>
<tag2>data&</tag2>
<tag3>data2&data3</tag3>
Now i can grep & and replace with & for whole file but it will replace all... (4 Replies)
Hi, I have text file abc.txt. In this file, I have the following data.
Input:
Mr Smith & Mrs Smith
Mr Smith &apos Mrs Smith
Mr Smith & Mrs Smith
Mr Smith& Mrs Smith
Mr Smith &Mrs Smith
Output:
Mr Smith & Mrs Smith
Mr Smith &apos Mrs Smith
Mr Smith & Mrs Smith
Mr Smith&... (4 Replies)
cat file1.txt
field1 "user1":
field2:"data-cde"
field3:"data-pqr"
field4:"data-mno"
field1 "user1":
field2:"data-dcb"
field3:"data-mxz"
field4:"data-zul"
field1 "user2":
field2:"data-cqz"
field3:"data-xoq"
field4:"data-pos"
Now i need to have the date like below.
i have just... (7 Replies)
Firstly, I would like to apologize if this is not the appropriate sub-forum to post about GNU/BSD makefile scripting. Though my code is in C++, because I am focusing on the makefile I thought it would go better in shell scripting. Please correct me if I am wrong.
Secondly, I am not interested in... (0 Replies)
Hi All,
Do you have any sample script,
- auto get file from SFTP remote server and delete file in remove server after downloaded.
- only download specify filename
- auto upload file from local to SFTP remote server and delete local folder file after uploaded
- only upload specify filename
... (3 Replies)