Sponsored Content
Full Discussion: Popular Plugins && SVN
Top Forums Programming Popular Plugins && SVN Post 302957271 by Corona688 on Thursday 8th of October 2015 02:24:53 PM
Old 10-08-2015
Quote:
Originally Posted by metallica1973
As always thank you for your reply ,

So scraping 34 pages, using random request times, you think isnt that bad.
I think it's OK to do that once. If a determined human is allowed to do it by hand, standalone is no different. You hadn't mentioned periodically though -- for that, there's standards and rules.

Your program is nothing more or less than an ordinary, ubiquitous web-spider. Does wordpress.org militate against them? Check http://wordpress.org/robots.txt :

Code:
User-agent: *
Disallow: /search
Disallow: /support/search.php
Disallow: /extend/plugins/search.php
Disallow: /plugins/search.php
Disallow: /extend/themes/search.php
Disallow: /themes/search.php
Disallow: /support/rss
Disallow: /archive/

They have places spiders clearly aren't supposed to go, mostly their search functions (probably slow and expensive). Their flat listing of popular plugins is not excluded -- wget would have flat-out refused to work if it was, it's polite like that.

They could have also refused wget's very well-known user agent value and haven't. They don't militate against referer values either, and haven't obfuscated anything with logins, javscript, or cookie nonsense. They don't even have advertising to be annoyed about humans never seeing. As far as I'm concerned it's in the clear. You are inside the marked areas, not doing anything obviously wrong, on a site with no ads or interactive restrictions. If you don't impact their performance and don't do anything suspicious, why would they care? Downloading information is what their web site is for.

The rules for polite spiders, as I see them:
  1. Obey The Signs: If robots.txt says go away, go away. wget obeys this.
  2. Don't Abuse It: Don't connect too often, connect too many times at once, or consume traffic too fast. Again, wget is helpful here: --wait=seconds to wait between retrievals, --limit-rate=50k to limit download speed to 50 kilobits, etc. wget also bundles retrievals going to the same domain name: 34 noisy retrievals become one long one. I bet the wget version is already far more polite than your python one.
  3. Reduce The Load: Only download what you need, as politely as you can. Avoid recursive or open-ended retrievals. Since we're only downloading plain text, we can get a huge improvement from compression -- each 100-kilobyte page is reduced to 15 kilobytes. This makes an entire scrape as small as four ordinary pageviews and saves almost 4 megabytes of their traffic. That's worth doing if you consider viewing wordpress.org four times per day a reasonable amount of traffic -- it's small enough it's no worse than one human casually browsing. Again, wget can do this relatively simply: --header="Accept-Encoding: gzip"
  4. Don't Lie: Don't conceal who you are. Don't obfuscate your traffic patterns. Don't coerce it into working via false referer values, request times, or any other sort of forgery. That is alarming behavior -- if they catch it they should be suspicious. Do the opposite instead --
  5. Tell Them What You're Doing: Set your user-agent to something they can look up. Seeing it in the logs will prompt them to search for your 'metallica-spider' or whatever it is, they will discover its limited scope and harmless intent.

A more polite spider which flags itself, limits impact and bandwidth, and obeys robots.txt:

Code:
wget -U popular-checker --wait=1 --limit-rate=50K --header="Accept-Encoding: gzip" \
        "https://wordpress.org/plugins/browse/popular/"  https://wordpress.org/plugins/browse/popular/page/{2..34}/ -O - 2>/dev/null |
        gunzip | 
        grep -o "wordpress.org/plugins/[^/\"']*" |
        awk '{ sub(/wordpress[.]org[/]plugins[/]/, ""); } $1 && !/[.]php$/' | sort -u > plugins.txt


Last edited by Corona688; 10-08-2015 at 04:14 PM..
 

9 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Search for & edit rows & columns in data file and pipe

Dear unix gurus, I have a data file with header information about a subject and also 3 columns of n rows of data on various items he owns. The data file looks something like this: adam peter blah blah blah blah blah blah car 01 30 200 02 31 400 03 57 121 .. .. .. .. .. .. n y... (8 Replies)
Discussion started by: tintin72
8 Replies

2. UNIX for Dummies Questions & Answers

Problem with xterm & tcsh & sourcing a script in a single command

Hi friends, I have a script that sets the env variable path based on different conditions. Now the new path variable setting should not done in the same terminal or same shell. Only a new terminal or new shell should have the new path env variable set. I am able to do this only as follows: >cd... (1 Reply)
Discussion started by: sowmya005
1 Replies

3. Shell Programming and Scripting

Find & Replace string in multiple files & folders using perl

find . -type f -name "*.sql" -print|xargs perl -i -pe 's/pattern/replaced/g' this is simple logic to find and replace in multiple files & folders Hope this helps. Thanks Zaheer (0 Replies)
Discussion started by: Zaheer.mic
0 Replies

4. Shell Programming and Scripting

PHP read large string & split in multidimensional arrays & assign fieldnames & write into MYSQL

Hi, I hope the title does not scare people to look into this thread but it describes roughly what I'm trying to do. I need a solution in PHP. I'm a programming beginner, so it might be that the approach to solve this, might be easier to solve with an other approach of someone else, so if you... (0 Replies)
Discussion started by: lowmaster
0 Replies

5. Shell Programming and Scripting

replace & with & xml file

Hello All I have a xml file with many sets of records like this <mytag>mydata</mytag> <tag2>data&</tag2> also same file can be like this <mytag>mydata</mytag> <tag2>data&</tag2> <tag3>data2&amp;data3</tag3> Now i can grep & and replace with &amp; for whole file but it will replace all... (4 Replies)
Discussion started by: lokaish23
4 Replies

6. Shell Programming and Scripting

Replace & sign to &amp word

Hi, I have text file abc.txt. In this file, I have the following data. Input: Mr Smith &amp Mrs Smith Mr Smith &apos Mrs Smith Mr Smith & Mrs Smith Mr Smith& Mrs Smith Mr Smith &Mrs Smith Output: Mr Smith &amp Mrs Smith Mr Smith &apos Mrs Smith Mr Smith &amp Mrs Smith Mr Smith&amp... (4 Replies)
Discussion started by: naveed
4 Replies

7. Shell Programming and Scripting

Sort a the file & refine data column & row format

cat file1.txt field1 "user1": field2:"data-cde" field3:"data-pqr" field4:"data-mno" field1 "user1": field2:"data-dcb" field3:"data-mxz" field4:"data-zul" field1 "user2": field2:"data-cqz" field3:"data-xoq" field4:"data-pos" Now i need to have the date like below. i have just... (7 Replies)
Discussion started by: ckaramsetty
7 Replies

8. Shell Programming and Scripting

GNU & BSD Makefile Directives & Conditions Compatibility

Firstly, I would like to apologize if this is not the appropriate sub-forum to post about GNU/BSD makefile scripting. Though my code is in C++, because I am focusing on the makefile I thought it would go better in shell scripting. Please correct me if I am wrong. Secondly, I am not interested in... (0 Replies)
Discussion started by: AntumDeluge
0 Replies

9. Shell Programming and Scripting

SFTP Shell Script Get & Delete && Upload & Delete

Hi All, Do you have any sample script, - auto get file from SFTP remote server and delete file in remove server after downloaded. - only download specify filename - auto upload file from local to SFTP remote server and delete local folder file after uploaded - only upload specify filename ... (3 Replies)
Discussion started by: weesiong
3 Replies
All times are GMT -4. The time now is 01:27 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy