Forgive me if this is a rather stupid but I am writing some code using Python for a project at work and was wondering if there was a way to download the entire "Popular" plugins directory in a single shot? I currently have to to look at 34 pages:
But is there a better way that I can accomplish my goal?
Thank you in advance!
---------- Post updated 10-08-15 at 10:03 AM ---------- Previous update was 10-07-15 at 03:44 PM ----------
I guess I should clarify what I am asking. I have already written my python code to scrape what I need from the plugins sites and works great but at my job they think that its inefficient to send many HTTP GET requests to the site and would prefer that I obtain all the plugin information another way, then simply searching through the results, parse the data that I need. So I can download all the plugins via SVN:
I see there are 34 pages of 'popular' plugins. If all you get is the page itself, 34 downloads isn't that excessive, some pages have that many sub-items to load anyway.
I don't know details about SVN but can help you get details from those pages...
So scraping 34 pages, using random request times, you think isnt that bad. I thought the same but management didnt like the idea of creating noisy traffic to their website periodically, possibly being viewed as suspicious and blacklisted. Your wget is interesting but essentially in a nutshell is doing what my Python script does. I was hoping that wordpress had a popular plugin directory that I could download in one shot and work with it.
---------- Post updated at 01:14 PM ---------- Previous update was at 12:42 PM ----------
Another site suggested downloading the popular plugins locally:
but ultimately after I did the math would require 1000(total popular plugins) svn co requests in the initial beginning and looking for diffs therefor-after which inst as efficient as just scraping the 34 popular plugin sites for this project. The holy grail of where the popular plugins reside would be best
So scraping 34 pages, using random request times, you think isnt that bad.
I think it's OK to do that once. If a determined human is allowed to do it by hand, standalone is no different. You hadn't mentioned periodically though -- for that, there's standards and rules.
Your program is nothing more or less than an ordinary, ubiquitous web-spider. Does wordpress.org militate against them? Check http://wordpress.org/robots.txt :
They have places spiders clearly aren't supposed to go, mostly their search functions (probably slow and expensive). Their flat listing of popular plugins is not excluded -- wget would have flat-out refused to work if it was, it's polite like that.
They could have also refused wget's very well-known user agent value and haven't. They don't militate against referer values either, and haven't obfuscated anything with logins, javscript, or cookie nonsense. They don't even have advertising to be annoyed about humans never seeing. As far as I'm concerned it's in the clear. You are inside the marked areas, not doing anything obviously wrong, on a site with no ads or interactive restrictions. If you don't impact their performance and don't do anything suspicious, why would they care? Downloading information is what their web site is for.
The rules for polite spiders, as I see them:
Obey The Signs: If robots.txt says go away, go away. wget obeys this.
Don't Abuse It: Don't connect too often, connect too many times at once, or consume traffic too fast. Again, wget is helpful here: --wait=seconds to wait between retrievals, --limit-rate=50k to limit download speed to 50 kilobits, etc. wget also bundles retrievals going to the same domain name: 34 noisy retrievals become one long one. I bet the wget version is already far more polite than your python one.
Reduce The Load: Only download what you need, as politely as you can. Avoid recursive or open-ended retrievals. Since we're only downloading plain text, we can get a huge improvement from compression -- each 100-kilobyte page is reduced to 15 kilobytes. This makes an entire scrape as small as four ordinary pageviews and saves almost 4 megabytes of their traffic. That's worth doing if you consider viewing wordpress.org four times per day a reasonable amount of traffic -- it's small enough it's no worse than one human casually browsing. Again, wget can do this relatively simply: --header="Accept-Encoding: gzip"
Don't Lie: Don't conceal who you are. Don't obfuscate your traffic patterns. Don't coerce it into working via false referer values, request times, or any other sort of forgery. That is alarming behavior -- if they catch it they should be suspicious. Do the opposite instead --
Tell Them What You're Doing: Set your user-agent to something they can look up. Seeing it in the logs will prompt them to search for your 'metallica-spider' or whatever it is, they will discover its limited scope and harmless intent.
A more polite spider which flags itself, limits impact and bandwidth, and obeys robots.txt:
Last edited by Corona688; 10-08-2015 at 04:14 PM..
I don't see any API on the site itself to directly get this information, no. The only thing close is "popular import plugin", which retrieves the small fixed list wordpress uses internally, not the great big 34 page one.
Hi All,
Do you have any sample script,
- auto get file from SFTP remote server and delete file in remove server after downloaded.
- only download specify filename
- auto upload file from local to SFTP remote server and delete local folder file after uploaded
- only upload specify filename
... (3 Replies)
Firstly, I would like to apologize if this is not the appropriate sub-forum to post about GNU/BSD makefile scripting. Though my code is in C++, because I am focusing on the makefile I thought it would go better in shell scripting. Please correct me if I am wrong.
Secondly, I am not interested in... (0 Replies)
cat file1.txt
field1 "user1":
field2:"data-cde"
field3:"data-pqr"
field4:"data-mno"
field1 "user1":
field2:"data-dcb"
field3:"data-mxz"
field4:"data-zul"
field1 "user2":
field2:"data-cqz"
field3:"data-xoq"
field4:"data-pos"
Now i need to have the date like below.
i have just... (7 Replies)
Hi, I have text file abc.txt. In this file, I have the following data.
Input:
Mr Smith & Mrs Smith
Mr Smith &apos Mrs Smith
Mr Smith & Mrs Smith
Mr Smith& Mrs Smith
Mr Smith &Mrs Smith
Output:
Mr Smith & Mrs Smith
Mr Smith &apos Mrs Smith
Mr Smith & Mrs Smith
Mr Smith&... (4 Replies)
Hello All
I have a xml file with many sets of records
like this
<mytag>mydata</mytag>
<tag2>data&</tag2>
also same file can be like this
<mytag>mydata</mytag>
<tag2>data&</tag2>
<tag3>data2&data3</tag3>
Now i can grep & and replace with & for whole file but it will replace all... (4 Replies)
Hi, I hope the title does not scare people to look into this thread but it describes roughly what I'm trying to do. I need a solution in PHP.
I'm a programming beginner, so it might be that the approach to solve this, might be easier to solve with an other approach of someone else, so if you... (0 Replies)
find . -type f -name "*.sql" -print|xargs perl -i -pe 's/pattern/replaced/g'
this is simple logic to find and replace in multiple files & folders
Hope this helps.
Thanks
Zaheer (0 Replies)
Hi friends,
I have a script that sets the env variable path based on different conditions.
Now the new path variable setting should not done in the same terminal or same shell.
Only a new terminal or new shell should have the new path env variable set.
I am able to do this only as follows:
>cd... (1 Reply)
Dear unix gurus,
I have a data file with header information about a subject and also 3 columns of n rows of data on various items he owns. The data file looks something like this:
adam peter
blah blah blah
blah blah blah
car
01 30 200
02 31 400
03 57 121
.. .. ..
.. .. ..
n y... (8 Replies)