Popular Plugins && SVN

10-08-2015

Registered User

227, 3

Join Date: Dec 2007

Last Activity: 3 February 2020, 9:46 AM EST

Location: Washington D.C

Posts: 227

Thanks Given: 31

Thanked 3 Times in 3 Posts

Popular Plugins && SVN

Forgive me if this is a rather stupid but I am writing some code using Python for a project at work and was wondering if there was a way to download the entire "Popular" plugins directory in a single shot? I currently have to to look at 34 pages:

https://wordpress.org/plugins/browse/popular/
https://wordpress.org/plugins/browse/popular/2
...

and parse data that way but is really inefficient. I see that I can check stuff out via svn

https://wordpress.org/plugins/about/svn/

and I can see all the plugins as a whole via:

https://plugins.svn.wordpress.org/

But is there a better way that I can accomplish my goal?

Thank you in advance!

---------- Post updated 10-08-15 at 10:03 AM ---------- Previous update was 10-07-15 at 03:44 PM ----------

I guess I should clarify what I am asking. I have already written my python code to scrape what I need from the plugins sites and works great but at my job they think that its inefficient to send many HTTP GET requests to the site and would prefer that I obtain all the plugin information another way, then simply searching through the results, parse the data that I need. So I can download all the plugins via SVN:

svn co https://plugins.svn.wordpress.org/

but I would prefer to just download plugins that are considered popular like what you see from the url:

https://wordpress.org/plugins/browse/popular/

Can I do an svn check of just the popular directory? Ex.

svn co https://popular.plugins.svn.wordpress.org

I know the about url doesnt exist but if it does what is it?

Thank you in advance!

Last edited by metallica1973; 10-08-2015 at 12:23 PM..

metallica1973

View Public Profile for metallica1973

Find all posts by metallica1973

10-08-2015

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Thank you for the clarification, that helps.

I see there are 34 pages of 'popular' plugins. If all you get is the page itself, 34 downloads isn't that excessive, some pages have that many sub-items to load anyway.

I don't know details about SVN but can help you get details from those pages...

To get you started:

Code:

$ wget "https://wordpress.org/plugins/browse/popular/"  https://wordpress.org/plugins/browse/popular/page/{2..34}/ -O - 2>/dev/null |
        grep -o "wordpress.org/plugins/[^/\"']*" |
        awk '{ sub(/wordpress[.]org[/]plugins[/]/, ""); } $1 && !/[.]php$/' | sort -u > plugins.txt

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

10-08-2015

Registered User

227, 3

Join Date: Dec 2007

Last Activity: 3 February 2020, 9:46 AM EST

Location: Washington D.C

Posts: 227

Thanks Given: 31

Thanked 3 Times in 3 Posts

As always thank you for your reply ,

So scraping 34 pages, using random request times, you think isnt that bad. I thought the same but management didnt like the idea of creating noisy traffic to their website periodically, possibly being viewed as suspicious and blacklisted. Your wget is interesting but essentially in a nutshell is doing what my Python script does. I was hoping that wordpress had a popular plugin directory that I could download in one shot and work with it.

---------- Post updated at 01:14 PM ---------- Previous update was at 12:42 PM ----------

Another site suggested downloading the popular plugins locally:

svn co https://plugins.svn.wordpress.org/your-plugin-name my-local-dir

but ultimately after I did the math would require 1000(total popular plugins) svn co requests in the initial beginning and looking for diffs therefor-after which inst as efficient as just scraping the 34 popular plugin sites for this project. The holy grail of where the popular plugins reside would be best

metallica1973

View Public Profile for metallica1973

Find all posts by metallica1973

10-08-2015

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by metallica1973

As always thank you for your reply ,

So scraping 34 pages, using random request times, you think isnt that bad.

I think it's OK to do that once. If a determined human is allowed to do it by hand, standalone is no different. You hadn't mentioned periodically though -- for that, there's standards and rules.

Your program is nothing more or less than an ordinary, ubiquitous web-spider. Does wordpress.org militate against them? Check http://wordpress.org/robots.txt :

Code:

User-agent: *
Disallow: /search
Disallow: /support/search.php
Disallow: /extend/plugins/search.php
Disallow: /plugins/search.php
Disallow: /extend/themes/search.php
Disallow: /themes/search.php
Disallow: /support/rss
Disallow: /archive/

They have places spiders clearly aren't supposed to go, mostly their search functions (probably slow and expensive). Their flat listing of popular plugins is not excluded -- wget would have flat-out refused to work if it was, it's polite like that.

They could have also refused wget's very well-known user agent value and haven't. They don't militate against referer values either, and haven't obfuscated anything with logins, javscript, or cookie nonsense. They don't even have advertising to be annoyed about humans never seeing. As far as I'm concerned it's in the clear. You are inside the marked areas, not doing anything obviously wrong, on a site with no ads or interactive restrictions. If you don't impact their performance and don't do anything suspicious, why would they care? Downloading information is what their web site is for.

The rules for polite spiders, as I see them:

Obey The Signs: If robots.txt says go away, go away. wget obeys this.
Don't Abuse It: Don't connect too often, connect too many times at once, or consume traffic too fast. Again, wget is helpful here: --wait=seconds to wait between retrievals, --limit-rate=50k to limit download speed to 50 kilobits, etc. wget also bundles retrievals going to the same domain name: 34 noisy retrievals become one long one. I bet the wget version is already far more polite than your python one.
Reduce The Load: Only download what you need, as politely as you can. Avoid recursive or open-ended retrievals. Since we're only downloading plain text, we can get a huge improvement from compression -- each 100-kilobyte page is reduced to 15 kilobytes. This makes an entire scrape as small as four ordinary pageviews and saves almost 4 megabytes of their traffic. That's worth doing if you consider viewing wordpress.org four times per day a reasonable amount of traffic -- it's small enough it's no worse than one human casually browsing. Again, wget can do this relatively simply: --header="Accept-Encoding: gzip"
Don't Lie: Don't conceal who you are. Don't obfuscate your traffic patterns. Don't coerce it into working via false referer values, request times, or any other sort of forgery. That is alarming behavior -- if they catch it they should be suspicious. Do the opposite instead --
Tell Them What You're Doing: Set your user-agent to something they can look up. Seeing it in the logs will prompt them to search for your 'metallica-spider' or whatever it is, they will discover its limited scope and harmless intent.

A more polite spider which flags itself, limits impact and bandwidth, and obeys robots.txt:

Code:

wget -U popular-checker --wait=1 --limit-rate=50K --header="Accept-Encoding: gzip" \
        "https://wordpress.org/plugins/browse/popular/"  https://wordpress.org/plugins/browse/popular/page/{2..34}/ -O - 2>/dev/null |
        gunzip | 
        grep -o "wordpress.org/plugins/[^/\"']*" |
        awk '{ sub(/wordpress[.]org[/]plugins[/]/, ""); } $1 && !/[.]php$/' | sort -u > plugins.txt

Last edited by Corona688; 10-08-2015 at 04:14 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

10-08-2015

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

I don't see any API on the site itself to directly get this information, no. The only thing close is "popular import plugin", which retrieves the small fixed list wordpress uses internally, not the great big 34 page one.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

Programming

Popular Plugins && SVN

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

SFTP Shell Script Get & Delete && Upload & Delete

Discussion started by: weesiong

2. Shell Programming and Scripting

GNU & BSD Makefile Directives & Conditions Compatibility

Discussion started by: AntumDeluge

3. Shell Programming and Scripting

Sort a the file & refine data column & row format

Discussion started by: ckaramsetty

4. Shell Programming and Scripting

Replace & sign to &amp word

Discussion started by: naveed

5. Shell Programming and Scripting

replace & with & xml file

Discussion started by: lokaish23

6. Shell Programming and Scripting

PHP read large string & split in multidimensional arrays & assign fieldnames & write into MYSQL

Discussion started by: lowmaster

7. Shell Programming and Scripting

Find & Replace string in multiple files & folders using perl

Discussion started by: Zaheer.mic

8. UNIX for Dummies Questions & Answers

Problem with xterm & tcsh & sourcing a script in a single command

Discussion started by: sowmya005

9. UNIX for Dummies Questions & Answers

Search for & edit rows & columns in data file and pipe

Discussion started by: tintin72