Popular Plugins && SVN


 
Thread Tools Search this Thread
Top Forums Programming Popular Plugins && SVN
# 1  
Old 10-08-2015
Popular Plugins && SVN

Forgive me if this is a rather stupid but I am writing some code using Python for a project at work and was wondering if there was a way to download the entire "Popular" plugins directory in a single shot? I currently have to to look at 34 pages:

https://wordpress.org/plugins/browse/popular/
https://wordpress.org/plugins/browse/popular/2
...

and parse data that way but is really inefficient. I see that I can check stuff out via svn

https://wordpress.org/plugins/about/svn/

and I can see all the plugins as a whole via:

https://plugins.svn.wordpress.org/

But is there a better way that I can accomplish my goal?

Thank you in advance!

---------- Post updated 10-08-15 at 10:03 AM ---------- Previous update was 10-07-15 at 03:44 PM ----------

I guess I should clarify what I am asking. I have already written my python code to scrape what I need from the plugins sites and works great but at my job they think that its inefficient to send many HTTP GET requests to the site and would prefer that I obtain all the plugin information another way, then simply searching through the results, parse the data that I need. So I can download all the plugins via SVN:

svn co https://plugins.svn.wordpress.org/

but I would prefer to just download plugins that are considered popular like what you see from the url:

https://wordpress.org/plugins/browse/popular/

Can I do an svn check of just the popular directory? Ex.

svn co https://popular.plugins.svn.wordpress.org

I know the about url doesnt exist but if it does what is it?

Thank you in advance!

Last edited by metallica1973; 10-08-2015 at 12:23 PM..
# 2  
Old 10-08-2015
Thank you for the clarification, that helps.

I see there are 34 pages of 'popular' plugins. If all you get is the page itself, 34 downloads isn't that excessive, some pages have that many sub-items to load anyway.

I don't know details about SVN but can help you get details from those pages...

To get you started:

Code:
$ wget "https://wordpress.org/plugins/browse/popular/"  https://wordpress.org/plugins/browse/popular/page/{2..34}/ -O - 2>/dev/null |
        grep -o "wordpress.org/plugins/[^/\"']*" |
        awk '{ sub(/wordpress[.]org[/]plugins[/]/, ""); } $1 && !/[.]php$/' | sort -u > plugins.txt

# 3  
Old 10-08-2015
As always thank you for your reply ,

So scraping 34 pages, using random request times, you think isnt that bad. I thought the same but management didnt like the idea of creating noisy traffic to their website periodically, possibly being viewed as suspicious and blacklisted. Your wget is interesting but essentially in a nutshell is doing what my Python script does. I was hoping that wordpress had a popular plugin directory that I could download in one shot and work with it.

---------- Post updated at 01:14 PM ---------- Previous update was at 12:42 PM ----------

Another site suggested downloading the popular plugins locally:

svn co https://plugins.svn.wordpress.org/your-plugin-name my-local-dir

but ultimately after I did the math would require 1000(total popular plugins) svn co requests in the initial beginning and looking for diffs therefor-after which inst as efficient as just scraping the 34 popular plugin sites for this project. The holy grail of where the popular plugins reside would be best
# 4  
Old 10-08-2015
Quote:
Originally Posted by metallica1973
As always thank you for your reply ,

So scraping 34 pages, using random request times, you think isnt that bad.
I think it's OK to do that once. If a determined human is allowed to do it by hand, standalone is no different. You hadn't mentioned periodically though -- for that, there's standards and rules.

Your program is nothing more or less than an ordinary, ubiquitous web-spider. Does wordpress.org militate against them? Check http://wordpress.org/robots.txt :

Code:
User-agent: *
Disallow: /search
Disallow: /support/search.php
Disallow: /extend/plugins/search.php
Disallow: /plugins/search.php
Disallow: /extend/themes/search.php
Disallow: /themes/search.php
Disallow: /support/rss
Disallow: /archive/

They have places spiders clearly aren't supposed to go, mostly their search functions (probably slow and expensive). Their flat listing of popular plugins is not excluded -- wget would have flat-out refused to work if it was, it's polite like that.

They could have also refused wget's very well-known user agent value and haven't. They don't militate against referer values either, and haven't obfuscated anything with logins, javscript, or cookie nonsense. They don't even have advertising to be annoyed about humans never seeing. As far as I'm concerned it's in the clear. You are inside the marked areas, not doing anything obviously wrong, on a site with no ads or interactive restrictions. If you don't impact their performance and don't do anything suspicious, why would they care? Downloading information is what their web site is for.

The rules for polite spiders, as I see them:
  1. Obey The Signs: If robots.txt says go away, go away. wget obeys this.
  2. Don't Abuse It: Don't connect too often, connect too many times at once, or consume traffic too fast. Again, wget is helpful here: --wait=seconds to wait between retrievals, --limit-rate=50k to limit download speed to 50 kilobits, etc. wget also bundles retrievals going to the same domain name: 34 noisy retrievals become one long one. I bet the wget version is already far more polite than your python one.
  3. Reduce The Load: Only download what you need, as politely as you can. Avoid recursive or open-ended retrievals. Since we're only downloading plain text, we can get a huge improvement from compression -- each 100-kilobyte page is reduced to 15 kilobytes. This makes an entire scrape as small as four ordinary pageviews and saves almost 4 megabytes of their traffic. That's worth doing if you consider viewing wordpress.org four times per day a reasonable amount of traffic -- it's small enough it's no worse than one human casually browsing. Again, wget can do this relatively simply: --header="Accept-Encoding: gzip"
  4. Don't Lie: Don't conceal who you are. Don't obfuscate your traffic patterns. Don't coerce it into working via false referer values, request times, or any other sort of forgery. That is alarming behavior -- if they catch it they should be suspicious. Do the opposite instead --
  5. Tell Them What You're Doing: Set your user-agent to something they can look up. Seeing it in the logs will prompt them to search for your 'metallica-spider' or whatever it is, they will discover its limited scope and harmless intent.

A more polite spider which flags itself, limits impact and bandwidth, and obeys robots.txt:

Code:
wget -U popular-checker --wait=1 --limit-rate=50K --header="Accept-Encoding: gzip" \
        "https://wordpress.org/plugins/browse/popular/"  https://wordpress.org/plugins/browse/popular/page/{2..34}/ -O - 2>/dev/null |
        gunzip | 
        grep -o "wordpress.org/plugins/[^/\"']*" |
        awk '{ sub(/wordpress[.]org[/]plugins[/]/, ""); } $1 && !/[.]php$/' | sort -u > plugins.txt


Last edited by Corona688; 10-08-2015 at 04:14 PM..
# 5  
Old 10-08-2015
I don't see any API on the site itself to directly get this information, no. The only thing close is "popular import plugin", which retrieves the small fixed list wordpress uses internally, not the great big 34 page one.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

SFTP Shell Script Get & Delete && Upload & Delete

Hi All, Do you have any sample script, - auto get file from SFTP remote server and delete file in remove server after downloaded. - only download specify filename - auto upload file from local to SFTP remote server and delete local folder file after uploaded - only upload specify filename ... (3 Replies)
Discussion started by: weesiong
3 Replies

2. Shell Programming and Scripting

GNU & BSD Makefile Directives & Conditions Compatibility

Firstly, I would like to apologize if this is not the appropriate sub-forum to post about GNU/BSD makefile scripting. Though my code is in C++, because I am focusing on the makefile I thought it would go better in shell scripting. Please correct me if I am wrong. Secondly, I am not interested in... (0 Replies)
Discussion started by: AntumDeluge
0 Replies

3. Shell Programming and Scripting

Sort a the file & refine data column & row format

cat file1.txt field1 "user1": field2:"data-cde" field3:"data-pqr" field4:"data-mno" field1 "user1": field2:"data-dcb" field3:"data-mxz" field4:"data-zul" field1 "user2": field2:"data-cqz" field3:"data-xoq" field4:"data-pos" Now i need to have the date like below. i have just... (7 Replies)
Discussion started by: ckaramsetty
7 Replies

4. Shell Programming and Scripting

Replace & sign to &amp word

Hi, I have text file abc.txt. In this file, I have the following data. Input: Mr Smith &amp Mrs Smith Mr Smith &apos Mrs Smith Mr Smith & Mrs Smith Mr Smith& Mrs Smith Mr Smith &Mrs Smith Output: Mr Smith &amp Mrs Smith Mr Smith &apos Mrs Smith Mr Smith &amp Mrs Smith Mr Smith&amp... (4 Replies)
Discussion started by: naveed
4 Replies

5. Shell Programming and Scripting

replace & with & xml file

Hello All I have a xml file with many sets of records like this <mytag>mydata</mytag> <tag2>data&</tag2> also same file can be like this <mytag>mydata</mytag> <tag2>data&</tag2> <tag3>data2&amp;data3</tag3> Now i can grep & and replace with &amp; for whole file but it will replace all... (4 Replies)
Discussion started by: lokaish23
4 Replies

6. Shell Programming and Scripting

PHP read large string & split in multidimensional arrays & assign fieldnames & write into MYSQL

Hi, I hope the title does not scare people to look into this thread but it describes roughly what I'm trying to do. I need a solution in PHP. I'm a programming beginner, so it might be that the approach to solve this, might be easier to solve with an other approach of someone else, so if you... (0 Replies)
Discussion started by: lowmaster
0 Replies

7. Shell Programming and Scripting

Find & Replace string in multiple files & folders using perl

find . -type f -name "*.sql" -print|xargs perl -i -pe 's/pattern/replaced/g' this is simple logic to find and replace in multiple files & folders Hope this helps. Thanks Zaheer (0 Replies)
Discussion started by: Zaheer.mic
0 Replies

8. UNIX for Dummies Questions & Answers

Problem with xterm & tcsh & sourcing a script in a single command

Hi friends, I have a script that sets the env variable path based on different conditions. Now the new path variable setting should not done in the same terminal or same shell. Only a new terminal or new shell should have the new path env variable set. I am able to do this only as follows: >cd... (1 Reply)
Discussion started by: sowmya005
1 Replies

9. UNIX for Dummies Questions & Answers

Search for & edit rows & columns in data file and pipe

Dear unix gurus, I have a data file with header information about a subject and also 3 columns of n rows of data on various items he owns. The data file looks something like this: adam peter blah blah blah blah blah blah car 01 30 200 02 31 400 03 57 121 .. .. .. .. .. .. n y... (8 Replies)
Discussion started by: tintin72
8 Replies
Login or Register to Ask a Question