wget crawl website by extracting links

03-16-2011

Registered User

190, 1

Join Date: Jan 2011

Last Activity: 11 October 2017, 1:35 PM EDT

Location: Nowhere

Posts: 190

Thanks Given: 227

Thanked 1 Time in 1 Post

wget crawl website by extracting links

I am using wget to crawl a website using the following command:

Code:

wget --wait=20 --limit-rate=20K -r -p -U Mozilla http://www.stupidsite.com

What I have found is that after two days of crawling some links are still not downloaded. For example, if some page has 10 links in it as anchor texts some link are downloaded but most are not. I want to use wget to follow crawling by extracting links and then crawl accordingly based on the links from a page. Here's a clear idea:
I start with page1.html and there are 10 hyperlinks. I extract all those hyperlinks (of course save page1.html locally) and proceed with those hyperlinks next one by one and keep downloading the web pages based on hyperlinks from all the new pages. I want to limit myself to one external site, else I'll run out of disk space. Is there any way of doing this. Hope I make some sense.

shoaibjameel123

View Public Profile for shoaibjameel123

Find all posts by shoaibjameel123

03-16-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Look at the links that are crawled and the links that aren't, maybe there's a pattern.

You may also want to use wget's -k option, to convert links so local-only links don't end up useless subs after retrieval.

This User Gave Thanks to Corona688 For This Post:

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

Shell Programming and Scripting

wget crawl website by extracting links

6 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Wget error while downloading from https website

Discussion started by: pinnacle

2. AIX

List all the soft links and hard links

Discussion started by: newtoaixos

3. Solaris

Hard Links and Soft or Sym links

Discussion started by: Harleyrci

4. Shell Programming and Scripting

Small script for website links and regular expressions

Discussion started by: Zbunce

5. UNIX for Advanced & Expert Users

Extracting files with multiple links-perl

Discussion started by: guptesanket

6. Shell Programming and Scripting

wget - force link conversion for all links?

Discussion started by: Allasso