wget crawl website by extracting links


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting wget crawl website by extracting links
# 1  
Old 03-16-2011
wget crawl website by extracting links

I am using wget to crawl a website using the following command:

Code:
wget --wait=20 --limit-rate=20K -r -p -U Mozilla http://www.stupidsite.com

What I have found is that after two days of crawling some links are still not downloaded. For example, if some page has 10 links in it as anchor texts some link are downloaded but most are not. I want to use wget to follow crawling by extracting links and then crawl accordingly based on the links from a page. Here's a clear idea:
I start with page1.html and there are 10 hyperlinks. I extract all those hyperlinks (of course save page1.html locally) and proceed with those hyperlinks next one by one and keep downloading the web pages based on hyperlinks from all the new pages. I want to limit myself to one external site, else I'll run out of disk space. Is there any way of doing this. Hope I make some sense.
# 2  
Old 03-16-2011
Look at the links that are crawled and the links that aren't, maybe there's a pattern.

You may also want to use wget's -k option, to convert links so local-only links don't end up useless subs after retrieval.
This User Gave Thanks to Corona688 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

6 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Wget error while downloading from https website

Hi, I would like to download a file from a https website. I don't have the file name as it changes every day. I am using the following command: wget --no-check-certificate -r -np --user=ABC --password=DEF -O temp.txt https://<website/directory> I am getting followin error in my... (9 Replies)
Discussion started by: pinnacle
9 Replies

2. AIX

List all the soft links and hard links

Hi I'm logged in as root in an aix box Which command will list all the soft links and hard links present in the server ? (2 Replies)
Discussion started by: newtoaixos
2 Replies

3. Solaris

Hard Links and Soft or Sym links

When loooking at files in a directory using ls, how can I tell if I have a hard link or soft link? (11 Replies)
Discussion started by: Harleyrci
11 Replies

4. Shell Programming and Scripting

Small script for website links and regular expressions

Need a help with a simple(i hope) script that would get a website location from stdin and then check all the links that site contains for some random regular expression ,and then save the links name and the expression found in some random file.Any help would be really helpfull. Considerin i`m... (5 Replies)
Discussion started by: Zbunce
5 Replies

5. UNIX for Advanced & Expert Users

Extracting files with multiple links-perl

i want to write a perl script that gets/displays all those files having multiple links (in current directory) (4 Replies)
Discussion started by: guptesanket
4 Replies

6. Shell Programming and Scripting

wget - force link conversion for all links?

Hello, In using wget with the -k option to convert links to relative URLs, I am finding that not all the links get converted in a recursive download, and when downloading a single file, none of them do. I am assuming that this is because wget will only convert those URLs for files it has... (1 Reply)
Discussion started by: Allasso
1 Replies
Login or Register to Ask a Question