webcheck(1) [debian man page]

webcheck(1)							   User Commands						       webcheck(1)

NAME

       webcheck - website link checker

SYNOPSIS

       webcheck [OPTION]...  URL

DESCRIPTION

       webcheck  will  check  the  document at the specified URL for links to other documents, follow these links recursively and generate an HTML
       report.

       -i,  --internal=PATTERN
	      Mark URLs matching the PATTERN (perl-type regular expression) as an internal link.  Can be  used	multiple  times.   Note  that  the
	      PATTERN  is  matched  against  the  full URL.  URLs matching this PATTERN will be considered internal, even if they match one of the
	      --external PATTERNs.

       -x,  --external=PATTERN
	      Mark URLs matching the PATTERN (perl-type regular expression) as an external link.  Can be  used	multiple  times.   Note  that  the
	      PATTERN is matched against the full URL.

       -y, --yank=PATTERN
	      Do  not check URLs matching the PATTERN (perl-type regular expression).  Like the -x flag, though this option will cause webcheck to
	      not check the link matched by regex whereas -x will check the link but not its children.	Can be used multiple times.  Note that the
	      PATTERN is matched against the full URL.

       -b, --base-only
	      Consider any URL not starting with the base URL to be external.  For example, if you run
		  webcheck -b http://www.example.com/foo
	      then  http://www.example.com/foo/bar  will  be  considered internal whereas http://www.example.com/ will be considered external.	By
	      default all the pages on the site will be considered internal.

       -a, --avoid-external
	      Avoid external links.  Normally if webcheck is examining an HTML page and it finds a link that points to an  external  document,	it
	      will check to see if that external document exists.  This flag disables that action.

       --ignore-robots
	      Do  not  retrieve  and  parse robots.txt files.  By default robots.txt files are retrieved and honored.  If you are sure you want to
	      ignore and override the webmaster's decision this option can be used.
	      For more information on robots.txt handling see the NOTES section below.

       -q, --quiet, --silent
	      Do not print out progress as webcheck traverses a site.

       -d, --debug
	      Print debugging information while crawling the site.  This option is mainly useful for developers.

       -o, --output=DIRECTORY
	      Output directory. Use to specify the directory where webcheck will dump its reports. The default is  the	current  directory  or	as
	      specified by config.py. If this directory does not exist it will be created for you (if possible).

       -c, --continue
	      Try  to  continue  from  a previous run. When using this option webcheck will look for a webcheck.dat in the output directory.  This
	      file is read to restore the state from the previous run.	This allows webcheck to continue a previously interrupted run.	When  this
	      option  is  used,  the  --internal, --external and --yank options will be ignored as well as any URL arguments.  The --base-only and
	      --avoid-external options should be the same as the previous run.
	      Note that this option is experimental and it's semantics may change with coming releases (especially in relation to other  options).
	      Also note that the stored files are not guaranteed to be compatible between releases.

       -f, --force
	      Overwrite files without asking.  This option is required for running webcheck non-interactively.

       -r, --redirects=N
	      Redirect depth. the number of redirects webcheck should follow when following a link. 0 implies to follow all redirects.

       -u, --userpass=URL
	      Specify a URL with username and password information to use for basic authentication when visiting the site.
	      e.g. http://test:secret@example.com/
	      This option may be specified multiple times.

       -w, --wait=SECONDS
	      Wait  SECONDS  between document retrievals. Usually webcheck will process a url and immediately move on to the next. However on some
	      loaded systems it may be desirable to have webcheck pause between requests.  This option can be set to any non-negative number.

       -v, --version
	      Show version of program.

       -h, --help
	      Show short summary of options.

URL CLASSES

       URLs are divided into two classes:

       Internal URLs are retrieved and the retrieved item is checked for syntax.  Also, the retrieved item is searched for links  to  other  items
       (of any class) and these links are followed.

       External  URLs are only retrieved to test whether they are valid and to gather some basic information from them (title, size, content-type,
       etc).  The retrieved items are not inspected for links to other items.

       Apart from their class, URLs can also be considered yanked (as specified with the --yank or --avoid-external options).	The  URLs  can	be
       either internal or external and will not be retrieved or checked at all.  URLs of unsupported schemes are also considered yanked.

EXAMPLES

       Check the site www.example.com but consider any path with "/webcheck" in it to be external.
	   webcheck http://www.example.com/ -x /webcheck

NOTES

       When  checking  internal  URLs webcheck honors the robots.txt file, identifying itself as user-agent webcheck. Disallowed links will not be
       checked at all as if the -y option was specified for that URL. To allow webcheck to crawl parts of a site that other robots are disallowed,
       use something like:
	   User-agent: *
	   Disallow: /foo

	   User-agent: webcheck
	   Allow: /foo

ENVIRONMENT

       <scheme>_proxy
	      Proxy url for <scheme>.

REPORTING BUGS

       Bug reports shoult be sent to the mailing list <webcheck-users@lists.arthurdejong.org>.	More information on reporting bugs can be found on
       the webcheck homepage:
       http://arthurdejong.org/webcheck/

COPYRIGHT

       Copyright (C) 1998, 1999 Albert Hopkins (marduk)
       Copyright (C) 2002 Mike W. Meyer
       Copyright (C) 2005, 2006, 2007, 2008, 2009, 2010 Arthur de Jong
       webcheck is free software; see the source for copying conditions.  There is NO warranty; not even for  MERCHANTABILITY  or  FITNESS  FOR  A
       PARTICULAR PURPOSE.
       The  files  produced  as  output  from the software do not automatically fall under the copyright of the software, unless explicitly stated
       otherwise.

Version 1.10.4							     Sep 2010							       webcheck(1)
Linux and UNIX Man Pages

webcheck(1) [debian man page]