Unix/Linux Go Back    

RedHat 9 (Linux i386) - man page for htmlparse (redhat section n)

Linux & Unix Commands - Search Man Pages
Man Page or Keyword Search:   man
Select Man Page Set:       apropos Keyword Search (sections above)

htmlparse(n)				   HTML Parser				     htmlparse(n)

       htmlparse - Procedures to parse HTML strings

       package require Tcl 8.2

       package require struct 1

       package require cmdline 1.1

       package require htmlparse ?0.3?

       ::htmlparse::parse ?-cmd cmd? ?-vroot tag? ?-split n? ?-incvar var? ?-queue q? html

       ::htmlparse::debugCallback ?clientdata? tag slash param textBehindTheTag

       ::htmlparse::mapEscapes html

       ::htmlparse::2tree html tree

       ::htmlparse::removeVisualFluff tree

       ::htmlparse::removeFormDefs tree

       The  htmlparse  package	provides  commands that allow libraries and applications to parse
       HTML in a string into a representation of their choice.

       The following commands are available:

       ::htmlparse::parse ?-cmd cmd? ?-vroot tag? ?-split n? ?-incvar var? ?-queue q? html
	      This command is the basic parser for HTML. It takes an HTML string, parses  it  and
	      invokes  a  command  prefix  for every tag encountered. It is not necessary for the
	      HTML to be valid for this parser to function. It is the responsibility of the  com-
	      mand  invoked  for  every  tag to check this. Another responsibility of the invoked
	      command is the handling of tag attributes and character entities	(escaped  charac-
	      ters). The parser provides the un-interpreted tag attributes to the invoked command
	      to aid in the former, and the package at large provides a helper	command,  ::html-
	      parse::mapEscapes,  to  aid  in  the handling of the latter. The parser does ignore
	      leading DOCTYPE declarations and all valid HTML comments it encounters.

	      All information beyond the HTML string itself is specified via options,  these  are
	      explained below.

	      To help understand the options, some more background information about the parser.

	      It  is  capable  of detecting incomplete tags in the HTML string given to it. Under
	      normal circumstances this will cause the parser to  throw  an  error,  but  if  the
	      option -incvar is used to specify a global (or namespace) variable, the parser will
	      store the incomplete part of the input into this variable instead.  This	will  aid
	      greatly  in  the handling of incrementally arriving HTML, as the parser will handle
	      whatever it can and defer the handling of the incomplete part until more	data  has

	      Another  feature	of the parser are its two possible modes of operation. The normal
	      mode is activated if the option -queue is not present on the command line  invoking
	      the parser. If it is present, the parser will go into the incremental mode instead.

	      The  main  difference  is  that a parser in normal mode will immediately invoke the
	      command prefix for each tag it encounters. In incremental mode however  the  parser
	      will  generate  a  number  of scripts which invoke the command prefix for groups of
	      tags in the HTML string and then store these scripts in the specified queue. It  is
	      then  the responsibility of the caller of the parser to ensure the execution of the
	      scripts in the queue.

	      Note: The queue object given to the parser has to provide the same interface as the
	      queue defined in tcllib -> struct. This means, for example, that all queues created
	      via that tcllib module can be immediately used here. Still, the queue doesn't  have
	      to come from tcllib -> struct as long as the same interface is provided.

	      In both modes the parser will return an empty string to the caller.

	      The  -split option may be given to a parser in incremental mode to specify the size
	      of the groups it creates. In other words, -split 5 means that each of the generated
	      scripts will invoke the command prefix for 5 consecutive tags in the HTML string. A
	      parser in normal mode will ignore this option and its value.

	      The option -vroot specifies a virtual root tag. A parser in normal mode will invoke
	      the command prefix for it immediately before and after it processes the tags in the
	      HTML, thus simulating that the HTML string is enclosed in a <vroot> </vroot> combi-
	      nation.  In  incremental	mode  however the parser is unable to provide the closing
	      virtual root as it never knows when the input is complete. In this case  the  first
	      script generated by each invocation of the parser will contain an invocation of the
	      command prefix for the virtual root as its first command.   The  following  options
	      are available:

	      -cmd cmd
		     The  command  prefix to invoke for every tag in the HTML string. Defaults to

	      -vroot tag
		     The virtual root tag to add around the HTML in normal mode.  In  incremental
		     mode  it  is  the first tag in each chunk processed by the parser, but there
		     will be no closing tags. Defaults to hmstart.

	      -split n
		     The size of the groups produced by an incremental mode parser. Ignored  when
		     in normal mode. Defaults to 10. Values <= 0 are not allowed.

	      -incvar var
		     The name of the variable where to store any incomplete HTML into. This makes
		     most sense for the incremental mode. The parser will throw an  error  if  it
		     sees  incomplete  HTML and has no place to store it to. This makes sense for
		     the normal mode. Only  incomplete	tags  are  detected,  not  missing  tags.
		     Optional, defaults to 'no variable'.

	      Interface to the command prefix
		     In normal mode the parser will invoke the command prefix with four arguments
		     appended. See ::htmlparse::debugCallback for a description.

		     In incremental mode, however, the generated scripts will invoke the  command
		     prefix  with  five  arguments  appended. The last four of these are the same
		     which were mentioned above. The first is a placeholder string (\win\) for	a
		     clientdata  value	to  be	supplied later during the actual execution of the
		     generated scripts. This could be a tk window path, for example. This  allows
		     the  user of this package to preprocess HTML strings without committing them
		     to a specific window, object, whatever during parsing. This  connection  can
		     be  made  later.  This  also means that it is possible to cache preprocessed
		     HTML. Of course, nothing prevents the user of the parser from replacing  the
		     placeholder with an empty string.

       ::htmlparse::debugCallback ?clientdata? tag slash param textBehindTheTag
	      This  command  is the standard callback used by the parser in ::htmlparse::parse if
	      none was specified by the user. It simply dumps  its  arguments  to  stdout.   This
	      callback can be used for both normal and incremental mode of the calling parser. In
	      other words, it accepts four  or	five  arguments.  The  last  four  arguments  are
	      described  below.  The optional fifth argument contains the clientdata value passed
	      to the callback by a parser in incremental mode. All callbacks have to  follow  the
	      signature  of this command in the last four arguments, and callbacks used in incre-
	      mental parsing have to follow this signature in the last five arguments.

	      The first argument, clientdata, is optional and present only  if	this  command  is
	      invoked  by  a  parser  in  incremental mode. It contains whatever the user of this
	      package wishes.

	      The second argument, tag, contains the name of the tag which is currently processed
	      by the parser.

	      The third argument, slash, is either empty or contains a slash character. It allows
	      the callback to distinguish between opening  (slash  is  empty)  and  closing  tags
	      (slash contains a slash character).

	      The  fourth  argument, param, contains the un-interpreted list of parameters to the

	      The fifth and last argument, textBehindTheTag,  contains	the  text  found  by  the
	      parser behind the tag named in tag.

       ::htmlparse::mapEscapes html
	      This  command  takes  a  HTML  string,  substitutes all escape sequences with their
	      actual characters and then returns the resulting string.	HTML strings which do not
	      contain escape sequences are returned unchanged.

       ::htmlparse::2tree html tree
	      This  command is a wrapper around ::htmlparse::parse which takes an HTML string (in
	      html) and converts it into a tree containing the logical structure  of  the  parsed
	      document.  The  name  of	the  tree  is given to the command as its second argument
	      (tree). The command does not generate the tree by itself but expects that the call-
	      er  provided it with an existing and empty tree. It also expects that the specified
	      tree object follows the same interface as the tree object in tcllib ->  struct.  It
	      doesn't have to be from tcllib -> struct, but it must provide the same interface.

	      The  internal  callback  does  some  basic  checking  of HTML validity and tries to
	      recover from the most basic errors. The command returns the contents of its  second
	      argument. Side effects are the creation and manipulation of a tree object.

       ::htmlparse::removeVisualFluff tree
	      This  command  walks  a tree as generated by ::htmlparse::2tree and removes all the
	      nodes which represent visual tags and not structural ones. The purpose of the  com-
	      mand  is	to make the tree easier to navigate without getting bogged down in visual
	      information not relevant to the search. Its only argument is the name of	the  tree
	      to cut down.

       ::htmlparse::removeFormDefs tree
	      Like ::htmlparse::removeVisualFluff this command is here to cut down on the size of
	      the tree as generated by ::htmlparse::2tree.  It	removes  all  nodes  representing
	      forms and form elements. Its only argument is the name of the tree to cut down.

       html, parsing

htmlparse				       0.3				     htmlparse(n)
Unix & Linux Commands & Man Pages : ©2000 - 2018 Unix and Linux Forums

All times are GMT -4. The time now is 03:18 AM.