kinosearch1::analysis::token(3pm) [debian man page]

KinoSearch1::Analysis::Token(3pm)			User Contributed Perl Documentation			 KinoSearch1::Analysis::Token(3pm)

NAME

       KinoSearch1::Analysis::Token - unit of text

SYNOPSIS

	   # private class - no public API

PRIVATE CLASS

       You can't actually instantiate a Token object at the Perl level -- however, you can affect individual Tokens within a TokenBatch by way of
       TokenBatch's (experimental) API.

DESCRIPTION

       Token is the fundamental unit used by KinoSearch1's Analyzer subclasses.  Each Token has 4 attributes: text, start_offset, end_offset, and
       pos_inc (for position increment).

       The text of a token is a string.

       A Token's start_offset and end_offset locate it within a larger text, even if the Token's text attribute gets modified -- by stemming, for
       instance.  The Token for "beating" in the text "beating a dead horse" begins life with a start_offset of 0 and an end_offset of 7; after
       stemming, the text is "beat", but the end_offset is still 7.

       The position increment, which defaults to 1, is a an advanced tool for manipulating phrase matching.  Ordinarily, Tokens are assigned
       consecutive position numbers: 0, 1, and 2 for "three blind mice".  However, if you set the position increment for "blind" to, say, 1000,
       then the three tokens will end up assigned to positions 0, 1, and 1001 -- and will no longer produce a phrase match for the query '"three
       blind mice"'.

COPYRIGHT

       Copyright 2006-2010 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.
       See KinoSearch1 version 1.00.

perl v5.14.2							    2011-11-15					 KinoSearch1::Analysis::Token(3pm)

Check Out this Related Man Page

KinoSearch1::Analysis::PolyAnalyzer(3pm)		User Contributed Perl Documentation		  KinoSearch1::Analysis::PolyAnalyzer(3pm)

NAME

       KinoSearch1::Analysis::PolyAnalyzer - multiple analyzers in series

SYNOPSIS

	   my $analyzer = KinoSearch1::Analysis::PolyAnalyzer->new(
	       language  => 'es',
	   );

	   # or...
	   my $analyzer = KinoSearch1::Analysis::PolyAnalyzer->new(
	       analyzers => [
		   $lc_normalizer,
		   $custom_tokenizer,
		   $snowball_stemmer,
	       ],
	   );

DESCRIPTION

       A PolyAnalyzer is a series of Analyzers -- objects which inherit from KinoSearch1::Analysis::Analyzer -- each of which will be called upon
       to "analyze" text in turn.  You can either provide the Analyzers yourself, or you can specify a supported language, in which case a
       PolyAnalyzer consisting of an LCNormalizer, a Tokenizer, and a Stemmer will be generated for you.

       Supported languages:

	   en => English,
	   da => Danish,
	   de => German,
	   es => Spanish,
	   fi => Finnish,
	   fr => French,
	   it => Italian,
	   nl => Dutch,
	   no => Norwegian,
	   pt => Portuguese,
	   ru => Russian,
	   sv => Swedish,

CONSTRUCTOR

   new()
	   my $analyzer = KinoSearch1::Analysis::PolyAnalyzer->new(
	       language   => 'en',
	   );

       Construct a PolyAnalyzer object.  If the parameter "analyzers" is specified, it will override "language" and no attempt will be made to
       generate a default set of Analyzers.

       o   language - Must be an ISO code from the list of supported languages.

       o   analyzers - Must be an arrayref.  Each element in the array must inherit from KinoSearch1::Analysis::Analyzer.  The order of the
	   analyzers matters.  Don't put a Stemmer before a Tokenizer (can't stem whole documents or paragraphs -- just individual words), or a
	   Stopalizer after a Stemmer (stemmed words, e.g. "themselv", will not appear in a stoplist).	In general, the sequence should be:
	   normalize, tokenize, stopalize, stem.

COPYRIGHT

       Copyright 2005-2010 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.
       See KinoSearch1 version 1.00.

perl v5.14.2							    2011-11-15				  KinoSearch1::Analysis::PolyAnalyzer(3pm)

Linux and UNIX Man Pages

kinosearch1::analysis::token(3pm) [debian man page]

Check Out this Related Man Page