kinosearch1::analysis::tokenizer(3pm) [debian man page]

KinoSearch1::Analysis::Tokenizer(3pm)			User Contributed Perl Documentation		     KinoSearch1::Analysis::Tokenizer(3pm)

NAME

       KinoSearch1::Analysis::Tokenizer - customizable tokenizing

SYNOPSIS

	   my $whitespace_tokenizer
	       = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/S+/, );

	   # or...
	   my $word_char_tokenizer
	       = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/w+/, );

	   # or...
	   my $apostrophising_tokenizer = KinoSearch1::Analysis::Tokenizer->new;

	   # then... once you have a tokenizer, put it into a PolyAnalyzer
	   my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new(
	       analyzers => [ $lc_normalizer, $word_char_tokenizer, $stemmer ], );

DESCRIPTION

       Generically, "tokenizing" is a process of breaking up a string into an array of "tokens".

	   # before:
	   my $string = "three blind mice";

	   # after:
	   @tokens = qw( three blind mice );

       KinoSearch1::Analysis::Tokenizer decides where it should break up the text based on the value of "token_re".

	   # before:
	   my $string = "Eats, Shoots and Leaves.";

	   # tokenized by $whitespace_tokenizer
	   @tokens = qw( Eats, Shoots and Leaves. );

	   # tokenized by $word_char_tokenizer
	   @tokens = qw( Eats Shoots and Leaves   );

METHODS

   new
	   # match "O'Henry" as well as "Henry" and "it's" as well as "it"
	   my $token_re = qr/
		   	     # start with a word boundary
		   w+	     # Match word chars.
		   (?:	     # Group, but don't capture...
		      'w+   # ... an apostrophe plus word chars.
		   )?	     # Matching the apostrophe group is optional.
		   	     # end with a word boundary
	       /xsm;
	   my $tokenizer = KinoSearch1::Analysis::Tokenizer->new(
	       token_re => $token_re, # default: what you see above
	   );

       Constructor.  Takes one hash style parameter.

       o   token_re - must be a pre-compiled regular expression matching one token.

COPYRIGHT

       Copyright 2005-2010 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.
       See KinoSearch1 version 1.00.

perl v5.14.2							    2011-11-15				     KinoSearch1::Analysis::Tokenizer(3pm)

Check Out this Related Man Page

KinoSearch1::Analysis::PolyAnalyzer(3pm)		User Contributed Perl Documentation		  KinoSearch1::Analysis::PolyAnalyzer(3pm)

NAME

       KinoSearch1::Analysis::PolyAnalyzer - multiple analyzers in series

SYNOPSIS

	   my $analyzer = KinoSearch1::Analysis::PolyAnalyzer->new(
	       language  => 'es',
	   );

	   # or...
	   my $analyzer = KinoSearch1::Analysis::PolyAnalyzer->new(
	       analyzers => [
		   $lc_normalizer,
		   $custom_tokenizer,
		   $snowball_stemmer,
	       ],
	   );

DESCRIPTION

       A PolyAnalyzer is a series of Analyzers -- objects which inherit from KinoSearch1::Analysis::Analyzer -- each of which will be called upon
       to "analyze" text in turn.  You can either provide the Analyzers yourself, or you can specify a supported language, in which case a
       PolyAnalyzer consisting of an LCNormalizer, a Tokenizer, and a Stemmer will be generated for you.

       Supported languages:

	   en => English,
	   da => Danish,
	   de => German,
	   es => Spanish,
	   fi => Finnish,
	   fr => French,
	   it => Italian,
	   nl => Dutch,
	   no => Norwegian,
	   pt => Portuguese,
	   ru => Russian,
	   sv => Swedish,

CONSTRUCTOR

   new()
	   my $analyzer = KinoSearch1::Analysis::PolyAnalyzer->new(
	       language   => 'en',
	   );

       Construct a PolyAnalyzer object.  If the parameter "analyzers" is specified, it will override "language" and no attempt will be made to
       generate a default set of Analyzers.

       o   language - Must be an ISO code from the list of supported languages.

       o   analyzers - Must be an arrayref.  Each element in the array must inherit from KinoSearch1::Analysis::Analyzer.  The order of the
	   analyzers matters.  Don't put a Stemmer before a Tokenizer (can't stem whole documents or paragraphs -- just individual words), or a
	   Stopalizer after a Stemmer (stemmed words, e.g. "themselv", will not appear in a stoplist).	In general, the sequence should be:
	   normalize, tokenize, stopalize, stem.

COPYRIGHT

       Copyright 2005-2010 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.
       See KinoSearch1 version 1.00.

perl v5.14.2							    2011-11-15				  KinoSearch1::Analysis::PolyAnalyzer(3pm)

Linux and UNIX Man Pages

kinosearch1::analysis::tokenizer(3pm) [debian man page]

Check Out this Related Man Page