kinosearch1::analysis::stopalizer(3pm) [debian man page]

KinoSearch1::Analysis::Stopalizer(3pm)			User Contributed Perl Documentation		    KinoSearch1::Analysis::Stopalizer(3pm)

NAME

       KinoSearch1::Analysis::Stopalizer - suppress a "stoplist" of common words

SYNOPSIS

	   my $stopalizer = KinoSearch1::Analysis::Stopalizer->new(
	       language => 'fr',
	   );
	   my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new(
	       analyzers => [ $lc_normalizer, $tokenizer, $stopalizer, $stemmer ],
	   );

DESCRIPTION

       A "stoplist" is collection of "stopwords": words which are common enough to be of little value when determining search results.	For
       example, so many documents in English contain "the", "if", and "maybe" that it may improve both performance and relevance to block them.

	   # before
	   @token_texts = ('i', 'am', 'the', 'walrus');

	   # after
	   @token_texts = ('',	'',   '',    'walrus');

CONSTRUCTOR

   new
	   my $stopalizer = KinoSearch1::Analysis::Stopalizer->new(
	       language => 'de',
	   );

	   # or...
	   my $stopalizer = KinoSearch1::Analysis::Stopalizer->new(
	       stoplist => \%stoplist,
	   );

       new() takes two possible parameters, "language" and "stoplist".	If "stoplist" is supplied, it will be used, overriding the behavior
       indicated by the value of "language".

       o   stoplist - must be a hashref, with stopwords as the keys of the hash and values set to 1.

       o   language - must be the ISO code for a language.  Loads a default stoplist supplied by Lingua::StopWords.

SEE ALSO

       Lingua::StopWords

COPYRIGHT

       Copyright 2005-2010 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.
       See KinoSearch1 version 1.00.

perl v5.14.2							    2011-11-15				    KinoSearch1::Analysis::Stopalizer(3pm)

Check Out this Related Man Page

KinoSearch1::Analysis::Tokenizer(3pm)			User Contributed Perl Documentation		     KinoSearch1::Analysis::Tokenizer(3pm)

NAME

       KinoSearch1::Analysis::Tokenizer - customizable tokenizing

SYNOPSIS

	   my $whitespace_tokenizer
	       = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/S+/, );

	   # or...
	   my $word_char_tokenizer
	       = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/w+/, );

	   # or...
	   my $apostrophising_tokenizer = KinoSearch1::Analysis::Tokenizer->new;

	   # then... once you have a tokenizer, put it into a PolyAnalyzer
	   my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new(
	       analyzers => [ $lc_normalizer, $word_char_tokenizer, $stemmer ], );

DESCRIPTION

       Generically, "tokenizing" is a process of breaking up a string into an array of "tokens".

	   # before:
	   my $string = "three blind mice";

	   # after:
	   @tokens = qw( three blind mice );

       KinoSearch1::Analysis::Tokenizer decides where it should break up the text based on the value of "token_re".

	   # before:
	   my $string = "Eats, Shoots and Leaves.";

	   # tokenized by $whitespace_tokenizer
	   @tokens = qw( Eats, Shoots and Leaves. );

	   # tokenized by $word_char_tokenizer
	   @tokens = qw( Eats Shoots and Leaves   );

METHODS

   new
	   # match "O'Henry" as well as "Henry" and "it's" as well as "it"
	   my $token_re = qr/
		   	     # start with a word boundary
		   w+	     # Match word chars.
		   (?:	     # Group, but don't capture...
		      'w+   # ... an apostrophe plus word chars.
		   )?	     # Matching the apostrophe group is optional.
		   	     # end with a word boundary
	       /xsm;
	   my $tokenizer = KinoSearch1::Analysis::Tokenizer->new(
	       token_re => $token_re, # default: what you see above
	   );

       Constructor.  Takes one hash style parameter.

       o   token_re - must be a pre-compiled regular expression matching one token.

COPYRIGHT

       Copyright 2005-2010 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.
       See KinoSearch1 version 1.00.

perl v5.14.2							    2011-11-15				     KinoSearch1::Analysis::Tokenizer(3pm)

Linux and UNIX Man Pages

kinosearch1::analysis::stopalizer(3pm) [debian man page]

Check Out this Related Man Page