debian man page for kinosearch1::analysis::tokenizer

Query: kinosearch1::analysis::tokenizer

OS: debian

Section: 3pm

Format: Original Unix Latex Style Formatted with HTML and a Horizontal Scroll Bar

KinoSearch1::Analysis::Tokenizer(3pm)			User Contributed Perl Documentation		     KinoSearch1::Analysis::Tokenizer(3pm)

NAME
KinoSearch1::Analysis::Tokenizer - customizable tokenizing
SYNOPSIS
my $whitespace_tokenizer = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/S+/, ); # or... my $word_char_tokenizer = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/w+/, ); # or... my $apostrophising_tokenizer = KinoSearch1::Analysis::Tokenizer->new; # then... once you have a tokenizer, put it into a PolyAnalyzer my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new( analyzers => [ $lc_normalizer, $word_char_tokenizer, $stemmer ], );
DESCRIPTION
Generically, "tokenizing" is a process of breaking up a string into an array of "tokens". # before: my $string = "three blind mice"; # after: @tokens = qw( three blind mice ); KinoSearch1::Analysis::Tokenizer decides where it should break up the text based on the value of "token_re". # before: my $string = "Eats, Shoots and Leaves."; # tokenized by $whitespace_tokenizer @tokens = qw( Eats, Shoots and Leaves. ); # tokenized by $word_char_tokenizer @tokens = qw( Eats Shoots and Leaves );
METHODS
new # match "O'Henry" as well as "Henry" and "it's" as well as "it" my $token_re = qr/  # start with a word boundary w+ # Match word chars. (?: # Group, but don't capture... 'w+ # ... an apostrophe plus word chars. )? # Matching the apostrophe group is optional.  # end with a word boundary /xsm; my $tokenizer = KinoSearch1::Analysis::Tokenizer->new( token_re => $token_re, # default: what you see above ); Constructor. Takes one hash style parameter. o token_re - must be a pre-compiled regular expression matching one token.
COPYRIGHT
Copyright 2005-2010 Marvin Humphrey LICENSE, DISCLAIMER, BUGS, etc. See KinoSearch1 version 1.00. perl v5.14.2 2011-11-15 KinoSearch1::Analysis::Tokenizer(3pm)
Related Man Pages
kinosearch1(3pm) - debian
kinosearch1::highlight::highlighter(3pm) - debian
kinosearch1::invindexer(3pm) - debian
kinosearch1::search::multisearcher(3pm) - debian
kinosearch1::searcher(3pm) - debian
Similar Topics in the Unix Linux Community
PERL question
Core Dump Analysis Using PStack and PMAP
Henry 0.1 (Default branch)
Cryptographic Implementations Analysis Toolkit 1.0 (Default branch)
Operational Analysis of Parallel Servers