Query: kinosearch1::analysis::tokenizer
OS: debian
Section: 3pm
Format: Original Unix Latex Style Formatted with HTML and a Horizontal Scroll Bar
KinoSearch1::Analysis::Tokenizer(3pm) User Contributed Perl Documentation KinoSearch1::Analysis::Tokenizer(3pm)NAMEKinoSearch1::Analysis::Tokenizer - customizable tokenizingSYNOPSISmy $whitespace_tokenizer = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/S+/, ); # or... my $word_char_tokenizer = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/w+/, ); # or... my $apostrophising_tokenizer = KinoSearch1::Analysis::Tokenizer->new; # then... once you have a tokenizer, put it into a PolyAnalyzer my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new( analyzers => [ $lc_normalizer, $word_char_tokenizer, $stemmer ], );DESCRIPTIONGenerically, "tokenizing" is a process of breaking up a string into an array of "tokens". # before: my $string = "three blind mice"; # after: @tokens = qw( three blind mice ); KinoSearch1::Analysis::Tokenizer decides where it should break up the text based on the value of "token_re". # before: my $string = "Eats, Shoots and Leaves."; # tokenized by $whitespace_tokenizer @tokens = qw( Eats, Shoots and Leaves. ); # tokenized by $word_char_tokenizer @tokens = qw( Eats Shoots and Leaves );METHODSnew # match "O'Henry" as well as "Henry" and "it's" as well as "it" my $token_re = qr/ # start with a word boundary w+ # Match word chars. (?: # Group, but don't capture... 'w+ # ... an apostrophe plus word chars. )? # Matching the apostrophe group is optional. # end with a word boundary /xsm; my $tokenizer = KinoSearch1::Analysis::Tokenizer->new( token_re => $token_re, # default: what you see above ); Constructor. Takes one hash style parameter. o token_re - must be a pre-compiled regular expression matching one token.COPYRIGHTCopyright 2005-2010 Marvin Humphrey LICENSE, DISCLAIMER, BUGS, etc. See KinoSearch1 version 1.00. perl v5.14.2 2011-11-15 KinoSearch1::Analysis::Tokenizer(3pm)