tokenizer

Unix and Linux Discussions Tagged with tokenizer
	Thread / Thread Starter	Last Post	Replies	Views	Forum
	string tokenizing daflore	02-23-2011 by jim mcnamara	2	11,442	Shell Programming and Scripting
	Help with tokenizer sbasetty	02-12-2008 by pt14	1	2,347	Shell Programming and Scripting

LEARN ABOUT DEBIAN

kinosearch1::analysis::tokenizer

KinoSearch1::Analysis::Tokenizer(3pm)			User Contributed Perl Documentation		     KinoSearch1::Analysis::Tokenizer(3pm)

NAME

       KinoSearch1::Analysis::Tokenizer - customizable tokenizing

SYNOPSIS

	   my $whitespace_tokenizer
	       = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/S+/, );

	   # or...
	   my $word_char_tokenizer
	       = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/w+/, );

	   # or...
	   my $apostrophising_tokenizer = KinoSearch1::Analysis::Tokenizer->new;

	   # then... once you have a tokenizer, put it into a PolyAnalyzer
	   my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new(
	       analyzers => [ $lc_normalizer, $word_char_tokenizer, $stemmer ], );

DESCRIPTION

       Generically, "tokenizing" is a process of breaking up a string into an array of "tokens".

	   # before:
	   my $string = "three blind mice";

	   # after:
	   @tokens = qw( three blind mice );

       KinoSearch1::Analysis::Tokenizer decides where it should break up the text based on the value of "token_re".

	   # before:
	   my $string = "Eats, Shoots and Leaves.";

	   # tokenized by $whitespace_tokenizer
	   @tokens = qw( Eats, Shoots and Leaves. );

	   # tokenized by $word_char_tokenizer
	   @tokens = qw( Eats Shoots and Leaves   );

METHODS

   new
	   # match "O'Henry" as well as "Henry" and "it's" as well as "it"
	   my $token_re = qr/
		   	     # start with a word boundary
		   w+	     # Match word chars.
		   (?:	     # Group, but don't capture...
		      'w+   # ... an apostrophe plus word chars.
		   )?	     # Matching the apostrophe group is optional.
		   	     # end with a word boundary
	       /xsm;
	   my $tokenizer = KinoSearch1::Analysis::Tokenizer->new(
	       token_re => $token_re, # default: what you see above
	   );

       Constructor.  Takes one hash style parameter.

       o   token_re - must be a pre-compiled regular expression matching one token.

COPYRIGHT

       Copyright 2005-2010 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.
       See KinoSearch1 version 1.00.

perl v5.14.2							    2011-11-15				     KinoSearch1::Analysis::Tokenizer(3pm)