kinosearch1::analysis::tokenizer(3pm) [debian man page]

KinoSearch1::Analysis::Tokenizer(3pm)			User Contributed Perl Documentation		     KinoSearch1::Analysis::Tokenizer(3pm)

NAME

       KinoSearch1::Analysis::Tokenizer - customizable tokenizing

SYNOPSIS

	   my $whitespace_tokenizer
	       = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/S+/, );

	   # or...
	   my $word_char_tokenizer
	       = KinoSearch1::Analysis::Tokenizer->new( token_re => qr/w+/, );

	   # or...
	   my $apostrophising_tokenizer = KinoSearch1::Analysis::Tokenizer->new;

	   # then... once you have a tokenizer, put it into a PolyAnalyzer
	   my $polyanalyzer = KinoSearch1::Analysis::PolyAnalyzer->new(
	       analyzers => [ $lc_normalizer, $word_char_tokenizer, $stemmer ], );

DESCRIPTION

       Generically, "tokenizing" is a process of breaking up a string into an array of "tokens".

	   # before:
	   my $string = "three blind mice";

	   # after:
	   @tokens = qw( three blind mice );

       KinoSearch1::Analysis::Tokenizer decides where it should break up the text based on the value of "token_re".

	   # before:
	   my $string = "Eats, Shoots and Leaves.";

	   # tokenized by $whitespace_tokenizer
	   @tokens = qw( Eats, Shoots and Leaves. );

	   # tokenized by $word_char_tokenizer
	   @tokens = qw( Eats Shoots and Leaves   );

METHODS

   new
	   # match "O'Henry" as well as "Henry" and "it's" as well as "it"
	   my $token_re = qr/
		   	     # start with a word boundary
		   w+	     # Match word chars.
		   (?:	     # Group, but don't capture...
		      'w+   # ... an apostrophe plus word chars.
		   )?	     # Matching the apostrophe group is optional.
		   	     # end with a word boundary
	       /xsm;
	   my $tokenizer = KinoSearch1::Analysis::Tokenizer->new(
	       token_re => $token_re, # default: what you see above
	   );

       Constructor.  Takes one hash style parameter.

       o   token_re - must be a pre-compiled regular expression matching one token.

COPYRIGHT

       Copyright 2005-2010 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.
       See KinoSearch1 version 1.00.

perl v5.14.2							    2011-11-15				     KinoSearch1::Analysis::Tokenizer(3pm)

Check Out this Related Man Page

KinoSearch1::InvIndexer(3pm)				User Contributed Perl Documentation			      KinoSearch1::InvIndexer(3pm)

NAME

       KinoSearch1::InvIndexer - build inverted indexes

SYNOPSIS

	   use KinoSearch1::InvIndexer;
	   use KinoSearch1::Analysis::PolyAnalyzer;

	   my $analyzer
	       = KinoSearch1::Analysis::PolyAnalyzer->new( language => 'en' );

	   my $invindexer = KinoSearch1::InvIndexer->new(
	       invindex => '/path/to/invindex',
	       create	=> 1,
	       analyzer => $analyzer,
	   );

	   $invindexer->spec_field(
	       name  => 'title'
	       boost => 3,
	   );
	   $invindexer->spec_field( name => 'bodytext' );

	   while ( my ( $title, $bodytext ) = each %source_documents ) {
	       my $doc = $invindexer->new_doc($title);

	       $doc->set_value( title	 => $title );
	       $doc->set_value( bodytext => $bodytext );

	       $invindexer->add_doc($doc);
	   }

	   $invindexer->finish;

DESCRIPTION

       The InvIndexer class is KinoSearch1's primary tool for creating and modifying inverted indexes, which may be searched using
       KinoSearch1::Searcher.

METHODS

   new
	   my $invindexer = KinoSearch1::InvIndexer->new(
	       invindex => '/path/to/invindex',  # required
	       create	=> 1,			 # default: 0
	       analyzer => $analyzer,		 # default: no-op Analyzer
	   );

       Create an InvIndexer object.

       o   invindex - can be either a filepath, or an InvIndex subclass such as KinoSearch1::Store::FSInvIndex or KinoSearch1::Store::RAMInvIndex.

       o   create - create a new invindex, clobbering an existing one if necessary.

       o   analyzer - an object which subclasses KinoSearch1::Analysis::Analyzer, such as a PolyAnalyzer.

   spec_field
	   $invindexer->spec_field(
	       name	  => 'url',	 # required
	       boost	  => 1, 	 # default: 1,
	       analyzer   => undef,	 # default: analyzer spec'd in new()
	       indexed	  => 0, 	 # default: 1
	       analyzed   => 0, 	 # default: 1
	       stored	  => 1, 	 # default: 1
	       compressed => 0, 	 # default: 0
	       vectorized => 0, 	 # default: 1
	   );

       Define a field.

       o   name - the field's name.

       o   boost - A multiplier which determines how much a field contributes to a document's score.

       o   analyzer - By default, all indexed fields are analyzed using the analyzer that was supplied to new().  Supplying an alternate for a
	   given field overrides the primary analyzer.

       o   indexed - index the field, so that it can be searched later.

       o   analyzed - analyze the field, using the relevant Analyzer.  Fields such as "category" or "product_number" might be indexed but not
	   analyzed.

       o   stored - store the field, so that it can be retrieved when the document turns up in a search.

       o   compressed - compress the stored field, using the zlib compression algorithm.

       o   vectorized - store the field's "term vectors", which are required by KinoSearch1::Highlight::Highlighter for excerpt selection and
	   search term highlighting.

   new_doc
	   my $doc = $invindexer->new_doc;

       Spawn an empty KinoSearch1::Document::Doc object, primed to accept values for the fields spec'd by spec_field.

   add_doc
	   $invindexer->add_doc($doc);

       Add a document to the invindex.

   add_invindexes
	   my $invindexer = KinoSearch1::InvIndexer->new(
	       invindex => $invindex,
	       analyzer => $analyzer,
	   );
	   $invindexer->add_invindexes( $another_invindex, $yet_another_invindex );
	   $invindexer->finish;

       Absorb existing invindexes into this one.  May only be called once per InvIndexer.  add_invindexes() and add_doc() cannot be called on the
       same InvIndexer.

   delete_docs_by_term
	   my $term = KinoSearch1::Index::Term->new( 'id', $unique_id );
	   $invindexer->delete_docs_by_term($term);

       Mark any document which contains the supplied term as deleted, so that it will be excluded from search results.	For more info, see
       Deletions in KinoSearch1::Docs::FileFormat.

   finish
	   $invindexer->finish(
	       optimize => 1, # default: 0
	   );

       Finish the invindex.  Invalidates the InvIndexer.  Takes one hash-style parameter.

       o   optimize - If optimize is set to 1, the invindex will be collapsed to its most compact form, which will yield the fastest queries.

COPYRIGHT

       Copyright 2005-2010 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.
       See KinoSearch1 version 1.00.

perl v5.14.2							    2011-11-15					      KinoSearch1::InvIndexer(3pm)

Linux and UNIX Man Pages

kinosearch1::analysis::tokenizer(3pm) [debian man page]

Check Out this Related Man Page