Query: lingua::stopwords
OS: debian
Section: 3pm
Format: Original Unix Latex Style Formatted with HTML and a Horizontal Scroll Bar
Lingua::StopWords(3pm) User Contributed Perl Documentation Lingua::StopWords(3pm)NAMELingua::StopWords - Stop words for several languages.SYNOPSISuse Lingua::StopWords qw( getStopWords ); my $stopwords = getStopWords('en'); my @words = qw( i am the walrus goo goo g'joob ); # prints "walrus goo goo g'joob" print join ' ', grep { !$stopwords->{$_} } @words;DESCRIPTIONIn keyword search, it is common practice to suppress a collection of "stopwords": words such as "the", "and", "maybe", etc. which exist in in a large number of documents and do not tell you anything important about any document which contains them. This module provides such "stoplists" in several languages. Supported Languages |-----------------------------------------------------------| | Language | ISO code | default encoding | also available | |-----------------------------------------------------------| | Danish | da | ISO-8859-1 | UTF-8 | | Dutch | nl | ISO-8859-1 | UTF-8 | | English | en | ISO-8859-1 | UTF-8 | | Finnish | fi | ISO-8859-1 | UTF-8 | | French | fr | ISO-8859-1 | UTF-8 | | German | de | ISO-8859-1 | UTF-8 | | Hungarian | hu | ISO-8859-1 | UTF-8 | | Italian | it | ISO-8859-1 | UTF-8 | | Norwegian | no | ISO-8859-1 | UTF-8 | | Portuguese | pt | ISO-8859-1 | UTF-8 | | Spanish | es | ISO-8859-1 | UTF-8 | | Swedish | sv | ISO-8859-1 | UTF-8 | | Russian | ru | KOI8-R | UTF-8 | |-----------------------------------------------------------|FUNCTIONSgetStopWords my $stoplist = getStopWords('en'); my $utf8_stoplist = getStopWords('en', 'UTF-8'); Retrieve a stoplist in the form of a hashref where the keys are all stopwords and the values are all 1. $stoplist = { and => 1, if => 1, # ... }; getStopWords() expects 1-2 arguments. The first, which is required, is an ISO code representing a supported language. If the ISO code cannot be found, getStopWords returns undef. The second argument should be 'UTF-8' if you want the stopwords encoded in UTF-8. The UTF-8 flag will be turned on, so make sure you understand all the implications of that.SEE ALSOThe stoplists supplied by this module were created as part of the Snowball project (see <http://snowball.tartarus.org>, Lingua::Stem::Snowball). Lingua::EN::StopWords provides a different stoplist for English.AUTHORMaintained by Marvin Humphrey <marvin at rectangular dot com>. Original author Fabien Potencier, <fabpot at cpan dot org>.COPYRIGHT AND LICENSECopyright 2004-2008 Fabien Potencier, Marvin Humphrey This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.3 or, at your option, any later version of Perl 5 you may have available. perl v5.10.0 2009-02-23 Lingua::StopWords(3pm)
Related Man Pages |
---|
encode::byte(3pm) - osx |
tcs(1) - debian |
encode::byte5.18(3pm) - mojave |
mb_detect_order(3) - php |
lingua::stem::snowball(3pm) - debian |
Similar Topics in the Unix Linux Community |
---|
Languages |
LanguageTool 0.9.4 (Default branch) |
LanguageTool 0.9.5 (Default branch) |
To replace a keyword for a number of files in a path |
Unrecognized Spanish characters from windows to Linux |