slmseg(1) debian man page | unix.com

Man Page: slmseg

Operating Environment: debian

Section: 1

SLMSEG(1)						User Contributed Perl Documentation						 SLMSEG(1)

NAME
slmseg - maximum matching segment Chinese text.
SYNOPSIS
slmseg -d dict_file [option]... [corpus_file]...
DESCRIPTION
slmseg is a tool for segmenting Chinese text into words using maximum matching algorithm. slmseg segments corpus_file, or standard input if no filename is specified, and write the segmented result to standard output.
OPTIONS
-d dict_file Use dict_file as lexicon. A default lexicon can be found at /usr/share/sunpinyin-slm/dict.utf8. -f,--format (text|bin) Output Format, can be 'text' or 'bin'. default 'bin'. Normally, in text mode, word text are output, while in binary mode, binary short integer of the word-ids are written to stdout. -s, --stok STOK_ID Sentence token id. Default 10. It will be written to output in binary mode after every sentence. -i, --show-id Show Id info. Under text output format mode, attach id after known words. If under binary mode, print id(s) in text. -m, --model language-model-file Speficy the language model file. This file is always generated by slmthread.
NOTES
Under binary mode, consecutive id of 0 are merged into one 0. Under text mode, no space are inserted between unknown-words.
AUTHOR
Originally written by Phill.Zhang <phill.zhang@sun.com>. Currently maintained by Kov.Chai <tchaikov@gmail.com>.
SEE ALSO
mmseg(1), ids2ngram (1). perl v5.14.2 2012-06-09 SLMSEG(1)
Related Man Pages
asn2asn(1) - debian
catod(1) - debian
cmafihe(1l) - debian
slmbuild(1) - debian
prezip-bin(1) - suse
Similar Topics in the Unix Linux Community
Binary to text format conversion
Display most top 10 occurring words along with number of ocurences of word inthe text
Split the file based on the content
Split a free form text delimited by space to words with other fields
Frequent words and trigraphs in text