mmseg(1) [debian man page]

MMSEG(1)						User Contributed Perl Documentation						  MMSEG(1)

NAME

       mmseg - maximum matching segment Chinese text.

SYNOPSIS

       mmseg -d dict_file [option]... [corpus_file]...

DESCRIPTION

       mmseg is a tool for segmenting Chinese text into words using maximum matching algorithm. mmseg segments corpus_file, or standard input if
       no filename is specified, and write the segmented result to standard output.

OPTIONS

       -d dict_file
	   Use dict_file as lexicon. A default lexicon can be found at /usr/share/sunpinyin-slm/dict.utf8.

       -f,--format (text|bin)
	   Output Format, can be 'text' or 'bin'. default 'bin'.  Normally, in text mode, word text are output, while in binary mode, binary short
	   integer of the word-ids are written to stdout.

       -s, --stok STOK_ID
	   Sentence token id. Default 10.  It will be written to output in binary mode after every sentence.

       -i, --show-id
	   Show Id info. Under text output format mode, attach id after known words.  If under binary mode, print id(s) in text.

       -a, --ambiguious-id AMBI-ID
	   Ambiguious means ABC => A BC or AB C. If specified (AMBI-ID != 0), The sequence ABC will not be segmented, in binary mode, the AMBI-ID
	   is written out; in text mode, "<ambi>ABC</ambi>" will be output. Default is 0.

NOTES

       Under binary mode, consecutive id of 0 are merged into one 0.  Under text mode, no space are inserted between unknown-words.

AUTHOR

       Originally written by Phill.Zhang <phill.zhang@sun.com>.  Currently maintained by Kov.Chai <tchaikov@gmail.com>.

SEE ALSO

       slmseg(1), ids2ngram (1).

perl v5.14.2							    2012-06-09								  MMSEG(1)

Check Out this Related Man Page

CATOD(1)						      General Commands Manual							  CATOD(1)

NAME

       catod - To convert the text format of a dictionary to binary
	       format.

SYNOPSIS

       catod  [-s maxword ] [-R] [-r] [-e] [-S] [-U]
	      [-P dicpasswd ]	 [-p frepasswd ]
	      [-h cixingfile ] outfilename

DEFAULT PATH

       /usr/local/bin/cWnn4/catod

DESCRIPTION

       This command converts a dictionary from text format into binary format.

       outfilename  is	the  name  of the binary format dictionary.  If outfilename is not given, the output will be passed to the standard output
       device(stdout).

       The input file may be piped in by using the "<" command.  For example,
	    catod  basic.dic  <  basic.u
       "basic.dic" here is the output binary format dictionary, while the "basic.u" is the input text format dictionary.

       If the input text dictionary is not given, the input will be taken from the standard input(stdin).  To end the input  via  standard  input,
       press ^D.

OPTIONS

       -s     maxword
	      To specify the maximum number of words.  Default is 70000.

       -R     To create a dictionary for both forward and reverse conversion.  (Default).

       -r     To create a reverse format dictionary only for reverse conversion.

       -e     If  the  Hanzi  inside  the  text  dictionary  contains  characters such as space and tab, they will be compacted to special format.
	      (Default).

       -S     To create a static dictionary.

       -U     To create a dynamic dictionary.

       -P     dicpasswd
	      To specify the password for the dictionary.
	      If "-N" is used instead, the password of the dictionary will be set to "*".

       -p     frepasswd
	      To specify the password for the usage frequency file.
	      If "-n" is used instead, the password of the frequency file will be set to "*".

       -h     cixingfile
	      To specify the Cixing definition file.

NOTE

       1.     The parts in [ ] are options.  They may be omitted.

       2.     The Pinyin and Zhuyin dictionary has the same format.

       3.     For details of the dictionary structure, refer to cWnn manual.

								    13 May 1992 							  CATOD(1)

Linux and UNIX Man Pages

mmseg(1) [debian man page]

Check Out this Related Man Page