Linux and UNIX Man Pages

Linux & Unix Commands - Search Man Pages

ids2ngram(1) [debian man page]

IDS2NGRAM(1)						User Contributed Perl Documentation					      IDS2NGRAM(1)

NAME
ids2ngram - generate n-gram data file from ids file SYNOPSIS
ids2ngram [option]... ids_file... DESCRIPTION
ids2ngram generates idngram file, which is a sorted [id1,..,idN,freq] array, from binary id stream files. Here, the id stream files are always generated by mmseg or slmseg. Basically, it finds all occurrence of n-words tuples (i.e. the tuple of (id1,..,idN)), and sorts these tuples by the lexicographic order of the ids make up the tuples, then write them to specified output file. INPUT
The input file is presented as a binary id stream, which looks like: [id0,...,idX] OPTIONS
All the following options are mandatory. -n,--NMax N Generates N-gram result. ids2ngram does only support uni-gram, bi-gram, and trigram, so any number not in the range of 1..3 is not valid. -s,--swap swap-file Specify the temporary intermediate file. -o, --out output-file Specify the result idngram file, e.g. the array of [id1, ..., idN, freq] -p, --para N Specify the maximum n-gram items per paragraph. ids2ngram writes to the temporary file on a per-paragraph basis. Every time it writes a paragraph out, it frees the corresponding memory allocated for it. When your computer system permits, a higher N is suggested. This can speed up the processing speed because of less I/O. EXAMPLE
Following example will use three input idstream file idsfile[1,2,3] to generate the idngram file all.id3gram. Each para (internal map size or hash size) would be 1024000, using swap file for temp result. All temp para result would eventually be merged to got the final result. ids2ngram -n 3 -s /tmp/swap -o all.id3gram -p 1024000 idsfile1 idsfile2 idsfile3 AUTHOR
Originally written by Phill.Zhang <phill.zhang@sun.com>. Currently maintained by Kov.Chai <tchaikov@gmail.com>. SEE ALSO
mmseg(1), slmseg(1), slmbuild (1). perl v5.14.2 2012-06-09 IDS2NGRAM(1)

Check Out this Related Man Page

SLMBUILD(1)						User Contributed Perl Documentation					       SLMBUILD(1)

NAME
slmbuild - generate language model from idngram file SYNOPSIS
slmbuild [option]... idngram_file... DESCRIPTION
slmbuild generates a back-off smoothing language model from a given idngram file. Generally, the idngram_file is created by ids2ngram. OPTIONS All the following options are mandatory. -n,--NMax N 1 for unigram, 2 for bigram, 3 for trigram. Any number not in the range of 1..3 is not valid. -o, --out output-file Specify the output xfilei name. -l, --log using -log(pr), use pr directly by default. -w, --wordcount N Lexican size, number of different words. -b, --brk id... Set the ids which should be treated as breaker. -e, --e id... Set the ids which should not be put into LM. -c, --cut c... k-grams whose freq <= c[k] are dropped. -d, --discount method, param... The k-th -d parm specifies the discount method For k-gram, possibble values for method/param are: B<GT>,I<R>,I<dis> : B<GT> discount for r E<lt>= I<R>, r is the freq of a ngram. Linear discount for those r E<gt> I<R>, i.e. r'=r*dis 0 E<lt>E<lt> dis E<lt> 1.0, for example 0.999 B<ABS>,[I<dis>] : Absolute discount r'=r-I<dis>. And I<dis> is optional 0 E<lt>E<lt> I<dis> E<lt> cut[k]+1.0, normally I<dis> E<lt> 1.0. LIN,[I<dis>] : Linear discount r'=r*dis. And dis is optional 0 E<lt> dis E<lt> 1.0 NOTE
-n must be given before -c -b. And -c must give right number of cut-off, also -ds must appear exactly N times specifying the discounts for 1-gram, 2-gram..., respectively. BREAKER-IDs could be SentenceTokens or ParagraphTokens. Conceptually, these ids have no meaning when they appeared in the middle of n-gram. EXCLUDE-IDs could be ambiguious-ids. Conceptually, n-grams which contain those ids are meaningless. We can not erase ngrams according to BREAKER-IDS and EXCLUDE-IDs directly from IDNGRAM file, because some low-level information is still useful in it. EXAMPLE
Following example read 'all.id3gram' and write trigram model 'all.slm'. At 1-gram level, use Good-Turing discount with cut-off 0, i<R>=8, dis=0.9995. At 2-gram level, use Absolute discount with cut-off 3, dis auto-calc. At 3-gram level, use Absolute discount with cut-off 2, dis auto-calc. Word id 10,11,12 are breakers (sentence/para/paper breaker, etc). Exclude-ID is 9. Lexicon contains 200000 words. The result languagme model uses -log(pr). slmbuild -l -n 3 -o all.slm -w 200000 -c 0,3,2 -d GT,8,0.9995 -d ABS -d ABS -b 10,11,12 -e 9 all.id3gram AUTHOR
Originally written by Phill.Zhang <phill.zhang@sun.com>. Currently maintained by Kov.Chai <tchaikov@gmail.com>. SEE ALSO
ids2ngram(1), slmprune(1). perl v5.14.2 2012-06-09 SLMBUILD(1)
Man Page