Query: ucto
OS: debian
Section: 1
Format: Original Unix Latex Style Formatted with HTML and a Horizontal Scroll Bar
ucto(1) General Commands Manual ucto(1)NAMEucto - Unicode TokenizerSYNOPSYSucto [[options]] [input-file] [[output-file]]DESCRIPTIONucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages.OPTIONS-c configfile read settings from a file -d value set debug mode to 'value' -e value set input encoding. (default UTF8) -f disable filtering of special characters -L language Automatically selects a configuration file by language code. e.g. 'fr' will select the file tokconfig-fr from the installation directory -l Convert to all lowercase -u Convert to all uppercase -n Assume one sentence per line on input -m Emit one sentence per line on output --passthru Don't tokenize, but perform input decoding and simple token role detection -P Disable Paragraph Detection -Q Enable Quote Detection. (this is experimental and may lead to unexpected results) -S Disable Sentence Detection -s <string> Set End-of-sentence marker. (Default <utt>) -V Show version information -v set Verbose mode -x <DocId> Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: -nulPQvsS) -F Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: -nulPQvsS)BUGSlikelyAUTHORSMaarten van Gompel proycon@anaproy.nl Ko van der Sloot Timbl@uvt.nl 2011 november 28 ucto(1)