debian man page for ucto

Query: ucto

OS: debian

Section: 1

Format: Original Unix Latex Style Formatted with HTML and a Horizontal Scroll Bar

ucto(1) 						      General Commands Manual							   ucto(1)

NAME
ucto - Unicode Tokenizer
SYNOPSYS
ucto [[options]] [input-file] [[output-file]]
DESCRIPTION
ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages.
OPTIONS
-c configfile read settings from a file -d value set debug mode to 'value' -e value set input encoding. (default UTF8) -f disable filtering of special characters -L language Automatically selects a configuration file by language code. e.g. 'fr' will select the file tokconfig-fr from the installation directory -l Convert to all lowercase -u Convert to all uppercase -n Assume one sentence per line on input -m Emit one sentence per line on output --passthru Don't tokenize, but perform input decoding and simple token role detection -P Disable Paragraph Detection -Q Enable Quote Detection. (this is experimental and may lead to unexpected results) -S Disable Sentence Detection -s <string> Set End-of-sentence marker. (Default <utt>) -V Show version information -v set Verbose mode -x <DocId> Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: -nulPQvsS) -F Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: -nulPQvsS)
BUGS
likely
AUTHORS
Maarten van Gompel proycon@anaproy.nl Ko van der Sloot Timbl@uvt.nl 2011 november 28 ucto(1)
Related Man Pages
style(1) - bsd
mbt(1) - debian
timblserver(1) - debian
ucto(1) - debian
voikkogc(1) - debian
Similar Topics in the Unix Linux Community
How to match all array contents and display all highest matched sentences in perl?
Network Worm Detection using Markov's and Cantelli's Inequalities
counting number of sentence
Trim the sentence containing colon and period to extract a word in between
Normalizing files for sentence count