tokenize

Unix and Linux Discussions Tagged with tokenize
	Thread / Thread Starter	Last Post	Replies	Views	Forum
	perl-like split function for bash? eur0dad	09-05-2008 by Jim Hertzler	5	59,865	Shell Programming and Scripting
	Help with tokenizer sbasetty	02-12-2008 by pt14	1	2,335	Shell Programming and Scripting

LEARN ABOUT DEBIAN

ucto

ucto(1) 						      General Commands Manual							   ucto(1)

NAME

       ucto - Unicode Tokenizer

SYNOPSYS

       ucto [[options]] [input-file] [[output-file]]

DESCRIPTION

       ucto ucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes.
       Ucto is preconfigured with tokenisation rules for several languages.

OPTIONS

       -c configfile
	      read settings from a file

       -d value
	      set debug mode to 'value'

       -e value
	      set input encoding. (default UTF8)

       -f
	      disable filtering of special characters

       -L language
	       Automatically selects a configuration file by language code.  e.g. 'fr' will select the file  tokconfig-fr  from  the  installation
	      directory

       -l
	      Convert to all lowercase

       -u
	      Convert to all uppercase

       -n
	      Assume one sentence per line on input

       -m
	      Emit one sentence per line on output

       --passthru
	      Don't tokenize, but perform input decoding and simple token role detection

       -P
	      Disable Paragraph Detection

       -Q
	      Enable Quote Detection. (this is experimental and may lead to unexpected results)

       -S
	      Disable Sentence Detection

       -s <string>
	      Set End-of-sentence marker. (Default <utt>)

       -V
	      Show version information

       -v
	      set Verbose mode

       -x <DocId>
	      Output FoLiA XML, use the specified Document ID. (this disables usage of most other options: -nulPQvsS)

       -F
	      Read a FoLiA XML document, tokenize it, and output the modified doc. (this disables usage of most other options: -nulPQvsS)

BUGS

       likely

AUTHORS

       Maarten van Gompel proycon@anaproy.nl

       Ko van der Sloot Timbl@uvt.nl

								 2011 november 28							   ucto(1)