wordlist2dawg(1) [debian man page]

WORDLIST2DAWG(1)														  WORDLIST2DAWG(1)

NAME

       wordlist2dawg - convert a wordlist to a DAWG for Tesseract

SYNOPSIS

       wordlist2dawg WORDLIST DAWG lang.unicharset

       wordlist2dawg -t WORDLIST DAWG lang.unicharset

       wordlist2dawg -r 1 WORDLIST DAWG lang.unicharset

       wordlist2dawg -r 2 WORDLIST DAWG lang.unicharset

       wordlist2dawg -l <short> <long> WORDLIST DAWG lang.unicharset

DESCRIPTION

       wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph (DAWG) for use with Tesseract. A DAWG is a compressed, space and time
       efficient representation of a word list.

OPTIONS

       -t Verify that a given dawg file is equivalent to a given wordlist.

       -r 1 Reverse a word if it contains an RTL character.

       -r 2 Reverse all words.

       -l <short> <long> Produce a file with several dawgs in it, one each for words of length <short>, <short+1>,... <long>

ARGUMENTS

       WORDLIST A plain text file in UTF-8, one word per line.

       DAWG The output DAWG to write.

       lang.unicharset The unicharset of the language. This is the unicharset generated by mftraining(1).

SEE ALSO

       tesseract(1), combine_tessdata(1), dawg2wordlist(1)

       http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

COPYING

       Copyright (C) 2006 Google, Inc. Licensed under the Apache License, Version 2.0

AUTHOR

       The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

								    02/09/2012							  WORDLIST2DAWG(1)

Check Out this Related Man Page

DAWG2WORDLIST(1)														  DAWG2WORDLIST(1)

NAME

       dawg2wordlist - convert a Tesseract DAWG to a wordlist

SYNOPSIS

       dawg2wordlist UNICHARSET DAWG WORDLIST

DESCRIPTION

       dawg2wordlist(1) converts a Tesseract Directed Acyclic Word Graph (DAWG) to a list of words using a unicharset as key.

OPTIONS

       UNICHARSET The unicharset of the language. This is the unicharset generated by mftraining(1).

       DAWG The input DAWG, created by wordlist2dawg(1)

       WORDLIST Plain text (output) file in UTF-8, one word per line

SEE ALSO

       tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), combine_tessdata(1)

       http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

COPYING

       Copyright (C) 2012 Google, Inc. Licensed under the Apache License, Version 2.0

AUTHOR

       The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).

								    02/09/2012							  DAWG2WORDLIST(1)

Linux and UNIX Man Pages

wordlist2dawg(1) [debian man page]

Check Out this Related Man Page

4 More Discussions You Might Find Interesting

1. Programming

Conpressed, Direct Child Info, Word Tracking, Lexicon Data Structure, ADTDAWG?

Discussion started by: HeavyJ

2. Programming

TWL06 Lexicon DAWG Engine

Discussion started by: HeavyJ

3. Programming

The World's Most Advanced Lexicon-Data-Structure

Discussion started by: HeavyJ

4. UNIX for Dummies Questions & Answers

UTF-8 in xterm

Discussion started by: mregine