WORDLIST2DAWG(1)WORDLIST2DAWG(1)NAME
wordlist2dawg - convert a wordlist to a DAWG for Tesseract
SYNOPSIS
wordlist2dawg WORDLIST DAWG lang.unicharset
wordlist2dawg -t WORDLIST DAWG lang.unicharset
wordlist2dawg -r 1 WORDLIST DAWG lang.unicharset
wordlist2dawg -r 2 WORDLIST DAWG lang.unicharset
wordlist2dawg -l <short> <long> WORDLIST DAWG lang.unicharset
DESCRIPTION wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph (DAWG) for use with Tesseract. A DAWG is a compressed, space and time
efficient representation of a word list.
OPTIONS -t Verify that a given dawg file is equivalent to a given wordlist.
-r 1 Reverse a word if it contains an RTL character.
-r 2 Reverse all words.
-l <short> <long> Produce a file with several dawgs in it, one each for words of length <short>, <short+1>,... <long>
ARGUMENTS
WORDLIST A plain text file in UTF-8, one word per line.
DAWG The output DAWG to write.
lang.unicharset The unicharset of the language. This is the unicharset generated by mftraining(1).
SEE ALSO tesseract(1), combine_tessdata(1), dawg2wordlist(1)
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
COPYING
Copyright (C) 2006 Google, Inc. Licensed under the Apache License, Version 2.0
AUTHOR
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).
02/09/2012 WORDLIST2DAWG(1)
Check Out this Related Man Page
DAWG2WORDLIST(1)DAWG2WORDLIST(1)NAME
dawg2wordlist - convert a Tesseract DAWG to a wordlist
SYNOPSIS
dawg2wordlist UNICHARSET DAWG WORDLIST
DESCRIPTION dawg2wordlist(1) converts a Tesseract Directed Acyclic Word Graph (DAWG) to a list of words using a unicharset as key.
OPTIONS
UNICHARSET The unicharset of the language. This is the unicharset generated by mftraining(1).
DAWG The input DAWG, created by wordlist2dawg(1)
WORDLIST Plain text (output) file in UTF-8, one word per line
SEE ALSO tesseract(1), mftraining(1), wordlist2dawg(1), unicharset(5), combine_tessdata(1)
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
COPYING
Copyright (C) 2012 Google, Inc. Licensed under the Apache License, Version 2.0
AUTHOR
The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and Google (2006-present).
02/09/2012 DAWG2WORDLIST(1)
Hello,
Back in late August 2009, I decided to start working on a modification of the traditional Directed Acyclic Word Graph data structure.
End Of Word Nodes did not match up with single words, and Child Information had to be discovered through list scrolling. These were a heavy price to... (0 Replies)
Hello UNIX,
I wrote a Java Web-Start application based on my C code for the Directed Acyclic Word Graph or DAWG. It is primarily an effective and convenient tool for training to be an expert Scrabble player. Beyond that, it should be more accessible than my low-level C code.
It works when I... (0 Replies)
Hello,
Over the past few years, I've conducted some rather thorough R&D in the field of lexicon-data-structure optimization.
A Trie is a good place to start, followed by a traditional DAWG.
Smaller means faster, but a traditional DAWG encoding operates as a Boolean-graph, unable to index... (1 Reply)
I need to use sort, uniq, grep, wc,... and the like to work with lists of words in UTF-8 (the "words" being phonetic transcriptions using the IPA). I have been using Google a lot and I even found at least one previous post on this topic, but it didn't help.
I tried following the instructions... (2 Replies)