djvu2hocr(1) [debian man page]

DJVU2HOCR(1)							 djvu2hocr manual						      DJVU2HOCR(1)

NAME

       djvu2hocr - DjVu to hOCR converter

SYNOPSIS

       djvu2hocr [option...] djvu-file

       djvu2hocr {--version | --help | -h}

DESCRIPTION

       djvu2hocr converts hidden text from a DjVu file to the hOCR[1] format.

OPTIONS

   Text segmentation options
       --word-segmentation=simple
	   Use the same word segmentation as found in the DjVu file.

	   This is the default.

       --word-segmentation=uax29
	   Use the Unicode Text Segmentation[2] algorithm to break lines into words, possibly fixing word segmentation found in the DjVu file.

   Other options
       --version
	   Output version information and exit.

       -h, --help
	   Display help and exit.

PORTABILITY

       djvu2hocr uses a custom extension to hOCR to retain characters which cannot be directly represented in an HTML/XML document. For example,
       control character BEL (^G, U+0007), is converted into the following HTML chunk: <span class="djvu_char" title="#x07"> </span>

SEE ALSO

       djvu(1)

AUTHOR

       Jakub Wilk <jwilk@jwilk.net>
	   Author.

NOTES

	1. hOCR
	   http://docs.google.com/View?docid=dfxcv4vc_67g844kf

	2. Unicode Text Segmentation
	   http://unicode.org/reports/tr29/

djvu2hocr 0.7.9 						    03/10/2012							      DJVU2HOCR(1)

Check Out this Related Man Page

OCRODJVU(1)							  ocrodjvu manual						       OCRODJVU(1)

NAME

       ocrodjvu - OCR for DjVu files

SYNOPSIS

       ocrodjvu {-o | --save-bundled} output-djvu-file [option...] djvu-file

       ocrodjvu {-i | --save-indirect} index-djvu-file [option...] djvu-file

       ocrodjvu --save-script script-file [option...] djvu-file

       ocrodjvu --in-place [option...] djvu-file

       ocrodjvu --dry-run [option...] djvu-file

       ocrodjvu {--version | --help | -h | --list-engines | --list-languages}

DESCRIPTION

       ocrodjvu is a wrapper for OCR systems that allows you to perform OCR on DjVu files.

       The following OCR engines are supported:

       o   OCRopus[1] (internally, ocrodjvu calls ocroscript's recognize (or rec-tess) command, so that ultimately Tesseract acts as the OCR
	   backend);

       o   Cuneiform for Linux[2].

       o   Ocrad[3].

       o   GOCR[4].

       o   Stand-alone Tesseract[5].

OPTIONS

   OCR engine options
       -e, --engine=engine-id
	   Use this OCR engine. The default is 'ocropus' (OCRopus).

       --list-engines
	   Print list of available OCR engines.

   Options controlling output
       It is mandatory to use exactly one of the following options:

       -o, --save-bundled=output-djvu-file
	   Save OCR results as a bundled multi-page document into output-djvu-file.

       -i, --save-indirect=index-djvu-file
	   Save OCR results as an indirect multi-page document. Use index-djvu-file as the index file name; put the component files into the same
	   directory. The directory must exist and be writable.

       --save-script=script-file
	   Save a djvused script with OCR results into script-file.

       --in-place
	   Save OCR results in place.

	   (Use this option to retain compatibility with ocrodjvu < 0.2.)

       --dry-run
	   Don't change any files, throw OCR results away.

   Text segmentation options
       -t lines, --details lines
	   Record location of every line. Don't record locations of particular words or characters.

	   This is the default for OCRopus 0.2. The option is ineffective with stand-alone Tesseract 2.0.

       -t words, --details=words
	   Record location of every line and every word. Don't record locations of particular characters.

	   This is the default for most OCR engines.

	   This option is ineffective with OCRopus 0.2 and stand-alone Tesseract 2.0.

       -t chars, --details=chars
	   Record location of every line, every word and every character.

	   This option is ineffective with OCRopus 0.2 and stand-alone Tesseract 2.0.

       --word-segmentation=simple
	   Consider each non-empty sequence of non-whitespace characters a single word.

	   This is the default, despite being linguistically incorrect.

       --word-segmentation=uax29
	   Use the Unicode Text Segmentation[6] algorithm to break lines into words.

	   This option breaks assumptions of some DjVu tools that words are separated by spaces, and therefore it is not recommended.

   Other options
       --clear-text
	   Remove existing hidden text if present in the pages not selected for OCR.

	   (Use this option to retain compatibility with ocrodjvu < 0.2.)

       --ocr-only
	   Don't save pages that were not processed.

       -l, --language=language-id
	   Set recognition language.  language-id is typically an ISO 639-2/T three-letter code.

	   For OCRopus, the default is 'eng' (English), unless the tesslanguage environment variable is set. For other OCR engines, the default is
	   always 'eng'.

       --list-languages
	   Print list of available languages for the currently selected OCR engine.

       --render=mask
	   Render only masks of page images.

	   This is the default.

       --render=foreground
	   Render only foreground layers of page images.

       --render=all
	   Render all layers of page images.

	   This option is necessary to OCR DjVu files with invalid foreground/background separation.

       -p, --pages=page-range
	   Specifies pages to process.	page-range is a comma-separated list of sub-ranges. Each sub-range is either a single page (e.g. 17) or a
	   contiguous range of pages (e.g. 37-42). Pages are numbered from 1.

	   The default is to process all pages.

       -j, --jobs=n
	   Start up to n OCR processes.

       --version
	   Output version information and exit.

       -h, --help
	   Display help and exit.

   Advanced options
       -D, --debug
	   To ease debugging, don't delete intermediate files.

       -X key=value
	   This option allow to control some details of how ocrodjvu operates.

       --on-error=abort
	   Stop program execution when exception situation (e.g., malformed output from the OCR engine, internal ocrodjvu error, etc.) occurs.

	   This is the default.

       --on-error=resume
	   Attempt to recover from exceptional situations.

	   This option is strongly discouraged.

       --html5
	   Use a HTML5 parser[7], which is more robust but slower than the default parser.

ENVIRONMENT

       The following environment variables affects ocrodjvu:

       tesslanguage
	   Recognition language for Tesseract.

	   (Use this variable is deprecated in favor of the --language option.)

       TMPDIR
	   ocrodjvu makes heavy use of temporary files. It will store them in a directory specified by this variable. The default is /tmp.

BUGS

       Tesseract 3.00 is affected by a bug [8] making it produce invalid hOCR output in certain circumstances. ocrodjvu does not try recover form
       this fault (which couldn't be done reliably anyway) unless you pass the -X fix-html=1 option.

       When using Tesseract >= 3.00, extracting bounding boxes of particular characters (which happens when either --details=chars or
       --word-segmentation=uax29) is inefficient. This due to limitations of Tesseract command line interface.

SEE ALSO

       djvu(1), ocroscript(1), tesseract(1), cuneiform(1), ocrad(1), gocr(1)

AUTHOR

       Jakub Wilk <jwilk@jwilk.net>
	   Author.

NOTES

	1. OCRopus
	   http://ocropus.googlecode.com/

	2. Cuneiform for Linux
	   http://launchpad.net/cuneiform-linux

	3. Ocrad
	   http://www.gnu.org/software/ocrad/

	4. GOCR
	   http://jocr.sourceforge.net/

	5. Tesseract
	   http://code.google.com/p/tesseract-ocr/

	6. Unicode Text Segmentation
	   http://unicode.org/reports/tr29/

	7. HTML5 parser
	   http://www.whatwg.org/specs/web-apps/current-work/#html-parser

	8. http://code.google.com/p/tesseract-ocr/issues/detail?id=376

ocrodjvu 0.7.9							    03/10/2012							       OCRODJVU(1)

Linux and UNIX Man Pages

djvu2hocr(1) [debian man page]

Check Out this Related Man Page