Sponsored Content
Top Forums Shell Programming and Scripting Working with OCR text inside PDF files Post 302277217 by dorcas on Thursday 15th of January 2009 05:14:09 PM
Old 01-15-2009
Working with OCR text inside PDF files

I'm trying to find a way to automate cleanup of OCR for a large number of scanned pages - due to limitations of the access mechanism where these are to end up, I need to create pdf files that include the background text for searching.

Going in I have Tif images too dirty to OCR and re-keyed text that matches page for page. I can see from reading here plenty of ways to turn the Tif files into pdf, what I can't find is a way to stick this text into the pdf file - I'm guessing this calls for some reverse-engineering of what ever mapping scheme pdf uses for the coordinates of words or characters. Does anyone know of a tool for getting access to this text - writing as well as reading. I'm looking at pdftk but so far all I can get is a dump of the "metadata" fields, but not the text with position mapping...
 

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

looping a array inside inside ssh is not working, pls help

set -A arr a1 a2 a3 a4 # START ssh -xq $Server1 -l $Username /usr/bin/ksh <<-EOS integer j=0 for loop in ${arr} do printf "array - ${arr}\n" (( j = j + 1 )) j=`expr j+1` done EOS # END ========= this is not giving me correct output. I... (5 Replies)
Discussion started by: reldb
5 Replies

2. Homework & Coursework Questions

copy files inside a text file

Hi Guys , I am new to this and Hi to all ,Need your help I am trying to copy Files which are inside file.txt The files inside file.txt are inthe below order file1.log file2.log file3.log ....... I want to copy these files to an output Directory , Please help (1 Reply)
Discussion started by: hc17972
1 Replies

3. Homework & Coursework Questions

copy files inside a text file

Hi Guys , I am new to this and Hi to all ,Need your help I am trying to copy Files which are inside file.txt The files inside file.txt are inthe below order file1.log file2.log file3.log ....... I want to copy these files to an output Directory , Please help (1 Reply)
Discussion started by: hc17972
1 Replies

4. Shell Programming and Scripting

Searching for a string in .PDF files inside .RAR & .ZIP archives.

Hi, I have got a large number of .PDF files that are archived in .RAR & ZIP files in various directories and I would like to search for strings inside the PDF files. I would think you would need something that can recursively read directories, extract the .RAR/.ZIP file in memory, read the... (3 Replies)
Discussion started by: lewk
3 Replies

5. Programming

Is it possible to change search inside .pdf or .doc files?

the titele was wrong ... the true one is: Is it possible to search words inside .pdf or .doc files? is it possible if i changed the word into binary combination:eek:? and this way is super too hyper huge of greatest codes i ever seen:D to read only 1 word so is there any other ways:confused:? ... (1 Reply)
Discussion started by: fwrlfo
1 Replies

6. UNIX for Dummies Questions & Answers

Pdftotext from multiple pdf files to a single text file

I have a directory having a number of pdf files. I want to convert all the files to text, stored in a single text file The following creates multiple text files ls *.pdf | xargs -n1 pdftotext (1 Reply)
Discussion started by: kristinu
1 Replies

7. Shell Programming and Scripting

Converting secured pdf files to pdf using acroread

Does anybody have idea of Converting secured pdf files to pdf using acroread ? ---------- Post updated at 04:49 PM ---------- Previous update was at 04:44 PM ---------- This file is not password protected. (4 Replies)
Discussion started by: Soham
4 Replies

8. Shell Programming and Scripting

Check if files inside a text file are found under a directory

Hi all, Please somebody help me with this: I want to check if the files listed in a text file, are found under a directory or not. For example: the file is list_of_files.txt, which contains inside this rows: # cat list_of_files logs errors paths debug # I want to check if these... (3 Replies)
Discussion started by: arrals_vl
3 Replies

9. Shell Programming and Scripting

OCR text that needs cleaning

Hi, I have OCR'ed text that needs cleaning. Lines are delimited by parts of speech (POS), for example, each line will have either an adj. OR s. f. OR s. m. etc I need to uppercase all text before the POS but all text within parentheses to be lowercase Text after (and including) the POS... (6 Replies)
Discussion started by: safran
6 Replies
PS2PDF(1)							    Ghostscript 							 PS2PDF(1)

NAME
ps2pdf - Convert PostScript to PDF using ghostscript ps2pdf12 - Convert PostScript to PDF 1.2 (Acrobat 3-and-later compatible) using ghostscript ps2pdf13 - Convert PostScript to PDF 1.3 (Acrobat 4-and-later compatible) using ghostscript SYNOPSIS
ps2pdf [options...] {input.[e]ps|-} [output.pdf|-] ps2pdf12 [options...] {input.[e]ps|-} [output.pdf|-] ps2pdf13 [options...] {input.[e]ps|-} [output.pdf|-] DESCRIPTION
The ps2pdf scripts are work-alikes for nearly all the functionality (but not the user interface) of Adobe's Acrobat(TM) Distiller(TM) prod- uct: they convert PostScript files to Portable Document Format (PDF) files. If the output filename is not specified, the output is placed is a file of the same name with a '.pdf' extension. Either the input filename or the output filename can be '-' to request reading from stdin or writing to stdout, respectively, when used as a filter. The three scripts differ as follows: - ps2pdf12 will always produce PDF 1.2 output (Acrobat 3-and-later compatible). - ps2pdf13 will always produce PDF 1.3 output (Acrobat 4-and-later compatible). - ps2pdf per se currently produces PDF 1.4 output. However, this may change in the future. If you care about the compatibility level of the output, use ps2pdf12 or ps2pdf13, or use the -dCompatibility=1.x switch in the command line. There are some limitations in ps2pdf's conversion. See the HTML documentation for more information. A large number of Adobe Distiller(TM) parameters which can be used to control the conversion are also documented there, including instructions for generating PDF/X and PDF/A documents. OPTIONS
The ps2pdf scripts use the same options as gs(1). EXAMPLES
Converting a figure.ps to figure.pdf: ps2pdf figure.ps A conversion with more specifics: ps2pdf -dPDFSETTINGS=/prepress figure.ps proof.pdf Converting as part of a pipe: make_report.pl -t ps | ps2pdf -dCompatibility=1.3 - - | lpr SEE ALSO
gs(1), ps2pdfwr(1), Ps2pdf.htm in the Ghostscript documentation BUGS
See http://bugs.ghostscript.com/ and the Usenet news group comp.lang.postscript. VERSION
This document was last revised for Ghostscript version 8.70. AUTHOR
Artifex Software, Inc. are the primary maintainers of Ghostscript. This manpage by George Ferguson. 8.70 31 July 2009 PS2PDF(1)
All times are GMT -4. The time now is 12:56 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy