Working with OCR text inside PDF files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Working with OCR text inside PDF files
# 1  
Old 01-15-2009
Working with OCR text inside PDF files

I'm trying to find a way to automate cleanup of OCR for a large number of scanned pages - due to limitations of the access mechanism where these are to end up, I need to create pdf files that include the background text for searching.

Going in I have Tif images too dirty to OCR and re-keyed text that matches page for page. I can see from reading here plenty of ways to turn the Tif files into pdf, what I can't find is a way to stick this text into the pdf file - I'm guessing this calls for some reverse-engineering of what ever mapping scheme pdf uses for the coordinates of words or characters. Does anyone know of a tool for getting access to this text - writing as well as reading. I'm looking at pdftk but so far all I can get is a dump of the "metadata" fields, but not the text with position mapping...
# 2  
Old 01-16-2009
Are you looking to map each word in the manually generated page text to its corresponding position in the OCR image of the page?
# 3  
Old 01-16-2009
Yes

There is an xml based metadata standard for this called METS-ALTO but what I'm trying to get at is the proprietary one that is inside of a pdf - the piece of the pdf file that is created when you run OCR within Adobe Acrobat.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

OCR text that needs cleaning

Hi, I have OCR'ed text that needs cleaning. Lines are delimited by parts of speech (POS), for example, each line will have either an adj. OR s. f. OR s. m. etc I need to uppercase all text before the POS but all text within parentheses to be lowercase Text after (and including) the POS... (6 Replies)
Discussion started by: safran
6 Replies

2. Shell Programming and Scripting

Check if files inside a text file are found under a directory

Hi all, Please somebody help me with this: I want to check if the files listed in a text file, are found under a directory or not. For example: the file is list_of_files.txt, which contains inside this rows: # cat list_of_files logs errors paths debug # I want to check if these... (3 Replies)
Discussion started by: arrals_vl
3 Replies

3. Shell Programming and Scripting

Converting secured pdf files to pdf using acroread

Does anybody have idea of Converting secured pdf files to pdf using acroread ? ---------- Post updated at 04:49 PM ---------- Previous update was at 04:44 PM ---------- This file is not password protected. (4 Replies)
Discussion started by: Soham
4 Replies

4. UNIX for Dummies Questions & Answers

Pdftotext from multiple pdf files to a single text file

I have a directory having a number of pdf files. I want to convert all the files to text, stored in a single text file The following creates multiple text files ls *.pdf | xargs -n1 pdftotext (1 Reply)
Discussion started by: kristinu
1 Replies

5. Programming

Is it possible to change search inside .pdf or .doc files?

the titele was wrong ... the true one is: Is it possible to search words inside .pdf or .doc files? is it possible if i changed the word into binary combination:eek:? and this way is super too hyper huge of greatest codes i ever seen:D to read only 1 word so is there any other ways:confused:? ... (1 Reply)
Discussion started by: fwrlfo
1 Replies

6. Shell Programming and Scripting

Searching for a string in .PDF files inside .RAR & .ZIP archives.

Hi, I have got a large number of .PDF files that are archived in .RAR & ZIP files in various directories and I would like to search for strings inside the PDF files. I would think you would need something that can recursively read directories, extract the .RAR/.ZIP file in memory, read the... (3 Replies)
Discussion started by: lewk
3 Replies

7. Homework & Coursework Questions

copy files inside a text file

Hi Guys , I am new to this and Hi to all ,Need your help I am trying to copy Files which are inside file.txt The files inside file.txt are inthe below order file1.log file2.log file3.log ....... I want to copy these files to an output Directory , Please help (1 Reply)
Discussion started by: hc17972
1 Replies

8. Homework & Coursework Questions

copy files inside a text file

Hi Guys , I am new to this and Hi to all ,Need your help I am trying to copy Files which are inside file.txt The files inside file.txt are inthe below order file1.log file2.log file3.log ....... I want to copy these files to an output Directory , Please help (1 Reply)
Discussion started by: hc17972
1 Replies

9. Shell Programming and Scripting

looping a array inside inside ssh is not working, pls help

set -A arr a1 a2 a3 a4 # START ssh -xq $Server1 -l $Username /usr/bin/ksh <<-EOS integer j=0 for loop in ${arr} do printf "array - ${arr}\n" (( j = j + 1 )) j=`expr j+1` done EOS # END ========= this is not giving me correct output. I... (5 Replies)
Discussion started by: reldb
5 Replies
Login or Register to Ask a Question