01-15-2009
Working with OCR text inside PDF files
I'm trying to find a way to automate cleanup of OCR for a large number of scanned pages - due to limitations of the access mechanism where these are to end up, I need to create pdf files that include the background text for searching.
Going in I have Tif images too dirty to OCR and re-keyed text that matches page for page. I can see from reading here plenty of ways to turn the Tif files into pdf, what I can't find is a way to stick this text into the pdf file - I'm guessing this calls for some reverse-engineering of what ever mapping scheme pdf uses for the coordinates of words or characters. Does anyone know of a tool for getting access to this text - writing as well as reading. I'm looking at pdftk but so far all I can get is a dump of the "metadata" fields, but not the text with position mapping...
9 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
set -A arr a1 a2 a3 a4
# START
ssh -xq $Server1 -l $Username /usr/bin/ksh <<-EOS
integer j=0
for loop in ${arr}
do
printf "array - ${arr}\n"
(( j = j + 1 ))
j=`expr j+1`
done
EOS
# END
=========
this is not giving me correct output.
I... (5 Replies)
Discussion started by: reldb
5 Replies
2. Homework & Coursework Questions
Hi
Guys , I am new to this and Hi to all ,Need your help
I am trying to copy Files which are inside file.txt
The files inside file.txt are inthe below order
file1.log
file2.log
file3.log
.......
I want to copy these files to an output Directory ,
Please help (1 Reply)
Discussion started by: hc17972
1 Replies
3. Homework & Coursework Questions
Hi
Guys , I am new to this and Hi to all ,Need your help
I am trying to copy Files which are inside file.txt
The files inside file.txt are inthe below order
file1.log
file2.log
file3.log
.......
I want to copy these files to an output Directory ,
Please help (1 Reply)
Discussion started by: hc17972
1 Replies
4. Shell Programming and Scripting
Hi,
I have got a large number of .PDF files that are archived in .RAR & ZIP files in various directories and I would like to search for strings inside the PDF files.
I would think you would need something that can recursively read directories, extract the .RAR/.ZIP file in memory, read the... (3 Replies)
Discussion started by: lewk
3 Replies
5. Programming
the titele was wrong ... the true one is: Is it possible to search words inside .pdf or .doc files?
is it possible if i changed the word into binary combination:eek:?
and this way is super too hyper huge of greatest codes i ever seen:D to read only 1 word so is there any other ways:confused:?
... (1 Reply)
Discussion started by: fwrlfo
1 Replies
6. UNIX for Dummies Questions & Answers
I have a directory having a number of pdf files.
I want to convert all the files to text, stored in a single text file
The following creates multiple text files
ls *.pdf | xargs -n1 pdftotext (1 Reply)
Discussion started by: kristinu
1 Replies
7. Shell Programming and Scripting
Does anybody have idea of Converting secured pdf files to pdf using acroread ?
---------- Post updated at 04:49 PM ---------- Previous update was at 04:44 PM ----------
This file is not password protected. (4 Replies)
Discussion started by: Soham
4 Replies
8. Shell Programming and Scripting
Hi all,
Please somebody help me with this:
I want to check if the files listed in a text file, are found under a directory or not.
For example: the file is list_of_files.txt, which contains inside this rows:
# cat list_of_files
logs
errors
paths
debug
#
I want to check if these... (3 Replies)
Discussion started by: arrals_vl
3 Replies
9. Shell Programming and Scripting
Hi,
I have OCR'ed text that needs cleaning.
Lines are delimited by parts of speech (POS), for example,
each line will have either an
adj. OR s. f. OR s. m. etc
I need to uppercase all text before the POS
but all text within parentheses to be lowercase
Text after (and including) the POS... (6 Replies)
Discussion started by: safran
6 Replies
LEARN ABOUT PHP
ps_add_bookmark
PS_ADD_BOOKMARK(3) 1 PS_ADD_BOOKMARK(3)
ps_add_bookmark - Add bookmark to current page
SYNOPSIS
int ps_add_bookmark (resource $psdoc, string $text, [int $parent], [int $open])
DESCRIPTION
Adds a bookmark for the current page. Bookmarks usually appear in PDF-Viewers left of the page in a hierarchical tree. Clicking on a book-
mark will jump to the given page.
The note will not be visible if the document is printed or viewed but it will show up if the document is converted to pdf by either Acrobat
Distillertm or Ghostview.
PARAMETERS
o $psdoc
- Resource identifier of the postscript file as returned by ps_new(3).
o $text
- The text used for displaying the bookmark.
o $parent
- A bookmark previously created by this function which is used as the parent of the new bookmark.
o $open
- If $open is unequal to zero the bookmark will be shown open by the pdf viewer.
RETURN VALUES
The returned value is a reference for the bookmark. It is only used if the bookmark shall be used as a parent. The value is greater zero
if the function succeeds. In case of an error zero will be returned.
SEE ALSO
ps_add_launchlink(3), ps_add_pdflink(3), ps_add_weblink(3).
PHP Documentation Group PS_ADD_BOOKMARK(3)