Limitations of 'pdftotext' in Linux...


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Limitations of 'pdftotext' in Linux...
# 15  
Old 11-21-2019
If it helps, I can run the PDF_Checker on a PDF from this same trading partner that actually processes properly. A pdftotext creates a readable text file and inside of Acrobat FILE -> SAVE AS TEXT works as well. The trading partner updated their PDF and this latest one is the result. Maybe a comparison between old and new would point to the answer. Thanks again for the help.
# 16  
Old 11-21-2019
Thanks for the update. Did you try Jim's suggestion here: ?

Code:
https://www.unix.com/303041312-post2.html

# 17  
Old 11-22-2019
Also, according to the pdftotext man page:

https://www.unix.com/man-page/linux/1/pdftotext/

Code:
BUGS

       Some  PDF  files  contain  fonts whose encodings have been mangled beyond recognition.  There is no way (short of OCR) to extract text from
       these files.

Code:
EXIT CODES

       The Xpdf tools use the following exit codes:

       0      No error.

       1      Error opening a PDF file.

       2      Error opening an output file.

       3      Error related to PDF permissions.

       99     Other error.

This would indicate that the first place to look would be at the fonts, since the man page says:

BUGS -- Some PDF files contain fonts whose encodings have been mangled beyond recognition. There is no way (short of OCR) to extract text from these files.

Did you check the file and list all the fonts and compare that list of fonts to a working PDF file (which converts to text properly)?
# 18  
Old 11-22-2019
Quote:
Originally Posted by Neo
Thanks for the update. Did you try Jim's suggestion here: ?

Code:
https://www.unix.com/303041312-post2.html

I did. I reported back that my version of Acrobat does not have the accessibility tool (apparently). When I click on it it shows that it's a "pro" feature that I do not have. But I have tried to save the document in Acrobat and it will save it under another filename without issue.

--- Post updated at 05:14 AM ---

Quote:
Originally Posted by Neo
Also, according to the pdftotext man page:

https://www.unix.com/man-page/linux/1/pdftotext/

Code:
BUGS

       Some  PDF  files  contain  fonts whose encodings have been mangled beyond recognition.  There is no way (short of OCR) to extract text from
       these files.

Code:
EXIT CODES

       The Xpdf tools use the following exit codes:

       0      No error.

       1      Error opening a PDF file.

       2      Error opening an output file.

       3      Error related to PDF permissions.

       99     Other error.

This would indicate that the first place to look would be at the fonts, since the man page says:

BUGS -- Some PDF files contain fonts whose encodings have been mangled beyond recognition. There is no way (short of OCR) to extract text from these files.

Did you check the file and list all the fonts and compare that list of fonts to a working PDF file (which converts to text properly)?
I was just looking at that and comparing the old version to the new version. The PDF_checker for the old version (which DOES convert) says that there are font errors...

Code:
Fonts Results
    Errors:
        Uses Base 14 fonts not embedded in document: 
            Helvetica (1 instance)
            Helvetica-Bold (1 instance)

I'm in a bit of deep water here because I'm an application programmer and rarely lift the hood on PDF structure. On this project where I use 'pdftotext', I simply use the command line instructions, take my text file and move on. Once the utility doesn't work (for whatever reason), I'm at a loss. My guess is that the size of the PDF (425kb for the bad one compared to about 17kb for the ones that work properly) suggests that it's actually an image. Does the PDF_Checker information I posted earlier tell us that or no? Thanks again.
# 19  
Old 11-22-2019
I have not looked into it but I doubt that particular PDF checker checks for fonts not compatible with the Linux pdftotext utility.

My guess is that you will need to preprocess your PDF files and strip out any fonts which are causing issues or not compatible with pdftotext .

Or... less likely,

You could to instruct everyone who provides PDF not to use unsupported fonts. LOL, but controlling users usually does not work..... so that "administrative" option may not help and you will need a technical solution to preprocess.

What do you think?
# 20  
Old 11-22-2019
Also, I saw that bug report about font encodings being mangled beyond recognition. What does that suggest? That the fonts are unusual and unable to be picked up? I have seen that statement on a number of 'pdftotext' websites but I'm not sure what they're trying to say unless it just comes down to some fonts being unusable by the utility. The font in this particular PDF does not seem to be unusual but I have no real reference.

--- Post updated at 05:41 AM ---

I think we posted at the same time there. Yeah, asking the trading partner to conform to something is dicey to say the least. What I find unusual is that this structure has been in place for quite awhile and AFAIK, this is the first time that a PDF simply will not process using 'pdftotext'. That along with the size suggests that this particular PDF was created under unusual circumstances. What I need to do is tell my customer that this PDF is incompatible but I would like to tell them WHY so that the trading partner might be able to do something different. I dislike mysteries and I don't like to say that something doesn't work without understanding why. It's definitely mysterious. Thanks again for the help. I appreciate it.
# 21  
Old 11-22-2019
Well, as you know, sometimes people find fancy fonts they like, and then they want to use them.

One approach is to extract / list the fonts in the PDF files and log them.

Then over time you can see what are the offending fonts (assume that is the case).

Then, you can find a way to preprocess the PDF to strip / change / remove the pdftotext offending fonts.

Or, you can get the source code for pdftotext and try to recompile to support these new font families.

Naturally, the first step toward solving any problem is knowing what the problem is and it sound like you may have isolated it to non-supported pdftotext fonts.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Pdftotext from multiple pdf files to a single text file

I have a directory having a number of pdf files. I want to convert all the files to text, stored in a single text file The following creates multiple text files ls *.pdf | xargs -n1 pdftotext (1 Reply)
Discussion started by: kristinu
1 Replies

2. Linux

Linux partitions and limitations

In recently reading an article on linux basics before I embark and my personal installation project I came across this passage - IDE drives have three types of partition: primary, logical, and extended. The partition table is located in the master boot record (MBR) of a disk. The MBR is the... (12 Replies)
Discussion started by: Synchlavier
12 Replies

3. Solaris

Solaris limitations

Hi, I recently started working with Solaris, and what I noticed is that a lot of commands I used to regularly use don't work, like sed -i and grep -r. I have found work arounds for these problems though but it's a pain in the ass. I'm just wondering why they decided not to include these handy... (4 Replies)
Discussion started by: Subbeh
4 Replies

4. Red Hat

Eth0 Limitations

Hi, I have noticed some performance issues on my RHEL5 server but the memory and CPU utilization on the box is fine. I have a 1G full duplexed eth0 card and I am suspicious that this may be causing the problem. My eth0 settings are as follows: Settings for eth0: Supported ports: ... (12 Replies)
Discussion started by: Duffs22
12 Replies

5. UNIX for Dummies Questions & Answers

Basic problem with pdftotext

Hi, I have used pdftotext with good results in the past, but today for some reason I keep getting the same error message. My command is as follows: And the error message is I am using Vmware player with Ubuntu server, but I don't think that is causing this issue as I have been using... (2 Replies)
Discussion started by: Joq
2 Replies

6. Red Hat

Limitations on the partition of linux

Hi, I need a documentation about limitations on the linux partition. On how many primary and extended I could create. And also on different type of storage, how many big capacity I can create. Thanks. (3 Replies)
Discussion started by: itik
3 Replies

7. UNIX and Linux Applications

gnuplot limitations

I'm running a simulation (programmed in C) which makes calls to gnuplot periodically to plot data I have stored. First I open a pipe to gnuplot and set it to multiplot: FILE * pipe = popen("gnuplot", "w"); fprintf(pipe, "set multiplot\n"); fflush(pipe); (this pipe stays open until the... (0 Replies)
Discussion started by: sedavidw
0 Replies

8. HP-UX

pdftotext / PDF conversion to .txt binaries

Good day, I've been trying to look for a way to compile the Xpdf sources in our HP-UX server, but have been failing to do so because there is no GCC installed, and I don't have privileges to install GCC. I was looking for a functionality to convert PDF files to .txt, which is exactly like the... (2 Replies)
Discussion started by: mike_s_6
2 Replies

9. UNIX for Dummies Questions & Answers

csplit limitations

I am trying to use the csplit file on a file that contains records that have more than 2048 characters on a line. The resultant split file seems to ignore the rest of the line and I lose the data. Is there any way that csplit can handle record lengths greater than 2048? Thanks (0 Replies)
Discussion started by: ravagga
0 Replies

10. UNIX for Dummies Questions & Answers

mkdir limitations

What characters can't be used with a mkdir? Any limits on length of name? Thank you, Randy M. Zeitman http://www.StoneRoseDesign.com (12 Replies)
Discussion started by: flignar
12 Replies
Login or Register to Ask a Question