Limitations of 'pdftotext' in Linux... Post: 303041366

Sponsored Content

Top Forums UNIX for Beginners Questions & Answers Limitations of 'pdftotext' in Linux... Post 303041366 by kenlenard on Thursday 21st of November 2019 11:14:45 PM

11-22-2019

Registered User

Quote:

Originally Posted by Neo

Thanks for the update. Did you try Jim's suggestion here: ?

Code:

https://www.unix.com/303041312-post2.html

I did. I reported back that my version of Acrobat does not have the accessibility tool (apparently). When I click on it it shows that it's a "pro" feature that I do not have. But I have tried to save the document in Acrobat and it will save it under another filename without issue.

--- Post updated at 05:14 AM ---

Quote:

Originally Posted by Neo

Also, according to the pdftotext man page:

https://www.unix.com/man-page/linux/1/pdftotext/

Code:

BUGS

       Some  PDF  files  contain  fonts whose encodings have been mangled beyond recognition.  There is no way (short of OCR) to extract text from
       these files.

Code:

EXIT CODES

       The Xpdf tools use the following exit codes:

       0      No error.

       1      Error opening a PDF file.

       2      Error opening an output file.

       3      Error related to PDF permissions.

       99     Other error.

This would indicate that the first place to look would be at the fonts, since the man page says:

BUGS -- Some PDF files contain fonts whose encodings have been mangled beyond recognition. There is no way (short of OCR) to extract text from these files.

Did you check the file and list all the fonts and compare that list of fonts to a working PDF file (which converts to text properly)?

I was just looking at that and comparing the old version to the new version. The PDF_checker for the old version (which DOES convert) says that there are font errors...

Code:

Fonts Results
    Errors:
        Uses Base 14 fonts not embedded in document: 
            Helvetica (1 instance)
            Helvetica-Bold (1 instance)

I'm in a bit of deep water here because I'm an application programmer and rarely lift the hood on PDF structure. On this project where I use 'pdftotext', I simply use the command line instructions, take my text file and move on. Once the utility doesn't work (for whatever reason), I'm at a loss. My guess is that the size of the PDF (425kb for the bad one compared to about 17kb for the ones that work properly) suggests that it's actually an image. Does the PDF_Checker information I posted earlier tell us that or no? Thanks again.

kenlenard

View Public Profile for kenlenard

Find all posts by kenlenard

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

mkdir limitations

What characters can't be used with a mkdir? Any limits on length of name? Thank you, Randy M. Zeitman http://www.StoneRoseDesign.com

2. UNIX for Dummies Questions & Answers

csplit limitations

I am trying to use the csplit file on a file that contains records that have more than 2048 characters on a line. The resultant split file seems to ignore the rest of the line and I lose the data. Is there any way that csplit can handle record lengths greater than 2048? Thanks

3. HP-UX

pdftotext / PDF conversion to .txt binaries

Good day, I've been trying to look for a way to compile the Xpdf sources in our HP-UX server, but have been failing to do so because there is no GCC installed, and I don't have privileges to install GCC. I was looking for a functionality to convert PDF files to .txt, which is exactly like the...

4. UNIX and Linux Applications

gnuplot limitations

I'm running a simulation (programmed in C) which makes calls to gnuplot periodically to plot data I have stored. First I open a pipe to gnuplot and set it to multiplot: FILE * pipe = popen("gnuplot", "w"); fprintf(pipe, "set multiplot\n"); fflush(pipe); (this pipe stays open until the...

5. Red Hat

Limitations on the partition of linux

Hi, I need a documentation about limitations on the linux partition. On how many primary and extended I could create. And also on different type of storage, how many big capacity I can create. Thanks.

6. UNIX for Dummies Questions & Answers

Basic problem with pdftotext

Hi, I have used pdftotext with good results in the past, but today for some reason I keep getting the same error message. My command is as follows: And the error message is I am using Vmware player with Ubuntu server, but I don't think that is causing this issue as I have been using...

7. Red Hat

Eth0 Limitations

Hi, I have noticed some performance issues on my RHEL5 server but the memory and CPU utilization on the box is fine. I have a 1G full duplexed eth0 card and I am suspicious that this may be causing the problem. My eth0 settings are as follows: Settings for eth0: Supported ports: ...

8. Solaris

Solaris limitations

Hi, I recently started working with Solaris, and what I noticed is that a lot of commands I used to regularly use don't work, like sed -i and grep -r. I have found work arounds for these problems though but it's a pain in the ass. I'm just wondering why they decided not to include these handy...

9. Linux

Linux partitions and limitations

In recently reading an article on linux basics before I embark and my personal installation project I came across this passage - IDE drives have three types of partition: primary, logical, and extended. The partition table is located in the master boot record (MBR) of a disk. The MBR is the...

10. UNIX for Dummies Questions & Answers

Pdftotext from multiple pdf files to a single text file

I have a directory having a number of pdf files. I want to convert all the files to text, stored in a single text file The following creates multiple text files ls *.pdf | xargs -n1 pdftotext

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

mkdir limitations

Discussion started by: flignar

2. UNIX for Dummies Questions & Answers

csplit limitations

Discussion started by: ravagga

3. HP-UX

pdftotext / PDF conversion to .txt binaries

Discussion started by: mike_s_6

4. UNIX and Linux Applications

gnuplot limitations

Discussion started by: sedavidw

5. Red Hat

Limitations on the partition of linux

Discussion started by: itik

6. UNIX for Dummies Questions & Answers

Basic problem with pdftotext

Discussion started by: Joq

7. Red Hat

Eth0 Limitations

Discussion started by: Duffs22

8. Solaris

Solaris limitations

Discussion started by: Subbeh

9. Linux

Linux partitions and limitations

Discussion started by: Synchlavier

10. UNIX for Dummies Questions & Answers

Pdftotext from multiple pdf files to a single text file

Discussion started by: kristinu