Sponsored Content
Top Forums UNIX for Advanced & Expert Users How to read pdf file in UNIX environment? Post 302291086 by stanleypane on Tuesday 24th of February 2009 04:10:04 PM
Old 02-24-2009
A better question would be:

Why would you routinely use PDF formatted files as input into a script?

If you're spending time trying to automate some process, than you should look at replacing PDF with some other kind of document. Or possibly generate an additional file along side the PDF's for use by your script.

Sure, you can read PDF files from a command line, but it's rarely a good solution. If you must read PDF's from a unix command line, see if your system has these commands:

pdf2txt
pdf2ps
ps2ascii

pdf2txt - converts from PDF to text
pdf2ps - converts from PDF to Postscript
ps2ascii - converts from Postscript to ASCII text

If you can't find pdf2txt, then you could try using the other two to do the same thing.
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

How to view ps and pdf file under unix

Hello, I'd like to view ps and pds file under Unix(Xwindow) who could tell me the which software/command can work? Thanks! Vicky (2 Replies)
Discussion started by: vicky20000
2 Replies

2. UNIX for Dummies Questions & Answers

Someone used ghostview to read the pdf files?

How can I open the page that I want to read when I used ghostview to read the pdf files? Thanks. (0 Replies)
Discussion started by: new_hand
0 Replies

3. UNIX for Dummies Questions & Answers

Using gv to read the pdf file.

Sometimes the gv does work well.But sometimes it doesn't work. The error message: ... error:/undefined in /GBpc-EUC-H ... Can anybody help me? Thanks. (2 Replies)
Discussion started by: new_hand
2 Replies

4. Shell Programming and Scripting

FORMAT OF CSV FILE under unix environment.

hi I unload the table results from oracle to csv file foramt. i need increse the width of each column using unix commands could you pl tell me how to increase the width of each column to spefic width uisng sed unix command or na other unix commands i have file name called report.csv inside... (38 Replies)
Discussion started by: raosurya
38 Replies

5. Programming

Uncompress a gzip and bzip file using java on unix solaris environment

Hi, I need to uncompress a gzip and bzip file using java on unix solaris environment. I also need to retreive the header information of the file inorder to differentiate between gzip and bzip file. Please help Pooja (0 Replies)
Discussion started by: wadhwa.pooja
0 Replies

6. Shell Programming and Scripting

Environment Variables in text file and read command

I cannot get the following substitution ($ORACLE_SID) to work: The variable ORACLE_SID is set to wardin my environment. It has been exported. I have a text file called test.dat: /u07/oradata/${ORACLE_SID}/extab/finmart/summit/ps_voucher_line_crnt_ex.dbf... (2 Replies)
Discussion started by: bradyd
2 Replies

7. Shell Programming and Scripting

Batch job in unix server to move the pdf file from unix to windows.

Hi Experts, I have a requirement where i need to setup a batch job which runs everymonth and move the pdf files from unix server to windows servers. Could some body provide the inputs for this. and also please provide the inputs on how to map the network dirve in the unix like that... (1 Reply)
Discussion started by: ger199901
1 Replies

8. UNIX for Dummies Questions & Answers

how to print a PDF file in UNIX

on a PROGRESS environment, i create an invoice which at printing it must generate both the .dat for the invoice that was sent to the printer and the .dat for the PDF version. we have never printed PDF files in our lp printer until recently. i've done a bit of googling and it comes down to that i... (2 Replies)
Discussion started by: pdf2ps
2 Replies

9. Shell Programming and Scripting

Unable to read Environment Variable

Hi I have created the following shell script file with the following content. #!/bin/csh set VAR1="abcxyz" << EOF EOF echo "---------------------" echo "VAR1 = $VAR1" echo "---------------------" i am not able to echo the previously set VAR1. Can any one suggested what could be wrong?... (5 Replies)
Discussion started by: srinu_b
5 Replies

10. HP-UX

Cannot read PDF emailed by HP-UX server

I have a very strange issue. Now that we have a lot of our users using iPads to read statements, this is becoming more of an issue. We have some financial statements that are generated into PDF format by an application that runs in HP-UX, and then we use uuencode to attach the statements to the... (4 Replies)
Discussion started by: lawadm1
4 Replies
PDF2TXT(1)							  PDFMiner Manual							PDF2TXT(1)

NAME
pdf2txt - extracts text contents of PDF files SYNOPSIS
pdf2txt [option...] file... DESCRIPTION
pdf2txt extracts text contents from a PDF file. It extracts all the text that is to be rendered programmatically, i.e. text represented as ASCII or Unicode strings. It cannot recognize text drawn as images that would require optical character recognition. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion. You need to provide a password for protected PDF documents when its access is restricted. You cannot extract any text from a PDF document which does not have extraction permission. OPTIONS
-o file Specifies the output file name. The default is to print the extracted contents to standand output in text format. -p pageno[,pageno,...] Specifies the comma-separated list of the page numbers to be extracted. Page numbers start at one. By default, it extracts text from all the pages. -c codec Specifies the output codec. -t type Specifies the output format. The following formats are currently supported: text Text format. This is the default. html HTML format. It is not recommended. xml XML format. It provides the most information. tag "Tagged PDF" format. A tagged PDF has its own contents annotated with HTML-like tags. pdf2txt tries to extract its content streams rather than inferring its text locations. Tags used here are defined in the PDF Reference, Sixth Edition[1] (S10.7 "Tagged PDF"). -D writing-mode Specifies the writing mode of text outputs: lr-tb Left-to-right, top-to-bottom. tb-rl Top-to-bottom, right-to-left. auto Determine writing mode automatically -M char-margin, -L line-margin, -W word-margin These are the parameters used for layout analysis. In an actual PDF file, text portions might be split into several chunks in the middle of its running, depending on the authoring software. Therefore, text extraction needs to splice text chunks. In the figure below, two text chunks whose distance is closer than the char-margin is considered continuous and get grouped into one. Also, two lines whose distance is closer than the line-margin is grouped as a text box, which is a rectangular area that contains a "cluster" of text portions. Furthermore, it may be required to insert blank characters (spaces) as necessary if the distance between two words is greater than the word-margin, as a blank between words might not be represented as a space, but indicated by the positioning of each word. Each value is specified not as an actual length, but as a proportion of the length to the size of each character in question. The default values are char-margin = 1.0, line-margin = 0.3, and W = 0.2, respectively. -n Suppress layout analysis. -A Force layout analysis for all the text strings, including text contained in figures. -V Enable detection of vertical writing. -s scale Specifies the output scale. This option can be used in HTML format only. -m n Specifies the maximum number of pages to extract. By default, all the pages in a document are extracted. -P password Provides the user password to access PDF contents. -d Increase the debug level. EXAMPLES
Extract text as an HTML file whose filename is output.html: $ pdf2txt -o output.html samples/naacl06-shinyama.pdf Extract a Japanese HTML file in vertical writing: $ pdf2txt -c euc-jp -D tb-rl -o output.html samples/jo.pdf Extract text from an encrypted PDF file: $ pdf2txt -P mypassword -o output.txt secret.pdf SEE ALSO
dumppdf(1) AUTHORS
Jakub Wilk <jwilk@debian.org> Wrote this manual page for the Debian system. Yusuke Shinyama <yusuke@cs.nyu.edu> Author of PDFMiner and its original HTML documentation. NOTES
1. PDF Reference, Sixth Edition http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf pdf2txt 08/24/2011 PDF2TXT(1)
All times are GMT -4. The time now is 07:12 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy