Pdf to xls or csv Linux

Pdf to xls or csv Linux


I want to convert a pdf to xls or csv. I used pdftotext to convert the pdf to text. A complication with the original dataset is there are empty spaces. For example, there is an empty space between Gala and Apple in the Product field and in the Quantity field (I put double quotes there but there is an empty space without any characters).
Date Product Quantity Price
1/1/2013 Gala Apple 100 $1.00
1/2/2013 Gala Apple " " $1.00
1/3/2013 Gala Apple 200 $1.00

I want the final product to be a properly aligned Excel file. Therefore, I would need 'Gala Apple' to be in one cell and preserve the blank cell (or with a unique character signifying there was no data to begin with).

Anyone have a simple fix for this?

By the way, I'm not a sophisticated programmer. I mostly write shell scripts with AWK.



A free utility for converting PDF to text is certainly a useful insight to solving your problem. Have you actually tried it, yet? Does it produce text like you have shown? Are you able to install or build it on your system? In short, what have you tried?

There's little point going farther if it does not work or you're unable to do anything. It's far too easy for us to craft solutions that don't work given this minimal information, and we are not a discount coding warehouse. One step at a time.

In order to use awk the source file cannot be pdf, it has to be a text file. Step 1. You cannot do anything until that happens. There is no pdfawk-like software. You can buy Nitro or some other pdf editor, you can use the Poppler API - if you can write C code. Those do not apply to you. Apparently.

If you want real help give simple example input and expected output. We already have what I think is input.
pdf really isn't made to allow this to happen. A pdf can contain many types of content. Shoot, the spreadsheet data could be inside of an image. Attempting pdftotext or other program is probably your best bet, but only as a starting point and even then, as I mentioned, not necessarily a full proof solution.

With all of that said, if the pdf file is something that is regularly generated in the same way, maybe if a sample were posted somewhere, something could be created (maybe by someone here) to extract the data as a csv.

Recommendation, upload the sample pdf somewhere (or provide a link)... and then let's see what is possible.

