Scanning a pdf file in Linux shell


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Scanning a pdf file in Linux shell
# 1  
Old 09-12-2015
Scanning a pdf file in Linux shell

I want to search a keyword in a list of pdf files and when i find a match i want to write the title and author of that pdf file to another file. How will I do this using linux shell script?
# 2  
Old 09-12-2015
Hello sk33,

Welcome to forums. Following may help you in same.
I- If you want to search a word let's say test in current directory and so on then following may help you.
Code:
 find . -type f -name "*.pdf" -exec grep -l "test" {} \+ 2>/dev/null

II- If you want to check string for a specific path then following may help.
Code:
 find /tmp/test/Singh/weekend -type f -name "*.pdf" -exec grep -l "test" {} \+ 2>/dev/null

Hope this helps. Welcome to forum again and have a nice weekend.

Thanks,
R. Singh
# 3  
Old 09-12-2015
Scanning a pdf file in Linux shell

But how I will print the title and author's name of the matched pdf files?

---------- Post updated at 07:48 AM ---------- Previous update was at 07:47 AM ----------

Quote:
Originally Posted by RavinderSingh13
Hello sk33,

Welcome to forums. Following may help you in same.
I- If you want to search a word let's say test in current directory and so on then following may help you.
Code:
 find . -type f -name "*.pdf" -exec grep -l "test" {} \+ 2>/dev/null

II- If you want to check string for a specific path then following may help.
Code:
 find /tmp/test/Singh/weekend -type f -name "*.pdf" -exec grep -l "test" {} \+ 2>/dev/null

Hope this helps. Welcome to forum again and have a nice weekend.

Thanks,
R. Singh
But how I will print the title and author's name of the matched pdf files?
# 4  
Old 09-12-2015
convert your pdf file to text with the command:

Code:
pdftotext

and parse the title and author's name
# 5  
Old 09-12-2015
Hi.

Possibly:
Code:
pdfgrep search in pdf files for strings matching a regular expression
 Pdfgrep is a tool to search text in PDF files. It works similar to
 `grep'.
 .
 Features:
  - search for regular expressions.
  - support for some important grep options, including:
    + filename output.
    + page number output.
    + optional case insensitivity.
    + count occurrences.
  - and the most important feature: color output!

Seen in the repository for:
Code:
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.1 (jessie)

See also: https://pdfgrep.org/

Good luck ... cheers, drl

Last edited by drl; 09-13-2015 at 11:22 AM..
# 6  
Old 09-13-2015
Hi.

Here is a demonstration of pdfgrep:
Code:
#!/usr/bin/env bash

# @(#) s1	Demonstrate search PDF, regular expressions, pdfgrep.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C pdfgrep

FILE=${1-pdfgrep.pdf}

pl " Input data file $FILE (a sample pdf file, as created by pandoc):"
file $FILE

pl " Results:"
pdfgrep --color never "AUTHOR|NAME" pdfgrep.pdf

exit 0

producing:
Code:
$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.1 (jessie) 
bash GNU bash 4.3.30
pdfgrep (local) 1.3.1

-----
 Input data file pdfgrep.pdf (a sample pdf file, as created by pandoc):
pdfgrep.pdf: PDF document, version 1.5

-----
 Results:
NAME pdfgrep - search pdf files for a regular expression
AUTHOR Hans-Peter Deifel

Of minor importance: the pdf is of the man page for pdfgrep itself. It was created by man writing a text file and then pandoc creating the pdf.

Note that protocomm's suggestion would allow you to use the full power of your native grep, which may be a significant advantage in some cases.

Best wishes ... cheers, drl

Last edited by drl; 09-13-2015 at 02:49 PM..
# 7  
Old 09-13-2015
Hi.

In reviewing this, I'm wondering if the OP was interested in the PDF meta-information. The fist and last lines of one rendition of a PFD looks like:
Code:
%PDF-1.1
1 0 obj
<<
/CreationDate (D:20150913125458)
/Producer (text2pdf v1.1 (\251 Phil Smith, 1996))
/Title (pdfgrep.txt)
   ---
/Root 2 0 R
/Info 1 0 R
>>
startxref
6452
%%EOF

In which case, a simple grep would probably suffice:
Code:
$ egrep 'Producer|Title' pdf-from-text2pdf.pdf
/Producer (text2pdf v1.1 (\251 Phil Smith, 1996))
/Title (pdfgrep.txt)

as has been posted by several responders here. I don't know enough about PDFs to say that Producer is/might be the same as Author. However, some PDFs seem to have binary data, so grep might not work as desired on those.

Best wishes ... cheers, drl
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Create a text file and a pdf file from Linux command results.

Hello. The task : Using multiple commands like : gdisk -l $SOME_DISK >> $SOME_FILEI generate some text file. For readiness I must insert page break. When the program is finished I want to convert the final text file to a pdf file. When finished, I got two files : One text file and One pdf... (1 Reply)
Discussion started by: jcdole
1 Replies

2. Shell Programming and Scripting

Retrieving a paragraph from a pdf file using shell commands

In the reference section of a research paper(in pdf form), many other paper names are cited which have been used inside the pdf at different places. If I give an input, the name of a paper which has been cited in the reference section and want to display the section (the paragraph) inside the pdf... (1 Reply)
Discussion started by: SK33
1 Replies

3. Shell Programming and Scripting

Reg scanning time based log file

Hi, I have a requirement to scan Oracle's alert log file. This file logs all event for Oracle database and each line will have timestamp followed by messages (which might be one or more lines). Example. Thu Aug 15 17:35:59 2013 VKTM detected a time drift. Please check trace file for more... (1 Reply)
Discussion started by: manickaraja
1 Replies

4. Shell Programming and Scripting

Convert excel file to PDF file using shell script

Hi All, Is it possible to convert the excel file to PDF file(Without loosing any format) using unix shell scripting ??? If yes Kindly help me on the code Thanks in advance!!! (5 Replies)
Discussion started by: Balasankar
5 Replies

5. Shell Programming and Scripting

Shell Script to Dynamically Extract file content based on Parameters from a pdf file

Hi Guru's, I am new to shell scripting. I have a unique requirement: The system generates a single pdf(/tmp/ABC.pdf) file with Invoices for Multiple Customers, the format is something like this: Page1 >> Customer 1 >>Invoice1 + invoice 2 >> Page1 end Page2 >> Customer 2 >>Invoice 3 + Invoice 4... (3 Replies)
Discussion started by: DIps
3 Replies

6. Red Hat

Setting Password For PDF File--Linux

Hi, I am in need of help. My requirements are : 1) To convert the existing files (irrespective of their format) in a directory to PDF format 2) To make the converted files password protected. I did the attempt to do the same. Though the existing files (irrespective of their format) are... (1 Reply)
Discussion started by: MKR
1 Replies

7. UNIX for Dummies Questions & Answers

scanning the file for a particular column

I have a file containing 4 columns. need to scan that file, if all the rows in the column4 have a value ZERO, it should print "everything is fine". And if all are not ZERO , at the first encounter of non ZERO value of 4th column it should print "some problem " may be a silly question, but at... (11 Replies)
Discussion started by: gotam
11 Replies

8. Programming

Linux C - how to open a pdf file with default reader

sorry if i repost this... hi.. i want to ask how to open pdf files using C in Linux in Windows, i just use this code: ShellExecute(GetDesktopWindow(), "open", "D:\\Folder\\File.pdf", NULL, NULL, SW_SHOWNORMAL); thanks for advance... (3 Replies)
Discussion started by: sunardo
3 Replies

9. UNIX for Advanced & Expert Users

Scanning file backwards

Is there any way to look for a directory path that is listed any number of lines *before* a keyword in an error message? I have a script that is trying to process different files that are always down a certain portion of a path, and if there is an error, then says there is an error, contact... (2 Replies)
Discussion started by: tekster757
2 Replies

10. Shell Programming and Scripting

scanning for '0' value in .txt file

Hello I am a novice shell scripting programmer, so please bare with me. I have embedded a simple SQL statement into a shell script, which simply returns an integer (its a count (*) statement). The result of the statement is then oputput to .txt file. So, the number could be 0, 1,2, 10,... (4 Replies)
Discussion started by: man80
4 Replies
Login or Register to Ask a Question