Script to batch pdfjoin based on pdfgrep output

03-06-2012

Registered User

3, 0

Join Date: Mar 2012

Last Activity: 9 March 2012, 9:25 PM EST

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

Script to batch pdfjoin based on pdfgrep output

I have a situation in which I'm given a bunch of pdf files which are all single pages with employee ID's on an independent line. I need to collate all of the pages by employee ID.

Piecemeal, I can find a particular employee ID by just using pdfgrep.

I could also do something like this:

find . -name "*.pdf" -print0 | xargs -0 -I FILENAME bash -c "if { pdftotext FILENAME - | grep -q <IDnumberHere>; } ; then echo FILENAME; fi"

Searching for an employee ID with pdfgrep will list all files containing the employee ID as well as the text content of the whole line which contains the employee ID (This is the same for each employee; the ID number is all that changes.)

The task of joining these files together with pdfjoin is quite simple.

However, I'm a novice at writing bash scripts. Doing all of this piecemeal takes longer than just shuffling actual pages! I need to know how to automate the joining of files that have identical employee ID lines output by pdfgrep.

The number of pages per employee ID varies but is, currently, a maximum of six.

Pseudocode

would be something like:

Filename = pg_0001.pdf

do until [ Filename = pg_<Lastpage>.pdf ]
Filename2= <somehow increment Filename by 1>
x= `pdfgrep "Employee ID" $Filename` #Not sure how to insert variable to be read as filename for pdfgrep
y= `pdfgrep "Employee ID" $Filename2`

if [ "$x" == "$y" ]; then
pdfjoin $Filename $Filename2 --outfile $Filename #Intended to join the two files under the name of Filename, i.e. replacing the first file with the joined file in the same directory.
fi

Filename = $Filename2

Thanks for any help you can offer.

Last edited by nopposan; 03-06-2012 at 01:29 PM..

nopposan

View Public Profile for nopposan

Find all posts by nopposan

03-08-2012

Registered User

3, 0

Join Date: Mar 2012

Last Activity: 9 March 2012, 9:25 PM EST

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

Mostly solved. A little advice needed to perfect.

Hello again, everybody. I've got a script that works pretty well, but only for files that are named without leading zeros. Our files are named with leading zeros. Please see the #comments# in the script which follows.

Code:

#!bin/bash
# This script works in its current form, but it's written for files that are named with the pattern pg_1.pdf, without leading zeros. Our files are named with four digits, the leading zeros being present if the page number is less than 1000. 'Need a script that will work with files named with the pattern pg_0001.pdf

a=1 #Sets the starting page equal to 1. 'Could set with user input.
b=$(($a+1))#Sets the next page equal to one more than the starting page.

Filename="pg_"$a".pdf" # This is the filename of the starting page.

until [ $b == 6 ]; do # The number sets the maximum pages to be considered for concatenation. Again, it could be set using user input.

Filename2="pg_"$b".pdf" # This is the filename of the next file to be considered for concatenation to the current file.

x=`pdfgrep -C 0 [0-9]\{7\} $Filename | head -n 1` # pdfgrep with option to capture only zero characters other than the seven digits in the employee ID; this is piped to head in order to get just the first occurrence.
y=`pdfgrep -C 0 [0-9]\{7\} $Filename2 | head -n 1`

if [ "$x" == "$y" ]; then 
    pdfjoin --rotateoversize 'false' $Filename $Filename2 --outfile $Filename # If the employee ID's are equal, then the pdf files are concatenated into a new file, which is given the name of the first file that's added to.
    rm $Filename2 # If the file is concatenated to a previous file, it is removed.
else
Filename=$Filename2 # If no match is found and the file is not concatenated, then advance the current origination file, Filename, to the name of the non-concatenated/non-matching file.

fi

b=$(($b+1)) # Advance the page number of the next file, Filename2.

done
exit

Last edited by Corona688; 03-08-2012 at 03:46 PM.. Reason: Code tags for code, please.

nopposan

View Public Profile for nopposan

Find all posts by nopposan

03-08-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

You can use printf to create strings of numbers beginning with zero:

Code:

ZNUM=$(printf "%04d" $NUM)

%04d gives you a 4-digit number padded with zeroes, for instance -- 0001, 0002, ... 0011, ...

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

03-09-2012

Registered User

3, 0

Join Date: Mar 2012

Last Activity: 9 March 2012, 9:25 PM EST

Posts: 3

Thanks Given: 0

Thanked 0 Times in 0 Posts

printf is nifty

Thanks!

Here's my updated script. I haven't tested it yet as I don't have any practice material at the moment.

Code:

#!bin/bash
# This script is written for files that are named with the pattern
#pg_0001.pdf, with leading zeros.

a=1 #Sets the starting page. 'Could set with user input.
b=$(($a+1))#Sets the next page equal to one more than the starting page.

Filename="pg_"`printf "%04d" $a`".pdf" # This is the filename of the
#starting page. The printf command is used to format the number with
#up to three leading zeros.

until [ $b == 680 ]; do # This line sets the maximum pages to be
#considered for concatenation.
Filename2="pg_"`printf "%04d" $b`".pdf" # This is the filename of the
#next file to be considered for concatenation to the current file.
x=`pdfgrep -C 0 [0-9]\{7\} $Filename | head -n 1` # pdfgrep with
#option to, -C, capture 0 characters other than the 7 digits in the
#employee ID; this is piped to "head" in order to get just the first
#occurrence.
y=`pdfgrep -C 0 [0-9]\{7\} $Filename2 | head -n 1`

if [ "$x" == "$y" ]; then 
    pdfjoin --rotateoversize 'false' $Filename $Filename2 --outfile $Filename # If the employee ID's are equal, then the pdf files
#are concatenated into a new file, which is given the name of the
#first file that's added to.
    rm $Filename2 # If the file is concatenated to a previous
#file, it is removed.
else
cp $Filename Empl_ID_"$x".pdf # Replace page number name with name
#based on Empl_ID.
#rm $Filename # Uncomment to remove the original file.
Filename=$Filename2 # If no match is found and the file is not
#concatenated, then advance the current origination file, Filename, to
#the name of the non-concatenated/non-matching file.
fi

b=$(($b+1)) # Advance the page number of the next file, Filename2.

done

cp $Filename Empl_ID_"$x".pdf # Finally, when all else is done,
#replace page number name with name of last file with one based on
#Empl_ID.

#rm $Filename # Uncomment to remove the original last file.

exit

Last edited by Corona688; 03-09-2012 at 06:32 PM.. Reason: Code tags, please.

nopposan

View Public Profile for nopposan

Find all posts by nopposan

Shell Programming and Scripting

Script to batch pdfjoin based on pdfgrep output

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Finding The Complete SQL statement Using PDFGREP Or Grep

Discussion started by: metallica1973

2. Shell Programming and Scripting

SFTP or scp with password in a batch script without using SSH keys and expect script

Discussion started by: ss112233

3. Shell Programming and Scripting

Extract batch based on condition

Discussion started by: abhi.mit32

4. UNIX for Advanced & Expert Users

Limiting size of rsync batch output

Discussion started by: dfbills

5. Shell Programming and Scripting

Executing a batch of files within a shell script with option to refire the individual files in batch

Discussion started by: goddevil

6. Shell Programming and Scripting

help with email to be triggered based on fatal error detection from batch run log file neded

Discussion started by: zico1986

7. Shell Programming and Scripting

Converting line output to column based output

Discussion started by: npatwardhan

8. Shell Programming and Scripting

Automatically Rerun a script based on previous execution output

Discussion started by: forumthreads

9. Shell Programming and Scripting

Shell script to email based on flat file output

Discussion started by: apoorva