I have a situation in which I'm given a bunch of pdf files which are all single pages with employee ID's on an independent line. I need to collate all of the pages by employee ID.
Piecemeal, I can find a particular employee ID by just using pdfgrep.
I could also do something like this:
find . -name "*.pdf" -print0 | xargs -0 -I FILENAME bash -c "if { pdftotext FILENAME - | grep -q <IDnumberHere>; } ; then echo FILENAME; fi"
Searching for an employee ID with pdfgrep will list all files containing the employee ID as well as the text content of the whole line which contains the employee ID (This is the same for each employee; the ID number is all that changes.)
The task of joining these files together with pdfjoin is quite simple.
However, I'm a novice at writing bash scripts. Doing all of this piecemeal takes longer than just shuffling actual pages! I need to know how to automate the joining of files that have identical employee ID lines output by pdfgrep.
The number of pages per employee ID varies but is, currently, a maximum of six.
Pseudocode
would be something like:
Filename = pg_0001.pdf
do until [ Filename = pg_<Lastpage>.pdf ]
Filename2= <somehow increment Filename by 1>
x= `pdfgrep "Employee ID" $Filename` #Not sure how to insert variable to be read as filename for pdfgrep
y= `pdfgrep "Employee ID" $Filename2`
if [ "$x" == "$y" ]; then
pdfjoin $Filename $Filename2 --outfile $Filename #Intended to join the two files under the name of Filename, i.e. replacing the first file with the joined file in the same directory.
fi
Filename = $Filename2
Thanks for any help you can offer.