Identify file pattern, take count of pattern, then act

05-13-2015

Registered User

9, 1

Join Date: May 2012

Last Activity: 15 May 2015, 2:20 PM EDT

Posts: 9

Thanks Given: 2

Thanked 1 Time in 1 Post

Identify file pattern, take count of pattern, then act

Guys -
Need your ideas on a section of code to finish something up. To make a long story short, I'm parsing a print output file that goes to pre-printed forms. I'm intercepting it, parsing it, formatting it, cutting it up into individual pages, grabbing the text I want in zones, building an .fdf (declares the zones), then populating (pdftk).

All of this is working fine.... In the end, I separate docs that have multiple pages (from those that don't) into a separate directory where I want to slam them together..in order. Single .pdf...for these multipages all with common "ID" name.

No problem getting the concatenation to work manually - I need your help with the code at the end of this bash script.

I've got a directory ($DIRECTORY) that has files that looks like this:

Code:

368363.pdf
368363-2.pdf
368363-3.pdf
368373.pdf
368373-2.pdf
368373-3.pdf
368389.pdf
368389-2.pdf

The beauty is, this directory will ONLY contain files that have other files matching it's first six characters as I move through line by line. And it WILL be in order with the first file being 'filname.pdf' and everything following...by line...being 'filename-2(++).pdf'.

What I want to do is simple...read the directory, take in a file, one at a time, store all it counterparts with the -'X'.pdf into a variable, then slam them all together with pdftk before I get to the next line (i.e - pdftk $ALL $pdform cat output $linem (m to indicate multipage since pdftk is a bitch about using same input and output names. I can put in a line at the end to move it all and cleanup).

Something like:

Code:

for f in $DIRECTORY; do
         FILE=$(echo ${f##*/})         #Cut out the path leaving filename.pdf
        CATEM=$(ls -1f|grep $FILE|sort|uniq|tr '\r\n' ' ') # Gives me each file, separated by space (rather than line) to input to pdftk as a variable...

I'm lost after this. I don't think my plan is going to work because the original file will contain EVERYTHING and won't be re-read as I 'do stuff'....

Need some fresh eyes/ideas on this...this script is HUGE and does a ton of processing and I think I'm just getting tired of looking at it! Lol...

Thanks, fellas.

ampsys

View Public Profile for ampsys

Find all posts by ampsys

05-13-2015

Registered User

1,690, 205

Join Date: Jun 2007

Last Activity: 13 July 2020, 5:35 PM EDT

Location: Mumbai, India

Posts: 1,690

Thanks Given: 139

Thanked 205 Times in 199 Posts

You can skip the further processing if the files are "counterparts". To be clear, if the filename contains "-" (or any other check which you think would be safe).

Something like..

Code:

for f in $DIRECTORY; do
  echo $f | grep -q '-'
  if [ $? -eq 0 ]; then
    continue # pick the next file if its not the main file
  else
    FILE=$(echo ${f##*/})
    CATEM=$(ls -1f|grep $FILE|sort|uniq|tr '\r\n' ' ')
    # any other stuff 
  fi
done

clx

View Public Profile for clx

Find all posts by clx

05-13-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

With extended pattern matching, available in recent bash, you could try

Code:

for f in !(*-*).pdf
   do CATEM=$(ls ${f%.*}*)
      echo $CATEM
   done
368363-2.pdf 368363-3.pdf 368363.pdf
368373-2.pdf 368373-3.pdf 368373.pdf
368389-2.pdf 368389.pdf

If the order is relevant, you'll need another step.

Last edited by RudiC; 05-13-2015 at 05:33 AM..

RudiC

View Public Profile for RudiC

Find all posts by RudiC

05-13-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by ampsys

Code:

368363.pdf
368363-2.pdf
368363-3.pdf
368373.pdf
368373-2.pdf
368373-3.pdf
368389.pdf
368389-2.pdf

Code:

for f in $DIRECTORY; do
         FILE=$(echo ${f##*/})         #Cut out the path leaving filename.pdf
        CATEM=$(ls -1f|grep $FILE|sort|uniq|tr '\r\n' ' ') # Gives me each file, separated by space (rather than line) to input to pdftk as a variable...

The above code snippet doesn't even come close to doing what the comments indicate it should do. The for loop will execute once with f set to the directory named by $DIRECTORY. It will set FILE to the last component of that directory; not the name of a file in that directory.

Note also that with filenames like 123456.pdf and 123456-2.pdf, ls or anything else that sorts its output alphabetically will sort 123456-2.pdf before 123456.pdf since - sorts before . in ASCII and in the C Locale sort order.

There is no end to your sample for loop, so I don't know what you're planning to do after you get the list of files. And, I'm not sure whether you are trying to create a list of filenames (with $DIRECTORY stripped off) that can be used in the directory where the files are located, or a list of pathnames (including $DIRECTORY so the list can be used in another directory. It is also not clear whether $DIRECTORY is an absolute pathname for that directory or a relative pathname for that directory. The following code will only work if $DIRECTORY expands to an absolute pathname for a directory.

Rather than creating a list of files as a scalar variable, with bash or ksh it would be much easier to do this with an array (especially if your directory structure or filenames might ever contain any whitespace characters). Perhaps the following will give you something you can build on to get what you want:

Code:

#!/bin/bash
DIRECTORY="$(PWD)/dir"
cd "$DIRECTORY" || exit 1
for pdf in ??????.pdf
do	base=${pdf%.pdf}
	FILES=("$pdf" "$base-"*.pdf)
	PATHS=("$DIRECTORY/$pdf" "$DIRECTORY/$base-"*.pdf)
	echo "FILES (${#FILES[@]} elements):"
	printf '\t"%s"\n' "${FILES[@]}"
	echo "PATHS (${#PATHS[@]} elements):"
	printf '\t"%s"\n' "${PATHS[@]}"
	echo
done

This loop runs in the directory specified by $DIRECTORY and creates a list of your desired filenames and a list of pathname for those filenames. I assume you'll want one of those and can delete the code for the one you don't want. I also assume that you'll want to replace one of those printf commands with a pdftk command, but I don't see how $ALL, $linem, or $pdform from your description relate to the list of filenames or pathnames you want to use; so I'm leaving that as an exercise for the reader.

In a directory that contains the files:

Code:

1 2 3 -2.pdf
1 2 3 -3.pdf
1 2 3 -4.pdf
1 2 3 .pdf
368363-2.pdf
368363-3.pdf
368363.pdf
368389-2.pdf
368389.pdf

it produces the output:

Code:

FILES (4 elements):
	"1 2 3 .pdf"
	"1 2 3 -2.pdf"
	"1 2 3 -3.pdf"
	"1 2 3 -4.pdf"
PATHS (4 elements):
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/1 2 3 .pdf"
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/1 2 3 -2.pdf"
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/1 2 3 -3.pdf"
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/1 2 3 -4.pdf"

FILES (3 elements):
	"368363.pdf"
	"368363-2.pdf"
	"368363-3.pdf"
PATHS (3 elements):
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/368363.pdf"
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/368363-2.pdf"
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/368363-3.pdf"

FILES (2 elements):
	"368389.pdf"
	"368389-2.pdf"
PATHS (2 elements):
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/368389.pdf"
	"/Users/dwc/test/unix.com/shell/Identify_file_pattern,.../dir/368389-2.pdf"

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Identify file pattern, take count of pattern, then act

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Grep pattern file and count occurances

Discussion started by: ahfze

2. Shell Programming and Scripting

Pattern count in file

Discussion started by: Jewel

3. UNIX for Advanced & Expert Users

Count number of lines between a pattern in a large file

Discussion started by: john2022

4. Shell Programming and Scripting

Search for a pattern in a String file and count the occurance of each pattern

Discussion started by: swayam123

5. Shell Programming and Scripting

Checking a pattern in file and the count of characters

Discussion started by: Naks_Sh10

6. Shell Programming and Scripting

Identify file name pattern in different file names

Discussion started by: dips_ag

7. Shell Programming and Scripting

How to count the pattern in a file by awk

Discussion started by: abhigrkist

8. Shell Programming and Scripting

Count the number of occurrences of a pattern between each occurrence of a different pattern

Discussion started by: slipstream

9. UNIX for Dummies Questions & Answers

Search and Count Occurrences of Pattern in a File

Discussion started by: tektips

10. Shell Programming and Scripting

nawk-how count the number of occurances of a pattern, when don't know the pattern

Discussion started by: cyber111