Speeding up a Shell Script (find, grep and a for loop)

08-05-2008

Registered User

7, 0

Join Date: Aug 2008

Last Activity: 17 September 2009, 7:02 AM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

Speeding up a Shell Script (find, grep and a for loop)

Hi all,

I'm having some trouble with a shell script that I have put together to search our web pages for links to PDFs.

The first thing I did was:

Code:

ls -R | grep .pdf > /tmp/dave_pdfs.out

Which generates a list of all of the PDFs on the server. For the sake of arguement, say it looks like this:

file1.pdf
file2.pdf
file3.pdf
file4.pdf

I then put this info into an array in a shell script, and loop through the array, searching all .htm and .html files in the site
for the value:

Code:

# The Array
pdfs=("file1.pdf" "file2.pdf" "file3.pdf" "file4.pdf")

# Just a counter that gets incremented for each iteration
counter=1

# For every value in the array
for value in ${pdfs[@]}
do

# Tell the user which file is being searched for, and how far along in the overall process we are.
echo "Working on $value..."
echo "($counter of ${#pdfs[*]})"

# Add what is being searched for to the output file
echo "$value is linked to from" >> /tmp/dave_locations.out

# Find all .htm and .html files with the filename we are looking for in, and add it to the output file
find . -name "*.htm*" -exec grep -l $value {} \; >> /tmp/dave_locations.out

# Adding a space afterwards
echo " " >> /tmp/dave_locations.out

# Increment the counter.
counter=`expr $counter + 1`

done

This does work.

However, our site is huge (1491 PDFs, and a whole lot of .htm and .html pages). Each iteration through the loop
takes around about 55 seconds. I've calculated that this shell script will take 6 days to complete.

Does anyone please know of a better (and significantly faster) way of doing this?

Any help would be greatly appreciated. I'm a bit of a unix newbie, and it took me hours just to get this far.

Dave Stockdale

View Public Profile for Dave Stockdale

Find all posts by Dave Stockdale

08-05-2008

Registered User

3,653, 12

Join Date: Mar 2008

Last Activity: 28 March 2011, 6:41 AM EDT

Location: /there/is/only/bin/sh

Posts: 3,653

Thanks Given: 0

Thanked 12 Times in 10 Posts

Could't you just run the find once, then grep out the matches against the list of PDFs?

era

View Public Profile for era

Find all posts by era

08-05-2008

Registered User

219, 0

Join Date: Dec 2007

Last Activity: 23 July 2014, 5:32 PM EDT

Location: Argentina

Posts: 219

Thanks Given: 0

Thanked 0 Times in 0 Posts

the find -exec grep seems kind of redundant

find . -name "*$value*.htm*"

that should make the same result without loading another binary to memory

broli

View Public Profile for broli

Find all posts by broli

08-05-2008

Registered User

7, 0

Join Date: Aug 2008

Last Activity: 17 September 2009, 7:02 AM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks for your reply Era.

I think I've got my list together now.

I established the list of all PDFs linked to using:

Code:

find . -name "*.htm*" -exec grep -o "[a-zA-Z0-9_]\{1,\}\.pdf" {} \; > /tmp/dave.out

And compared that to my list of existing PDFs as you suggested using:

Code:

pdfs=("file1.pdf" "file2.pdf" "file3.pdf" "file4.pdf")

for value in ${pdfs[@]}
do

set | grep $value /tmp/dave.out

# If grep finds nothing...
if [ $? == 1 ]
then
  echo "$value" >> /tmp/dave_files.out
fi

done

Turns out we have 3145 PDFs not linked to... Of course I still want to double and triple check this list, but I suppose I should be able to come up with some kind of shell script to remove all of these as well.

Thanks again for your help.

Dave Stockdale

View Public Profile for Dave Stockdale

Find all posts by Dave Stockdale

08-06-2008

Registered User

3,653, 12

Join Date: Mar 2008

Last Activity: 28 March 2011, 6:41 AM EDT

Location: /there/is/only/bin/sh

Posts: 3,653

Thanks Given: 0

Thanked 12 Times in 10 Posts

I don't understand the construction set | grep $value /tmp/date.out -- as far as I can tell, the output from set will not be used for anything.

Also, the conditional is a Useless Use of Test, and it will print to stdout any matches; I imagine that's undesirable. The following avoids those problems.

Code:

if ! grep "$value" /tmp/dave.out >/dev/null
then
  echo "$value"" >>/tmp/dave_files.out
fi

Still, if you had your list of PDF files in another file, one PDF per line, it could be as simple as

Code:

fgrep -vxf pdfs.txt /tmp/dave.out >/tmp/dave_files.out

The use of fgrep -x requires an exact match (not a regex match; you know that dot in a regex matches any character, for example) spanning the whole line (that's the -x). The -v causes only lines in /tmp/dave.out which are not anywhere in pdfs.txt to be printed.

era

View Public Profile for era

Find all posts by era

08-06-2008

Registered User

7, 0

Join Date: Aug 2008

Last Activity: 17 September 2009, 7:02 AM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

Hi,

I've solved the problem. More or less. This is what I went with for my shell script. It is probably far more complicated than it needs to be, but for my first shell script that isn't a basic option list, I'm quite pleased with it.

Thanks for your help era.

Code:

# Establishing a list of ALL PDFs

echo "Finding All PDFs..."
ls -R | grep .pdf > /tmp/pdfs/all_pdfs.out
echo "Done."

# Remove rubbish from list

echo "Removing Rubbish From List..."
sed 's|^\./[a-zA-Z0-9_ &./:]*$||g' /tmp/pdfs/all_pdfs.out > /tmp/pdfs/all_pdfs2.out
sed '/^$/d' /tmp/pdfs/all_pdfs2.out > /tmp/pdfs/all_pdfs.out
echo "Done."

# List all PDFs Linked to

echo "Gathering List of PDF Links..."
find . -name "*.htm*" -exec grep -o "[a-zA-Z0-9_]\{1,\}\.pdf" {} \; > /tmp/pdfs/all_links.out
find . -name "*.php" -exec grep -o "[a-zA-Z0-9_]\{1,\}\.pdf" {} \; >> /tmp/pdfs/all_links.out
echo "Done."

# Compare the two lists to see which aren't linked to (Thanks era)

echo "Establishing PDFs not Linked to..."
fgrep -vxf /tmp/pdfs/all_links.out /tmp/pdfs/all_pdfs.out > /tmp/pdfs/not_linked.out
echo "Done."

# Get full paths relative to the site root

echo "Gathering Full Paths Now..."
while read Value
do
  find . -name "$Value" >> /tmp/pdfs/not_linked_paths.out
done < /tmp/pdfs/not_linked.out
echo "Done."

# Attaching Real Full Path
sed 's/^\./\/home\/httpd\/site/g' /tmp/pdfs/not_linked_paths.out > /tmp/pdfs/not_linked_full_paths.out

# Archiving Files...
while read Value2
do
  cp "$Value2" /tmp/pdfs/backups/
  # mv "$Value2" /tmp/pdfs/backups/
done < /tmp/pdfs/not_linked_full_paths.out
echo "Done."

# mv is commented out until I've confirmed that it is ok with my manager. Just copying for now.

# Getting rid of the .out files.

echo "Cleaning Up..."
rm /tmp/pdfs/*.out
echo "Finished!"

# Finished.

Dave Stockdale

View Public Profile for Dave Stockdale

Find all posts by Dave Stockdale

08-06-2008

Registered User

2,898, 136

Join Date: Mar 2007

Last Activity: 11 July 2016, 2:55 PM EDT

Location: Toronto, Canada

Posts: 2,898

Thanks Given: 0

Thanked 136 Times in 120 Posts

Quote:

Originally Posted by Dave Stockdale

Code:

# Archiving Files...
while read Value2
do
  cp "$Value2" /tmp/pdfs/backups/
  # mv "$Value2" /tmp/pdfs/backups/
done < /tmp/pdfs/not_linked_full_paths.out
echo "Done."

If there aren't too many lines in the file, you can do it without a loop:
Code:
IFS=$'\n'
list=$( < /tmp/pdfs/not_linked_full_paths.out )
cp $list /tmp/pdfs/backups/

cfajohnson

View Public Profile for cfajohnson

Find all posts by cfajohnson

UNIX for Dummies Questions & Answers

Speeding up a Shell Script (find, grep and a for loop)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help with speeding up my working script to take less time - how to use more CPU usage for a script

Discussion started by: prvnrk

2. Shell Programming and Scripting

Help 'speeding' up this 'parsing' script - taking 24+ hours to run

Discussion started by: newbie_01

3. Shell Programming and Scripting

Speeding up shell script with grep

Discussion started by: dunryc

4. Shell Programming and Scripting

How to use grep in a loop using a bash script?

Discussion started by: aberg

5. Shell Programming and Scripting

Help speeding up script

Discussion started by: JohnN6

6. Shell Programming and Scripting

Speeding up search and replace in a for loop

Discussion started by: pbluescript

7. UNIX for Dummies Questions & Answers

Speeding/Optimizing GREP search on CSV files

Discussion started by: Whit3H0rse

8. Shell Programming and Scripting

Shell script / Grep / Awk to variable and Loop

Discussion started by: Spoonless

9. Shell Programming and Scripting

Bash script (using find and grep)

Discussion started by: limmer

10. Shell Programming and Scripting

grep'ing and sed'ing chunks in bash... need help on speeding up a log parser.

Discussion started by: elinenbe