The UNIX and Linux Forums  


Go Back   The UNIX and Linux Forums > Top Forums > UNIX for Dummies Questions & Answers
.
google unix.com



UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Problem with while loop in shell script rkrgarlapati Shell Programming and Scripting 20 02-02-2009 03:13 AM
Help shell script to loop through files update ctl file to be sql loaded dba_nh Shell Programming and Scripting 1 04-15-2008 09:00 PM
If then else loop in Shell script pankajkrmishra Shell Programming and Scripting 4 07-31-2006 10:40 AM
Shell Script loop problem MaxMouse Shell Programming and Scripting 1 07-26-2005 04:19 PM
shell script - loop to countdown froggwife UNIX for Dummies Questions & Answers 2 11-29-2001 10:48 AM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 08-05-2008
Dave Stockdale Dave Stockdale is offline
Registered User
  
 

Join Date: Aug 2008
Posts: 7
Speeding up a Shell Script (find, grep and a for loop)

Hi all,

I'm having some trouble with a shell script that I have put together to search our web pages for links to PDFs.

The first thing I did was:


Code:
ls -R | grep .pdf > /tmp/dave_pdfs.out

Which generates a list of all of the PDFs on the server. For the sake of arguement, say it looks like this:

file1.pdf
file2.pdf
file3.pdf
file4.pdf

I then put this info into an array in a shell script, and loop through the array, searching all .htm and .html files in the site
for the value:


Code:
# The Array
pdfs=("file1.pdf" "file2.pdf" "file3.pdf" "file4.pdf")

# Just a counter that gets incremented for each iteration
counter=1

# For every value in the array
for value in ${pdfs[@]}
do

# Tell the user which file is being searched for, and how far along in the overall process we are.
echo "Working on $value..."
echo "($counter of ${#pdfs[*]})"

# Add what is being searched for to the output file
echo "$value is linked to from" >> /tmp/dave_locations.out

# Find all .htm and .html files with the filename we are looking for in, and add it to the output file
find . -name "*.htm*" -exec grep -l $value {} \; >> /tmp/dave_locations.out

# Adding a space afterwards
echo " " >> /tmp/dave_locations.out

# Increment the counter.
counter=`expr $counter + 1`

done

This does work.

However, our site is huge (1491 PDFs, and a whole lot of .htm and .html pages). Each iteration through the loop
takes around about 55 seconds. I've calculated that this shell script will take 6 days to complete.

Does anyone please know of a better (and significantly faster) way of doing this?

Any help would be greatly appreciated. I'm a bit of a unix newbie, and it took me hours just to get this far.
  #2 (permalink)  
Old 08-05-2008
era era is offline Forum Advisor  
Herder of Useless Cats (On Sabbatical)
  
 

Join Date: Mar 2008
Location: /there/is/only/bin/sh
Posts: 3,652
Could't you just run the find once, then grep out the matches against the list of PDFs?
  #3 (permalink)  
Old 08-05-2008
broli's Avatar
broli broli is offline
Registered User
  
 

Join Date: Dec 2007
Location: Argentina
Posts: 215
the find -exec grep seems kind of redundant

find . -name "*$value*.htm*"

that should make the same result without loading another binary to memory
  #4 (permalink)  
Old 08-05-2008
Dave Stockdale Dave Stockdale is offline
Registered User
  
 

Join Date: Aug 2008
Posts: 7
Thanks for your reply Era.

I think I've got my list together now.

I established the list of all PDFs linked to using:


Code:
find . -name "*.htm*" -exec grep -o "[a-zA-Z0-9_]\{1,\}\.pdf" {} \; > /tmp/dave.out

And compared that to my list of existing PDFs as you suggested using:


Code:
pdfs=("file1.pdf" "file2.pdf" "file3.pdf" "file4.pdf")

for value in ${pdfs[@]}
do

set | grep $value /tmp/dave.out

# If grep finds nothing...
if [ $? == 1 ]
then
  echo "$value" >> /tmp/dave_files.out
fi

done

Turns out we have 3145 PDFs not linked to... Of course I still want to double and triple check this list, but I suppose I should be able to come up with some kind of shell script to remove all of these as well.

Thanks again for your help.
  #5 (permalink)  
Old 08-06-2008
era era is offline Forum Advisor  
Herder of Useless Cats (On Sabbatical)
  
 

Join Date: Mar 2008
Location: /there/is/only/bin/sh
Posts: 3,652
I don't understand the construction set | grep $value /tmp/date.out -- as far as I can tell, the output from set will not be used for anything.

Also, the conditional is a Useless Use of Test, and it will print to stdout any matches; I imagine that's undesirable. The following avoids those problems.


Code:
if ! grep "$value" /tmp/dave.out >/dev/null
then
  echo "$value"" >>/tmp/dave_files.out
fi

Still, if you had your list of PDF files in another file, one PDF per line, it could be as simple as


Code:
fgrep -vxf pdfs.txt /tmp/dave.out >/tmp/dave_files.out

The use of fgrep -x requires an exact match (not a regex match; you know that dot in a regex matches any character, for example) spanning the whole line (that's the -x). The -v causes only lines in /tmp/dave.out which are not anywhere in pdfs.txt to be printed.
  #6 (permalink)  
Old 08-06-2008
Dave Stockdale Dave Stockdale is offline
Registered User
  
 

Join Date: Aug 2008
Posts: 7
Hi,

I've solved the problem. More or less. This is what I went with for my shell script. It is probably far more complicated than it needs to be, but for my first shell script that isn't a basic option list, I'm quite pleased with it.

Thanks for your help era.


Code:
# Establishing a list of ALL PDFs

echo "Finding All PDFs..."
ls -R | grep .pdf > /tmp/pdfs/all_pdfs.out
echo "Done."

# Remove rubbish from list

echo "Removing Rubbish From List..."
sed 's|^\./[a-zA-Z0-9_ &./:]*$||g' /tmp/pdfs/all_pdfs.out > /tmp/pdfs/all_pdfs2.out
sed '/^$/d' /tmp/pdfs/all_pdfs2.out > /tmp/pdfs/all_pdfs.out
echo "Done."

# List all PDFs Linked to

echo "Gathering List of PDF Links..."
find . -name "*.htm*" -exec grep -o "[a-zA-Z0-9_]\{1,\}\.pdf" {} \; > /tmp/pdfs/all_links.out
find . -name "*.php" -exec grep -o "[a-zA-Z0-9_]\{1,\}\.pdf" {} \; >> /tmp/pdfs/all_links.out
echo "Done."

# Compare the two lists to see which aren't linked to (Thanks era)

echo "Establishing PDFs not Linked to..."
fgrep -vxf /tmp/pdfs/all_links.out /tmp/pdfs/all_pdfs.out > /tmp/pdfs/not_linked.out
echo "Done."

# Get full paths relative to the site root

echo "Gathering Full Paths Now..."
while read Value
do
  find . -name "$Value" >> /tmp/pdfs/not_linked_paths.out
done < /tmp/pdfs/not_linked.out
echo "Done."

# Attaching Real Full Path
sed 's/^\./\/home\/httpd\/site/g' /tmp/pdfs/not_linked_paths.out > /tmp/pdfs/not_linked_full_paths.out

# Archiving Files...
while read Value2
do
  cp "$Value2" /tmp/pdfs/backups/
  # mv "$Value2" /tmp/pdfs/backups/
done < /tmp/pdfs/not_linked_full_paths.out
echo "Done."

# mv is commented out until I've confirmed that it is ok with my manager. Just copying for now.

# Getting rid of the .out files.

echo "Cleaning Up..."
rm /tmp/pdfs/*.out
echo "Finished!"

# Finished.

  #7 (permalink)  
Old 08-06-2008
cfajohnson's Avatar
cfajohnson cfajohnson is offline Forum Advisor  
Shell programmer, author
  
 

Join Date: Mar 2007
Location: Toronto, Canada
Posts: 2,372
Quote:
Originally Posted by Dave Stockdale View Post

Code:
# Archiving Files...
while read Value2
do
  cp "$Value2" /tmp/pdfs/backups/
  # mv "$Value2" /tmp/pdfs/backups/
done < /tmp/pdfs/not_linked_full_paths.out
echo "Done."

If there aren't too many lines in the file, you can do it without a loop:


Code:
IFS=$'\n'
list=$( < /tmp/pdfs/not_linked_full_paths.out )
cp $list /tmp/pdfs/backups/

Closed Thread

Bookmarks

Tags
awk, awk trim, trim, trim awk

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 08:35 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0