![]() |
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !! |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Problem with while loop in shell script | rkrgarlapati | Shell Programming and Scripting | 20 | 02-02-2009 03:13 AM |
| Help shell script to loop through files update ctl file to be sql loaded | dba_nh | Shell Programming and Scripting | 1 | 04-15-2008 09:00 PM |
| If then else loop in Shell script | pankajkrmishra | Shell Programming and Scripting | 4 | 07-31-2006 10:40 AM |
| Shell Script loop problem | MaxMouse | Shell Programming and Scripting | 1 | 07-26-2005 04:19 PM |
| shell script - loop to countdown | froggwife | UNIX for Dummies Questions & Answers | 2 | 11-29-2001 10:48 AM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
||||
|
Speeding up a Shell Script (find, grep and a for loop)
Hi all, I'm having some trouble with a shell script that I have put together to search our web pages for links to PDFs. The first thing I did was: Code:
ls -R | grep .pdf > /tmp/dave_pdfs.out Which generates a list of all of the PDFs on the server. For the sake of arguement, say it looks like this: file1.pdf file2.pdf file3.pdf file4.pdf I then put this info into an array in a shell script, and loop through the array, searching all .htm and .html files in the site for the value: Code:
# The Array
pdfs=("file1.pdf" "file2.pdf" "file3.pdf" "file4.pdf")
# Just a counter that gets incremented for each iteration
counter=1
# For every value in the array
for value in ${pdfs[@]}
do
# Tell the user which file is being searched for, and how far along in the overall process we are.
echo "Working on $value..."
echo "($counter of ${#pdfs[*]})"
# Add what is being searched for to the output file
echo "$value is linked to from" >> /tmp/dave_locations.out
# Find all .htm and .html files with the filename we are looking for in, and add it to the output file
find . -name "*.htm*" -exec grep -l $value {} \; >> /tmp/dave_locations.out
# Adding a space afterwards
echo " " >> /tmp/dave_locations.out
# Increment the counter.
counter=`expr $counter + 1`
done
This does work. However, our site is huge (1491 PDFs, and a whole lot of .htm and .html pages). Each iteration through the loop takes around about 55 seconds. I've calculated that this shell script will take 6 days to complete. Does anyone please know of a better (and significantly faster) way of doing this? Any help would be greatly appreciated. I'm a bit of a unix newbie, and it took me hours just to get this far. |
|
||||
|
Thanks for your reply Era. I think I've got my list together now. I established the list of all PDFs linked to using: Code:
find . -name "*.htm*" -exec grep -o "[a-zA-Z0-9_]\{1,\}\.pdf" {} \; > /tmp/dave.out
And compared that to my list of existing PDFs as you suggested using: Code:
pdfs=("file1.pdf" "file2.pdf" "file3.pdf" "file4.pdf")
for value in ${pdfs[@]}
do
set | grep $value /tmp/dave.out
# If grep finds nothing...
if [ $? == 1 ]
then
echo "$value" >> /tmp/dave_files.out
fi
done
Turns out we have 3145 PDFs not linked to... Of course I still want to double and triple check this list, but I suppose I should be able to come up with some kind of shell script to remove all of these as well. Thanks again for your help.
|
|
||||
|
I don't understand the construction set | grep $value /tmp/date.out -- as far as I can tell, the output from set will not be used for anything. Also, the conditional is a Useless Use of Test, and it will print to stdout any matches; I imagine that's undesirable. The following avoids those problems. Code:
if ! grep "$value" /tmp/dave.out >/dev/null then echo "$value"" >>/tmp/dave_files.out fi Still, if you had your list of PDF files in another file, one PDF per line, it could be as simple as Code:
fgrep -vxf pdfs.txt /tmp/dave.out >/tmp/dave_files.out The use of fgrep -x requires an exact match (not a regex match; you know that dot in a regex matches any character, for example) spanning the whole line (that's the -x). The -v causes only lines in /tmp/dave.out which are not anywhere in pdfs.txt to be printed. |
|
||||
|
Hi, I've solved the problem. More or less. This is what I went with for my shell script. It is probably far more complicated than it needs to be, but for my first shell script that isn't a basic option list, I'm quite pleased with it. Thanks for your help era. Code:
# Establishing a list of ALL PDFs
echo "Finding All PDFs..."
ls -R | grep .pdf > /tmp/pdfs/all_pdfs.out
echo "Done."
# Remove rubbish from list
echo "Removing Rubbish From List..."
sed 's|^\./[a-zA-Z0-9_ &./:]*$||g' /tmp/pdfs/all_pdfs.out > /tmp/pdfs/all_pdfs2.out
sed '/^$/d' /tmp/pdfs/all_pdfs2.out > /tmp/pdfs/all_pdfs.out
echo "Done."
# List all PDFs Linked to
echo "Gathering List of PDF Links..."
find . -name "*.htm*" -exec grep -o "[a-zA-Z0-9_]\{1,\}\.pdf" {} \; > /tmp/pdfs/all_links.out
find . -name "*.php" -exec grep -o "[a-zA-Z0-9_]\{1,\}\.pdf" {} \; >> /tmp/pdfs/all_links.out
echo "Done."
# Compare the two lists to see which aren't linked to (Thanks era)
echo "Establishing PDFs not Linked to..."
fgrep -vxf /tmp/pdfs/all_links.out /tmp/pdfs/all_pdfs.out > /tmp/pdfs/not_linked.out
echo "Done."
# Get full paths relative to the site root
echo "Gathering Full Paths Now..."
while read Value
do
find . -name "$Value" >> /tmp/pdfs/not_linked_paths.out
done < /tmp/pdfs/not_linked.out
echo "Done."
# Attaching Real Full Path
sed 's/^\./\/home\/httpd\/site/g' /tmp/pdfs/not_linked_paths.out > /tmp/pdfs/not_linked_full_paths.out
# Archiving Files...
while read Value2
do
cp "$Value2" /tmp/pdfs/backups/
# mv "$Value2" /tmp/pdfs/backups/
done < /tmp/pdfs/not_linked_full_paths.out
echo "Done."
# mv is commented out until I've confirmed that it is ok with my manager. Just copying for now.
# Getting rid of the .out files.
echo "Cleaning Up..."
rm /tmp/pdfs/*.out
echo "Finished!"
# Finished.
|
![]() |
| Bookmarks |
| Tags |
| awk, awk trim, trim, trim awk |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|