Speeding up a Shell Script (find, grep and a for loop)


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Speeding up a Shell Script (find, grep and a for loop)
# 1  
Old 08-05-2008
Speeding up a Shell Script (find, grep and a for loop)

Hi all,

I'm having some trouble with a shell script that I have put together to search our web pages for links to PDFs.

The first thing I did was:

Code:
ls -R | grep .pdf > /tmp/dave_pdfs.out

Which generates a list of all of the PDFs on the server. For the sake of arguement, say it looks like this:

file1.pdf
file2.pdf
file3.pdf
file4.pdf

I then put this info into an array in a shell script, and loop through the array, searching all .htm and .html files in the site
for the value:

Code:
# The Array
pdfs=("file1.pdf" "file2.pdf" "file3.pdf" "file4.pdf")

# Just a counter that gets incremented for each iteration
counter=1

# For every value in the array
for value in ${pdfs[@]}
do

# Tell the user which file is being searched for, and how far along in the overall process we are.
echo "Working on $value..."
echo "($counter of ${#pdfs[*]})"

# Add what is being searched for to the output file
echo "$value is linked to from" >> /tmp/dave_locations.out

# Find all .htm and .html files with the filename we are looking for in, and add it to the output file
find . -name "*.htm*" -exec grep -l $value {} \; >> /tmp/dave_locations.out

# Adding a space afterwards
echo " " >> /tmp/dave_locations.out

# Increment the counter.
counter=`expr $counter + 1`

done

This does work.

However, our site is huge (1491 PDFs, and a whole lot of .htm and .html pages). Each iteration through the loop
takes around about 55 seconds. I've calculated that this shell script will take 6 days to complete.

Does anyone please know of a better (and significantly faster) way of doing this?

Any help would be greatly appreciated. I'm a bit of a unix newbie, and it took me hours just to get this far.
# 2  
Old 08-05-2008
Could't you just run the find once, then grep out the matches against the list of PDFs?
# 3  
Old 08-05-2008
the find -exec grep seems kind of redundant

find . -name "*$value*.htm*"

that should make the same result without loading another binary to memory
# 4  
Old 08-05-2008
Thanks for your reply Era.

I think I've got my list together now.

I established the list of all PDFs linked to using:

Code:
find . -name "*.htm*" -exec grep -o "[a-zA-Z0-9_]\{1,\}\.pdf" {} \; > /tmp/dave.out

And compared that to my list of existing PDFs as you suggested using:

Code:
pdfs=("file1.pdf" "file2.pdf" "file3.pdf" "file4.pdf")

for value in ${pdfs[@]}
do

set | grep $value /tmp/dave.out

# If grep finds nothing...
if [ $? == 1 ]
then
  echo "$value" >> /tmp/dave_files.out
fi

done

Turns out we have 3145 PDFs not linked to... Of course I still want to double and triple check this list, but I suppose I should be able to come up with some kind of shell script to remove all of these as well.

Thanks again for your help. Smilie
# 5  
Old 08-06-2008
I don't understand the construction set | grep $value /tmp/date.out -- as far as I can tell, the output from set will not be used for anything.

Also, the conditional is a Useless Use of Test, and it will print to stdout any matches; I imagine that's undesirable. The following avoids those problems.

Code:
if ! grep "$value" /tmp/dave.out >/dev/null
then
  echo "$value"" >>/tmp/dave_files.out
fi

Still, if you had your list of PDF files in another file, one PDF per line, it could be as simple as

Code:
fgrep -vxf pdfs.txt /tmp/dave.out >/tmp/dave_files.out

The use of fgrep -x requires an exact match (not a regex match; you know that dot in a regex matches any character, for example) spanning the whole line (that's the -x). The -v causes only lines in /tmp/dave.out which are not anywhere in pdfs.txt to be printed.
# 6  
Old 08-06-2008
Hi,

I've solved the problem. More or less. This is what I went with for my shell script. It is probably far more complicated than it needs to be, but for my first shell script that isn't a basic option list, I'm quite pleased with it.

Thanks for your help era.

Code:
# Establishing a list of ALL PDFs

echo "Finding All PDFs..."
ls -R | grep .pdf > /tmp/pdfs/all_pdfs.out
echo "Done."

# Remove rubbish from list

echo "Removing Rubbish From List..."
sed 's|^\./[a-zA-Z0-9_ &./:]*$||g' /tmp/pdfs/all_pdfs.out > /tmp/pdfs/all_pdfs2.out
sed '/^$/d' /tmp/pdfs/all_pdfs2.out > /tmp/pdfs/all_pdfs.out
echo "Done."

# List all PDFs Linked to

echo "Gathering List of PDF Links..."
find . -name "*.htm*" -exec grep -o "[a-zA-Z0-9_]\{1,\}\.pdf" {} \; > /tmp/pdfs/all_links.out
find . -name "*.php" -exec grep -o "[a-zA-Z0-9_]\{1,\}\.pdf" {} \; >> /tmp/pdfs/all_links.out
echo "Done."

# Compare the two lists to see which aren't linked to (Thanks era)

echo "Establishing PDFs not Linked to..."
fgrep -vxf /tmp/pdfs/all_links.out /tmp/pdfs/all_pdfs.out > /tmp/pdfs/not_linked.out
echo "Done."

# Get full paths relative to the site root

echo "Gathering Full Paths Now..."
while read Value
do
  find . -name "$Value" >> /tmp/pdfs/not_linked_paths.out
done < /tmp/pdfs/not_linked.out
echo "Done."

# Attaching Real Full Path
sed 's/^\./\/home\/httpd\/site/g' /tmp/pdfs/not_linked_paths.out > /tmp/pdfs/not_linked_full_paths.out

# Archiving Files...
while read Value2
do
  cp "$Value2" /tmp/pdfs/backups/
  # mv "$Value2" /tmp/pdfs/backups/
done < /tmp/pdfs/not_linked_full_paths.out
echo "Done."

# mv is commented out until I've confirmed that it is ok with my manager. Just copying for now.

# Getting rid of the .out files.

echo "Cleaning Up..."
rm /tmp/pdfs/*.out
echo "Finished!"

# Finished.

# 7  
Old 08-06-2008
Quote:
Originally Posted by Dave Stockdale
Code:
# Archiving Files...
while read Value2
do
  cp "$Value2" /tmp/pdfs/backups/
  # mv "$Value2" /tmp/pdfs/backups/
done < /tmp/pdfs/not_linked_full_paths.out
echo "Done."


If there aren't too many lines in the file, you can do it without a loop:

Code:
IFS=$'\n'
list=$( < /tmp/pdfs/not_linked_full_paths.out )
cp $list /tmp/pdfs/backups/

 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help with speeding up my working script to take less time - how to use more CPU usage for a script

Hello experts, we have input files with 700K lines each (one generated for every hour). and we need to convert them as below and move them to another directory once. Sample INPUT:- # cat test1 1559205600000,8474,NormalizedPortInfo,PctDiscards,0.0,Interface,BG-CTA-AX1.test.com,Vl111... (7 Replies)
Discussion started by: prvnrk
7 Replies

2. Shell Programming and Scripting

Help 'speeding' up this 'parsing' script - taking 24+ hours to run

Hi, I've written a ksh script that read a file and parse/filter/format each line. The script runs as expected but it runs for 24+ hours for a file that has 2million lines. And sometimes, the input file has 10million lines which means it can be running for more than 2 days and still not finish.... (9 Replies)
Discussion started by: newbie_01
9 Replies

3. Shell Programming and Scripting

Speeding up shell script with grep

HI Guys hoping some one can help I have two files on both containing uk phone numbers master is a file which has been collated over a few years ad currently contains around 4 million numbers new is a file which also contains 4 million number i need to split new nto two separate files... (4 Replies)
Discussion started by: dunryc
4 Replies

4. Shell Programming and Scripting

How to use grep in a loop using a bash script?

Dear all, Please help with the following. I have a file, let's call it data.txt, that has 3 columns and approx 700,000 lines, and looks like this: rs1234 A C rs1236 T G rs2345 G T Please use code tags as required by forum rules! I have a second file, called reference.txt,... (1 Reply)
Discussion started by: aberg
1 Replies

5. Shell Programming and Scripting

Help speeding up script

This is my first experience writing unix script. I've created the following script. It does what I want it to do, but I need it to be a lot faster. Is there any way to speed it up? cat 'Tax_Provision_Sample.dat' | sort | while read p; do fn=`echo $p|cut -d~ -f2,4,3,8,9`; echo $p >> "$fn.txt";... (20 Replies)
Discussion started by: JohnN6
20 Replies

6. Shell Programming and Scripting

Speeding up search and replace in a for loop

Hello, I am using sed in a for loop to replace text in a 100MB file. I have about 55,000 entries to convert in a csv file with two entries per line. The following script works to search file.txt for the first field from conversion.csv and then replace it with the second field. While it works fine,... (15 Replies)
Discussion started by: pbluescript
15 Replies

7. UNIX for Dummies Questions & Answers

Speeding/Optimizing GREP search on CSV files

Hi all, I have problem with searching hundreds of CSV files, the problem is that search is lasting too long (over 5min). Csv files are "," delimited, and have 30 fields each line, but I always grep same 4 fields - so is there a way to grep just those 4 fields to speed-up search. Example:... (11 Replies)
Discussion started by: Whit3H0rse
11 Replies

8. Shell Programming and Scripting

Shell script / Grep / Awk to variable and Loop

Hi, I have a text file with data in that I wish to extract, assign to a variable and process through a loop. Kind of the process that I am after: 1: Grep the text file for the values. Currently using: cat /root/test.txt | grep TESTING= | awk -F"=" '{ a = $2 } {print a}' | sort -u ... (0 Replies)
Discussion started by: Spoonless
0 Replies

9. Shell Programming and Scripting

Bash script (using find and grep)

I'm trying to make a simple search script but cannot get it right. The script should search for keywords inside files. Then return the file paths in a variable. (Each file path separated with \n). #!/bin/bash SEARCHQUERY="searchword1 searchword2 searchword3"; for WORD in $SEARCHQUERY do ... (6 Replies)
Discussion started by: limmer
6 Replies

10. Shell Programming and Scripting

grep'ing and sed'ing chunks in bash... need help on speeding up a log parser.

I have a file that is 20 - 80+ MB in size that is a certain type of log file. It logs one of our processes and this process is multi-threaded. Therefore the log file is kind of a mess. Here's an example: The logfile looks like: "DATE TIME - THREAD ID - Details", and a new file is created... (4 Replies)
Discussion started by: elinenbe
4 Replies
Login or Register to Ask a Question