Recursive directory search using ls instead of find

07-06-2011

Registered User

6, 0

Join Date: Jul 2011

Last Activity: 28 September 2011, 4:47 PM EDT

Posts: 6

Thanks Given: 0

Thanked 0 Times in 0 Posts

Recursive directory search using ls instead of find

I was working on a shell script and found that the find command took too long, especially when I had to execute it multiple times. After some thought and research I came up with two functions.
fileScan()
filescan will cd into a directory and perform any operations you would like from within it.
directoryScan()
directoryScan will recursively cd into all directories benieth an initial provided root directory. once in a new directory, the directory is sent to fileScan so that other functions can be executed.

I found that this is blazing fast compared to find especially when searching large directory trees or if having to run more than one find in a script or chron.

enjoy the code

Code:

#!/bin/bash
# Directory Scanner using recursive ls instead of find
# Do not make any of the local variables into globals
# folder, numdirectories, and x should not be used outside fileScan() and directoryScan()
# directoryScan() will cd into all directories below the "root" directory sent to it
# fileScan() will perform operations on any directory sent to it
fileScan()
{
local folder=$1
cd $folder
if [ $folder = $PWD ]
then
#you are now inside of a directory.  Do any operations you need to do with files that may exist in this directory
fi
}
directoryScan()
{
local folder=$1
cd $folder
if [ $folder = $PWD ]
then
local numdirectories=$(ls -lS | egrep '^d' | wc -l)
fileScan $folder
local x=1
while [ $x -le $numdirectories ]
do
subdirectory=$(ls -lS | egrep '^d' | sed "s/[ \t][ \t]*/ /g" | cut -d" " -f9 | head -n $x | tail -n 1)
subdirectory="${folder}/${subdirectory}"
directoryScan $subdirectory
x=$(($x + 1))
cd $folder
done
fi
}
# sample call to directoryScan()
# directoryScan $rootdirectory
# sample call to fileScan()
# fileScan $scandirectory

newreverie

View Public Profile for newreverie

Find all posts by newreverie

07-06-2011

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Hi, newreverie:

Welcome to the forum.

I'd be interested in seeing to what that shell script is comparitively blazingly fast. I'm inclined to believe that your find solution was suboptimal if that shell script, executing those pipelines for each visited directory, is faster.

If you are not familiar with AWK, you might enjoy the challenge of learning enough of it to simplify the egrep|sed|cut|head|tail pipeline to one concise AWK invocation.

Performance and efficiency aside, there are some potentially serious issues with that code. One that stands out: if a directory is deleted between the time $numdirectories is calculated and the subsequent while loop concludes, entire subtrees of the hierarchy will be visited more than once (a result of the input to head being shorter than expected). Depending on what's being done with each of the files, this could be deal breaker.

Again, welcome to the forum and thanks for the contribution.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

07-06-2011

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

I am also quite sceptical.

By the way, ls perform an ascii sorting by default if you are in a directory with several tousands of files, this sorting operation can be consumming and may slow down the processing.

To avoid it you can use the -f option to get the inode in the order they will comme from the directory structure, this will avoid useless sorting, especially when you pipe your ls output in a wc -l

A lot of the performance problem are because of weak algorithm logic or approach, going through a redesigning step logic can then speed up processing.

I am curious to see the code of the initial "poor performance" script that was using the find command.

Sharing your code is still a nice intention.

Here are some example of performance problem because of wrong logic or bad use of find command :

https://www.unix.com/shell-programmin...nce-issue.html

https://www.unix.com/unix-dummies-que...-txt-file.html

Last edited by ctsgnb; 07-06-2011 at 03:08 PM..

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

07-07-2011

Registered User

6, 0

Join Date: Jul 2011

Last Activity: 28 September 2011, 4:47 PM EDT

Posts: 6

Thanks Given: 0

Thanked 0 Times in 0 Posts

find may be faster if I sent the results into an array or text file and then looped through those results for my program.

My issue with the find command had more to do with the time it took to run to completion. Given the large directory structure and the variety and type of files i needed to search for, the find command took several minutes or more to run to completion.

The particular shell i was writing has a UI, and so the user is forced to wait several minutes or more between executing any search and the ability to work with the results of that search. This was decided to be unacceptable and so a method was needed to execute searches closer to real time and allow the user to interact with files as they are found.

find could still be used the fileScan() function with the prune option to search only within the current directory. But I left the options open in that function to suite your purposes.

So perhaps I overstated the net speed of the functions in relation to find. find may work faster overal, but if a user is faced with waiting for a find command to run to completion vs the abiltiy to interact with the results of a search in near real time, i believe this is a better method.

As for the comment about directory deletion while this script is running, I can see the pitfalls, but it can also be avoided by making subdirectories into a local array and storing the results of an ls there without using the head and tail method. attempts to cd into the non existent directory would be handled in the if [ $folder = $PWD ] logic.

newreverie

View Public Profile for newreverie

Find all posts by newreverie

07-08-2011

Registered User

2, 0

Join Date: Jul 2011

Last Activity: 8 July 2011, 11:42 AM EDT

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

Use a program rather than a shell script

One of the phenomena I have noticed over my years of being involved in Unix/Linux is that people tend to over-use shell scripting.

The problem with shell scripting is, simply, performance. Its one thing to accomplish small to medium tasks with a shell script. But once you begin doing serious processing work, involving tight loops of text processing, you will very quickly run into trouble. The reason is because most things done in a shell script are done by small programs - cut, sed, awk, head, tail, etc. When you combine dozens of these in a loop that will be running heavily - the computer has to launch THOUSANDS of tiny programs to accomplish the overall task.

I have seen large powerful Unix systems brought to their knees by simple DB loader scripts done in ksh for this very reason.

The solution is to use a more appropriate software tool to solve the problem. If you really want to do it in shell scripting style, why not try it in perl or python? These programs will allow you to write a single program to accomplish the task, no spawning of child programs required. This will vastly speed up the program. I know this to be a fact, because I've had a simple perl file and text search program in my toolbox since 1994.

bearvarine

View Public Profile for bearvarine

Find all posts by bearvarine

07-08-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by bearvarine

One of the phenomena I have noticed over my years of being involved in Unix/Linux is that people tend to over-use shell scripting.

The problem with shell scripting is, simply, performance.

One of the phenomena I have noticed over my years of being involved with UNIX/Linux is that people tend to blame poor shell scripts on the language.

The program above isn't slow because it's shell. It's slow because of things like these:

Code:

ls -lS | egrep '^d' | sed "s/[ \t][ \t]*/ /g" | cut -d" " -f9 | head -n $x | tail -n 1

Six programs and five pipes, to do something you could've done in two or less! How find could be slower I can't imagine -- perhaps he didn't realize find is recursive?

Quote:

Its one thing to accomplish small to medium tasks with a shell script. But once you begin doing serious processing work, involving tight loops of text processing, you will very quickly run into trouble. The reason is because most things done in a shell script are done by small programs - cut, sed, awk, head, tail, etc. When you combine dozens of these in a loop that will be running heavily - the computer has to launch THOUSANDS of tiny programs to accomplish the overall task.

If you'd been programming shell for years, you ought to know:

1) awk can make a decent a replacement for all the tools you listed above -- in combinations, even -- being capable of quite complex programs in its own right. Putting it in the same class as head, tail, etc is a bit of a misnomer and jamming it in the middle of a long pipe chain is generally misuse: awk can often replace the entire chain, sometimes the entire script.

2) It's often not necessary to run thousands of tiny processes to accomplish single tasks when people have chosen to do so. Efficient use of piping or external programs are powerful features, but too often they're abused, causing terrible performance.

Quote:

I have seen large powerful Unix systems brought to their knees by simple DB loader scripts done in ksh for this very reason.

Funny thing -- I've done that with Perl. I've also done it in assembly language. It's possible to write terrible code in any language.

Quote:

The solution is to use a more appropriate software tool to solve the problem. If you really want to do it in shell scripting style, why not try it in perl or python?

because those don't resemble shell languages? Someone who writes a shell script precisely the same way they'd write a perl or python one isn't utilizing the shell's important features.

Quote:

I know this to be a fact, because I've had a simple perl file and text search program in my toolbox since 1994.

Did you know many modern shells have regular expressions, can do substrings and simple text replacement, can pipe text between entire code blocks, can read line by line or token by token and split lines on tokens, can open/close/seek in files, etc, etc, etc -- all as shell builtin features?

All too often, people don't, and use thousands of tiny external programs instead.

The trick is to do large amounts of work with each process you make, never use them for anything trivial.

Last edited by Corona688; 07-08-2011 at 12:21 PM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

07-08-2011

Registered User

2, 0

Join Date: Jul 2011

Last Activity: 8 July 2011, 11:42 AM EDT

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

@Corona688: You make very good points here, and I believe you are essentially affirming my main point - don't put lots of little programs together in tight loops and expect good performance from a shell script.

Honestly though, -- and I know this is just my personal opinion -- I think awk is a scourge upon our land. Like kudzu, it should be ripped out where ever it is found and replaced with something less inscrutable. I don't think there is any reason in 2011 to continue using an such an ancient, arcane, difficult to debug tool like awk when there are so many better choices available. - I'm Just Sayin'...

bearvarine

View Public Profile for bearvarine

Find all posts by bearvarine

UNIX for Advanced & Expert Users

Recursive directory search using ls instead of find

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Recursive folder search faster than find?

Discussion started by: Michael Stora

2. UNIX for Dummies Questions & Answers

How to search in specific directory using find?

Discussion started by: Little

3. UNIX for Dummies Questions & Answers

Help needed - find command for recursive search

Discussion started by: sudeep.id

4. Shell Programming and Scripting

How to restrict Find only search the current directory?

Discussion started by: littlewenwen

5. Shell Programming and Scripting

Find command to search files in a directory excluding subdirectories

Discussion started by: jhilmil

6. UNIX for Dummies Questions & Answers

Restricting a Find search to the current directory only

Discussion started by: daveu7

7. Shell Programming and Scripting

search directory-find files-append at end of line

Discussion started by: PrasannaKS

8. Shell Programming and Scripting

non recursive search in the current directory only

Discussion started by: puppala

9. UNIX for Dummies Questions & Answers

Unix find command to print directory and search string

Discussion started by: princein

10. UNIX for Advanced & Expert Users

find file with date and recursive search for a text

Discussion started by: rosh0623