Quote:
Originally posted by oombera
You're saying it's easier on Unix to remove a thousand files from a thousand different directories than it is to remove them from one directory?
I didn't say that. There is overhead in opening a directory. Even if each of the thousand directories contained only one file, that would not be win.
Directories can grow in size but they cannot shrink. Consider a directory with 100,000 files in it. Now you want to unlink the very last file. This will take some time because unix must scan all 100,000 entries looking for that directory entry. Now suppose that you know which file is the very last entry in the directory. And you first delete the 99,999 other files. Now you go and unlink() that final entry. It still takes the same amount of time.
If we delete each of 100,000 files from a directory, we must scan the directory 100,000 times. On average, we will scan half way before we find the entry we want. That is 100,000 * 100,000 / 2 directory entries read otherwise known as 5,000,000,000. That is a lot. Now suppose that the 100,000 files are evenly distributed in 10 directories. That is 10,000 * 10,000 /2 directory entry scan or 50,000,000 per directory. We need to do that 10 times, once for each directory. That bring us up to 500,000,000 directory entry scans or one tenth of the total work. We pay for this improvement by needing to open 9 more directories but that is a win. 9 directory opens beats reading 4,500,000,000 directory entries.
If use 100 directories, that is one hundredth of the total directory entries to read balanced against the need to open 100 times as many directories. And so on. And, yes, by the time you get to one file per directory, that is dumb. But so is 100,000 files per directory.
I would not suggest that you try to find the exact optimum directory entries per directory and always go with that. The exact number will vary from filesystem to filesystem. And that wouldn't be convienent to a user. But a directory with 100,000 files is way over the top. Users will control C out of an ls rather than let it finish. And they can't figure out how to prune the directory down. When a command like "wc -l *" fails because there are too many filenames, that's a good sign that things have gotten out of control. At that point, the directory is too large for the user to handle. And if, on a quiet system, "ls -l" takes more than 3 seconds to start printing, that's a good sign that the directory is too large for unix to handle efficiently.