Performance Hit With Many Files


 
Thread Tools Search this Thread
Special Forums Hardware Filesystems, Disks and Memory Performance Hit With Many Files
# 1  
Old 03-03-2009
Question Performance Hit With Many Files

Is it slower to open or create a file in a directory with 1 million files than a directory with 1000 files? How much slower? Where can I find information about this?

I'm mainly concerned about JFS on AIX, but also NTFS on Windows Server. Is there a difference?

I'm trying to determine a good way to store a large number of files (2 million and growing about 200 K per year). Currently, I have them in six different directories, so it's about 300 K files per directory.

Thanks!
# 2  
Old 03-03-2009
Excellent question! It is indeed filesystem-dependent. Such performance metrics are hard to come by, since there are so many variables, and to do a good apples-apples comparison, you need the same computer with the same disks and install multiple OS's. But it may not be necessary, really, to get such statistics. One can look at the filesystem architecture, directory handling semantics, and conclude one might be better than the other.

Unforatunally, I cannot provide specifics on JFS nor NTFS. However, ReiserFS and modern versions of ext2 (on filesystems created with -O DIR_INDEX), file creation and lookup are very fast; they both use a hash index to find files. So if you know the name of the file, it can be found almost instantly (as I understand it and have experienced). On older versions of ext2, things really started to slow down after the directory entry itself extended to one or more indirect blocks -- maybe 1000 or so.

You do know one way to store large amounts of files is to create a directory hierarchy that is keyed on the filenames themselves? So files named "ergo1802.txt" might be stored in:
Code:
    data/er/go/18/ergo1802.txt

# 3  
Old 03-03-2009
Quote:
Originally Posted by otheus
You do know one way to store large amounts of files is to create a directory hierarchy that is keyed on the filenames themselves? So files named "ergo1802.txt" might be stored in:
Code:
    data/er/go/18/ergo1802.txt

Yes, that would probably be the smartest design if the file names are randomized or evenly distributed. I know the users directories at Sourceforge are organized like that (something like /home/u/us/username). Unfortunately, I've inherited a legacy system, and first need to determine if it's worth the trouble to change the design.

Maybe I should create a JFS partition on a virtual Linux machine and do some benchmarking. I don't have access to an idle AIX machine comparable to the one used in production. Of course, such a test would differ from "reality" in some ways: different OS, CPU architecture and storage solutions (local IDE drive compared to SAN).
# 4  
Old 03-04-2009
Windows XP Test

I haven't had the time to do a test on Linux yet, but I just finished a test on my Windows XP desktop machine (NTFS). I'm not sure how valuable this test is, but it's very interesting... Please give your thoughts on this.

Opening and closing 100 selected files randomly 100,000 times from directories containing different amount of files (relative times):

100 files: 100.0
1000 files: 100.4
10,000 files: 101.3
100,000 files: 109.6
1,000,000 files: 130.9

A performance hit of 30% when going from 100 to 1,000,000 files in a directory!

When I ran the tests again, they were not only faster, but the differences were almost zero:

100 files: 100.0
1000 files: 100.0
10,000 files: 100.6
100,000 files: 100.2
1,000,000 files: 100.3

Obviously, some caching is going on. So, if you open the same files over and over (and the number of files is small enough), it doesn't seem to matter how many files you keep in the directories.

This caching could suggest that the performance hit above would be larger if I had opened more files than 100. Another way of doing this test would be to read every single file in random order.

Maybe I should have used the same 1,000,000 files in each test case and instead distributed them differently (100 files per directory, 1000 files per directory etc). But then other variables would affect the results, such as how I distributed them -- path depth, number of directories etc.

Details

I used a script to create files with random names of 10+3 characters. I copied the files from the "100 directory" to the other directories, then added additional files. The files were almost empty (72 bytes).

Then I ran a Python script that opened and closed randomly selected files (from the 100 files above) in each directory. The source code is:

Code:
import datetime
import random

def getMS():
    dt = datetime.datetime.now()
    ms = dt.microsecond / 1000
    ms += dt.second * 1000
    ms += dt.minute * 60000
    ms += dt.hour * 3600000
    return ms

fh = open("files.txt", "r")
filenames = map(lambda fn: fn.strip(), fh.readlines())
fh.close()

random.seed()

NUMBER_OF_OPENS = 100000
TIMES_PER_CASE = 3

testcases = ["1000000", "100000", "10000", "1000", "100"]

for i in range(TIMES_PER_CASE):
    for testcase in testcases:
        starttime = getMS()
        for j in range(NUMBER_OF_OPENS):
            filename = "c:\\temp\\test" + testcase + "\\" + random.choice(filenames)
            open(filename, "rb").close()
        endtime = getMS()

        print testcase, i, endtime - starttime

And the results:

Code:
C:\Temp>python -OO openfiles.py
1000000 0 16156
100000 0 13531
10000 0 12508
1000 0 12399
100 0 12346
1000000 1 12291
100000 1 12274
10000 1 11886
1000 1 11265
100 1 11117
1000000 2 11199
100000 2 11183
10000 2 11232
1000 2 11166
100 2 11166

Machine Specifications

I ran the tests on my old desktop DELL Optiplex 280 with a Pentium 4 CPU (2.8 GHz), 2 GB DDR2 SDRAM and 80 GB Serial ATA-150, 7200 rpm hard drive (cache size unknown).

I'm using Windows XP SP3 with NTFS. I shut down all anti-virus, indexing and updating services and most programs before running the tests.

The hard drive was defragmented after creating the small files and before running the tests. I also rebooted before running the tests.
# 5  
Old 03-05-2009
We did once testing of the sort (with VMS, Novell unix MS...) and found out that all true preemptive multi-process-multitask OS where outperformed by the others...
Its the price you pay for equally sharing your time between all the processes...

(For me it proved again that windows server (NT4 W2000) were still not completely preemptive multitask...)
# 6  
Old 03-06-2009
I too have done similar testing but slightly different. Rather than simply time the creation of 100 files, which can be very misleading OR timing the creation of 1M files which is no better, I prefer to look at what is happening across the system resources during the entire event.

I run collectl with a monitoring interval of 1 second, logging to a file or simply watching the system in real time. When I create a million files I can watch the cpu periodically increase. In fact, when getting in the higher ends of files I can actually see spike in cpu load. This is something you can't see when just doing end-to-end numbers.

Another interesting test is to set up an alarm in your script to write out the number of files created every 10th (or even hundredth) of a second. You'll be amazed to see how linearly the number of files created/second drops over time as well as how things periodically slow down but are not visible when only looking at second-level samples.

You can also run collectl at a monitoring interval of 0.1 seconds and see micro-spikes in CPU load as well. This is something most people miss because none of the existing tools can deal with sub-second reporting.

-mark
# 7  
Old 03-06-2009
We were not talking of 100 files but files by 10'000's...
Quote:
You'll be amazed to see how linearly the number of files created/second drops over time as well as how things periodically slow down but are not visible when only looking at second-level samples.
Doesnt that remind you about CPU scheduling priority in time?

Last edited by vbe; 03-06-2009 at 12:46 PM.. Reason: minor correction...
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Best performance to merge two files

Hi Gurus, I need to merge two files. file1 (small file, only one line) this is first linefile2 (large file) abc def ghi ... I use below command to merge the file, since the file2 is really large file, the command read whole file2, the performance is not good. cat file1 > file3... (7 Replies)
Discussion started by: green_k
7 Replies

2. Shell Programming and Scripting

Curl to hit the submit button

Hello, I am looking to hit a URL using curl and click on submit button so that I can get the results. The below is the code <input name="tos_accepted" id="tos_accepted" class="button" type="submit" value="Yes, I Agree"/> <input name="tos_discarded" id="tos_discarded"... (1 Reply)
Discussion started by: Kochappa
1 Replies

3. Cybersecurity

vnc password hit from Retina

Hello, I'm having an issue with VNC. Security at work says that they scanned my servers (Solaris, RHEL, SLES) and found that you don't need a password to access a VNC session. I have tested this and you can't login to the VNC session without a password. Can someone tell what the Retina scanner... (1 Reply)
Discussion started by: bitlord
1 Replies

4. SuSE

Java hit

Hello, I'm having trouble looking for info for SUSIE on this CVE-2012-4681. This is basically the newest Java hit. It is mostly a web browser issue but I would like to see if the versions on our servers are vulnerable. I already found the pages/info for Solaris and RHEL. Any help would be... (4 Replies)
Discussion started by: bitlord
4 Replies

5. Shell Programming and Scripting

Getting Next Best Hit..

Hi.. I need to get the following output from the input file like this INPUT GRM1 GRM1 0 GRM1 ABC1 1 GRM1 FEQ1 2 GRM1 SED1 3 ABC2 GRM1 0 ABC2 ABC2 1 ABC2 FEQ1 2 ABC2 BED1 3 SED1 SED1 0 SED1 SED1 1 SED1 SED1 2 SED1 ABC1 3 OUTPUT: (7 Replies)
Discussion started by: empyrean
7 Replies

6. Shell Programming and Scripting

Hit count on a shell script

I have a unix shell script (ex.sh) written. How to find out how many users (incl. myself) have run this .sh ? I can insert code snipet at top of script if need be. - Ravi (2 Replies)
Discussion started by: ravi368
2 Replies

7. Programming

why multiple SIGINT raises when i hit C-c

hi, in my application, i have set up to capture SIGINT and execute a handler.the problem is whenever i hit C-c, multiple SIGINT are sent to the application.I have blocked the SIGINT right after catching the first one but it is unsuccessful.Here is what i do : jmp_buf main_loop; int... (1 Reply)
Discussion started by: Sedighzadeh
1 Replies

8. Shell Programming and Scripting

Shell Script to hit a url

Hi all, I have a php file that grabs xml, parses it and updates my db accordingly. I want to automate the execution of this process, rather than having to hit the url manually. I have been looking into using cron to execute a script to do this, however i'm not exactly sure what command i would... (1 Reply)
Discussion started by: restivz77
1 Replies

9. UNIX for Dummies Questions & Answers

Anyone else see a performance hit from ext3

I reinstalled my Linux box with RedHat 7.2 and used the ext3 journaling file system. This thing is a pig now. There isn't much running on the box, and performance is sad. (1 Reply)
Discussion started by: 98_1LE
1 Replies
Login or Register to Ask a Question