The UNIX and Linux Forums  
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.

Go Back   The UNIX and Linux Forums > Special Forums > Hardware > Filesystems, Disks and Memory
.
google unix.com



Filesystems, Disks and Memory Discuss NAS, SAN, RAID, Robotic Libraries, backup devices, RAM, DRAM, SCSI, IDE, EIDE topics here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
HBA performance jwholey Filesystems, Disks and Memory 2 02-27-2009 01:27 PM
Improve system performance by moving your log files to RAM iBot UNIX and Linux RSS News 0 07-16-2008 04:30 AM
Remove header from files: optimal performance kausmone UNIX for Dummies Questions & Answers 4 11-14-2007 10:14 AM
Announcing collectl - new performance linux performance monitor MarkSeger News, Links, Events and Announcements 0 10-26-2007 06:14 PM
comparing Huge Files - Performance is very bad madhukalyan UNIX for Dummies Questions & Answers 5 10-10-2006 10:58 PM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rating: Thread Rating: 1 votes, 5.00 average. Display Modes
  #1 (permalink)  
Old 03-03-2009
cyner cyner is offline
Registered User
  
 

Join Date: Mar 2009
Location: Stockholm, Sweden
Posts: 3
Question Performance Hit With Many Files

Is it slower to open or create a file in a directory with 1 million files than a directory with 1000 files? How much slower? Where can I find information about this?

I'm mainly concerned about JFS on AIX, but also NTFS on Windows Server. Is there a difference?

I'm trying to determine a good way to store a large number of files (2 million and growing about 200 K per year). Currently, I have them in six different directories, so it's about 300 K files per directory.

Thanks!
  #2 (permalink)  
Old 03-03-2009
otheus's Avatar
otheus otheus is offline Forum Staff  
Moderator ala Mode
  
 

Join Date: Feb 2007
Location: Innsbruck, Austria
Posts: 1,884
Excellent question! It is indeed filesystem-dependent. Such performance metrics are hard to come by, since there are so many variables, and to do a good apples-apples comparison, you need the same computer with the same disks and install multiple OS's. But it may not be necessary, really, to get such statistics. One can look at the filesystem architecture, directory handling semantics, and conclude one might be better than the other.

Unforatunally, I cannot provide specifics on JFS nor NTFS. However, ReiserFS and modern versions of ext2 (on filesystems created with -O DIR_INDEX), file creation and lookup are very fast; they both use a hash index to find files. So if you know the name of the file, it can be found almost instantly (as I understand it and have experienced). On older versions of ext2, things really started to slow down after the directory entry itself extended to one or more indirect blocks -- maybe 1000 or so.

You do know one way to store large amounts of files is to create a directory hierarchy that is keyed on the filenames themselves? So files named "ergo1802.txt" might be stored in:
Code:
    data/er/go/18/ergo1802.txt
  #3 (permalink)  
Old 03-03-2009
cyner cyner is offline
Registered User
  
 

Join Date: Mar 2009
Location: Stockholm, Sweden
Posts: 3
Quote:
Originally Posted by otheus View Post
You do know one way to store large amounts of files is to create a directory hierarchy that is keyed on the filenames themselves? So files named "ergo1802.txt" might be stored in:
Code:
    data/er/go/18/ergo1802.txt
Yes, that would probably be the smartest design if the file names are randomized or evenly distributed. I know the users directories at Sourceforge are organized like that (something like /home/u/us/username). Unfortunately, I've inherited a legacy system, and first need to determine if it's worth the trouble to change the design.

Maybe I should create a JFS partition on a virtual Linux machine and do some benchmarking. I don't have access to an idle AIX machine comparable to the one used in production. Of course, such a test would differ from "reality" in some ways: different OS, CPU architecture and storage solutions (local IDE drive compared to SAN).
  #4 (permalink)  
Old 03-04-2009
cyner cyner is offline
Registered User
  
 

Join Date: Mar 2009
Location: Stockholm, Sweden
Posts: 3
Windows XP Test

I haven't had the time to do a test on Linux yet, but I just finished a test on my Windows XP desktop machine (NTFS). I'm not sure how valuable this test is, but it's very interesting... Please give your thoughts on this.

Opening and closing 100 selected files randomly 100,000 times from directories containing different amount of files (relative times):

100 files: 100.0
1000 files: 100.4
10,000 files: 101.3
100,000 files: 109.6
1,000,000 files: 130.9

A performance hit of 30% when going from 100 to 1,000,000 files in a directory!

When I ran the tests again, they were not only faster, but the differences were almost zero:

100 files: 100.0
1000 files: 100.0
10,000 files: 100.6
100,000 files: 100.2
1,000,000 files: 100.3

Obviously, some caching is going on. So, if you open the same files over and over (and the number of files is small enough), it doesn't seem to matter how many files you keep in the directories.

This caching could suggest that the performance hit above would be larger if I had opened more files than 100. Another way of doing this test would be to read every single file in random order.

Maybe I should have used the same 1,000,000 files in each test case and instead distributed them differently (100 files per directory, 1000 files per directory etc). But then other variables would affect the results, such as how I distributed them -- path depth, number of directories etc.

Details

I used a script to create files with random names of 10+3 characters. I copied the files from the "100 directory" to the other directories, then added additional files. The files were almost empty (72 bytes).

Then I ran a Python script that opened and closed randomly selected files (from the 100 files above) in each directory. The source code is:

Code:
import datetime
import random

def getMS():
    dt = datetime.datetime.now()
    ms = dt.microsecond / 1000
    ms += dt.second * 1000
    ms += dt.minute * 60000
    ms += dt.hour * 3600000
    return ms

fh = open("files.txt", "r")
filenames = map(lambda fn: fn.strip(), fh.readlines())
fh.close()

random.seed()

NUMBER_OF_OPENS = 100000
TIMES_PER_CASE = 3

testcases = ["1000000", "100000", "10000", "1000", "100"]

for i in range(TIMES_PER_CASE):
    for testcase in testcases:
        starttime = getMS()
        for j in range(NUMBER_OF_OPENS):
            filename = "c:\\temp\\test" + testcase + "\\" + random.choice(filenames)
            open(filename, "rb").close()
        endtime = getMS()

        print testcase, i, endtime - starttime
And the results:

Code:
C:\Temp>python -OO openfiles.py
1000000 0 16156
100000 0 13531
10000 0 12508
1000 0 12399
100 0 12346
1000000 1 12291
100000 1 12274
10000 1 11886
1000 1 11265
100 1 11117
1000000 2 11199
100000 2 11183
10000 2 11232
1000 2 11166
100 2 11166
Machine Specifications

I ran the tests on my old desktop DELL Optiplex 280 with a Pentium 4 CPU (2.8 GHz), 2 GB DDR2 SDRAM and 80 GB Serial ATA-150, 7200 rpm hard drive (cache size unknown).

I'm using Windows XP SP3 with NTFS. I shut down all anti-virus, indexing and updating services and most programs before running the tests.

The hard drive was defragmented after creating the small files and before running the tests. I also rebooted before running the tests.
  #5 (permalink)  
Old 03-05-2009
vbe's Avatar
vbe vbe is offline Forum Staff  
Moderator
  
 

Join Date: Sep 2005
Location: Switzerland - GE
Posts: 1,568
We did once testing of the sort (with VMS, Novell unix MS...) and found out that all true preemptive multi-process-multitask OS where outperformed by the others...
Its the price you pay for equally sharing your time between all the processes...

(For me it proved again that windows server (NT4 W2000) were still not completely preemptive multitask...)
  #6 (permalink)  
Old 03-06-2009
MarkSeger MarkSeger is offline
Registered User
  
 

Join Date: Oct 2007
Posts: 14
I too have done similar testing but slightly different. Rather than simply time the creation of 100 files, which can be very misleading OR timing the creation of 1M files which is no better, I prefer to look at what is happening across the system resources during the entire event.

I run collectl with a monitoring interval of 1 second, logging to a file or simply watching the system in real time. When I create a million files I can watch the cpu periodically increase. In fact, when getting in the higher ends of files I can actually see spike in cpu load. This is something you can't see when just doing end-to-end numbers.

Another interesting test is to set up an alarm in your script to write out the number of files created every 10th (or even hundredth) of a second. You'll be amazed to see how linearly the number of files created/second drops over time as well as how things periodically slow down but are not visible when only looking at second-level samples.

You can also run collectl at a monitoring interval of 0.1 seconds and see micro-spikes in CPU load as well. This is something most people miss because none of the existing tools can deal with sub-second reporting.

-mark
  #7 (permalink)  
Old 03-06-2009
vbe's Avatar
vbe vbe is offline Forum Staff  
Moderator
  
 

Join Date: Sep 2005
Location: Switzerland - GE
Posts: 1,568
We were not talking of 100 files but files by 10'000's...
Quote:
You'll be amazed to see how linearly the number of files created/second drops over time as well as how things periodically slow down but are not visible when only looking at second-level samples.
Doesnt that remind you about CPU scheduling priority in time?

Last edited by vbe; 03-06-2009 at 12:46 PM.. Reason: minor correction...
Closed Thread

Bookmarks

Tags
inodes, jfs, ntfs

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 01:56 PM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0