![]() |
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.
|
|
google unix.com
|
|||||||
| Forums | Register | Forum Rules | Links | Albums | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| Filesystems, Disks and Memory Discuss NAS, SAN, RAID, Robotic Libraries, backup devices, RAM, DRAM, SCSI, IDE, EIDE topics here. |
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| HBA performance | jwholey | Filesystems, Disks and Memory | 2 | 02-27-2009 01:27 PM |
| Improve system performance by moving your log files to RAM | iBot | UNIX and Linux RSS News | 0 | 07-16-2008 04:30 AM |
| Remove header from files: optimal performance | kausmone | UNIX for Dummies Questions & Answers | 4 | 11-14-2007 10:14 AM |
| Announcing collectl - new performance linux performance monitor | MarkSeger | News, Links, Events and Announcements | 0 | 10-26-2007 06:14 PM |
| comparing Huge Files - Performance is very bad | madhukalyan | UNIX for Dummies Questions & Answers | 5 | 10-10-2006 10:58 PM |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread |
Rating:
|
Display Modes |
|
|
|
||||
|
Is it slower to open or create a file in a directory with 1 million files than a directory with 1000 files? How much slower? Where can I find information about this?
I'm mainly concerned about JFS on AIX, but also NTFS on Windows Server. Is there a difference? I'm trying to determine a good way to store a large number of files (2 million and growing about 200 K per year). Currently, I have them in six different directories, so it's about 300 K files per directory. Thanks! |
|
||||
|
Quote:
Maybe I should create a JFS partition on a virtual Linux machine and do some benchmarking. I don't have access to an idle AIX machine comparable to the one used in production. Of course, such a test would differ from "reality" in some ways: different OS, CPU architecture and storage solutions (local IDE drive compared to SAN). |
|
||||
|
Windows XP Test
I haven't had the time to do a test on Linux yet, but I just finished a test on my Windows XP desktop machine (NTFS). I'm not sure how valuable this test is, but it's very interesting... Please give your thoughts on this.
Opening and closing 100 selected files randomly 100,000 times from directories containing different amount of files (relative times): 100 files: 100.0 1000 files: 100.4 10,000 files: 101.3 100,000 files: 109.6 1,000,000 files: 130.9 A performance hit of 30% when going from 100 to 1,000,000 files in a directory! When I ran the tests again, they were not only faster, but the differences were almost zero: 100 files: 100.0 1000 files: 100.0 10,000 files: 100.6 100,000 files: 100.2 1,000,000 files: 100.3 Obviously, some caching is going on. So, if you open the same files over and over (and the number of files is small enough), it doesn't seem to matter how many files you keep in the directories. This caching could suggest that the performance hit above would be larger if I had opened more files than 100. Another way of doing this test would be to read every single file in random order. Maybe I should have used the same 1,000,000 files in each test case and instead distributed them differently (100 files per directory, 1000 files per directory etc). But then other variables would affect the results, such as how I distributed them -- path depth, number of directories etc. Details I used a script to create files with random names of 10+3 characters. I copied the files from the "100 directory" to the other directories, then added additional files. The files were almost empty (72 bytes). Then I ran a Python script that opened and closed randomly selected files (from the 100 files above) in each directory. The source code is: Code:
import datetime
import random
def getMS():
dt = datetime.datetime.now()
ms = dt.microsecond / 1000
ms += dt.second * 1000
ms += dt.minute * 60000
ms += dt.hour * 3600000
return ms
fh = open("files.txt", "r")
filenames = map(lambda fn: fn.strip(), fh.readlines())
fh.close()
random.seed()
NUMBER_OF_OPENS = 100000
TIMES_PER_CASE = 3
testcases = ["1000000", "100000", "10000", "1000", "100"]
for i in range(TIMES_PER_CASE):
for testcase in testcases:
starttime = getMS()
for j in range(NUMBER_OF_OPENS):
filename = "c:\\temp\\test" + testcase + "\\" + random.choice(filenames)
open(filename, "rb").close()
endtime = getMS()
print testcase, i, endtime - starttime
Code:
C:\Temp>python -OO openfiles.py 1000000 0 16156 100000 0 13531 10000 0 12508 1000 0 12399 100 0 12346 1000000 1 12291 100000 1 12274 10000 1 11886 1000 1 11265 100 1 11117 1000000 2 11199 100000 2 11183 10000 2 11232 1000 2 11166 100 2 11166 I ran the tests on my old desktop DELL Optiplex 280 with a Pentium 4 CPU (2.8 GHz), 2 GB DDR2 SDRAM and 80 GB Serial ATA-150, 7200 rpm hard drive (cache size unknown). I'm using Windows XP SP3 with NTFS. I shut down all anti-virus, indexing and updating services and most programs before running the tests. The hard drive was defragmented after creating the small files and before running the tests. I also rebooted before running the tests. |
|
|||||
|
We did once testing of the sort (with VMS, Novell unix MS...) and found out that all true preemptive multi-process-multitask OS where outperformed by the others...
Its the price you pay for equally sharing your time between all the processes... (For me it proved again that windows server (NT4 W2000) were still not completely preemptive multitask...) |
|
||||
|
I too have done similar testing but slightly different. Rather than simply time the creation of 100 files, which can be very misleading OR timing the creation of 1M files which is no better, I prefer to look at what is happening across the system resources during the entire event.
I run collectl with a monitoring interval of 1 second, logging to a file or simply watching the system in real time. When I create a million files I can watch the cpu periodically increase. In fact, when getting in the higher ends of files I can actually see spike in cpu load. This is something you can't see when just doing end-to-end numbers. Another interesting test is to set up an alarm in your script to write out the number of files created every 10th (or even hundredth) of a second. You'll be amazed to see how linearly the number of files created/second drops over time as well as how things periodically slow down but are not visible when only looking at second-level samples. You can also run collectl at a monitoring interval of 0.1 seconds and see micro-spikes in CPU load as well. This is something most people miss because none of the existing tools can deal with sub-second reporting. -mark |
|
|||||
|
We were not talking of 100 files but files by 10'000's...
Quote:
Last edited by vbe; 03-06-2009 at 12:46 PM.. Reason: minor correction... |
![]() |
| Bookmarks |
| Tags |
| inodes, jfs, ntfs |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|