Copying Thousands of Tiny or Empty Files?


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users Copying Thousands of Tiny or Empty Files?
# 1  
Old 04-28-2009
Copying Thousands of Tiny or Empty Files?

There is a procedure I do here at work where I have to synchronize file systems. The source file system always has three or four directories of hundreds of thousands of tiny (1k or smaller) or empty files. Whenever my rsync command reaches these directories, I'm waiting for hours for those files to finish copying. Is there any way to decrease the time it takes for those files to be copied?

The files are generated by an application that definitely needs them, and I'm in no position to dispense with them. I wondered about trying to 'tar' the directories first, but I suspect that if I do, I'll merely be moving the time spent copying them during rsync to the time spent to create the archive in the first place.

My rsync command is pretty basic:

Code:
rsync -auvlxHS /source_dir/ /dest_dir/

Usually /dest_dir/ is a new, empty file system so it really is a full copy, but sometimes there are actual synchronizations done. However, if there's a better approach than my rsync, I'd like to know.
# 2  
Old 04-28-2009
Can you run multiple 'threads' of rsync - divide up the source tree and dest tree among several rsync processes?

Code:
rsync -auvlxHS /source_dir/dir1 /dest_dir/dir1
rsync -auvlxHS /source_dir/dir2 /dest_dir/dir2
rsync -auvlxHS /source_dir/dir3 /dest_dir/dir3

When you create lots of files and directories there is substantially more filesystem overhead than just writing to an existing file. You may want to do some serious filesystem tuning on the destination box, particularly the /dest_dir filesystem.

Also, having huge numbers of files in a single directory really bogs things down as well. readdir() takes a lot longer to complete a full scan of a directory for example...

What OS?
# 3  
Old 04-28-2009
I could try doing multiple instances, that's a good idea at least to test and see if it has any speed increase over the single rsync process. The OS itself is HP-UX 11.11 but we expect to be moving to 11.30 soon-ish. The filesystem is vxfs and it was created with the 'largefiles' option because we also have files that are 8 to 12 gigs in size.

The application uses the small/empty files as some kind of "label" for information in a database that needs to be changed in an indexing process. I'm not clear on it as that portion isn't my responsibility. I've been told that they're necessary. As such, I'm hoping to increase the speed of transfer. However, tuning the FS might not be workable since I need both large files and these small/empty ones.

To add to that, when it is a true sync instead of a full copy, these empty files are always different, so basically it winds up being a full copy anyway. The files are deleted and new ones created on a daily basis during the week.

Quote:
Originally Posted by jim mcnamara
Can you run multiple 'threads' of rsync - divide up the source tree and dest tree among several rsync processes?

Code:
rsync -auvlxHS /source_dir/dir1 /dest_dir/dir1
rsync -auvlxHS /source_dir/dir2 /dest_dir/dir2
rsync -auvlxHS /source_dir/dir3 /dest_dir/dir3

When you create lots of files and directories there is substantially more filesystem overhead than just writing to an existing file. You may want to do some serious filesystem tuning on the destination box, particularly the /dest_dir filesystem.

Also, having huge numbers of files in a single directory really bogs things down as well. readdir() takes a lot longer to complete a full scan of a directory for example...

What OS?
# 4  
Old 04-28-2009
try man vxtunefs
Login or Register to Ask a Question

Previous Thread | Next Thread

5 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Bash-awk to process thousands of files

Hi to all, I have thousand of files in a folder with names with format "FILE-YYYY-MM-DD-HHMM" for what I want to send the following AWK command awk '/Code.*/' FILE-2014* I'd like to separate all files that have the same date to a folder named with the corresponding date. For example, if I... (7 Replies)
Discussion started by: Ophiuchus
7 Replies

2. Shell Programming and Scripting

Search for patterns in thousands of files

Hi All, I want to search for a certain string in thousands of files and these files are distributed over different directories created daily. For that I created a small script in bash but while running it I am getting the below error: /ms.sh: xrealloc: subst.c:5173: cannot allocate... (17 Replies)
Discussion started by: danish0909
17 Replies

3. Shell Programming and Scripting

help to parallelize work on thousands of files

I need to find a smarter way to process about 60,000 files in a single directory. Every night a script runs on each file generating a output on another directory; this used to take 5 hours, but as the data grows it is taking 7 hours. The files are of different sizes, but there are 16 cores... (10 Replies)
Discussion started by: vhope07
10 Replies

4. Shell Programming and Scripting

trnsmiting thousands ftp files and get an error message

Im transmiting thousands ftp files to a server, when type the command mput *, an error comes and say. args list to long. set to I. So ihave to transmit them in batch or blocks, but its too sloww. what shoul i do?. i need to do a program, or with a simple command i could solve the problem? (3 Replies)
Discussion started by: alexcol
3 Replies

5. Shell Programming and Scripting

Finding a specific pattern from thousands of files ????

Hi All, I want to find a specific pattern from approximately 400000 files on solaris platform. Its very heavy for me to grep that pattern to each file individually. Can anybody suggest me some way to search for specific pattern (alpha numeric) from these forty thousand files. Please note that... (6 Replies)
Discussion started by: aarora_98
6 Replies
Login or Register to Ask a Question