How to zip/tar millions of files?


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users How to zip/tar millions of files?
# 1  
Old 12-08-2010
Computer How to zip/tar millions of files?

Hi guys,

I have an issue processing a large amount of files. I have around 5 million files (some of them are actually directories) in a server.

I am unable to find out the exact number of files since it's taking "forever" to finish (See this thread for more on the issue).

Anyway, now I want to move these ~5 million files to a different location so my first thought was to tar/gzip the files and SCP them somewhere else as a single file however the tar process is also taking a loooong time to finish (in fact it never finished and I cancelled the job after 10 hours).

Basically I just want to build a single package containing the ~5 million files (zip, tar, cpio, raw data, whatever) so that I can easily move and transfer the files to a different location.

Any ideas?

Thank you.
# 2  
Old 12-08-2010
Here is your problem - you are reading millions of directory entries and writing to a tarfile.
Then you copy the tarfile somewhere, then extract. Tons of I/O writing the tarball, I/O copying it, I/O extracting it.

Eliminate the "middleman I/O".

FWIW:
Plus, assuming you actually want the data, you are perpetuating the problem - way too many file entries per directory. You really should reorganize the directory structure. It is probably not possible that users are reading those files very often or you would have lots of user complaints 'It takes forever to read a file...'

That said:
moving a directory tree from node to node, eliminate the middleman processing:
Code:
scp -r -p /my/path remotenode:/my/path

using tar to relocate on the same box, eliminate the middleman
Code:
tar cf - /my/path | ( cd /new/target; tar xfp -)

BTW don't kill this job off until it is done - it could take forever. You will never make any progress if you kill of these jobs.
This User Gave Thanks to jim mcnamara For This Post:
# 3  
Old 12-08-2010
When you say somewhere else, I guess you mean a different server ?
I imagine the "gzip" and the ssl encryption in scp are both adding quite an overhead.

There is a tradeoff between network speed, cpu speed and disk speed. You may find it more efficient leaving out the gzip if you have a fast network. Also if you are not worried about security using "rsh" instead of "ssh/scp" will be quicker. Probably the quickest would be:
Code:
tar cf - my/path | rsh -l user host "cd /new/target; tar xfp -"

However if network speed is slower and you need security something like "rsync" with the "-z" option may be better for you.
This User Gave Thanks to citaylor For This Post:
# 4  
Old 12-08-2010
Please state what Operating System you have and describe your hardware configuration including memory and discs and enything relevant to performance.

Is it safe to assume that the filesystem will be quiescent?

Do you have spare discs equivalent to say twice the existing space? Personally I would copy the entire filesystem first to produce a defragmented filesystem which runs at a reasonable speed. This is also intended to prove that the original disc can be read from end-to-end.

Please post the current values of:
Code:
df -i
df -k

This User Gave Thanks to methyl For This Post:
# 5  
Old 12-08-2010
First of all, thanks all for your suggestions...

Well, even though I have a lot of files, each one of them is pretty small.

I'd say that the whole ~5 million text files do not take more than 50 GB of disk space*

Some details about my system:

Code:
RHEL 5.5 (Tikanga)
Kernel 2.6.18 x86_64
Local ext3 LVM (3 Physical Volumes)
4.5 GB RAM

My Disks:
Code:
SCSI device sdb: 31457280 512-byte hdwr sectors
sd 0:0:0:0: Attached scsi disk sdb
  Type:   Direct-Access                      ANSI SCSI revision: 02
 target0:0:1: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 127)

Code:
[root@atlas ~]# df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/volAvg-A1lv
                     15466496 4464503 11001993   29% /export
					 
[root@atlas ~]# df -k
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/volAvg-A1lv
                     121790564  89066984  26538244  78% /export

*According to df -k I'm using around 80 GB but that's because I have other files in the same filesystem that are eating up close to 30 GB.

Quote:
When you say somewhere else, I guess you mean a different server ?
A different server or a different filesystem within the same server; whatever approach is faster.

Quote:
Is it safe to assume that the filesystem will be quiescent?
Not exactly quiescent but with very little disk usage since I will run this process at night when nobody uses the server.

Quote:
Do you have spare discs equivalent to say twice the existing space?
Yes, I can attach more disks if necessary.

Last edited by verdepollo; 12-08-2010 at 03:33 PM..
# 6  
Old 12-08-2010
Different filesystems on the same server will be orders of magnitude quicker.
Probably one of the fastest is the one mentioned previously:
Code:
tar cf - /my/path | ( cd /new/target; tar xfp -)

And this has the added benefit of making the files contiguous on the new filesystem.
This also doesnt compress or encrypt the files, both of which will hit the cpu.
# 7  
Old 12-29-2010
Guys,

Just wanted to thank you.

Since the original request was creating more issues than expected I have opted for a whole disk backup (the disk is not that big)... Faster and less problematic.

Thanks for your valuable suggestions, though. Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How can we Zip multiple files created on the same date into one single zip file.?

Hi all i am very new to shell scripting and need some help from you to learn 1)i have some log files that gets generated on daily basis example: i have abc_2017_01_30_1.log ,2017_01_30_2.log like wise so i want to zip this 4 logs which are created on same date into one zip folder. 2)Post zipping... (1 Reply)
Discussion started by: b.saipriyanka
1 Replies

2. UNIX for Beginners Questions & Answers

How can we Zip multiple files created on the same date into one single zip file.?

Hi all i am very new to shell scripting and need some help from you to learn 1)i have some log files that gets generated on daily basis example: i have abc_2017_01_30_1.log ,2017_01_30_2.log like wise so i want to zip this 4 logs which are created on same date into one zip folder. 2)Post zipping... (2 Replies)
Discussion started by: b.saipriyanka
2 Replies

3. Shell Programming and Scripting

How to create zip/gz/tar files for if the files are older than particular days in UNIX or Linux?

I need a script file for backup (zip or tar or gz) of old log files in our unix server (causing the space problem). Could you please help me to create the zip or gz files for each log files in current directory and sub-directories also? I found one command which is to create gz file for the... (4 Replies)
Discussion started by: Mallikgm
4 Replies

4. Shell Programming and Scripting

Zip Multiple files to One .zip file in AIX system

Hi I have a requirement in unix shell where I need to zip multiple files on server to one single .zip file. I dont see zip command in AIX and gzip command not doing completely what I want. One I do .zip file, I should be able to unzip in my local Computer. Here is example what I want... (9 Replies)
Discussion started by: RAMA PULI
9 Replies

5. Shell Programming and Scripting

help with tar & zip only last months(say,Sep) files

Need to 1. archive all the files in a directory from the previous month into a tar/gz file, ignoring all already archived 'tar.gz' files 2. Check created .tar.gz file isnt corrupted and has all the required files in it. and then remove the original files. I am using a function to get the... (1 Reply)
Discussion started by: Prev
1 Replies

6. UNIX for Dummies Questions & Answers

TAR and ZIP files

Hi, I need a help with zip and tar. I have no done any scripts before with zip command. What I need to achieve is list files in a directory with a specific name (ID_DATE format- given examples) and then zip (or gunzip which I need to use, I am not sure) with timestamp on the file name and then... (15 Replies)
Discussion started by: Vijay81
15 Replies

7. Shell Programming and Scripting

Need script to remove millions of tmp files in /html/cache/ directory

Hello, I just saw that on my vps (centOS) my oscommerce with a seo script has created millions of tmp files inside the /html/cache/ directory. I would need to remove all those files (millions), I tried via shell but the vps loads goes to very high and it hangs, is there some way to do a... (7 Replies)
Discussion started by: andymc1
7 Replies

8. Shell Programming and Scripting

To write a shell script which groups files with certain pattern, create a tar and zip

Hi Guru's, I have to write a shell script which groups file names based upon the certain matching string pattern, then creates the Tar file for that particular group of files and then zips the Tar file created for the respective group of files. For example, In the given directory these files... (3 Replies)
Discussion started by: rahu_sg
3 Replies

9. UNIX Desktop Questions & Answers

Using Tar Zip

Hi, I want to backup my SQL database using tar zip but I'm paranoid that I will archive it. What I mean is I want the files to stay where they are but make a zipped copy of the files as well, I don't want to delete the originals. Is the command? tar -cvzf databasename.tar.gz... (1 Reply)
Discussion started by: chickenhouse
1 Replies

10. UNIX for Dummies Questions & Answers

unzip .zip file and list the files included in the .zip archive

Hello, I am trying to return the name of the resulting file from a .zip archive file using unix unzip command. unzip c07212007.cef7081.zip Archive: c07212007.cef7081.zip SecureZIP for z/OS by PKWARE inflating: CEP/CEM7080/PPVBILL/PASS/G0063V00 I used the following command to unzip in... (5 Replies)
Discussion started by: oracledev
5 Replies
Login or Register to Ask a Question