Decompress (with gunzip) recursively, but do not delete original gz file

01-12-2011

Registered User

6, 0

Join Date: Jan 2011

Last Activity: 19 January 2011, 1:41 PM EST

Location: Gainesville, FL

Posts: 6

Thanks Given: 1

Thanked 0 Times in 0 Posts

Decompress (with gunzip) recursively, but do not delete original gz file

Hi all,

I have a folder hierarchy with many gz files in them. I would like to recursively decompress them, but keep the original files. I would also like to move all the decompressed files (these are very large HDF5 files with .CP12 extension) to another data folder.

Currently I am using four steps to achieve this:

1. Make a copy of the source directory hierarchy:

Code:

cp -R old_archive/ new_archive/

2. Inside new_archive, gunzip recursively:

Code:

find . -name "*.gz" -exec gunzip {} \;

3. Move all decompressed files from new_archive hierarchy to a data folder:

Code:

find . -name "*.CP12" | xargs -I {} mv -iv {} ~/data/

4. Remove new_archive (empty hierarchy)

Code:

rm -rf new_archive/

This works. old_archive contains gz files and data contains the decompressed versions. But is time consuming.

My question is: how can I perform this recursive extraction efficiently? I would like to avoid step 1, since it takes a very long time (terabyte size datasets).

I need to prevent gunzip's default behavior (removing the original gz file). Since, grep -d or gunzip has a "-c" option to extract to stdout, how I can recursively extract and put in data?

If I write a simple shell script for this, would running gunzip -c hdf-file.gz> hdf-file recursively be more efficient that doing cp and then gunzip? Note that the decompressed files can be very big (gigabyte size each) and so I also want to prevent sized related errors during the pipe process. Could someone comment on this. Thanks in advance!

---------- Post updated at 06:21 PM ---------- Previous update was at 06:03 PM ----------

Here is another script version for the same:

Code:

#!/bin/bash

shopt -s extglob

for dir in *
do
  if [ -d $dir ]
  then
    echo "--- Entering directory $dir ---"
        for file in "${dir}"/*.gz
         do
            fname=`basename "$file" .gz`
            echo "Now processing $fname ..."
            gunzip -cv "$fname.gz" > "$fname"
            mv -iv "$fname" ~/data
         done
  fi
done

Is there a better way?

Last edited by gansvv; 01-12-2011 at 07:26 PM.. Reason: added another version

gansvv

View Public Profile for gansvv

Find all posts by gansvv

01-12-2011

Registered User

6,402, 678

Join Date: Mar 2008

Last Activity: 8 June 2016, 9:58 PM EDT

Posts: 6,402

Thanks Given: 288

Thanked 678 Times in 647 Posts

What Operating System and version are you running?
What Shell do you use?
Can you post a sample directory listing of a representative directory?
Are old_archive and new_archive on the same filesystem? This question is very important because of the way "mv" works.
Scheduling. How often do you run this job? There would appear to be opportunity to carry out the online backup of the original files in advance.

Quote:

But is time consuming.

How much time? Seconds, Minutes, Hours, Days, Weeks ?

Quote:

sized related errors during the pipe process

Not clear what this means. Please post what you typed, what you expected to happen, what actually happened. Don't forget to include any error messages and a "ls -la" directory listing of any files involved.

methyl

View Public Profile for methyl

Find all posts by methyl

01-13-2011

Registered User

6, 0

Join Date: Jan 2011

Last Activity: 19 January 2011, 1:41 PM EST

Location: Gainesville, FL

Posts: 6

Thanks Given: 1

Thanked 0 Times in 0 Posts

@Methyl: I am running Ubuntu Server 10.10 and using bash.

The directory structure is like this:
/old_archive/
--/year2009/
----/001/
----/002/
.
.
----/300/
--/year2008/
----/001/
----/002/
.
.
(each of the 001/ to 300/ folders has 10+ large gz files).

Yes, currently both archive locations are on the same filesystem. BTW, the disks are setup as RAID 0.

And, the entire process takes to the order of days. Its not run often (perhaps once a month) but I like your "online" backup idea. I will try that. That brings up another interesting idea: Is there a was to parallelize the cp or mv operations? Can I break it into execution threads running simultaneously?

About the piping error I mentioned: I did not actually see any such errors. But I was wondering if sending gigabyte sized decompressed file to stdout (and piping to a file) has a chance of generating errors. Has anyone seen such problems?

Thanks for your reply!

gansvv

View Public Profile for gansvv

Find all posts by gansvv

01-13-2011

Registered User

6,402, 678

Join Date: Mar 2008

Last Activity: 8 June 2016, 9:58 PM EDT

Posts: 6,402

Thanks Given: 288

Thanked 678 Times in 647 Posts

#!/bin/bash

Quote:

shopt -s extglob

for dir in *
do
if [ -d $dir ]
then
echo "--- Entering directory $dir ---"
for file in "${dir}"/*.gz
do
fname=`basename "$file" .gz`
echo "Now processing $fname ..."
gunzip -cv "$fname.gz" > "$fname"
mv -iv "$fname" ~/data
done
fi
done

Assuming that I have understood this correctly, I think that the script contains fundamental design errors which makes it slow. Writing gigabytes using Shell redirect ">" is not a good idea.

It would be considerably faster to copy the zipped files directly to the target directory then unzip in the target directory using "gunzip" (not "gunzip -c") on the file copy. Maybe you had an issue copying the directory tree?

The original process describes copying the original tree, decompressing each file, then copying the decompressed files to the target tree. It is much easier to copy the whole tree of compressed files to the target directory using "find ." piped to "cpio -pdum /target_directory" then decompress in the target directory. This technique for copying files is described in the man pages for "find" and "cpio" - do read both manuals and try on a test system first. Not clear whether there is anything present in the target directories already.

My idea only makes sense if you are copying all files. If ".CP12" files are a selection then we need a different technique. It also matters if the various directories are on different filesystems (because "mv" becomes a copy rather than a rename if they are).

Last edited by methyl; 01-13-2011 at 09:09 AM.. Reason: typos

methyl

View Public Profile for methyl

Find all posts by methyl

Shell Programming and Scripting

Decompress (with gunzip) recursively, but do not delete original gz file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to decompress files using gunzip?

Discussion started by: hoyanet

2. Shell Programming and Scripting

How to decompress files using gunzip?

Discussion started by: vel4ever

3. UNIX for Dummies Questions & Answers

How to delete original files after using a tar operation.

Discussion started by: manutd

4. Shell Programming and Scripting

How to delete a duplicate line and original with sed.

Discussion started by: chino_1

5. UNIX for Advanced & Expert Users

Delete empty directories recursively - HP-UX

Discussion started by: asutoshch

6. Shell Programming and Scripting

recursively delete the text between 2 strings from a file

Discussion started by: santosh1234

7. UNIX for Dummies Questions & Answers

Using gunzip to decompress .zip file

Discussion started by: syang68

8. Shell Programming and Scripting

Delete original wav file if lame was successful encoding.

Discussion started by: Aia

9. UNIX for Dummies Questions & Answers

Decompress a .SP file

Discussion started by: jeco

10. HP-UX

decompress in HPUX11 by Gunzip and gzip

Discussion started by: yanly64