Converting Huge Archive into smaller ones

10-15-2008

Registered User

173, 0

Join Date: Sep 2007

Last Activity: 13 May 2010, 3:07 PM EDT

Posts: 173

Thanks Given: 0

Thanked 0 Times in 0 Posts

Converting Huge Archive into smaller ones

I have a 13G gz archive... The problem is that when I expand it, it goes to 300G and I don't have so much of hdd space. The file is a one huge file: rrc.tar.gz. What I want to do is to extract the archive but at each step gzip the resulting file.

So, if

Code:

gunzip -c rrc00.tar.gz | tar -xvf -

gives me an uncompressed directory, I want each of the files to be gzipped as and when they are extracted. So for example, if the resulting directory is something like

2007/fileA.txt
2007/fileB.txt

I want fileA.txt to be gzipped into fileA.txt.gz before it goes and extracts fileB.txt. Is there a way that this is possible?

Legend986

View Public Profile for Legend986

Find all posts by Legend986

10-15-2008

Registered User

40, 1

Join Date: Jul 2008

Last Activity: 19 June 2012, 3:05 AM EDT

Posts: 40

Thanks Given: 0

Thanked 1 Time in 1 Post

I was succesful with this
gunzip -c rr.tar.gz | tar -tf - > contents
for f in `cat contents`; do gunzip -c rr.tar.gz | tar -xf - $f; gzip $f; done

but it has a huge drawback of doing gunzip of the whole file and extracting just one.
It'd be much better to do "gunzip -c" once, and then parse the output

google for two interesting tar's options: --to-stdout (-O) and --to-command=
I need to go now, but Ill be glad if you share the solution with us. It's interesting.

togr

View Public Profile for togr

Find all posts by togr

10-15-2008

Registered User

173, 0

Join Date: Sep 2007

Last Activity: 13 May 2010, 3:07 PM EDT

Posts: 173

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thank You for the advice. There is a heavy resource constraint so I will try to explore more. I could think of one solution and I would appreciate if someone could provide a better one...

I would do a

Code:

 gunzip -c rrc00.tar.gz | tar -xvf -

And then setup a crontab to look for new files in the current directory with a certain extension. If there is, then I would gzip the file. This is the simplest I could think of. Please let me know if it is the best though

Legend986

View Public Profile for Legend986

Find all posts by Legend986

10-15-2008

Registered User

173, 0

Join Date: Sep 2007

Last Activity: 13 May 2010, 3:07 PM EDT

Posts: 173

Thanks Given: 0

Thanked 0 Times in 0 Posts

Coming to think of it, I am now facing another problem. If the file is in the middle of execution, there is a chance that the cron will take even this file into consideration and execute a gzip on it which could be a problem. Is there a way to tell the find command to find only those files which are not being accessed by any other process?

Legend986

View Public Profile for Legend986

Find all posts by Legend986

10-16-2008

Registered User

173, 0

Join Date: Sep 2007

Last Activity: 13 May 2010, 3:07 PM EDT

Posts: 173

Thanks Given: 0

Thanked 0 Times in 0 Posts

I never imagined, I would face so many problems with a directory archived the wrong way

In any case, I was able to convert a complete directory archive into a directory of archives. In any case, here's a solution for those interested:

Problem:
A directory was gzipped on the whole. So, it becomes almost impossible to extract data from some particular files efficiently.

Constraint:
The archive is 13G and expands into 250G but the disk capacity is 50G

Conventional Answer:
13G Directory Archive --> Expands to 250G --> Converted into 13G Directory of Archives

Answer:
Convert the directory archive into a directory of archives.

Solution:

Step 1:

Prepare a shell script and place it in the directory where the archive is to be extracted: checkAndGzip.sh

Code:

#!/bin/bash

for FILE in `find ./ -name "*.extension"`
do
        temp=`lsof $FILE | awk '{if(NR>1) if($4 ~ "w") print $4}'`;
        if [ "$temp" = "" ]; then
                #Implies that the file is not in use
                #Initiate gzip on file
                gzip $FILE;
        fi
done;

Note: Observe the usage of lsof which is a nice utility that tells if the file is in use.

Step 2:

Setup a cron as

Code:

*/2 * * * * /path/to/checkAndGzip.sh > output

Note: Cron runs every two minutes

Step 3:

Run this command in the directory:

Code:

gunzip -c archive.tar.gz | tar -xvf -

Logic:The logic is pretty simple. On the one hand, the extraction takes place and on the other, the cron executes a shell script that checks if a new file has been generated and then gzips it. The reason why we use lsof is to verify if the file is still being extracted (gzip doesn't seem to care about partial files) and if a file is in use, skip it during this run.

If anyone has a better solution, or have an improvement for the above solution, kindly suggest

Legend986

View Public Profile for Legend986

Find all posts by Legend986

10-16-2008

Registered User

325, 2

Join Date: Nov 2007

Last Activity: 26 April 2020, 8:13 AM EDT

Posts: 325

Thanks Given: 0

Thanked 2 Times in 2 Posts

Nice solution ...

and a few suggestions:
- You can consolidate all the commands in one script,
- and you can use the sleep command within the script, instead of setting up a cron process, running every 2 min,

checkAndGzip.sh

Code:

#!/bin/bash

#set -x


# Provide full path, so you can run the script from every dir.
cd /full/path/to/zipped_files

# Start unzipping the files, run it in the background so the files checking can start.
gunzip -c archive.tar.gz | tar -xvf -   &

# Start checking for the files, while the unzipping is happening. 
# Use a find ...| while read ... construction , because it doesn't break if the file name has white space in it.

 find . -name '*.extension' | while read FILE      2>/dev/null
  do
        temp=`lsof "$FILE" | awk 'NR>1 && $4 ~ "w"{ print $4 }'` ;
        
        if [ "$temp" = "" ]; then
                #Implies that the file is not in use
                #Initiate gzip on file
                gzip "$FILE";
        fi
    
        # Wait for 2 minutes.
        sleep 120

  done > output

Modify the code to fit any other requirement.

rubin

View Public Profile for rubin

Find all posts by rubin

10-16-2008

Registered User

173, 0

Join Date: Sep 2007

Last Activity: 13 May 2010, 3:07 PM EDT

Posts: 173

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks for the improvement

Actually, on my system, for some reason, the find command doesn't work. I mean, extraction is taking place but the gzipping part doesn't seem to work.

The first time find runs, it doesn't find any files (or finds only a few files in use) so it exits out of the loop... is that correct by any chance?

Legend986

View Public Profile for Legend986

Find all posts by Legend986

Shell Programming and Scripting

Converting Huge Archive into smaller ones

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Script to archive logs and sftp to another archive server

Discussion started by: Arjun Goswami

2. Shell Programming and Scripting

Breaking out ip subnet to smaller subnets

Discussion started by: numele

3. Shell Programming and Scripting

Converting huge xls(having multiple tabs) to csv

Discussion started by: joshiamit

4. Shell Programming and Scripting

Number of lines smaller than specified value

Discussion started by: senayasma

5. Shell Programming and Scripting

Extracting from archive | compressing to new archive

Discussion started by: chebarbudo

6. Shell Programming and Scripting

Grab a smaller and larger value

Discussion started by: Raynon

7. UNIX for Dummies Questions & Answers

Anything smaller than sleep 1

Discussion started by: markms

8. Post Here to Contact Site Administrators and Moderators

Smaller splash graphic

Discussion started by: kduffin