Converting Huge Archive into smaller ones


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Converting Huge Archive into smaller ones
# 1  
Old 10-15-2008
Converting Huge Archive into smaller ones

I have a 13G gz archive... The problem is that when I expand it, it goes to 300G and I don't have so much of hdd space. The file is a one huge file: rrc.tar.gz. What I want to do is to extract the archive but at each step gzip the resulting file.

So, if

Code:
gunzip -c rrc00.tar.gz | tar -xvf -

gives me an uncompressed directory, I want each of the files to be gzipped as and when they are extracted. So for example, if the resulting directory is something like

2007/fileA.txt
2007/fileB.txt

I want fileA.txt to be gzipped into fileA.txt.gz before it goes and extracts fileB.txt. Is there a way that this is possible?
# 2  
Old 10-15-2008
I was succesful with this
gunzip -c rr.tar.gz | tar -tf - > contents
for f in `cat contents`; do gunzip -c rr.tar.gz | tar -xf - $f; gzip $f; done

but it has a huge drawback of doing gunzip of the whole file and extracting just one.
It'd be much better to do "gunzip -c" once, and then parse the output

google for two interesting tar's options: --to-stdout (-O) and --to-command=
I need to go now, but Ill be glad if you share the solution with us. It's interesting.
# 3  
Old 10-15-2008
Thank You for the advice. There is a heavy resource constraint so I will try to explore more. I could think of one solution and I would appreciate if someone could provide a better one...

I would do a

Code:
 gunzip -c rrc00.tar.gz | tar -xvf -

And then setup a crontab to look for new files in the current directory with a certain extension. If there is, then I would gzip the file. This is the simplest I could think of. Please let me know if it is the best though Smilie
# 4  
Old 10-15-2008
Coming to think of it, I am now facing another problem. If the file is in the middle of execution, there is a chance that the cron will take even this file into consideration and execute a gzip on it which could be a problem. Is there a way to tell the find command to find only those files which are not being accessed by any other process?
# 5  
Old 10-16-2008
I never imagined, I would face so many problems with a directory archived the wrong way Smilie In any case, I was able to convert a complete directory archive into a directory of archives. In any case, here's a solution for those interested:

Problem:
A directory was gzipped on the whole. So, it becomes almost impossible to extract data from some particular files efficiently.

Constraint:
The archive is 13G and expands into 250G but the disk capacity is 50G

Conventional Answer:
13G Directory Archive --> Expands to 250G --> Converted into 13G Directory of Archives

Answer:
Convert the directory archive into a directory of archives.

Solution:

Step 1:

Prepare a shell script and place it in the directory where the archive is to be extracted: checkAndGzip.sh
Code:
#!/bin/bash

for FILE in `find ./ -name "*.extension"`
do
        temp=`lsof $FILE | awk '{if(NR>1) if($4 ~ "w") print $4}'`;
        if [ "$temp" = "" ]; then
                #Implies that the file is not in use
                #Initiate gzip on file
                gzip $FILE;
        fi
done;

Note: Observe the usage of lsof which is a nice utility that tells if the file is in use.

Step 2:

Setup a cron as

Code:
*/2 * * * * /path/to/checkAndGzip.sh > output

Note: Cron runs every two minutes

Step 3:

Run this command in the directory:

Code:
gunzip -c archive.tar.gz | tar -xvf -

Logic:The logic is pretty simple. On the one hand, the extraction takes place and on the other, the cron executes a shell script that checks if a new file has been generated and then gzips it. The reason why we use lsof is to verify if the file is still being extracted (gzip doesn't seem to care about partial files) and if a file is in use, skip it during this run.

If anyone has a better solution, or have an improvement for the above solution, kindly suggest Smilie
# 6  
Old 10-16-2008
Nice solution ...

and a few suggestions:
- You can consolidate all the commands in one script,
- and you can use the sleep command within the script, instead of setting up a cron process, running every 2 min,


checkAndGzip.sh

Code:
#!/bin/bash

#set -x


# Provide full path, so you can run the script from every dir.
cd /full/path/to/zipped_files

# Start unzipping the files, run it in the background so the files checking can start.
gunzip -c archive.tar.gz | tar -xvf -   &

# Start checking for the files, while the unzipping is happening. 
# Use a find ...| while read ... construction , because it doesn't break if the file name has white space in it.

 find . -name '*.extension' | while read FILE      2>/dev/null
  do
        temp=`lsof "$FILE" | awk 'NR>1 && $4 ~ "w"{ print $4 }'` ;
        
        if [ "$temp" = "" ]; then
                #Implies that the file is not in use
                #Initiate gzip on file
                gzip "$FILE";
        fi
    
        # Wait for 2 minutes.
        sleep 120

  done > output

Modify the code to fit any other requirement.
# 7  
Old 10-16-2008
Thanks for the improvement Smilie Actually, on my system, for some reason, the find command doesn't work. I mean, extraction is taking place but the gzipping part doesn't seem to work.

The first time find runs, it doesn't find any files (or finds only a few files in use) so it exits out of the loop... is that correct by any chance?
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Script to archive logs and sftp to another archive server

Requirement: Under fuse application we have placeholders called containers; Every container has their logs under: <container1>/data/log/fuse.log <container1>/data/log/fuse.log.1 <container1>/data/log/fuse.log.XX <container2>/data/log/fuse.log... (6 Replies)
Discussion started by: Arjun Goswami
6 Replies

2. Shell Programming and Scripting

Breaking out ip subnet to smaller subnets

I have a script and it works fine, but I am sure this can be shrunk down to something much better. I would appreciate someone taking a crack at it for me. What it does is take the ip block submitted and breaks it out down to /24's. #!/bin/ksh ipadd=${1} octet1=`echo $ipadd | nawk -F.... (3 Replies)
Discussion started by: numele
3 Replies

3. Shell Programming and Scripting

Converting huge xls(having multiple tabs) to csv

hello I have browsed for the similar requirement i found this https://www.unix.com/shell-programming-scripting/40163-xls-csv-conversion.html but my problem is i have multiple tabs in xls file having same metadata I want to convert it into single csv file any ways to do it pls... (5 Replies)
Discussion started by: joshiamit
5 Replies

4. Shell Programming and Scripting

Number of lines smaller than specified value

Hi All, I have a problem to find number of lines per column smaller than the values given in a different file. In example, compare the 1st column of file1 with the 1st line of the file2, 2nd column of file1 with the 2nd line of the file2, etc cat file1 0.2 0.9 0.8 0.5 ... 0.6 0.5... (9 Replies)
Discussion started by: senayasma
9 Replies

5. Shell Programming and Scripting

Extracting from archive | compressing to new archive

Hi there, I have one huge archive (it's a system image). I need sometime to create smaller archives with only one or two file from my big archive. So I'm looking for a command that extracts files from an archive and pipe them to another one. I tried the following : tar -xzOf oldarchive.tgz... (5 Replies)
Discussion started by: chebarbudo
5 Replies

6. Shell Programming and Scripting

Grab a smaller and larger value

Hi All, I am trying to grab a term which is just smaller and larger than the assigned value using the below code. But there seems to be some problem. The value i assign is 25 so i would expect it to output a smaller value to be 20 instead of 10 and 20 and larger value to be 30 instead of 30 and... (3 Replies)
Discussion started by: Raynon
3 Replies

7. UNIX for Dummies Questions & Answers

Anything smaller than sleep 1

I'm writting a shell script to email customers invoices. after each RCPT to: email@address.com, i've put in a sleep 1 command, but with 2000 customers to email and about 5 or 6 of these sleep commands it can take a very long time. Is there any smaller amount of time the sleeper can sleep for... (4 Replies)
Discussion started by: markms
4 Replies

8. Post Here to Contact Site Administrators and Moderators

Smaller splash graphic

Not to complain, but the large graphic at the top of the page ensures that I have to scroll down everytime I change pages. Maybe the problem is the 1024x768 of my laptop screen, but it does seem excessive. Keith (2 Replies)
Discussion started by: kduffin
2 Replies
Login or Register to Ask a Question