Gzip behavior on open files?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Gzip behavior on open files?
# 1  
Old 04-23-2013
Gzip behavior on open files?

Just a quick question: How does gzip behave under linux if it's source is a file that is currently being written to by a different process? Basically, in my below code I want to make sure that there is ultimately no loss of data; ie. the gzip command runs until it "catches up" to the end of the file while the file is expanding, and then the cat /dev/null clears the file immediately, therefore the next write to the file happens when it is empty, and all prior data in the file is safely preserved in the archived gzip file. How does my code look?

Code:
CAPDIR=/data/capture
KEEPDIR=/data/capture/keep

for FILE in `find $CAPDIR -maxdepth 1 -not -type d | awk -F/ '{print $NF}'`
do
   echo Processing $CAPDIR/$FILE --\> $KEEPDIR/$FILE.GZ
   gzip -c /$CAPDIR/$FILE  > $KEEPDIR/$FILE.GZ
   cat /dev/null > $CAPDIR/$FILE
done
echo
echo Done. 
echo

I know in some OS's that when a file handle is locked for reading you get the file contents up to the EOF at the time of lock, not up to the EOF at the current time.

I guess another way to put my question would be is there a way to "atomize" these commands:
Code:
   gzip -c /$CAPDIR/$FILE  > $KEEPDIR/$FILE.GZ
   cat /dev/null > $CAPDIR/$FILE

...such that I can be guaranteed that no other process gets a chance to write data to $CAPDIR/$FILE in between the call to gzip and the call to cat /dev/null?

Last edited by dan-e; 04-23-2013 at 10:19 PM..
# 2  
Old 04-24-2013
Do you really need to blank out the file after archiving it? That seems a little dangerous. If something goes wrong with the archive process, you might lose data?

What if you:
1) rename the file first
2) touch the original file name and set permissions
3) archive the renamed file
4) delete the renamed file

That way the file you are archiving is not being appended to.
# 3  
Old 04-24-2013
Thanks for your help -
That's a fair enough alternative approach but wouldn't there be a potential problem if I renamed the capture file while it was being written to by another process? Would the process continue writing to the now-renamed file?
I've been told that I cannot remove the capture file (hence the cat /dev/null) because otherwise it will break the other process that writes to the capture file. Considering this, renaming it would have the same effect as removing it wouldn't it?

It seems that the best possible solution is to somehow prevent any data being flushed to the file between the completion of archiving and the emptying of the file, but I'm not sure that this is even possible?
# 4  
Old 04-24-2013
I agree there is a potential problem with renaming the logfile, that maybe some data might be lost or something go wrong with the process of writing to the logfile. I was just trying to suggest something as an alternative to blanking out the file, but I would do neither.

What if you just gzip the file, and let it keep growing? You will have a 100% sure valid zip file, and you are 100% sure the log file will not be damaged, and you will be 100% sure that no data are lost. It's blanking out the log file or renaming the log file that introduces the data loss potential.

The zip file might fail to collect all the data in the log file, but it doesn't matter. The data are still in the log file. The zip file is complete as of some time point.

Maybe there is some reason you need to blank out (truncate) the log file?
# 5  
Old 04-24-2013
Quote:
Originally Posted by dan-e
Thanks for your help -
That's a fair enough alternative approach but wouldn't there be a potential problem if I renamed the capture file while it was being written to by another process? Would the process continue writing to the now-renamed file?
If all it's doing is writing to the file, it should continue seamlessly. You can rename files in use and nothing happens as long as the inode, the file's unique ID on the filesystem, remains the same. You can even delete them in use and the program continues -- but everything except things which already have it open lose their ability to access the file...

You could move it out of the folder as long as it remains on the same partition. The process would continue writing uninterrupted because the file always exists somewhere; the unique ID of the file, its inode, would remain unchanged.

I'd use ln and rm instead of mv, to guarantee mv doesn't decide to create a new file for whatever reason, and to guarantee that you're not trying to move it to a different partition. If /path/to/dest is not on the same partition as /path/to/source, ln will fail.

Code:
if ! ln /path/to/source /path/to/dest
then
        echo "Couldn't link" >&2
        exit 1
fi

# They share the same inode -- they are literally the same file
# You can delete one of the names without deleting the file itself now.
ls -i /path/to/source /path/to/dest

# Delete the original location, and the new location still exists
rm /path/to/source

Quote:
I've been told that I cannot remove the capture file (hence the cat /dev/null) because otherwise it will break the other process that writes to the capture file.
Actually, it wouldn't affect the process writing the file -- but it would affect you. The file would still be on disk, but not listed in any folder until the process quits.

People often delete logfiles expecting to free up disk space, but because they deleted files that were being written to, no space was freed and they couldn't even truncate the files anymore -- no longer listed in any folders. You have to restart the log daemon or the system itself to free that space.

Quote:
It seems that the best possible solution is to somehow prevent any data being flushed to the file between the completion of archiving and the emptying of the file, but I'm not sure that this is even possible?
Indeed, that'd be ideal. Cooperative locking is possible, but note the word 'cooperative', the writing process has to play along. If it doesn't ask to lock the file, it'll never get stopped from writing.

Check the program's options, you might be able to tell it to do so.

Last edited by Corona688; 04-24-2013 at 01:05 PM..
This User Gave Thanks to Corona688 For This Post:
# 6  
Old 04-24-2013
Quote:
Originally Posted by dan-e
I guess another way to put my question would be is there a way to "atomize" these commands:
Code:
   gzip -c /$CAPDIR/$FILE  > $KEEPDIR/$FILE.GZ
   cat /dev/null > $CAPDIR/$FILE

...such that I can be guaranteed that no other process gets a chance to write data to $CAPDIR/$FILE in between the call to gzip and the call to cat /dev/null?
The way this is typically done during logrotation is to rename the logfile, then send a signal to the logging process to inform it that it needs to close its file descriptor and create a new logfile, and finally, compress.

Another safe alternative, though more brutish, is to shut down the logging process during rotation.

If your system has logrotate or similar tool, and if the logging process uses a signal for rotation, then you don't even need to write the shell script. Just add a section to the config file to handle your files.

On an unrelated note, there's no need to use cat to truncate a file. The shell can do it with a simpler, less expensive redirection:
Code:
> "$CAPDIR/$FILE

Regards,
Alister

Last edited by alister; 04-24-2013 at 01:34 PM..
This User Gave Thanks to alister For This Post:
# 7  
Old 04-24-2013
gzip certainly gives error when another process is writing to it.
Chain the next command with && i.e. only if successful.
Code:
find "$CAPDIR" -maxdepth 1 -type f |
 awk -F/ '{print $NF}' |
while read FILE
do
   echo "Processing $CAPDIR/$FILE --> $KEEPDIR/$FILE.GZ"
   gzip -c "$CAPDIR/$FILE"  > "$KEEPDIR/$FILE.GZ" &&
   > "$CAPDIR/$FILE"
done

A while loop is more appropriate.
Variables in command arguments should be quoted.
This User Gave Thanks to MadeInGermany For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

tar and gzip files

Hi Guys, I am using RHEL5 and Solaris 9 & 10. I want to tar and gzip my files then remove them after a successful tar command... Lets say I have files with extension .arc then I want to tar and gzip these files. After successful tar command I want to remove all these files (i.e .arc). ... (3 Replies)
Discussion started by: Phuti
3 Replies

2. Shell Programming and Scripting

help with a script to gzip/move files

Hi Please can you help me in writing a script to find files on a specific directory, and of extension "tap" but only of the month of september, gzip and move them to another directory. Your help will be appreciated. (4 Replies)
Discussion started by: fretagi
4 Replies

3. UNIX for Advanced & Expert Users

gzip vs pipe gzip: produce different file size

Hi All, I have a random test file: test.txt, size: 146 $ ll test.txt $ 146 test.txt Take 1: $ cat test.txt | gzip > test.txt.gz $ ll test.txt.gz $ 124 test.txt.gz Take 2: $ gzip test.txt $ ll test.txt.gz $ 133 test.txt.gz As you can see, gzipping a file and piping into gzip... (1 Reply)
Discussion started by: hanfresco
1 Replies

4. Shell Programming and Scripting

gzip files with extension

Hi, I have 1000 of files in a folder with the file extension as .csv In this some of the files are already zipped and its looks like filename.csv.gz Now i need to zip all the files in the folder to free some disk space. When i give gzip *.csv It prompts me to overwrite filename.csv.gz... (5 Replies)
Discussion started by: nokiak810
5 Replies

5. Shell Programming and Scripting

gzip the files with particular extension

Is there any way to compress only the files with .xml extension within a folder which in turn has many sub folders? gzip -r9 path/name/*.xml is not working This compression is done in the Windows server using Batch script. (2 Replies)
Discussion started by: Codesearcher
2 Replies

6. Shell Programming and Scripting

Gzip files as they are created

Hello. I have a scripting query that I am stumped on which I hope you can help with. Basically, I have a ksh script that calls a process to create n number of binary files. These files have a maximum size of 1Gb. The process can write n number of files at once (parallel operation) based on the... (4 Replies)
Discussion started by: eisenhorn
4 Replies

7. Shell Programming and Scripting

unzip particular gzip files among the normal data files

Hello experts, I run Solaris 9. I have a below script which is used for gunzip the thousand files from a directory. ---- #!/usr/bin/sh cd /home/thousands/gzipfiles/ for i in `ls -1` do gunzip -c $i > /path/to/file/$i done ---- In my SAME directory there thousand of GZIP file and also... (4 Replies)
Discussion started by: thepurple
4 Replies

8. UNIX for Dummies Questions & Answers

gzip all the files in a directory

Hi, There are multiple files in a directory with different names.How can they be gzipped such that the timestamp of the files is not changed. (2 Replies)
Discussion started by: er_ashu
2 Replies

9. UNIX for Dummies Questions & Answers

Need to gzip LARGE files

The windows version of gzip supports pretty much unlimited file sizes while the one we have in solaris only goes up to a set size, one or two gigs I think. Is there a new version of gzip I can put on our systems that supports massive file sizes? (2 Replies)
Discussion started by: LordJezo
2 Replies

10. UNIX for Dummies Questions & Answers

gzip, multiple files

Hello Everyone, Here is what I am trying to do. I have four text files, I want to gzip them under unix and mail the zipped file via outlook. I am able to do this easily enough, but using winzip or pkunzip to unzip the file, there is only one file. (In essence, all four files were... (2 Replies)
Discussion started by: smbodnar
2 Replies
Login or Register to Ask a Question