I have been doing some investigation into a log file from one of my systems, and the means which I currently use to compress and rotate it. I am looking for something smarter than gzip, faster than bzip2, and that can match or beat "my script" (which is slow as heck, but WAY better compression ratios result)
If 23-bytes per line are timestamp, that is roughly 33% of the file. If another 45% can be saved by doing a dictionary-map/replace of the 25 most-common phrases (I wrote a python script to do my mapping)... then there must be a compression program out there that can compress my logs without me needing to do this stuff prior to a gzip... right? And the bonus is that I wouldnt have to also un-do my changes on the decompress
"My Script" does the following:
- Find the first date/timestamped line, convert the timestamp to a number (IE: 2008-03-06 11:24:36.123 becomes 20080306112436123)
- Each subsequent datestamp is replaced with the difference between it and the last stamped line (resulting in small numbers) converted into a "base 72" number-string
- looking at everything on the line BEYOND the stamp, I check to see if a message is repeating from the line above, if it IS, then I replace the entire message with a hyphen (so only the first occurrence is actually seen)
- finally I compress using "gzip --best" because it is 1000x faster than bzip2 (although bzip2 gives me a better ratio)
Any Ideas???