Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Interestingly, since "recovery" is mentioned several times, I decided to test myself.

I took a copy of a jpeg image, compressed it different times with either gzip or bzip2, then with a hexeditor modified one byte.

The recovery instructions for gzip is to simply do "zcat corrupt_file.gz > corrupt_file". While for bzip2 is to use the bzip2recover command which just dumps the blocks out individually (corrupt ones and all).

Uncompressing the corrupt gzip jpeg file via zcat at all times resulted in an image file the same size as the original and could be opened with any image viewer although the colors were clearly off.

I never could recover the image compressed with bzip2. Trying to extract all the recovered blocks made by bzip2recover via bzcat would just choke on the single corrupted block. And the smallest you can make a block is 100K (vs 32K for gzip?). Obviously pulling 100K out of a jpeg will not work.

Though I'm still confused as to how the corrupted gzip file extracted to a file of the same size as the original. I guess gzip writes out the corrupted data as well instead of choking on it? I guess gzip is the winner here. Having a file with a corrupted byte is much better than having a file with 100K of data missing...



Your method is clearly flawed. Altering a single byte once is insufficient as a test unless you analyzed the structure of the compressed file first to see where the really important information is stored. It may well be that you just modified a verbatim string from the source data in the gzip case, but corrupted a bit of metadata about how the compressed data is structured in the bzip2 case. If you tried a different random bytes, the results might be reversed.

The proper test would be to iterate over every bit in the compressed file, flip it and try to recover. Then compute number of successful recoveries against the number of bits tested. Compression algorithms that perform similarly should gmhave similar likelyhoods that a single bit flip corrupts the entirety of the data.


I thought about that as well. I tried it three different times all with the same results.


Three? Well then, case closed!


Did the poster imply that their test was the be-all and end-all of error tolerance in common-use compressions systems? No. Then why did you assume that they did say that, and then write such a useless comment


Whether recovery leads to (almost) useable data depends on what byte you modify. It's entirely possible that a single corrupt byte in the compressed data leads to a single corrupt byte when uncompressed. When you are dealing with images you may not even notice that a single pixel is wrong. But it's also possible that you completely destroy the data such that the decompression algorithm can't even deal with it and has to give up.


A decade and a half ago, I wrote an Oracle archived log that I had compressed with bzip2 to a DLT40 tape.

I recovered and uncompressed (without error) the log, then tried to apply it to a database recovery which rejected it as corrupt.

After several attempts to read the tape (amounting to dozens of hours), I finally put it in the original drive that wrote it and pulled the file to the remote recovery system - this worked.

I immediately began including PAR2 files on the tapes, so the restored contents could be verified and corrected.

I have my doubts that bzip2 is as sensitive to corruption as the author of asserts, but perhaps there have been improvements to the code since my misfortune.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: