Which of the 3 standard compression algorithms on Unix (gz, xz, or bz2) is best for long term data archival at their highest compression? (lemmy.ml)

submitted 5 days ago* (last edited 5 days ago) by HiddenLayer555@lemmy.ml to c/linux@lemmy.ml

30 comments fedilink hide all child comments

I have a lot of tar and disk image backups, as well as raw photos, that I want to squeeze onto a hard drive for long term offline archival, but I want to make the most of the drive's capacity so I want to compress them at the highest ratio supported by standard tools. I've zeroed out the free space in my disk images so I can save the entire image while only having it take up as much space as there are actual files on them, and raw images in my experience can have their size reduced by a third or even half with max compression (and I would assume it's lossless since file level compression can regenerate the original file in its entirety?)

I've heard horror stories of compressed files being made completely unextractable by a single corrupted bit but I don't know how much a risk that still is in 2025, though since I plan to leave the hard drive unplugged for long periods, I want the best chance of recovery if something does go wrong.

I also want the files to be extractable with just the Linux/Unix standard binutils since this is my disaster recovery plan and I want to be able to work with it through a Linux live image without installing any extra packages when my server dies, hence I'm only looking at gz, xz, or bz2.

So out of the three, which is generally considered more stable and corruption resistant when the compression ratio is turned all the way up? Do any of them have the ability to recover from a bit flip or at the very least detect with certainty whether the data is corrupted or not when extracting? Additionally, should I be generating separate checksum files for the original data or do the compressed formats include checksumming themselves?

you are viewing a single comment's thread
view the rest of the comments

[-] DasFaultier@sh.itjust.works 5 points 4 days ago

und denke mal, bei dem Username, dass du deutsch sprechen kannst haha Jup, stimmt. :D

Ich bleib' trotzdem mal bei Englisch, damit's im englischen Thread verstanden wird.

ENGLISH: Yeah, you're right, I wasn't particularly on-topic there. :D I tried to address your underlying assumptions as well as the actual file format question, and it kinda derailed from there.

Sooo, file format... I think you're restricting yourself too much if you just use the formats that are included in binutils. Also, you have conflicting goals there: it's compression (make the most of your storage) vs. resilience (have a format that is stable in the long term). Someone here recommended lzip, which is definitely a right answer for good compression ratio. The Wikipedia article I linked features a table that compares compressed archive formats, so that might be a good starting point to find resilient formats. Look out for formats with at least Integrity Check and possibly Recovery Record, as these seem to be more important than compression ratio. When you have settled on a format, run some tests to find the best compression algorithm for your material. You might also want to measure throughput/time while you're at it to find variants that offer a reasonable compromise between compression and performance. If you're so inclined, try to read a few format specs to find suitable candidates.

You're generally looking for formats that:

are in widespread use
are specified/standardized publicly
are of a low complexity
don't have features like DRM/Encryption/anti-copy
are self-documenting
are robust
don't have external dependencies (e.g. for other file formats)
are free of any restrictive licensing/patents
can be validated.

You might want to read up on more technical infos on how an actual archive handles these challenges at https://slubarchiv.slub-dresden.de/technische-standards-fuer-die-ablieferung-von-digitalen-dokumenten and the PDF files with specifications linked there (all in German).

[-] Ferk@lemmy.ml 3 points 4 days ago* (last edited 4 days ago)

Just note that @RiverRabbits@lemmy.blahaj.zone wasn't the one who opened the Thread, that's why they said they didn't ask the question (I get the feeling there might have been some confusion here :P ).

Still, very informative comment.

[-] RiverRabbits@lemmy.blahaj.zone 3 points 4 days ago

Haha, yeah I'm not the OP! But the way my german is phrased here and how the replier interpreted it would read as super passive aggressive (think "I didn't ask that question but thanks"), and for that I apologize 😭 I just meant I'm not the OP😌

[-] DasFaultier@sh.itjust.works 1 points 4 days ago

Of yeah, there really was, thank you. :)

this post was submitted on 24 Sep 2025

77 points (97.5% liked)

Linux

57274 readers

1251 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Posts must be relevant to operating systems running the Linux kernel. GNU/Linux or otherwise.
No misinformation
No NSFW content
No hate speech, bigotry, etc

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 6 years ago

MODERATORS

AgreeableLandscape@lemmy.ml

nooter692@lemmy.ml

MarcellusDrum@lemmy.ml

cypherpunks@lemmy.ml

cyclohexane@lemmy.ml

d3Xt3r@lemmy.nz