Best free / open source compression or container format for file archiving

Forums › Desktop computing › Best free / open source compression or container format for file archiving

timmmay

20580 posts

Uber Geek

Trusted

Lifetime subscriber

#214568 18-May-2017 07:36

Task

I'm going to be archiving a around 50GB of data to AWS Glacier. The data I'm archiving is images, video, and a few documents, so compression isn't really an issue.

Glacier

Glacier is cheaper for fewer, larger files, rather than many small files, so I'd like to put these many small files into a container. I'd like to try to use a container format that is tolerant of error, so a single byte error doesn't make the whole archive inaccessible. This is probably unlikely with Glacier, because uploads are checksummed, three copies of files are stored and integrity is checked and corrected, but an error could creep in somewhere.

I'll be making a dozen files in the 1-10GB range, rather than one huge file. That's mostly so recovery is easier.

Question

What's the best durable container format for this that's free or open source? RAR is probably the best, as you can add recovery information to the file, but it's not free or open.

Background

According to Wikipedia, few other formats have any kind of error correction, probably because it increases archive size. Zip and 7z can detect errors, but can't correct them.

Default

If none are great I'd probably go with zip, since it's been around the longest and has the best recovery tools. I generally use 7z for most things, because of its better compression ratio.

1 | 2

5680 posts

Uber Geek

Lifetime subscriber

#1783993 18-May-2017 08:34

Can't comment on the best format, but a bit of information theory about error correction.

Yes... error correction capability does add to the size of an archive; IIRC, add log2(filesize_in_bits) bits to correct any single bit error.

But beware...

1. There's a limit to what errors can be corrected (otherwise, you wouldn't need to store the data, just the error correction information). For example, The Golay code encodes 12 bits of data in a 24-bit word in such a way that any 3-bit errors can be corrected or any 7-bit errors can be detected. i.e. file size is doubled and only 3-bit errors can be corrected. Typically, errors happen in large blocks rather than singly.

2. If your data is stored on a HDD or DVD, then each sector is CRCed, and the whole sector (maybe even the whole file) is thrown away if the CRC fails. If it's stored on the Cloud, you won't have any control over attempting to recover bad sectors. Even if it's stored somewhere where you can access the physical device, you may not be able to recover a bad sector. So you'll need to correct the missing 4096/32768/whatever bits.

Because of these things, forward error correction (i.e. including enough error correction information with the data to be able to correct errors) is not widely used. It's far cheaper and easier to save 2 or more copies of the data separately. The degree of physical separation depends on the scale of disaster you want to recover from... e.g. if your concern is hardware failure of a disk drive, then store on two separate disk drives, whereas if want to be immune from an earthquake destroying your data, then you would store it in two separate cities a long way apart. When you have 2 separate physical copies, you need to periodically check that both are still recoverable (e.g. lifetime of writable CD/DVD is about 5 years?)

Dynamic

3867 posts

Uber Geek

ID Verified

Trusted

Lifetime subscriber

#1783995 18-May-2017 08:41

An option you have is to upload the data, and then download it again (in an ideal world using a completely different computer etc) and testing that downloaded copy. If it checks out OK, then you can be pretty confident.

“Don't believe anything you read on the net. Except this. Well, including this, I suppose.” Douglas Adams

Referral links to services I use, really like, and may be rewarded if you sign up:
PocketSmith for budgeting and personal finance management. A great Kiwi company.

timmmay

20580 posts

Uber Geek

Trusted

Lifetime subscriber

#1783997 18-May-2017 08:46

Thanks Frank. I'm not particularly concerned with storage errors, Glacier is 99.9999999% durable. Errors during transmission are possible, though upload is checksummed the download isn't as such, other than in TCP.

The data will be on one online local disk, one offsite local disk, and in cloud. So I have redundancy. The cloud archive is "last line". Photos I'm storing medium quality, because I have so many of them. 50GB of "last line" backups will cost me around $0.20/month.

The main thing that would be useful is a file format that doesn't become inaccessible through small errors. So if I have a 10GB archive with many images in it, if I have a single byte error I don't mind losing an image, but losing the whole archive would be annoying.

timmmay

20580 posts

Uber Geek

Trusted

Lifetime subscriber

#1783998 18-May-2017 08:47

Dynamic:

An option you have is to upload the data, and then download it again (in an ideal world using a completely different computer etc) and testing that downloaded copy. If it checks out OK, then you can be pretty confident.

Upload to Glacier is pretty safe, since it's checksummed by the uploading software. Then multiple copies are kept in multiple locations, integrity is checked regularly and any failed blocks replaced.

frankv

5680 posts

Uber Geek

Lifetime subscriber

#1784030 18-May-2017 09:56

timmmay:

The main thing that would be useful is a file format that doesn't become inaccessible through small errors. So if I have a 10GB archive with many images in it, if I have a single byte error I don't mind losing an image, but losing the whole archive would be annoying.

Right. But I think you're overthinking this.

What I am saying is that the chances of having a single byte error and not have the 4K (or more, depending on your HDD sector size) bytes around it error are vanishingly small. I don't know what WinRAR mean when they talk about a "Recovery Record", but if it is going to correct any arbitrary sized error anywhere in the archive (as they imply), it will need to be orders of magnitude larger than the archive. It's apparently actually 3-10% of the archive size. At that size, it would at best only detect and fix single-bit errors. So I don't believe that's what it does. There would be no benefit from this kind of error correction. So I'm guessing the "Recovery Record" is some kind of index that allows you to recover *most* of your archive in the event of losing a chunk of it.

Given that your images are already compressed (I assume), there's no particular point to combining them into an archive, except to save on Amazon's charges. Most archiving programs (e.g. Zip) will not compress files that are already compressed, they will just store them. In that case, there's no reason why an error in one archived file should affect another.

Zip does have a single point of failure in that the only copy of the archive directory is stored at the end of the archive rather than the beginning. A bad sector in the directory, or truncation of the file, or inability to find the directory due to some other sector's error, could make the entire archive unreadable.

But you're relying on the Internet to transport the file without error, and on Amazon to store and retrieve it without error (including keeping multiple copies and periodic validity checking and so on). If you can't assume those things, then there's no point to doing this. How are you going to get an error in an archive? If there is a single byte error on Amazon's storage device, that will cause a CRC error when the sector is retrieved. At that point, I expect that if Amazon doesn't have another copy of the file, it will just report that your file is corrupt and not deliver *any* of it to you (i.e. you'll lose the whole 10GB, regardless of whether it includes error correction options or not).

timmmay

20580 posts

Uber Geek

Trusted

Lifetime subscriber

#1784049 18-May-2017 10:20

frankv:

Right. But I think you're overthinking this.

What I am saying is that the chances of having a single byte error and not have the 4K (or more, depending on your HDD sector size) bytes around it error are vanishingly small. I don't know what WinRAR mean when they talk about a "Recovery Record", but if it is going to correct any arbitrary sized error anywhere in the archive (as they imply), it will need to be orders of magnitude larger than the archive. It's apparently actually 3-10% of the archive size. At that size, it would at best only detect and fix single-bit errors. So I don't believe that's what it does. There would be no benefit from this kind of error correction. So I'm guessing the "Recovery Record" is some kind of index that allows you to recover *most* of your archive in the event of losing a chunk of it.

Given that your images are already compressed (I assume), there's no particular point to combining them into an archive, except to save on Amazon's charges. Most archiving programs (e.g. Zip) will not compress files that are already compressed, they will just store them. In that case, there's no reason why an error in one archived file should affect another.

Zip does have a single point of failure in that the only copy of the archive directory is stored at the end of the archive rather than the beginning. A bad sector in the directory, or truncation of the file, or inability to find the directory due to some other sector's error, could make the entire archive unreadable.

But you're relying on the Internet to transport the file without error, and on Amazon to store and retrieve it without error (including keeping multiple copies and periodic validity checking and so on). If you can't assume those things, then there's no point to doing this. How are you going to get an error in an archive? If there is a single byte error on Amazon's storage device, that will cause a CRC error when the sector is retrieved. At that point, I expect that if Amazon doesn't have another copy of the file, it will just report that your file is corrupt and not deliver *any* of it to you (i.e. you'll lose the whole 10GB, regardless of whether it includes error correction options or not).

I don't think I've overthinking this. I'm just interested if anyone knows more about the durability compression / container formats than me.

Combining images into a container is both to reduce AWS charges, and also for convenience. Since this is an archive I'm unlikely to want to download one image, I'm more likely to want to download them all.

It could be that some compression programs check file integrity and refuse to extract any files if the whole archive is corrupt. That's probably unlikely, but you never know.

I guess I could try it. I could compress some files in a few formats, change a few random bytes, and see what happens.

solutionz

589 posts

Ultimate Geek
Inactive user

#1784057 18-May-2017 10:38

timmmay:

I guess I could try it. I could compress some files in a few formats, change a few random bytes, and see what happens.

Would certainly be interested in your conclusions...

Quic Broadband

Move to New Zealand's best fibre broadband service (affiliate link). Note that to use Quic Broadband you must be comfortable with configuring your own router.

Dynamic

3867 posts

Uber Geek

ID Verified

Trusted

Lifetime subscriber

#1784060 18-May-2017 10:49

I'm pretty sure Zip or 7z will extract an archive and skip any faulty files in that archive (unless the header of the archive is faulty of course). It's been a very long time since I've dealt with a faulty archive file.

“Don't believe anything you read on the net. Except this. Well, including this, I suppose.” Douglas Adams

Referral links to services I use, really like, and may be rewarded if you sign up:
PocketSmith for budgeting and personal finance management. A great Kiwi company.

timmmay

20580 posts

Uber Geek

Trusted

Lifetime subscriber

#1784079 18-May-2017 11:11

I put a bunch of text into a few files, compressed it using zip/7z/tar on minimum compression setting, changed a random byte near the middle of the file, then extracted. This isn't a fair test as the damage could've been different for each file.

7z: noted two damaged files, one was extracted with problems, one wasn't extracted

zip: noted one damaged file, extracted them all, one had problems

tar: did not note any damage (it doesn't have checksums), one file damaged

I redid this using store only, no compression. 7z and zip did the same thing, reported a problem but extracted the files.

Based on this exceptionally quick and poor quality it seems like zip is a more robust file format that 7z.

I also note this answer on Ask Ubuntu.

PAR

I also found the PArchive format, with MultiPar software. It's designed to add configurable amounts of redundancy to archives to help with recovery if the files are damaged.

I'm unlikely to use PAR format, just because it seems like a niche. In the unlikely event of corruption I think the zip file format is sufficiently robust, especially if it's used in 'store' mode rather than compress. Given I'm packaging images and videos that are already compressed using store is no problem.

solutionz

589 posts

Ultimate Geek
Inactive user

#1784086 18-May-2017 11:25

timmmay:

PAR

I also found the PArchive format, with MultiPar software. It's designed to add configurable amounts of redundancy to archives to help with recovery if the files are damaged.

I'm unlikely to use PAR format, just because it seems like a niche. In the unlikely event of corruption I think the zip file format is sufficiently robust, especially if it's used in 'store' mode rather than compress. Given I'm packaging images and videos that are already compressed using store is no problem.

Supposedly PAR just compliments your existing archive; i.e. You can create a PAR file for a ZIP file or something like that...

Dynamic

3867 posts

Uber Geek

ID Verified

Trusted

Lifetime subscriber

#1784102 18-May-2017 11:59

timmmay: I put a bunch of text into a few files, compressed it using zip/7z/tar on minimum compression setting, changed a random byte near the middle of the file, then extracted. This isn't a fair test as the damage could've been different for each file.......

Nice test. Thanks for sharing the results. If it's potentially important to wring and extra 5-20% compression out of 7z, perhaps do another couple of tests with this format and some basic image files (like 100x100 pixel png checkerboard for example). You may have happened to scramble the header for the file in the first 7z test and changing a bit a few bytes down may have given a different result, like a slightly corrupted PNG that was still viewable.

I'm pleasantly surprised that ZIP has come out on top for you, given the age of the format.

“Don't believe anything you read on the net. Except this. Well, including this, I suppose.” Douglas Adams

Referral links to services I use, really like, and may be rewarded if you sign up:
PocketSmith for budgeting and personal finance management. A great Kiwi company.

jpoc

1043 posts

Uber Geek

#1784105 18-May-2017 12:05

I'd suggest par - in particular the par2 format most likely using the quickpar program - IIRC you use windows.

The combination of par2/quickpar is far from niche I would suggest that it is one of the most commonly used recovery formats out there.

As an example, suppose that you have 50G of files that you wish to store on aws in 10 5G containers. You could generate a set of 6 par files each 1G in size. If you try to get your files back and you retrieve all ten files but some of them are corrupted then you would try to retrieve one of your par files and that would most likely be enough to recover the damage. If you could only get 9 of your main files back then you would need five of your par files.

hio77

12999 posts

Uber Geek

ID Verified

Trusted

Lizard Networks

#1784170 18-May-2017 12:38

par files, now that goes back to the bbs days!

Given the situation i don't think they would be that much of a bad idea.

Given you are looking at a separate file for Par etc though.

#include <std_disclaimer>

Any comments made are personal opinion and do not reflect directly on the position my current or past employers may have.

timmmay

20580 posts

Uber Geek

Trusted

Lifetime subscriber

#1784200 18-May-2017 13:09

Ah, if par goes alongsize a zip/7z file I could do that. Quickpar is pretty old now, Multipar is the more modern software apparently.

@Dynamic - I won't be using compression, there's very little to gain compressing RAW/jpeg files, so I'll just store so they're all in a single file. That's better for archiving to AWS Glacier.

I was hoping to find a format with storage and error correction built in, but this is almost as good :)

ANglEAUT

2321 posts

Uber Geek

Trusted

Lifetime subscriber

#1784338 18-May-2017 16:21

timmmay:

frankv: ... Given that your images are already compressed (I assume), there's no particular point to combining them into an archive, except to save on Amazon's charges. Most archiving programs (e.g. Zip) will not compress files that are already compressed, they will just store them. In that case, there's no reason why an error in one archived file should affect another. ...

... Combining images into a container is both to reduce AWS charges, and also for convenience. Since this is an archive I'm unlikely to want to download one image, I'm more likely to want to download them all. ...

Have you considered the ISO format? It does what you want. Create virtual CD-ROMs of any required size, grouping the files that belong together.

Groups many files together into a single archive
No compression because files are already compressed
Well known format, mountable by all modern OS's and on older ones with utilities.
MD5 / SHA checksumming to confirm non-corruption.
You will always be able to "extract" the other files that are not corrupted.

Please keep this GZ community vibrant by contributing in a constructive & respectful manner.

1 | 2