Geekzone: technology news, blogs, forums
Guest
Welcome Guest.
You haven't logged in yet. If you don't have an account you can register now.




12456 posts

Uber Geek
+1 received by user: 1995

Trusted
Subscriber

Topic # 214568 18-May-2017 07:36 Send private message quote this post

Task

 

I'm going to be archiving a around 50GB of data to AWS Glacier. The data I'm archiving is images, video, and a few documents, so compression isn't really an issue.

 

Glacier

 

Glacier is cheaper for fewer, larger files, rather than many small files, so I'd like to put these many small files into a container. I'd like to try to use a container format that is tolerant of error, so a single byte error doesn't make the whole archive inaccessible. This is probably unlikely with Glacier, because uploads are checksummed, three copies of files are stored and integrity is checked and corrected, but an error could creep in somewhere.

 

I'll be making a dozen files in the 1-10GB range, rather than one huge file. That's mostly so recovery is easier.

 

Question

 

What's the best durable container format for this that's free or open source? RAR is probably the best, as you can add recovery information to the file, but it's not free or open.

 

Background

 

According to Wikipedia, few other formats have any kind of error correction, probably because it increases archive size. Zip and 7z can detect errors, but can't correct them.

 

Default

 

If none are great I'd probably go with zip, since it's been around the longest and has the best recovery tools. I generally use 7z for most things, because of its better compression ratio.





AWS Certified Solution Architect Professional, Sysop Administrator Associate, and Developer Associate
TOGAF certified enterprise architect
Professional photographer


View this topic in a long page with up to 500 replies per page Create new topic
 1 | 2
1637 posts

Uber Geek
+1 received by user: 762


  Reply # 1783993 18-May-2017 08:34 One person supports this post Send private message quote this post

Can't comment on the best format, but a bit of information theory about error correction.

 

Yes... error correction capability does add to the size of an archive; IIRC, add log2(filesize_in_bits) bits to correct any single bit error.

 

But beware...

 

1. There's a limit to what errors can be corrected (otherwise, you wouldn't need to store the data, just the error correction information). For example, The Golay code encodes 12 bits of data in a 24-bit word in such a way that any 3-bit errors can be corrected or any 7-bit errors can be detected. i.e. file size is doubled and only 3-bit errors can be corrected. Typically, errors happen in large blocks rather than singly.

 

2. If your data is stored on a HDD or DVD, then each sector is CRCed, and the whole sector (maybe even the whole file) is thrown away if the CRC fails. If it's stored on the Cloud, you won't have any control over attempting to recover bad sectors. Even if it's stored somewhere where you can access the physical device, you may not be able to recover a bad sector. So you'll need to correct the missing 4096/32768/whatever bits.

 

Because of these things, forward error correction (i.e. including enough error correction information with the data to be able to correct errors) is not widely used. It's far cheaper and easier to save 2 or more copies of the data separately. The degree of physical separation depends on the scale of disaster you want to recover from... e.g. if your concern is hardware failure of a disk drive, then store on two separate disk drives, whereas if want to be immune from an earthquake destroying your data, then you would store it in two separate cities a long way apart. When you have 2 separate physical copies, you need to periodically check that both are still recoverable (e.g. lifetime of writable CD/DVD is about 5 years?)

 

 


2161 posts

Uber Geek
+1 received by user: 594

Trusted
Subscriber

  Reply # 1783995 18-May-2017 08:41 Send private message quote this post

An option you have is to upload the data, and then download it again (in an ideal world using a completely different computer etc) and testing that downloaded copy.  If it checks out OK, then you can be pretty confident.





"4 wheels move the body.  2 wheels move the soul."

“Don't believe anything you read on the net. Except this. Well, including this, I suppose.” Douglas Adams

 

 





12456 posts

Uber Geek
+1 received by user: 1995

Trusted
Subscriber

  Reply # 1783997 18-May-2017 08:46 Send private message quote this post

Thanks Frank. I'm not particularly concerned with storage errors, Glacier is 99.9999999% durable. Errors during transmission are possible, though upload is checksummed the download isn't as such, other than in TCP.

 

The data will be on one online local disk, one offsite local disk, and in cloud. So I have redundancy. The cloud archive is "last line". Photos I'm storing medium quality, because I have so many of them. 50GB of "last line" backups will cost me around $0.20/month.

 

The main thing that would be useful is a file format that doesn't become inaccessible through small errors. So if I have a 10GB archive with many images in it, if I have a single byte error I don't mind losing an image, but losing the whole archive would be annoying.





AWS Certified Solution Architect Professional, Sysop Administrator Associate, and Developer Associate
TOGAF certified enterprise architect
Professional photographer




12456 posts

Uber Geek
+1 received by user: 1995

Trusted
Subscriber

  Reply # 1783998 18-May-2017 08:47 Send private message quote this post

Dynamic:

 

An option you have is to upload the data, and then download it again (in an ideal world using a completely different computer etc) and testing that downloaded copy.  If it checks out OK, then you can be pretty confident.

 

 

Upload to Glacier is pretty safe, since it's checksummed by the uploading software. Then multiple copies are kept in multiple locations, integrity is checked regularly and any failed blocks replaced.





AWS Certified Solution Architect Professional, Sysop Administrator Associate, and Developer Associate
TOGAF certified enterprise architect
Professional photographer


1637 posts

Uber Geek
+1 received by user: 762


  Reply # 1784030 18-May-2017 09:56 One person supports this post Send private message quote this post

timmmay:

 

The main thing that would be useful is a file format that doesn't become inaccessible through small errors. So if I have a 10GB archive with many images in it, if I have a single byte error I don't mind losing an image, but losing the whole archive would be annoying.

 

 

Right. But I think you're overthinking this.

 

What I am saying is that the chances of having a single byte error and not have the 4K (or more, depending on your HDD sector size) bytes around it error are vanishingly small. I don't know what WinRAR mean when they talk about a "Recovery Record", but if it is going to correct any arbitrary sized error anywhere in the archive (as they imply), it will need to be orders of magnitude larger than the archive. It's apparently actually 3-10% of the archive size. At that size, it would at best only detect and fix single-bit errors. So I don't believe that's what it does. There would be no benefit from this kind of error correction. So I'm guessing the "Recovery Record" is some kind of index that allows you to recover *most* of your archive in the event of losing a chunk of it.

 

Given that your images are already compressed (I assume), there's no particular point to combining them into an archive, except to save on Amazon's charges. Most archiving programs (e.g. Zip) will not compress files that are already compressed, they will just store them. In that case, there's no reason why an error in one archived file should affect another.

 

Zip does have a single point of failure in that the only copy of the archive directory is stored at the end of the archive rather than the beginning. A bad sector in the directory, or truncation of the file, or inability to find the directory due to some other sector's error, could make the entire archive unreadable.

 

But you're relying on the Internet to transport the file without error, and on Amazon to store and retrieve it without error (including keeping multiple copies and periodic validity checking and so on). If you can't assume those things, then there's no point to doing this. How are you going to get an error in an archive? If there is a single byte error on Amazon's storage device, that will cause a CRC error when the sector is retrieved. At that point, I expect that if Amazon doesn't have another copy of the file, it will just report that your file is corrupt and not deliver *any* of it to you (i.e. you'll lose the whole 10GB, regardless of whether it includes error correction options or not).

 

 




12456 posts

Uber Geek
+1 received by user: 1995

Trusted
Subscriber

  Reply # 1784049 18-May-2017 10:20 Send private message quote this post

frankv:

 

 

 

Right. But I think you're overthinking this.

 

What I am saying is that the chances of having a single byte error and not have the 4K (or more, depending on your HDD sector size) bytes around it error are vanishingly small. I don't know what WinRAR mean when they talk about a "Recovery Record", but if it is going to correct any arbitrary sized error anywhere in the archive (as they imply), it will need to be orders of magnitude larger than the archive. It's apparently actually 3-10% of the archive size. At that size, it would at best only detect and fix single-bit errors. So I don't believe that's what it does. There would be no benefit from this kind of error correction. So I'm guessing the "Recovery Record" is some kind of index that allows you to recover *most* of your archive in the event of losing a chunk of it.

 

Given that your images are already compressed (I assume), there's no particular point to combining them into an archive, except to save on Amazon's charges. Most archiving programs (e.g. Zip) will not compress files that are already compressed, they will just store them. In that case, there's no reason why an error in one archived file should affect another.

 

Zip does have a single point of failure in that the only copy of the archive directory is stored at the end of the archive rather than the beginning. A bad sector in the directory, or truncation of the file, or inability to find the directory due to some other sector's error, could make the entire archive unreadable.

 

But you're relying on the Internet to transport the file without error, and on Amazon to store and retrieve it without error (including keeping multiple copies and periodic validity checking and so on). If you can't assume those things, then there's no point to doing this. How are you going to get an error in an archive? If there is a single byte error on Amazon's storage device, that will cause a CRC error when the sector is retrieved. At that point, I expect that if Amazon doesn't have another copy of the file, it will just report that your file is corrupt and not deliver *any* of it to you (i.e. you'll lose the whole 10GB, regardless of whether it includes error correction options or not).

 

 

I don't think I've overthinking this. I'm just interested if anyone knows more about the durability compression / container formats than me.

 

Combining images into a container is both to reduce AWS charges, and also for convenience. Since this is an archive I'm unlikely to want to download one image, I'm more likely to want to download them all.

 

It could be that some compression programs check file integrity and refuse to extract any files if the whole archive is corrupt. That's probably unlikely, but you never know.

 

I guess I could try it. I could compress some files in a few formats, change a few random bytes, and see what happens.





AWS Certified Solution Architect Professional, Sysop Administrator Associate, and Developer Associate
TOGAF certified enterprise architect
Professional photographer


270 posts

Ultimate Geek
+1 received by user: 57

Subscriber

  Reply # 1784057 18-May-2017 10:38 Send private message quote this post

timmmay:

 

 

 

I guess I could try it. I could compress some files in a few formats, change a few random bytes, and see what happens.

 

 

Would certainly be interested in your conclusions...





                                                                                                             

2161 posts

Uber Geek
+1 received by user: 594

Trusted
Subscriber

  Reply # 1784060 18-May-2017 10:49 Send private message quote this post

I'm pretty sure Zip or 7z will extract an archive and skip any faulty files in that archive (unless the header of the archive is faulty of course).  It's been a very long time since I've dealt with a faulty archive file.





"4 wheels move the body.  2 wheels move the soul."

“Don't believe anything you read on the net. Except this. Well, including this, I suppose.” Douglas Adams



12456 posts

Uber Geek
+1 received by user: 1995

Trusted
Subscriber

  Reply # 1784079 18-May-2017 11:11 Send private message quote this post

I put a bunch of text into a few files, compressed it using zip/7z/tar on minimum compression setting, changed a random byte near the middle of the file, then extracted. This isn't a fair test as the damage could've been different for each file.

 

7z: noted two damaged files, one was extracted with problems, one wasn't extracted

 

zip: noted one damaged file, extracted them all, one had problems

 

tar: did not note any damage (it doesn't have checksums), one file damaged

 

 

 

I redid this using store only, no compression. 7z and zip did the same thing, reported a problem but extracted the files.

 

Based on this exceptionally quick and poor quality it seems like zip is a more robust file format that 7z. 

 

I also note this answer on Ask Ubuntu.

 

 

 

PAR

 

I also found the PArchive format, with MultiPar software. It's designed to add configurable amounts of redundancy to archives to help with recovery if the files are damaged.

 

I'm unlikely to use PAR format, just because it seems like a niche. In the unlikely event of corruption I think the zip file format is sufficiently robust, especially if it's used in 'store' mode rather than compress. Given I'm packaging images and videos that are already compressed using store is no problem.





AWS Certified Solution Architect Professional, Sysop Administrator Associate, and Developer Associate
TOGAF certified enterprise architect
Professional photographer


270 posts

Ultimate Geek
+1 received by user: 57

Subscriber

  Reply # 1784086 18-May-2017 11:25 One person supports this post Send private message quote this post

timmmay:

 

 

 

PAR

 

I also found the PArchive format, with MultiPar software. It's designed to add configurable amounts of redundancy to archives to help with recovery if the files are damaged.

 

I'm unlikely to use PAR format, just because it seems like a niche. In the unlikely event of corruption I think the zip file format is sufficiently robust, especially if it's used in 'store' mode rather than compress. Given I'm packaging images and videos that are already compressed using store is no problem.

 

 

Supposedly PAR just compliments your existing archive; i.e. You can create a PAR file for a ZIP file or something like that...

 

 





                                                                                                             

2161 posts

Uber Geek
+1 received by user: 594

Trusted
Subscriber

  Reply # 1784102 18-May-2017 11:59 One person supports this post Send private message quote this post

timmmay: I put a bunch of text into a few files, compressed it using zip/7z/tar on minimum compression setting, changed a random byte near the middle of the file, then extracted. This isn't a fair test as the damage could've been different for each file.......

 

Nice test.  Thanks for sharing the results.  If it's potentially important to wring and extra 5-20% compression out of 7z, perhaps do another couple of tests with this format and some basic image files (like 100x100 pixel png checkerboard for example).  You may have happened to scramble the header for the file in the first 7z test and changing a bit a few bytes down may have given a different result, like a slightly corrupted PNG that was still viewable.

 

I'm pleasantly surprised that ZIP has come out on top for you, given the age of the format.





"4 wheels move the body.  2 wheels move the soul."

“Don't believe anything you read on the net. Except this. Well, including this, I suppose.” Douglas Adams

647 posts

Ultimate Geek
+1 received by user: 131


  Reply # 1784105 18-May-2017 12:05 One person supports this post Send private message quote this post

I'd suggest par - in particular the par2 format most likely using the quickpar program - IIRC you use windows.

 

The combination of par2/quickpar is far from niche I would suggest that it is one of the most commonly used recovery formats out there.

 

As an example, suppose that you have 50G of files that you wish to store on aws in 10 5G containers. You could generate a set of 6 par files each 1G in size. If you try to get your files back and you retrieve all ten files but some of them are corrupted then you would try to retrieve one of your par files and that would most likely be enough to recover the damage. If you could only get 9 of your main files back then you would need five of your par files.


'That VDSL Cat'
5974 posts

Uber Geek
+1 received by user: 1093

Trusted
Spark
Subscriber

  Reply # 1784170 18-May-2017 12:38 Send private message quote this post

par files, now that goes back to the bbs days!

 

 

 

Given the situation i don't think they would be that much of a bad idea.

 

Given you are looking at a separate file for Par etc though.





#include <std_disclaimer>

 

Any comments made are personal opinion and do not reflect directly on the position my current or past employers may have.




12456 posts

Uber Geek
+1 received by user: 1995

Trusted
Subscriber

  Reply # 1784200 18-May-2017 13:09 Send private message quote this post

Ah, if par goes alongsize a zip/7z file I could do that. Quickpar is pretty old now, Multipar is the more modern software apparently.

 

@Dynamic - I won't be using compression, there's very little to gain compressing RAW/jpeg files, so I'll just store so they're all in a single file. That's better for archiving to AWS Glacier.

 

I was hoping to find a format with storage and error correction built in, but this is almost as good :)





AWS Certified Solution Architect Professional, Sysop Administrator Associate, and Developer Associate
TOGAF certified enterprise architect
Professional photographer


IcI

296 posts

Ultimate Geek
+1 received by user: 61

Trusted

  Reply # 1784338 18-May-2017 16:21 Send private message quote this post

timmmay:

 

frankv: ... Given that your images are already compressed (I assume), there's no particular point to combining them into an archive, except to save on Amazon's charges. Most archiving programs (e.g. Zip) will not compress files that are already compressed, they will just store them. In that case, there's no reason why an error in one archived file should affect another. ...

 

... Combining images into a container is both to reduce AWS charges, and also for convenience. Since this is an archive I'm unlikely to want to download one image, I'm more likely to want to download them all. ...

 

Have you considered the ISO format? It does what you want. Create virtual CD-ROMs of any required size, grouping the files that belong together.

 

  • Groups many files together into a single archive
  • No compression because files are already compressed
  • Well known format, mountable by all modern OS's and on older ones with utilities.
  • MD5 / SHA checksumming to confirm non-corruption.
  • You will always be able to "extract" the other files that are not corrupted.

 


 1 | 2
View this topic in a long page with up to 500 replies per page Create new topic



Twitter »

Follow us to receive Twitter updates when new discussions are posted in our forums:



Follow us to receive Twitter updates when news items and blogs are posted in our frontpage:



Follow us to receive Twitter updates when tech item prices are listed in our price comparison site:





News »

IBM remote work recall a red herring
Posted 29-May-2017 19:15


RBI2 bidders at Rural Connectivity Symposium
Posted 29-May-2017 12:50


Edifier R1700BT speakers review: Luxury Bluetooth sounds
Posted 28-May-2017 13:06


National AI group launching next month
Posted 25-May-2017 09:54


New Zealand Digital Future, according to tech companies
Posted 25-May-2017 09:51


New Microsoft Surface Pro delivers outstanding battery life, performance
Posted 25-May-2017 09:34


Garmin VIRB 360 brings immersive 360-degree 5.7K camera experience
Posted 25-May-2017 09:30


Telecommunications monitoring report: Are you being served?
Posted 24-May-2017 11:54


NetValue partners with CRM Provider SugarCRM
Posted 23-May-2017 20:04


Terabyte looms as Vocus users download 430GB a month
Posted 19-May-2017 14:51


2degrees tips into profit after seven lean years
Posted 19-May-2017 09:47


2degrees growth story continues
Posted 17-May-2017 15:25


Symantec Blocks 22 Million Attempted WannaCry Ransomware Attacks Globally
Posted 17-May-2017 12:41


HPE Unveils Computer Built for the Era of Big Data
Posted 17-May-2017 12:39


Samsung Galaxy S8 Plus review: Beautiful, feature-packed
Posted 16-May-2017 20:14



Geekzone Live »

Try automatic live updates from Geekzone directly in your browser, without refreshing the page, with Geekzone Live now.



Are you subscribed to our RSS feed? You can download the latest headlines and summaries from our stories directly to your computer or smartphone by using a feed reader.

Alternatively, you can receive a daily email with Geekzone updates.