OT: Calling MGoNerds/CompSci people - Heavy Duty Data Backup

Submitted by MGoArchive on

Hello - this post will probably seem Greek to nearly everyone here but I'm hoping there's a few of you out there that work at Amazon AWS/Google GCE/Microsoft Azure or have a NAS/SAN at home.

The problem - I have a growing archive of Michigan Athletics video (4-5TB now). Eventually I can expand my NAS (Windows Server 2016 using ReFS - to hell with RAID card firmware bugs) to 32TB (8x8TB in RAID1) but  I will run out of room at some point. I usually like to leave one spare 'dual bay' worth of free space open in the event I need to pull two hard drives and replace them with two larger ones. For example, if I had 8x4TB, 16TB useable, I'd like to leave 4TB free so I can swap out 2x4TB for 2x8TB + 6x4TB to give the NAS room to 'breathe' and grow.

The big problem I'm facing - cold storage. I'd ideally like to move older seasons onto cheap 2x2TB hard drives that are $40 apiece (far cheaper in the long run than Amazon Glacier). Each file saved on the disk would have a CHECKSUM (to detect bitrot) amd then I'd have a program run a comparison to see if the CHECKSUMS between the two disks match for every file, or at least report that one of the files has a discrepancy compared to the other. 

For the CompSci guys- apparently PAR3 uses Reed Solomon correction codes - is it worth it to generate these recovery records for a single 15-20GB file (this is about the average size of a football game)?

Does anyone know of an application (freeware or otherwise) that create checksums of each file on a drive or directory, write it to a text file, and then re-scan and compare the historical checksum that was written previously with the new checksum?

quiqsilver

July 6th, 2017 at 1:52 PM ^

I would generate par2 files with multipar rather than doing checksums. Choose the level of redundancy you feel is appropriate or just size it to fill the free space on the volume. It's probably worth generating enough to cover at least a small amount of corruption. The process is fast, especially with a decent GPU.

gbdub

July 6th, 2017 at 2:31 PM ^

What's a good consumer-grade network storage solution? Basically I just want to be able to do wireless backups of critical data on my home PCs / laptops, because I'm too lazy to do it reliably when I have to dig out a drive and physically hook it up. 

Don't need a ton of storage, something like 4TB would be plenty for now (actually 2 would probably be enough). RAID would be very good, but encryption, fancy features, vast customizability, etc. not really needed. 

I'm tech-fluent enough to not need it to be strictly plug-and-play, but I'm also overscheduled and easily distracted, so straightforward is definitely preferred. 

mvp

July 6th, 2017 at 6:45 PM ^

There are a ton of options out there...

I have a Synology DS-412+ which has 4 hot swappable bays for HDs and I have four 3TB drives in it.  You can use different RAID configurations which impacts your total storage capacity vs. safety and convenience.

The Synology platform supports tons of stuff including a real-time PC backup program, Apple Time Machine, and various media servers.  I'm also working up to ripping Blu-Rays and streaming them throughout the house using PLEX.

Like I said, there are many options.  My Synology has been running reliably for a few years now and I've been very pleased with the app support.  Your mileage may vary.

As I was reading the OP and replies, I was still thinking one of the multi-bay NAS systems is the way to go.  I don't know enough about it, and I think this violates some of the original constraints, but it is pretty simple and by the time you run out of space or need to swap out a dying drive, a bigger one would be cheaper.  Just my $0.02.

ca_prophet

July 6th, 2017 at 5:44 PM ^

They store unlimited backups for a monthly fee. That still may be outside your price range, but the other folks looking for backup/cloud data solutions might be interested. To your specific case, if you do write your own, and you're talking about ~200 files consider generating the hashes for each file and printing them out. Verify you can scan it back (heck, you can do it with your phone these days) and then run the program again and compare it to your scanned hardcopy (UNIX comp(1) works well for this. Another option, probably also more costly, is DVD backup. (Truly cold storage should be write-only, right?) If you get it working, let us know what you picked - I'm curious to know your chosen solution.

MGoArchive

July 6th, 2017 at 7:00 PM ^

But unfortunately it won't run on Windows Server. Only desktop versions of Windows (7/8.1.10)

Supposedly there is a 'Workstation' edition of Windows 10 Pro coming that supports the ReFS filesystem, even as a bootable drive - right now ReFS is only really supported using Windows Server and the 'Storage Spaces' role, which is what I use it for.

But if I could run a version of Windows 10 Pro that supports ReFS, I would probably install it on a seperate SSD (my OS footprint is small, maybe 40-50GB), buy a bunch of 4TB externals from Apple and format them as NTFS (no restocking fee if returned before 14 days), copy the data over, wipe the pools, reinstall the OS, recopy the data over, and install Backblaze ^_^

freejs

July 6th, 2017 at 6:27 PM ^

what the best, reasonably priced external hard drive out there in terms of not leaking data, and how long should one expect a hard drive to stay good before there is data deterioration? How many years? 

mvp

July 6th, 2017 at 6:53 PM ^

I'm very far from an expert, but I have researched and set up my own backup system at home using a Synology DiskStation product.  

I googled "WD Red Blue Purple" and found this article which describes the different tiers Western Digital sells.  https://www.pugetsystems.com/labs/articles/Understanding-the-WD-Rainbow-674/

In short, you get what you pay for.  Some HDs are designed for personal vs. commercial use.  Some are rated for more or less up-time.  Some are for storage systems vs. individual computers.

A good NAS (network attached storage) housing/system will monitor the condition of your HDs and let you know if there's a problem.  You can then hot-swap out the failing drives.  I just looked on Amazon and a WD Red 3TB drive (designed for use in an NAS) is $109.

SJ Steve

July 7th, 2017 at 3:23 AM ^

Consumer-grade HDDs deteriorate right out of the box. The only way to reliably prevent data loss is via double parity as single parity leaves you vulnerable to single bit errors during parity reconstruction. If you use consumer grade drives, only double parity (called RAID-6 or SHR-2 in Synology nomenclature) will protect your data in all but the most catastrophic scenarios.

johnthesavage

July 6th, 2017 at 11:33 PM ^

You can put 5 TB in Glacier for ~$20 / month, which is a few hundred a year and seems reasonable. Remember then you don't have to deal with any of the infrastructure and they will guarantee all the 9's of reliability for you. You are also protected against weird stuff like your house burning down. I'd just do this and forget about it, you could keep some more recent stuff in s3 too (more expensiver of course) if you want and get that data quickly, wherever.

SJ Steve

July 7th, 2017 at 3:08 AM ^

Two suggestions: Store the data locally using at least 5 drives and double parity. Scrub the array every month using a scheduled task. I use a Synology 1817+ with 4TB drives and two SSDs for caching. Create a private tracker with one torrent per year of content. That way you can leverage the power of multiplicity to prevent bitrot as every block is hashed and can be re-downloaded from a peer even in the case of an astronomically rare data loss event. I'd be willing to participate and it's possible I have some content you may not have. FWIW, I used to be the main Michigan capper for TYT pre-HDTV.

Oscar

July 8th, 2017 at 9:04 PM ^

Have you looked into FreeNAS?  It handles the datarot checking natively.  There is a high learning curve, and it can be particular about its hardware.

MGoArchive

July 9th, 2017 at 11:14 PM ^

FreeNAS/ZFS is the gold standard when it comes to protecting against bitrot. But ultimately I decided to go with Windows Server 2016 (four core Haswell Xeon + 16GB ECC memory) + Stoorage Pools (mirror config - essentially software RAID1) with the ReFS filesystem. 

I can get into it quickly from any device via Chrome Remote Desktop, runs Plex streaming software, and qBittorrent. Yes, I'm aware there are OSS alternatives to all the items mentioned above but it 'just works' for me. Maybe I'm just a n00b, but yeah - I'm very comfortable with it and I've had zero issues so far running it in production for almost a year now.

The data storage issue is primarily for what happens when I max out the 'hot' storage on the server. Right now I have 4x4TB (in a single 4x3.5' HDD cage) for 8TB of useable space. I have another cage I'm going to install, along with another stick of ECC memory (bringing it up to 24GB). I'll put another 2x4TB in. And then in 3-5 years I'll put in 2x8TB. Then start swapping out 2x4TB at a time for 2x8TB. You get where I'm going with this.