OT: Calling MGoNerds/CompSci people - Heavy Duty Data Backup

Submitted by MGoArchive on July 6th, 2017 at 10:24 AM

Hello - this post will probably seem Greek to nearly everyone here but I'm hoping there's a few of you out there that work at Amazon AWS/Google GCE/Microsoft Azure or have a NAS/SAN at home.

The problem - I have a growing archive of Michigan Athletics video (4-5TB now). Eventually I can expand my NAS (Windows Server 2016 using ReFS - to hell with RAID card firmware bugs) to 32TB (8x8TB in RAID1) but  I will run out of room at some point. I usually like to leave one spare 'dual bay' worth of free space open in the event I need to pull two hard drives and replace them with two larger ones. For example, if I had 8x4TB, 16TB useable, I'd like to leave 4TB free so I can swap out 2x4TB for 2x8TB + 6x4TB to give the NAS room to 'breathe' and grow.

The big problem I'm facing - cold storage. I'd ideally like to move older seasons onto cheap 2x2TB hard drives that are $40 apiece (far cheaper in the long run than Amazon Glacier). Each file saved on the disk would have a CHECKSUM (to detect bitrot) amd then I'd have a program run a comparison to see if the CHECKSUMS between the two disks match for every file, or at least report that one of the files has a discrepancy compared to the other. 

For the CompSci guys- apparently PAR3 uses Reed Solomon correction codes - is it worth it to generate these recovery records for a single 15-20GB file (this is about the average size of a football game)?

Does anyone know of an application (freeware or otherwise) that create checksums of each file on a drive or directory, write it to a text file, and then re-scan and compare the historical checksum that was written previously with the new checksum?



July 6th, 2017 at 10:27 AM ^

This is pretty OT but I'm leaving it up because

a) It's probably not that greek to most of you, and 

b) I'd like to know the answer.

Markley Mojo

July 6th, 2017 at 12:30 PM ^

I second the StackOverflow post, although I don't think Python or a database is needed. The script in the SO's top answer maps nicely to the author's request.

Bash scripting is either in Windows Server 2016 now or soon to be there.

And make sure you have a copy of the data somewhere else. Preferably on the other side of the planet (in case of a large meteor strike).

EDIT: Actually, the StackOverflow post I was looking at was the following




July 6th, 2017 at 10:37 AM ^

How do you load a replacement file if you intend to only maintain a single archival original file and the data should corrupt slightly over time? Perhaps I'm missing sight of the forest for the trees.


July 6th, 2017 at 10:45 AM ^

The 2x2TB external drives would have mirror identical data on them. I'd 'check' each drive every few years to detect to validate against the historical checksums vs. the new checksums. If there was a mismatch, hopefully there wasn't bitrot on the other drive, and I can copy it over.


For the first time around - the drive would be connected via USB3 dock, generate checksums with the mythical application I described above, I'd pull the drive and have it sit on a shelf for a few years, reconnect the drive every few years and the mythical application scan and generate new checksums and compare those checksums with the historical checksums, to detect for bitrot.

Essentially - 'cold storage' RAID, but the two drives are independent of each other, just a mirror image in case bitrot occurs on one of the drives.


I 'cold storage' the old seasons (I can fit two seasons onto a 2TB drive) to move them off of the 'hot' NAS to free up space.


July 6th, 2017 at 10:54 AM ^

The issue is that I have about (betwene football/basketball/hockey) about 150-200 files on disk. I'd probably need to write an application that generates md5s for each file initially. Years alater, I'd like the application to perform a scan to generate a new md5, compare it against the old md5. If the new/old md5 match, great. If not, alert that the md5 on the file has changed. Hopefully the md5 on the 'mirror' drive is still the same.


July 6th, 2017 at 11:57 AM ^

Just verifying that in this plan, the storage of the hashes does not reside on the actual cold-stored drives.  If the "known good" hash record gets ruined, the checks don't matter.

This is obvious, I know that; but I have seen some really dumb mistakes screw up otherwise awesome plans.  DIdn't notice this detail listed elswhere and decided to come say the obvious.



July 6th, 2017 at 12:39 PM ^

The md5sum utility from Linux can both create initial hashes and later verify those hashes.

$ md5sum -b * > archive_hashes.txt
$ md5sum -c archive_hashes.txt

The first line create hashes for all named files and stores them in archive_hashes.txt. The second line reads hashes and file names from archive_hashes.txt and checks that the named files haven't changed.

Combine with the find utility or otherwise wrap in a bash script to recurse through your directory tree. If you are a Windows only person, I recommend installing cygwin to gain access to a mountain of Linux command line utilities including bash and md5sum.



July 6th, 2017 at 10:37 AM ^

OP: I thought your post was at least a 4-star (out of 5) on the OT scale. Quite interesting.

Unsoliticed advice: I'd give some thought to the size of your archive and its ultimate purpose before going hog-wild on technology specs. (Perhaps you've already done this.) Yeah, storage isn't all that expensive, and you seem to know what you're doing, but are all five terabytes worth keeping?


July 6th, 2017 at 10:43 AM ^

It wouldn't be more than $50-60 bucks a year to have a really high quality long term archive.

I could ask the Bentley Library people - 'hey...do you guys want this stuff?' but then we'd probably get into a conversation of where I got this video and yeah, that would probably be where the conversation ends.


July 6th, 2017 at 10:51 AM ^

I feel that solution is designed primarily for backups that increment in small amounts - the issue is I'm generating around 600-700GB new data a year.

The technical  problem I'm having is finding a solution that programmatically analyzes the integrity of my 'cold storage' data. Sure, I could generate an MD5 for each football game. But I'd like an application to point at a drive and tell it, you should find the md5 of the file you scanned a few years ago in the same directory as the file itself. Re-scan the file - if the md5 matches, continue with the next file. If not, alert me and tell me that the md5 of this file has changed.


July 6th, 2017 at 11:04 AM ^

It might not be the intended purpose, but it basically does what you are asking behind the scenes. When you run backup it will run the hashing process on all the files and it will cause an error if the hash is different but the modify date is unchanged. There is also a verify command that will check the integrity of the backup. If nothing changed then nothing will be done.


July 6th, 2017 at 11:02 AM ^

Aside from the checksums, have you considered optical media for archival storage?  The new M-discs are supposed to last an extremely long time without degrading, and the cost isn't significantly different than blu ray.  At least something to consider.


July 6th, 2017 at 11:11 AM ^

That even the big guys (Facebook/Google) were still using optical media and programmatically loading/unloaidng disks out of cold storage when the data was needed.

The issue though is that a 25-pack spindle is $61...That's $61 for 625GB of storage. I could get 4TB (2x2TB) for $80 - http://www.ebay.com/itm/162441343250?ssPageName=STRK:MESINDXX:IT&_trksi…

And the drives would basically spin for a day or two every two years, so it's not like there's a huge load being placed on them. And I'd back them up to Crashplan, which supports external drives, which are then disconnected.


July 6th, 2017 at 11:50 AM ^

Are you more concerned about cost or data preservation?  I interpreted your post as being more concerned with the latter.  You can burn multiple copies on M-discs and store them offsite in case the building burns down. 


July 6th, 2017 at 11:08 AM ^

I was with you until you said you wanted to use a nascent, unproven Microsoft file system due to your fear of bugs in RAID card firmware. If there is any big software company out there right now who's entirely forgotten how to publish good quality code, it's MS.


July 6th, 2017 at 11:58 AM ^

Personally I'd take hardware RAID over software any day, but I work in IT.  Maybe budget precludes buying a decent card/motherboard, but running software RAID, especially from M$ would make me nervous if I cared about the data.


July 6th, 2017 at 11:28 AM ^

So I work for GCE, although what you want is GCS (Google Cloud Storage).  You should look at Coldline storage $.007 GB/Month.  You get instant access to the data and all of this backup/maintenance is managed for you.


I guess the question is whether this is worth $35/Month


July 6th, 2017 at 11:39 AM ^

Thanks, I just looked up the calculator - https://cloud.google.com/storage/pricing

$35/month is a bit out my price range, but perhaps I can convince Brian to have this as a business expense? I'd be ok with having the Hoke/Rodriguez years live in cold storage.

I think I'm still leaning towards my 2x2TB sitting on my shelf, having mirrored data, with MD5s for each file on each drive. Pulling the drives off the shelf every few years during the offseason and running a scan/comparison on each drive.

Oh, and uploading the data from the external to Crashplan  (they let you connect an external, upload the data, and disconnect the external). Crashplan has been around awhile, who knows how long they'll last though - https://www.code42.com/about-code42/


July 6th, 2017 at 12:35 PM ^

I've rewatched the Wisconsin game, just to reinforce that UM did win and QB Threet did run through the Badger defense.
Oh, also rewatched the game for the jug; Nick Sheridan's moment in the spotlight.
Whole lot of pain in the rest of the season, worth watching only to remind one's self of what a dark year felt like.


July 6th, 2017 at 5:15 PM ^

I would assume that any small player in the backup space is going to die a slow death.  The economics of storage are brutal.  

If you are willing to do the work on your own storage media; that's the cheapest way to go.  You run the risk of a single source catastrophic event (e.g house fire) -- but Michigan football video from the Hoke era is the least of your concerns in that event.

I tried to rally Brian to the idea of cloud web servers a while back -- I don't think IT is his favorite topic.  I even used MgoBlog as my interview topic to get me a job at Google: its a good case study for a site with spiky loads and weak underlying infrastructure.  


July 6th, 2017 at 6:55 PM ^

I just read on r/datahoarder that Crashplan is shutting down, immintently. 

But yeah, I have my requirements sheet written out for the applicaiton I'm going to write - it will scan a directory/drive, placing a .txt file with both the MD5/SHA256 of the file in the same directory. The GUI will have another button that will scan a directory looking for the MD5/SHA256 of the file/scan the file and compare the results. Or, you can feed the application a file/folder full of .txt files that contain the filenames/directory structure of a scanned drive (in case the MD5/SHA256 themselves get corrupted - it's text, I'll e-mail it to myself and place it in a 100 other places) and it will tell you the results.

I will call it MGoArk - I'll keep one of the hard drives, my MGoFriend (Hi Eric!) will get the other. 

Two by two!


July 6th, 2017 at 7:20 PM ^

You could upload the MD5/SHA256 to a free tier of storage and compare it to that.  The can assume that the free tier is backed by triplicate storage with bit error correction so that it's more or less always a golden copy.

Alternatively, you can safely assume that no one is going to care about corrupted pixels of Sam McGuffie's umpteenth concussion.  Or M00N. Or Shane Morris vs. Minnesota. Or anything between else that happened between Jan 2007 and Aug 2015 (except Denard videos -- those should be in the cloud right now)


July 6th, 2017 at 11:57 AM ^

I don't know any of the tech stuff, but I do know that if all your backups and your backup's backups are in the same place, then a fire, flood or a break-in could be catastrophic. If it's super critical to not lose anything you might want to put Copy 2 in a separate location.

And I don't know the answer to this, but does Youtube compress/alter the original uploaded video? If not, you could upload to Youtube, then download at some future point using one of the many browser extensions available for that purpose.


July 6th, 2017 at 11:58 AM ^

Your post reminded me of an interesting (to me, at least) article I read in IEEE Spectrum about the movie and TV industries' problems with storage. For those interested, I recommend checking out: http://spectrum.ieee.org/computing/it/the-lost-picture-show-hollywood-a…

Money quote:

“There’s going to be a large dead period,” he told me, “from the late ’90s through 2020, where most media will be lost.”

Fight the good fight, my MGoBrother. Preserve that UofM history. This is the most on-topic post ever.


July 6th, 2017 at 12:24 PM ^


This may be simplistic for what you need but if you only need to create a checksum for each file and verify that checksum at a later date, this should do it.

I'm a bit busy but I think I read your need correctly. This should be similar to a command line I used years ago that did sometjing similar but against a home network "cloud" I moved to a true, off-site cloud. It's basically a command line runtime that I edited slightly before use.

TheTeam x 3

July 6th, 2017 at 12:48 PM ^

Have you considered writing to optical discs?  Check out M-discs - blu rays designed for archiving. Can last 100-150 yrs.  Then you can just upload to SkyNet.

Even if you are using the NAS for fast access...disk backup wouldnt be a bad idea.  No need to worry about continual corruption and checksum process. Granted this will be more expensive than cheap HDs.