Digital Preservation Prestidigitation

I realize I haven’t posted in awhile but I’ve been busy, what can I say? I’ve been continuing to ingest our preservation files into APTrust, creating workflows for collections with ongoing additions, and integrating our workflows with ArchivesSpace. A couple weeks ago we even passed 1 TB ingested into APTrust!

With this milestone I think it’s a good time to reflect on this number and how sure we are that it’s all there and accessible. While I can say with a good degree of certainty that this material is actually stored in all of its’ locations now, an event that occurred a few months ago made me less sure that things were as they seemed, and thought maybe there was nothing there at all.

At that time, I was doing some due diligence by restoring files from APTrust to see if the things we uploaded retained fixity. This was both to ensure that our processes were working correctly as well as those used by APTrust. Regardless, what a restore does is requests a bag with a copy of one of our digital objects back from APTrust Amazon S3 long-term storage. When restored, you get something like this (once untarred), which you can then validate for fixity.

aptrust_restorecontents
APTrust bag contents

The problem in this situation was that the restore never went through. It remained pending in the APTrust web interface until I received a fixity error.

532f7cab1605fb12ef000001

Can it be? A real fixity error? I was thinking that it must be some kind of mistake. I couldn’t try to restore the same digital object with the process still running, so I tried to restore another item that I had recently uploaded. I received the same error. I immediately contacted the main APTrust developer, Andrew Diamond, to see if he could shed some light on the issue.

He got right back to me saying that he was shocked as well. Of the 1.2 million files that had been uploaded to APTrust by members since December 2014, none had ever returned a fixity error. This was literally a one-in-a-million situation. It was great timing too because it was the spring APTrust member’s meeting in Miami, Florida so he couldn’t look at the issue until later in the week.

Once Andrew could take a serious look at the problem, he determined that a bug was causing an issue once the files uploaded to our APTrust receiving bucket had been moved to APTrust’s Amazon Elastic File System. The first time it was accessed there for copying to Amazon S3 (in Northern Virginia) for long-term storage, it was read as a 0 byte file. Subsequent reads were successful (the second read goes to S3 in Oregon and then to Glacier storage).

aptrustarchitecture
APTrust architecture

But the 0 byte files uploaded successfully without S3 sending back an error. APTrust had switched from the unofficial goamz S3 upload library to the official aws-sdk-go upload library, which now only returned that the file was uploaded completely without error, not whether the right file was uploaded. To it, a 0 byte file did not constitute an error, it read it as a 0 byte file and so writing it as one didn’t set off any alarms. The system was saying things were perfect while there was in fact nothing there! Quite a trick by Amazon S3.

illusion-michael-2-giphy

This was the worst kind of silent error where there is no indication that anything was wrong. It affected files over 100 MB uploaded to APTrust from late February to April. Smaller files are processed without untarring their bags, but using the new library takes up too much system memory when this is done on larger files. This is why APTrust untarred and wrote the files to Elastic File System  before ingesting them to S3, which is where the bug mentioned above occurred. In our case at Georgetown, this amounted to 1500 files! I was worried that all my work for this time period would have been for nothing and that it would need to be redone.

Luckily, all these files were in Glacier storage and did not suffer from the 0 byte issue because they were copies made after the first read of the file. There was still a cost to this and with the steep prices to access items in Glacier, we agreed to spread out the copying of these items into S3 out over a week to avoid most of the cost to APTrust. The membership price fortunately gives APtrust reserve funds to use in unlikely incidents like this (or restores for institutions that lose their local copies).

In the end, this error would have been discovered anyway about a month later in the regular 90-day fixity check on the items that first had the error. But this would have meant thousands of more files to move and more money spent by APTrust on Glacier access. APTrust was fast acting and thorough when dealing with the issue. To ensure this doesn’t happen again, APTrust now explicitly checks the result of the S3 upload, even when the Amazon S3 library says it succeeded. If the file fails it is automatically re-uploaded until it is successful.

giphy

It certainly could have been worse, like if Glacier copy was 0 bytes and if we had deleted our local copies. I think this experience underscores that APTrust’s and (the digital preservation field’s) solid framework of bit-level digital preservation is sound for the most part. Keep lots of copies in multiple locations and on multiple mediums while performing regular fixity checks on those items to make sure they don’t change and things will remain accessible (with big assumptions of continuous funding, no environmental catastrophe, etc.). We’re reminded that our skepticism in one cloud solution to do it all is well founded and of the need to diversify our storage options. In this situation another check needed to be put in place, but the other contingency methods, copies in another system and location, worked as designed.

Despite best efforts to follow digital preservation best practices there can still be loss. No system is infallible and I think we, as professionals, need to be ok with that. In most cases we’re going to be alright and with the many costs of storage it will be hard for most to do better. Our 1 TB has already grown to 1.3 TB at time of writing and it will continue to grow in the future. With the safeguards in place I believe that number is accurate and that it will remain accessible for the long term.

Advertisements