I was lucky enough to present at the professional poster session during the 2017 Society of American Archivists annual conference in Portland, OR last week on my NDSR project. My poster, entitled “Bridging the ‘Digital’ Divide: Implementing a Distributed Digital Preservation Program at Georgetown University Library” is available below and discusses and visualizes our workflows at Georgetown for moving digital objects into Academic Preservation Trust and accessioning and processing digital objects held in the archives and special collections. The distributed approach described in this model enabled many staff in the library that may not traditionally be involved with “digital” aspects of library or archives to learn and practice digital preservation in areas related to their everyday work.
I realize I haven’t posted in awhile but I’ve been busy, what can I say? I’ve been continuing to ingest our preservation files into APTrust, creating workflows for collections with ongoing additions, and integrating our workflows with ArchivesSpace. A couple weeks ago we even passed 1 TB ingested into APTrust!
Woohoo! Up to our 1st terabyte in @APTrust and we keep rolling along
With this milestone I think it’s a good time to reflect on this number and how sure we are that it’s all there and accessible. While I can say with a good degree of certainty that this material is actually stored in all of its’ locations now, an event that occurred a few months ago made me less sure that things were as they seemed, and thought maybe there was nothing there at all.
At that time, I was doing some due diligence by restoring files from APTrust to see if the things we uploaded retained fixity. This was both to ensure that our processes were working correctly as well as those used by APTrust. Regardless, what a restore does is requests a bag with a copy of one of our digital objects back from APTrust Amazon S3 long-term storage. When restored, you get something like this (once untarred), which you can then validate for fixity.
The problem in this situation was that the restore never went through. It remained pending in the APTrust web interface until I received a fixity error.
Can it be? A real fixity error? I was thinking that it must be some kind of mistake. I couldn’t try to restore the same digital object with the process still running, so I tried to restore another item that I had recently uploaded. I received the same error. I immediately contacted the main APTrust developer, Andrew Diamond, to see if he could shed some light on the issue.
He got right back to me saying that he was shocked as well. Of the 1.2 million files that had been uploaded to APTrust by members since December 2014, none had ever returned a fixity error. This was literally a one-in-a-million situation. It was great timing too because it was the spring APTrust member’s meeting in Miami, Florida so he couldn’t look at the issue until later in the week.
Once Andrew could take a serious look at the problem, he determined that a bug was causing an issue once the files uploaded to our APTrust receiving bucket had been moved to APTrust’s Amazon Elastic File System. The first time it was accessed there for copying to Amazon S3 (in Northern Virginia) for long-term storage, it was read as a 0 byte file. Subsequent reads were successful (the second read goes to S3 in Oregon and then to Glacier storage).
But the 0 byte files uploaded successfully without S3 sending back an error. APTrust had switched from the unofficial goamz S3 upload library to the official aws-sdk-go upload library, which now only returned that the file was uploaded completely without error, not whether the right file was uploaded. To it, a 0 byte file did not constitute an error, it read it as a 0 byte file and so writing it as one didn’t set off any alarms. The system was saying things were perfect while there was in fact nothing there! Quite a trick by Amazon S3.
This was the worst kind of silent error where there is no indication that anything was wrong. It affected files over 100 MB uploaded to APTrust from late February to April. Smaller files are processed without untarring their bags, but using the new library takes up too much system memory when this is done on larger files. This is why APTrust untarred and wrote the files to Elastic File System before ingesting them to S3, which is where the bug mentioned above occurred. In our case at Georgetown, this amounted to 1500 files! I was worried that all my work for this time period would have been for nothing and that it would need to be redone.
Luckily, all these files were in Glacier storage and did not suffer from the 0 byte issue because they were copies made after the first read of the file. There was still a cost to this and with the steep prices to access items in Glacier, we agreed to spread out the copying of these items into S3 out over a week to avoid most of the cost to APTrust. The membership price fortunately gives APtrust reserve funds to use in unlikely incidents like this (or restores for institutions that lose their local copies).
In the end, this error would have been discovered anyway about a month later in the regular 90-day fixity check on the items that first had the error. But this would have meant thousands of more files to move and more money spent by APTrust on Glacier access. APTrust was fast acting and thorough when dealing with the issue. To ensure this doesn’t happen again, APTrust now explicitly checks the result of the S3 upload, even when the Amazon S3 library says it succeeded. If the file fails it is automatically re-uploaded until it is successful.
It certainly could have been worse, like if Glacier copy was 0 bytes and if we had deleted our local copies. I think this experience underscores that APTrust’s and (the digital preservation field’s) solid framework of bit-level digital preservation is sound for the most part. Keep lots of copies in multiple locations and on multiple mediums while performing regular fixity checks on those items to make sure they don’t change and things will remain accessible (with big assumptions of continuous funding, no environmental catastrophe, etc.). We’re reminded that our skepticism in one cloud solution to do it all is well founded and of the need to diversify our storage options. In this situation another check needed to be put in place, but the other contingency methods, copies in another system and location, worked as designed.
Despite best efforts to follow digital preservation best practices there can still be loss. No system is infallible and I think we, as professionals, need to be ok with that. In most cases we’re going to be alright and with the many costs of storage it will be hard for most to do better. Our 1 TB has already grown to 1.3 TB at time of writing and it will continue to grow in the future. With the safeguards in place I believe that number is accurate and that it will remain accessible for the long term.
I started my National Digital Stewardship Residency about two months ago and thought it was about time to put some words on paper (err well, hypertext on hosted web server) about a part of what I’m working on. One of the main goals of my project at Georgetown University Library is to ingest all the collections selected for long term preservation into Academic Preservation Trust (APTrust), which is serving as our Trusted Digital Repository, from the multitude of places they live currently. Along the way, I am creating thorough documentation to foster sustainability of the project both at Georgetown and for other institutions looking to learn or implement a similar project.
Our Tools and Workflows
My project will play out in a series of stages based on difficulty. Right now, we are working on the easiest of our collections to ingest into APTrust, the items which have their preservation files and metadata in Georgetown’s DSpace digital repository, Digital Georgetown (or what my mentor has dubbed low-hanging fruit #1). Electronic theses and dissertations (ETDs) comprise most of this set and have mainly PDF files serving as their access and preservation copy.
One of our developers, Terry Brady, has has created a suite of custom tools to ease and automate this process. There is a custom web GUI APTrust workflow management system that he created and we use to keep track of the collections we’re working on and to initiate each step in the process to complete ingest.
First, based on a collection handle, items are exported along with their METS metadata from DSpace and stored temporarily on a networked drive. Then the GU File Analyzer Tool is run, confirming the checksum of each item within the digital object and creating a valid APTrust bag. Once bagged, the items are sent in batches of 20 to a receiving bucket for ingest onto the APTrust servers. Furthermore, we run a command line utility that queries the APTrust API and confirms that ingest was successful. The unique APTrust etag and bag name is also written back into the metadata of the object in DSpace.
For more on the ingest process check out the video Terry made:
Still Needs Improvement
Since we are uploading bags on an item level due to many of the collections’ expanding nature and already existing item-level persistent ID’s, we are losing the context and ease of determining in which collection each item belongs. One of the questions we are dealing with now is how we would reassemble our collections from preservation storage should we need to in the future (i.e. if there is a catastrophic failure).
Our idea was to store the collection level handle from DSpace in the Bag-Group-Identifier element in the bag-info.txt tag file but APTrust does not parse and make this field searchable through their web interface. Hopefully we can propose a feature request or come up with some other solution. We’re also thinking about uploading separate bags with a list of all the item handles for a collection in an xml file. Other solutions are welcome!
What I’ve Been Doing and Next Steps
After finishing up our digital collections inventory, I’ve been extensively testing these tools over the past few weeks, reporting errors, recommending improvements (like the one mentioned above), and creating documentation. This is all in preparation of training other staff in hopes of distributing this knowledge into other parts of the library as another way of making the procedures sustainable into the future. I believe that a distributed digital preservation model is becoming necessary as digital collections become more the norm and all staff, rather than just those with “digital” in their title, need to be empowered to take on this type of work.
While all this has been happening, we’ve made a significant dent of ingesting these easy collections and should be finished soon. And yet, we have only uploaded a modest 10 GB (only a small percentage of the total size of our digital collections). Therefore, we still have a ways to go, so watch this space for updates about coming challenges like integrating our workflows with ArchivesSpace, Embark, and networked and external drives. I’ll also be discussing other aspects of my project like the improvement of born-digital ingest workflows and the creation of a digital forensics work station. As always, I appreciate any feedback my colleagues might offer on this work.