Getting the Easy Bits into APTrust

I started my National Digital Stewardship Residency about two months ago and thought it was about time to put some words on paper (err well, hypertext on hosted web server) about a part of what I’m working on. One of the main goals of my project at Georgetown University Library is to ingest all the collections selected for long term preservation into Academic Preservation Trust (APTrust), which is serving as our Trusted Digital Repository, from the multitude of places they live currently. Along the way, I am creating thorough documentation to foster sustainability of the project both at Georgetown and for other institutions looking to learn or implement a similar project.

Our Tools and Workflows

My project will play out in a series of stages based on difficulty. Right now, we are working on the easiest of our collections to ingest into APTrust, the items which have their preservation files and metadata in Georgetown’s DSpace digital repository, Digital Georgetown (or what my mentor has dubbed low-hanging fruit #1). Electronic theses and dissertations (ETDs) comprise most of this set and have mainly PDF files serving as their access and preservation copy.

apt-workflow

One of our developers, Terry Brady, has has created a suite of custom tools to ease and automate this process.  There is a custom web GUI APTrust workflow management system that he created and we use to keep track of the collections we’re working on and to initiate each step in the process to complete ingest.

First, based on a collection handle, items are exported along with their METS metadata from DSpace and stored temporarily on a networked drive. Then the GU File Analyzer Tool is run, confirming the checksum of each item within the digital object and creating a valid APTrust bag. Once bagged, the items are sent in batches of 20 to a receiving bucket for ingest onto the APTrust servers. Furthermore, we run a command line utility that queries the APTrust API and confirms that ingest was successful. The unique APTrust etag and bag name is also written back into the metadata of the object in DSpace.

For more on the ingest process check out the video Terry made:

Still Needs Improvement

Since we are uploading bags on an item level due to many of the collections’ expanding nature and already existing item-level persistent ID’s, we are losing the context and ease of determining in which collection each item belongs. One of the questions we are dealing with now is how we would reassemble our collections from preservation storage should we need to in the future (i.e. if there is a catastrophic failure).

dog in a burning room saying "this is fine"
Georgetown staff knowing that their content is preserved and findable in APTrust.

Our idea was to store the collection level handle from DSpace in the Bag-Group-Identifier element in the bag-info.txt tag file but APTrust does not parse and make this field searchable through their web interface. Hopefully we can propose a feature request or come up with some other solution. We’re also thinking about uploading separate bags with a list of all the item handles for a collection in an xml file. Other solutions are welcome!

What I’ve Been Doing and Next Steps

After finishing up our digital collections inventory, I’ve been extensively testing these tools over the past few weeks, reporting errors, recommending improvements (like the one mentioned above), and creating documentation. This is all in preparation of training other staff in hopes of distributing this knowledge into other parts of the library as another way of making the procedures sustainable into the future. I believe that a distributed digital preservation model is becoming necessary as digital collections become more the norm and all staff, rather than just those with “digital” in their title, need to be empowered to take on this type of work.

While all this has been happening, we’ve made a significant dent of ingesting these easy collections and should be finished soon. And yet, we have only uploaded a modest 10 GB (only a small percentage of the total size of our digital collections). Therefore, we still have a ways to go, so watch this space for updates about coming challenges like integrating our workflows with ArchivesSpace, Embark, and networked and external drives. I’ll also be discussing other aspects of my project like the improvement of born-digital ingest workflows and the creation of a digital forensics work station. As always, I appreciate any feedback my colleagues might offer on this work.