Poster Presentation at SAA

I was lucky enough to present at the professional poster session during the 2017 Society of American Archivists annual conference in Portland, OR last week on my NDSR project. My poster, entitled “Bridging the ‘Digital’ Divide: Implementing a Distributed Digital Preservation Program at Georgetown University Library” is available below and discusses and visualizes our workflows at Georgetown for moving digital objects into Academic Preservation Trust and accessioning and processing digital objects held in the archives and special collections. The distributed approach described in this model enabled many staff in the library that may not traditionally be involved with “digital” aspects of library or archives to learn and practice digital preservation in areas related to their everyday work.

saa poster

Advertisements

Digital Preservation Prestidigitation

I realize I haven’t posted in awhile but I’ve been busy, what can I say? I’ve been continuing to ingest our preservation files into APTrust, creating workflows for collections with ongoing additions, and integrating our workflows with ArchivesSpace. A couple weeks ago we even passed 1 TB ingested into APTrust!

With this milestone I think it’s a good time to reflect on this number and how sure we are that it’s all there and accessible. While I can say with a good degree of certainty that this material is actually stored in all of its’ locations now, an event that occurred a few months ago made me less sure that things were as they seemed, and thought maybe there was nothing there at all.

At that time, I was doing some due diligence by restoring files from APTrust to see if the things we uploaded retained fixity. This was both to ensure that our processes were working correctly as well as those used by APTrust. Regardless, what a restore does is requests a bag with a copy of one of our digital objects back from APTrust Amazon S3 long-term storage. When restored, you get something like this (once untarred), which you can then validate for fixity.

aptrust_restorecontents
APTrust bag contents

The problem in this situation was that the restore never went through. It remained pending in the APTrust web interface until I received a fixity error.

532f7cab1605fb12ef000001

Can it be? A real fixity error? I was thinking that it must be some kind of mistake. I couldn’t try to restore the same digital object with the process still running, so I tried to restore another item that I had recently uploaded. I received the same error. I immediately contacted the main APTrust developer, Andrew Diamond, to see if he could shed some light on the issue.

He got right back to me saying that he was shocked as well. Of the 1.2 million files that had been uploaded to APTrust by members since December 2014, none had ever returned a fixity error. This was literally a one-in-a-million situation. It was great timing too because it was the spring APTrust member’s meeting in Miami, Florida so he couldn’t look at the issue until later in the week.

Once Andrew could take a serious look at the problem, he determined that a bug was causing an issue once the files uploaded to our APTrust receiving bucket had been moved to APTrust’s Amazon Elastic File System. The first time it was accessed there for copying to Amazon S3 (in Northern Virginia) for long-term storage, it was read as a 0 byte file. Subsequent reads were successful (the second read goes to S3 in Oregon and then to Glacier storage).

aptrustarchitecture
APTrust architecture

But the 0 byte files uploaded successfully without S3 sending back an error. APTrust had switched from the unofficial goamz S3 upload library to the official aws-sdk-go upload library, which now only returned that the file was uploaded completely without error, not whether the right file was uploaded. To it, a 0 byte file did not constitute an error, it read it as a 0 byte file and so writing it as one didn’t set off any alarms. The system was saying things were perfect while there was in fact nothing there! Quite a trick by Amazon S3.

illusion-michael-2-giphy

This was the worst kind of silent error where there is no indication that anything was wrong. It affected files over 100 MB uploaded to APTrust from late February to April. Smaller files are processed without untarring their bags, but using the new library takes up too much system memory when this is done on larger files. This is why APTrust untarred and wrote the files to Elastic File System  before ingesting them to S3, which is where the bug mentioned above occurred. In our case at Georgetown, this amounted to 1500 files! I was worried that all my work for this time period would have been for nothing and that it would need to be redone.

Luckily, all these files were in Glacier storage and did not suffer from the 0 byte issue because they were copies made after the first read of the file. There was still a cost to this and with the steep prices to access items in Glacier, we agreed to spread out the copying of these items into S3 out over a week to avoid most of the cost to APTrust. The membership price fortunately gives APtrust reserve funds to use in unlikely incidents like this (or restores for institutions that lose their local copies).

In the end, this error would have been discovered anyway about a month later in the regular 90-day fixity check on the items that first had the error. But this would have meant thousands of more files to move and more money spent by APTrust on Glacier access. APTrust was fast acting and thorough when dealing with the issue. To ensure this doesn’t happen again, APTrust now explicitly checks the result of the S3 upload, even when the Amazon S3 library says it succeeded. If the file fails it is automatically re-uploaded until it is successful.

giphy

It certainly could have been worse, like if Glacier copy was 0 bytes and if we had deleted our local copies. I think this experience underscores that APTrust’s and (the digital preservation field’s) solid framework of bit-level digital preservation is sound for the most part. Keep lots of copies in multiple locations and on multiple mediums while performing regular fixity checks on those items to make sure they don’t change and things will remain accessible (with big assumptions of continuous funding, no environmental catastrophe, etc.). We’re reminded that our skepticism in one cloud solution to do it all is well founded and of the need to diversify our storage options. In this situation another check needed to be put in place, but the other contingency methods, copies in another system and location, worked as designed.

Despite best efforts to follow digital preservation best practices there can still be loss. No system is infallible and I think we, as professionals, need to be ok with that. In most cases we’re going to be alright and with the many costs of storage it will be hard for most to do better. Our 1 TB has already grown to 1.3 TB at time of writing and it will continue to grow in the future. With the safeguards in place I believe that number is accurate and that it will remain accessible for the long term.

Moving Stuff Around: An NDSR Project Update

About a month ago I finished the first and easiest part of my project, uploading digital objects with preservation copies in DSpace into APTrust, which I mentioned in my last post. I also trained staff in these workflows and worked with our developer to improve our tools for upload into APTrust (you can read more about this overall process in my post for The Signal). I’m continuing to do this as I’m working through stage two of the main part of my NDSR project, ingesting items into APTrust that have metadata in DSpace but their preservation copies on network or external drives. While the previous process was almost completely automated, this new workflow requires more human intervention.

Files, files, everywhere

pile of hard drives

Copies of preservation files are dispersed throughout the library in various systems and mediums. They are on external hard drives that reside in three¬†different departments, within three¬†network drives (often with non-intuitive organization), and within the University’s streaming service servers. While I generated inventories for¬†all our drives at the beginning of my residency, at the time I didn’t check for completeness of each copied collection.

Files are duplicated for redundancy but sometimes don’t contain the full collection of files in each location. As you can imagine, this causes me some headaches and just finding where all of these files live can be a large part of the process.

In order to ingest these materials into APTrust, the process works much like the upload of stage one¬†items, but it requires human intervention to transfer the files before the tool bags the item(s) and sends them to the APTrust AWS bucket for ingest. I start our tool for each item which then exports the DSpace metadata into a folder on a network drive with a name based on the item’s handle (what will become the data folder in the bag). Then it’s my job to load the preservation files into that folder.

Computer, move these files from here to here

giphy

When I started, the workflow called for manual copying of the preservation files into the bagging folder. This was tedious and had a strong likelihood for human error, so I began thinking about if there was a easy way to automate this process.

Although there might be ways to do this programmatically, I thought the Robocopy command would work for this situation and seemed like the easiest option. A simple Windows command line utility, all it requires is a source directory path, a destination directory path, and has the option to specify which file(s) you want to copy.

robocopy <Source> <Destination> [<File>[ ...]] [<Options>]

The problem was that we’re bagging most of this content on an item level, and for many collections, one¬†item = one¬†file (often hundreds in a collection) which then each have to be transferred to one specific bagging folder. This was compounded by the fact that there wasn’t an easy way to identify which file went with each folder without going to each page individually in DSpace to determine the item’s handle and file name. At this point, finding the¬†inputs and writing out hundreds of commands on the command line didn’t seem like it would be much faster than manual copying.

Fixing the problem

One way I thought this could be remedied was through the use of our DSpace self-service query tool. This is part of a suite of batch tools created by our developer Terry Brady and provides an easy-to-use GUI interface to query the API of our DSpace instance and return metadata about our collections.  Results are returned in a table and can be exported to Google Sheets as CSV.

dspace_selfservicequery
DSpace self-service query results

What I needed was the file name for each item in a collection but that wasn’t an available option in the tool at the time. Luckily, Terry was able to fulfill my enhancement request¬†to return the original uploaded file name from the bitstream information in DSpace. With this addition, and also selecting to return¬†each item’s handle, I had a way of matching files to the bagging folders that each file goes into.

.csv > .bat

Now that I had some of the key elements of the Robocopy command for each item, and with all the information in one place, I could make a few quick changes to turn the information imported into Google Sheets into the basis for a batch file that will automatically transfer the files to their bagging location.

robocopy_sheet
Results exported to Google Sheets

In Google Sheets, I can change the handle into the destination file path of the item through find and replace. After that, I find the location of the preservation files and get the source directory path, along with checking if the preservation file names match what was returned from DSpace. Sometimes these names will differ slightly (frustrating if I don’t catch it), but for the most part only requires a simple find and replace of the file extension (for instance .jpg to .tif). I also add additional columns for other parts of the command like using the /log+ option to¬†create¬†a log file where I can store the command output to¬†determine if anything goes wrong and append each new output to the same file.

brosnan_batchfile
CSV turned into a batch file

With each line in the command complete, I download the sheet as a CSV file. I then open it up in a text editor, find and replace the commas with spaces and save it as a .bat file. Although running the batch and copying of the files can often take time, the process of determining which files go where and initiating the transfer has been greatly reduced. Once the items are copied, I just need to click validate in our upload tool to confirm the contents of each bag looks correct and the rest of the ingest process is taken care of.

Next Steps

Once I’m done uploading these materials and documenting the workflow, I’ll train some other librarians to continue the process of distributing digital preservation knowledge and practice throughout the library. We’re also working on our next workflow, which will be integrating our tool with ArchivesSpace and extracting metadata for inclusion with bags for born-digital and digitized archival collections. Concurrently, I’ve been working on born-digital accessioning and forensic workflows to transfer data off of obsolete media, with it eventually ending up in APTrust down the line. Hope to have some posts on these up in the future!

Getting the Easy Bits into APTrust

I started my National Digital Stewardship Residency about two months ago and thought it was about time to put some words on paper (err well, hypertext on hosted web server)¬†about a part of what I’m working on. One of the main goals of my project at Georgetown University Library is to ingest all the collections selected for long term preservation into¬†Academic Preservation Trust (APTrust), which is serving as our Trusted Digital Repository, from the multitude of places they live currently. Along the way, I am creating thorough documentation to¬†foster sustainability of the project both at Georgetown and for other institutions looking to learn or implement a similar project.

Our Tools and Workflows

My project will play out in a series of stages based on difficulty. Right now, we are working on the easiest of our collections¬†to ingest into APTrust, the items which have their preservation files and metadata in Georgetown’s DSpace digital repository, Digital Georgetown¬†(or what my mentor has dubbed low-hanging fruit #1). Electronic theses and dissertations (ETDs) comprise most of this set and have mainly PDF files serving as their access and preservation copy.

apt-workflow

One of our¬†developers, Terry Brady, has has¬†created a suite of custom¬†tools to ease and automate this process. ¬†There is a custom web GUI APTrust workflow management system that he created and we use to keep track of the collections we’re working on¬†and to initiate each step in the process to complete ingest.

First, based on a collection handle, items are exported along with their METS metadata from DSpace and stored temporarily on a networked drive. Then the GU File Analyzer Tool is run, confirming the checksum of each item within the digital object and creating a valid APTrust bag. Once bagged, the items are sent in batches of 20 to a receiving bucket for ingest onto the APTrust servers. Furthermore, we run a command line utility that queries the APTrust API and confirms that ingest was successful. The unique APTrust etag and bag name is also written back into the metadata of the object in DSpace.

For more on the ingest process check out the video Terry made:

Still Needs Improvement

Since we are uploading bags on an item level due to many of the collections’ expanding nature and already existing¬†item-level persistent ID’s, we are losing the context and ease of determining in which collection each item belongs. One of the questions we are dealing with now is how we would reassemble our collections from preservation storage should we need to in the future (i.e. if there is a catastrophic failure).

dog in a burning room saying "this is fine"
Georgetown staff knowing that their content is preserved and findable in APTrust.

Our idea was to store the collection level handle from DSpace in¬†the Bag-Group-Identifier element in the bag-info.txt tag file but APTrust does not parse and make this field searchable through their web interface.¬†Hopefully we can propose a feature request or come up with some other solution. We’re also thinking about uploading separate bags with a list of all the¬†item handles for a collection in an xml file. Other solutions are welcome!

What I’ve Been Doing and Next Steps

After finishing up our digital collections inventory, I’ve been extensively testing these tools over the past few weeks, reporting errors, recommending improvements (like the one mentioned above), and creating documentation. This is all in preparation of training other staff in hopes of distributing this knowledge into other parts of the library as¬†another way of making the procedures sustainable into the future. I¬†believe that a¬†distributed digital preservation model is becoming necessary as digital collections become more the norm and all staff, rather than just those with “digital” in their title, need to be empowered to take on this type of work.

While all this has been happening, we’ve made a significant dent¬†of ingesting these easy collections and should be finished soon. And yet, we have only uploaded a modest 10 GB (only a small percentage of the total size of¬†our digital collections). Therefore, we¬†still have a ways to go, so watch this space for updates about coming challenges like integrating our workflows with ArchivesSpace, Embark, and networked and external drives. I’ll also be discussing other aspects of my project like the improvement of born-digital ingest workflows and the creation of a digital forensics work station. As always, I appreciate any feedback my colleagues might offer on this work.

Open Digital History or “I Have No Idea What I’m Doing”

updogWhile computers have opened the world around us (in some good and some bad ways), the process of writing and disseminating history has remained largely closed to public view. How can historians leverage the internet and digital environment to write more publicly accessible history?¬†In what ways can the historian’s process be enhanced and made more transparent through the use of digital tools?

This semester I’ve begun researching and writing my master’s of history thesis (for more on that go here), something I plan to finish by next spring. But while I’m writing it, I hope to experiment with some methods for answering¬†the questions mentioned above. I owe much of my inspiration from reading about the¬†work of others in pieces like “Open History Notebook” by W. Caleb McDaniel, ¬†“Writing The Historian‚Äôs Macroscope in Public”¬†by Shawn Graham, Ian Milligan, and Scott Weingart, and¬†“Curating in the Open: Martians, Old News, and the Value of Sharing as you go”¬†by Trevor Owens.

Each of these authors discuss the benefits of sharing your data or your notes. It can bring greater transparency, publicity, engagement, help from others, and novel ideas. Beyond all that, historians should give the consumer of their historical production¬†the ability to come to their own conclusions. As many who have practiced history know, there are biases to every argument and¬†there are multiple ways to interpret evidence. McDaniel eloquently explains this¬†impetus¬†behind sharing historical data when he states in “Open History Notebook”¬†that:

“providing our data will have less to do with a desire to make our experiments reproducible, and more to do with a belief that historical arguments are on a fundamental level irreproducible. Each one is the product of a particular person or group of people at a particular time and place.”

In the process of understanding the past and a greater comprehension of who we are as human beings, historians must acknowledge a multiplicity of understandings and help foster them through opening up their research process. Below I outline the methods I am taking to obtain this goal.

“I Have No Idea What I’m Doing”

But some of some of you may be saying “digital open history?! I have no idea how to do that stuff.” This is pretty much how I feel too. This “I have no idea what I’m doing” feeling is still with me even as I am doing what I set out to. The feeling is often a conceptual barrier¬†to jumping in and learning along the way.¬†Some of the authors mentioned above used methods too technically advanced or too resource (time or money)¬†prohibitive for me and there are certainly many projects that are even more intimidating. I am trying to do what I can and not to let impostor syndrome get the best of me.

My goal is to use some free, ready-made tools and to create a workflow that burdens me as little as possible. This workflow is still in process and being tweaked, somethings haven’t even been implemented yet, but I hope it can benefit someone, somewhere, someday. Note: this may all be invalidated by RRCHNM’s exciting new image management tool Tropy, being built as I write.

Flickr

Sharing has become much more efficient with the rise of social media. Trevor Owens in particular demonstrated that by sharing primary sources as you go on social media can get the public really excited about what you are doing (being exhibit, article, or book). Unfortunately rights restrictions can be an obstacle to sharing materials online (or paid databases that have terms of service that ban sharing, etc.).

However, since much of the material I am looking at was published pre-1923 and thus within the public domain, I am trying to share as much as possible. As you may notice on the side of this page, I am uploading some images to Flickr so they can be easily displayed on this blog. Flickr is a simple tool for uploading images while effectively managing tags, notes, and sharing sources. For those interested in my project, you can also follow along on my Flickr page through the McMillan Park Album or through the RSS Feed in any standard RSS reader. I will also periodically share photos on Twitter to gain more interest.

IFTTT

if picture from phone then upload to flickr

To make this process quicker and easier, I’ve been using a service called IFTTT (If This Then This). This app allows you to easily create or reuse “recipes” (neither the Emeril Laggase or Walter White type) which connect apps and perform functions in the background as actions in one app trigger actions in another. For work in the archives, I often use the digital camera on my phone to copy material for later viewing. With this recipe,¬†that runs whenever a photo is saved to the gallery on my camera, I can automatically upload pictures to Flickr at the same time. This is good for most additional imaging apps, since most will create a folder in your gallery and cause the image to be uploaded.

In the cases of online collections I can skip this first step and download images/screenshots and upload them directly to Flickr. Then using another recipe I can save a copy to Google Drive to ensure further backup of the image without having to do any extra work. Also this will work in succession with the first recipe, as the photo is automatically uploaded to Flickr it will also then be saved to Google Drive.

Google Sheets + OpenRefine = Omeka

Ultimately at the end of all this sharing, I hope to have the basis to create a coherent Omeka site that can be more understandable to the public than my academic thesis will be (but who knows). For those that don’t know, Omeka is a platform for hosting online collections and building exhibits from the Roy Rosenzweig Center for History and New Media. My site is just a shell right now but and I think will be a good way of bringing my results to the public view.

omeka logo

In preparation for this ultimate goal, I am running another recipe concurrently with the others which creates a row in a Google spreadsheet each time an image is uploaded to Flickr, Each line includes the the basic metadata of the image such as title, tags, and link to image. The link is the most important part, as it is essential for uploading the images to Omeka via the CSV importer plug-in.

UPDATE: It turns out that the link sent to the Google spreadsheet is to the Flickr image page not the URL for the image file, which is required as a part of the CSV importer. I have yet to find a way to change this in bulk so I will have to get the links manually if I cannot find a solution.

Once I am done uploading images and the spreadsheet is complete, I can download it as a .CSV file and edit it in OpenRefine to meet the importer template guidelines. OpenRefine is a great free data clean-up tool that is fairly easy to learn the basics but has other more advanced and powerful options. Hopefully this complicated domino like process all works and my images will be transferred seamlessly into Omeka.

There you have it, my hacky attempt to do history openly and digitally. I hope this has sparked ideas in others (particularly in regards to automation). I’d love to hear comments and suggestions below about improvements or your own processes!

On a side note, I am also sharing my research notes in a public Google Doc, but it is mostly confusing drivel right now.

Presentation at ETDG

A few weeks ago I gave a presentation at the Emerging Technologies Discussion Group (EDTG) at the University of Maryland Libraries. This group meets once a month and features presenters discussing technology that can be applicable to the work of librarians or archivists. The August meeting featured myself and Catherine Bloom, a fellow student at UMD’s iSchool. I¬†presented on stereomap (stereomap.weebly.com), my¬†project which mapped stereographic views of New York City as a part of my¬†Digital Public History course from this past spring. The presentation focused on some notable examples that inspired my¬†project, the tools I¬†examined for use in its completion, and the process I went through to complete the final project. I¬†emphasized that one does not need to be a technical expert to undertake a project like this (although it does help) and that there are a vast array of tools online for bringing collections to the public’s attention and into different contexts. Check out the slides here:¬†slides.com/jcarrano/dighist-maps¬†or you can read more about the project in the class blog that I linked to above.