About a month ago I finished the first and easiest part of my project, uploading digital objects with preservation copies in DSpace into APTrust, which I mentioned in my last post. I also trained staff in these workflows and worked with our developer to improve our tools for upload into APTrust (you can read more about this overall process in my post for The Signal). I’m continuing to do this as I’m working through stage two of the main part of my NDSR project, ingesting items into APTrust that have metadata in DSpace but their preservation copies on network or external drives. While the previous process was almost completely automated, this new workflow requires more human intervention.
Files, files, everywhere
Copies of preservation files are dispersed throughout the library in various systems and mediums. They are on external hard drives that reside in three different departments, within three network drives (often with non-intuitive organization), and within the University’s streaming service servers. While I generated inventories for all our drives at the beginning of my residency, at the time I didn’t check for completeness of each copied collection.
Files are duplicated for redundancy but sometimes don’t contain the full collection of files in each location. As you can imagine, this causes me some headaches and just finding where all of these files live can be a large part of the process.
In order to ingest these materials into APTrust, the process works much like the upload of stage one items, but it requires human intervention to transfer the files before the tool bags the item(s) and sends them to the APTrust AWS bucket for ingest. I start our tool for each item which then exports the DSpace metadata into a folder on a network drive with a name based on the item’s handle (what will become the data folder in the bag). Then it’s my job to load the preservation files into that folder.
Computer, move these files from here to here
When I started, the workflow called for manual copying of the preservation files into the bagging folder. This was tedious and had a strong likelihood for human error, so I began thinking about if there was a easy way to automate this process.
Although there might be ways to do this programmatically, I thought the Robocopy command would work for this situation and seemed like the easiest option. A simple Windows command line utility, all it requires is a source directory path, a destination directory path, and has the option to specify which file(s) you want to copy.
robocopy <Source> <Destination> [<File>[ ...]] [<Options>]
The problem was that we’re bagging most of this content on an item level, and for many collections, one item = one file (often hundreds in a collection) which then each have to be transferred to one specific bagging folder. This was compounded by the fact that there wasn’t an easy way to identify which file went with each folder without going to each page individually in DSpace to determine the item’s handle and file name. At this point, finding the inputs and writing out hundreds of commands on the command line didn’t seem like it would be much faster than manual copying.
Fixing the problem
One way I thought this could be remedied was through the use of our DSpace self-service query tool. This is part of a suite of batch tools created by our developer Terry Brady and provides an easy-to-use GUI interface to query the API of our DSpace instance and return metadata about our collections. Results are returned in a table and can be exported to Google Sheets as CSV.
What I needed was the file name for each item in a collection but that wasn’t an available option in the tool at the time. Luckily, Terry was able to fulfill my enhancement request to return the original uploaded file name from the bitstream information in DSpace. With this addition, and also selecting to return each item’s handle, I had a way of matching files to the bagging folders that each file goes into.
.csv > .bat
Now that I had some of the key elements of the Robocopy command for each item, and with all the information in one place, I could make a few quick changes to turn the information imported into Google Sheets into the basis for a batch file that will automatically transfer the files to their bagging location.
In Google Sheets, I can change the handle into the destination file path of the item through find and replace. After that, I find the location of the preservation files and get the source directory path, along with checking if the preservation file names match what was returned from DSpace. Sometimes these names will differ slightly (frustrating if I don’t catch it), but for the most part only requires a simple find and replace of the file extension (for instance .jpg to .tif). I also add additional columns for other parts of the command like using the /log+ option to create a log file where I can store the command output to determine if anything goes wrong and append each new output to the same file.
With each line in the command complete, I download the sheet as a CSV file. I then open it up in a text editor, find and replace the commas with spaces and save it as a .bat file. Although running the batch and copying of the files can often take time, the process of determining which files go where and initiating the transfer has been greatly reduced. Once the items are copied, I just need to click validate in our upload tool to confirm the contents of each bag looks correct and the rest of the ingest process is taken care of.
Once I’m done uploading these materials and documenting the workflow, I’ll train some other librarians to continue the process of distributing digital preservation knowledge and practice throughout the library. We’re also working on our next workflow, which will be integrating our tool with ArchivesSpace and extracting metadata for inclusion with bags for born-digital and digitized archival collections. Concurrently, I’ve been working on born-digital accessioning and forensic workflows to transfer data off of obsolete media, with it eventually ending up in APTrust down the line. Hope to have some posts on these up in the future!