This article is from our guest blog contributor, Julia A. Lovett, Associate Professor, Digital Initiatives Librarian, University of Rhode Island Libraries
This post has two purposes: firstly, to show how you can use Open Refine to easily edit a spreadsheet and transform data from Excel (tab-delimited) format to XML; secondly, and more specifically, to show how I used this method to add Crossref DOIs to an IR (institutional repository) collection. In this case, the IR platform is Digital Commons and the collection is dissertations and theses.
For some background, after the University of Rhode Island Libraries joined Crossref, I wanted to create new DOIs for our ETD (electronic theses and dissertations) collections on the IR, DigitalCommons@URI. I was familiar with creating DOI’s one by one using the Crossref web deposit form, where you can enter your new DOI along with its associated metadata and URL. But with thousands of ETD’s, this would need to be done as a batch deposit. I would need to somehow get all of the metadata from our ETD collections into Crossref, along with a new DOI for each title.
It turns out that to register a whole bunch of new DOIs at once in Crossref, you need to deposit the metadata in XML format. But as other Digital Commons administrators will know, this particular IR platform can only export metadata in Excel format. So, how to transform Excel data into XML, and specifically the Crossref XML format? To a librarian with some technical knowledge but not a whole lot of coding experience, this seemed a little intimidating at first.
Here’s where OpenRefine comes in handy. I first learned some basics of this tool at a Library Carpentry workshop held at Brown University, and then URI’s data librarian, Harrison Dekker, recommended it to me again for this particular project. This article on converting spreadsheets to MODXML was also very helpful.
I had heard OpenRefine described as “Excel on steroids,” which seems accurate. It’s free, browser-based, and fairly easy to start using. After some trial and error, I was able to extract metadata from Digital Commons in Excel format, edit it in Open Refine, and transform it into Crossref-ready XML. For my fellow librarians and anyone facing a similar task, here is that workflow.
Preparing the Data in Excel
- Export your metadata from Digital Commons using the Batch Revise function.
- If you haven’t already, decide how your DOIs should be constructed. In our case, I used a combination of “diss” or “thesis”, the author’s first and last name, and the publication date. For example, here is a DOI link for a 2020 dissertation by Nayanthara Dharmaratne: https://doi.org/10.23860/diss-dharmaratne-nayanthara-2020
- Create a DOI Prefix column, which you’ll later combine with other information to create the full DOI. The prefix consists of your assigned DOI prefix, plus any other initial characters that will be the same for all DOIs. In my case, I used “10.23860/diss” for dissertations and “10.23860/thesis” for theses. Make sure the DOI prefix appears in each row in the spreadsheet.
- Delete all extraneous columns, keeping the ones you’ll eventually want to include in your Crossref deposit. I kept author first and last names, titles, DOI prefix, year of publication, and URL (called calc_URL).
Editing in Open Refine
- Open up OpenRefine, and upload your spreadsheet as a new project.
- First off, your metadata may contain special characters that need to be removed or replaced (“escaped”) so that they work as part of an XML file. You can do this by clicking on the dropdown menu on any column and then using Edit Cells –> Replace
- Ampersands need to be escaped by using & instead of &
- Angle brackets are escaped as < and > for < and > and quotes are escaped as " for “
- For angle brackets, they’re from HTML tags so you’ll need to remove <em>, </em>, <sup>, </sup>, <sub>, </sub> and any other HTML tags
- Ensure the publication date column is formatted as date by clicking Edit cells –> Common transforms –> To date
- Transform the date to show only the year
- go to Edit cells –> Transform and in the Expression text box, enter: value.datePart(“year”)
- Create the DOI’s by combining the DOI Prefix column with information from other columns.
- From the DOI Prefix column, click on the dropdown menu, then Edit column –> Join columns
- Select the columns you want to combine and drag them into the correct order, and enter your separator character of choice.
- Rename the column from “DOI Prefix” to “DOI.”
- At this point, it’s a good idea to check your data for cleanup issues. For example, I ended up with some spaces in my DOIs, and duplicate DOIs that needed to removed.
- Remove any diacritics and apostrophes from the DOI column using Edit Cells –> Replace for each character as needed (these will be rejected by Crossref so you need to remove them).
Transforming the Data from Excel to XML Format
This chunk of the workflow shows how to export the data to Crossref’s XML schema for dissertations with a single author. For other works, check out Crossref’s extensive documentation on their metadata schema. My thanks also goes to Shayn Smulyan of Crossref Technical Support for answering tons of questions and sending me this example XML for dissertations.
- In the top right corner, click on Export –> Templating to bring up the export window. From here you’ll need to copy-paste in your XML code and replace the contents of each element with some JSON code for your data. For example, here is the code for the <surname> element, where “author1_lname” is the column title for author last names:
Feel free to copy from my file, also available on Github:
- Prepare your XML in a separate file and then copy-paste it into the left panel of the Templating Export window. Make sure to paste the Prefix, Row Template, and Suffix into the correct boxes. Leave the separator box blank.
- Edit <doi_batch_id> to something unique. You will need to change the batch ID each time you submit a file to Crossref. I used the current date and time in the format etd-YYYY-MM-time (where “time” is four digits representing hours and minutes)
- Similarly, edit <timestamp> to something unique. I used the current date and time, in the format YYYYMMDDtime
- Click Export
- Save your file and change the name to something unique (I used the timestamp) and change the extension to .xml
- Deposit the data to Crossref using the directions at https://www.crossref.org/education/member-setup/upload-xml-files-via-our-admin-tool/
Finally, don’t forget to add the new DOI’s to the records in Digital Commons!
- Download a new Batch Revise spreadsheet
- Copy-paste the DOI’s from Open Refine into the DOI column. Be careful that everything lines up correctly.
- Upload the new spreadsheet to Digital Commons.
Guest Blog Contributor: Julia A. Lovett, Associate Professor, Digital Initiatives Librarian, University of Rhode Island Libraries
Do you have thoughts or ideas about core library work that you’d like to share with the community? Submit your article to become featured as a blog contributor for CoreNews.