The two-day “Regarding Library Database Cleanup” e-Forum that was continued from October 2020 in December 2020 had 152 messages posted from 31 participants who were from academic or public libraries or vendors. Our discussions included the types of projects undertaken and some best practices, the methods that revealed the need for these projects, and the software or programs that are used to carry out, track or document the projects. Our final discussions surrounded barriers to performing these projects as well as the underlying philosophy with which these projects are carried out. Finally, the resources shared by the three co-hosts and participants can be found at the end of this summary.
The participants used Voyager, Alma, or Aleph (from Ex Libris), Symphony or EOS (SirsiDynix), Sierra (Innovative) or LS2 (TLC) and reported that they work with all types of records (bibliographic, holdings, item, order, and authority records) for all types of formats. The database cleanup projects the participants perform are discovered through working on other types of projects (location changes, weeding, inventorying, retrospective conversion, or migrating to a new ILS/LSP as was discussed in the November 2020 Core e-Forum). Projects were also frequently discovered incidentally; by encountering an issue in few records, which led to investigating if the issue occurred more broadly. Participants reported using the Springshare LibAnswers product or the osTicket support ticket system for error reports by internal users, and another mentioned having a link for patrons to report errors from within the discovery layer.
Best practices that respondents apply to their database projects included making changes in a test environment before moving to the live, or production, catalog or at the very least, making the changes on a small batch of test items in the live catalog before applying the changes to an entire set of records. A few others agreed with this approach in addition to saving original spreadsheets or MARC records to serve as backup copies if the changes backfire. Several respondents whole-heartedly agreed with the tactic of getting to know your database and its “dragons” before starting major cleanup work, especially if the database is new to you.
An overwhelming number of respondents use Excel and MarcEdit for these database cleanup projects on a regular basis with training on these programs being obtained through colleagues, experimentation, or online tutorials, along with MarcEdit support available through Terry Reese’s YouTube videos and the MarcEdit listserv. OpenRefine, Regular Expressions, and Python were the next most used tools, training for all of which can be found online. Other tools mentioned included LibreOffice Calc, SQL, Alma Normalization Rules, R, and Aleph’s global change function. Many participants noted that their ILS/LSP have built-in reporting software that can assist in identifying metadata that may require editing. Several people responded that they wish they had the time and occasion to learn and apply many of the same tools: OpenRefine, Python, Alma Normalization Rules, and Regular Expressions to name a few.
To review and evaluate projects, participants re-run queries and/or reports to verify that all the changes happened as expected and check for any remaining items needing to be edited. Some respondents export data into spreadsheets or MarcEdit in order to review it. Some participants mentioned comparing circulation data for a collection or set before and after a project. Considerations for future projects include assessing potential future needs of the collection, standardizing updated or added fields/data for easier retrieval. Some participants maintain saved queries/sets or lists of reports to re-run regularly to capture common errors that creep back into the database over time. To help prevent errors, participants look for patterns in reports and work with colleagues to improve training, workflows, or documentation.
The second day of the e-Forum centered around project management and conceptual matters related to database cleanup projects.
A variety of programs were mentioned by the participants as being useful for documenting or tracking database cleanup projects, with Excel and Trello being the most frequently reported. Both programs were used for tracking lists of projects as well as for containing project data. In addition, LibAnswers, BaseCamp, GitHub, Drive, JIRA, Wrike, Confluence, and Asana were all mentioned for tracking projects and retaining relevant tips and documentation (MarcEdit task lists, SQL queries, regex or Python scripts). Drive, Slack and internal staff wikis or shared network folders were used to retain relevant documents. Maintaining extremely detailed documentation was mentioned as being highly recommended, both in order to share project responsibilities or for training, but also to record what steps were taken in order to replicate those projects in the future. Recommended tips were to use a file naming convention or consistent color coding in spreadsheets across projects as well as to document any major collection changes in order to explain any changes in collection-count assessment data from year to year.
When asked about cleanup projects that have “gone awry”, three projects were mentioned: (1) a card catalog shelflist project was described as problematic due to its longevity; (2) a global add of GMDs resulted in some less-than-ideal data; and (3) an Alma normalization process (a batch change) was implemented that corrupted over 300,000 holdings records. This last project led Ex Libris to develop a holdings restoration functionality. If projects do go sideways, quickly shifting to troubleshooting (rather than to assigning blame) was noted as being very productive, as was the realization that clear communication about projects is critical. One participant described delegating a project to someone else who interpreted and carried out the project in a completely different manner than intended.
Continuing on the theme of communication, collaboration on projects as well as organizational and staffing challenges were our next topics. Several participants noted a lack of staff to work on database maintenance projects, either because staff are working on other duties or because staff may not be enthusiastic about database projects; in the latter case, providing choices among different projects or staff finding their own projects of interest has been helpful. Departmental silos were also mentioned as a barrier to projects. Understanding the structure, roles, and each other’s perspectives can help prevent miscommunications. One participant developed a handout to facilitate project planning by all stakeholders from the beginning of the process. Participants emphasized the importance of cross-departmental communication, both intentional and serendipitous, while acknowledging that this is a constant challenge as the impact of our work may affect other units in ways we do not expect. Regular cross-departmental check-ins to share information about current projects or problems can help awareness of how workflows and processes intersect, and may also lead to future projects. Taking the time to build relationships and trust pays off in successful projects and less frustration for everyone involved.
Several people chimed in that they have projects on hold while primarily working from home due to the COVID-19 pandemic. Future projects include continuing to enhance hybrid RDA/AACR2 records to align even more closely with RDA; one participant is considering using WorldShare Collection Manager for this enhancement. Another has a set of records that need an OCLC record equivalent. One public librarian shared that their future projects will focus on highlighting specific collections in the catalog through location changes and local genre tagging, standardizing the treatment of their library’s Spanish language materials, as well as conduct a check on series that are held in multiple collections.
Regarding projects that need some problem-solving, one respondent is figuring out how to maintain their libraries authority records and controlled access points without the use of an authority control vendor. Another expressed that organizational barriers and not technical problems are holding up some of their projects.
The following are information-seeking and/or discussion questions raised by some of the respondents:
- How have people specifically integrated OpenRefine into their database maintenance workflow?
- Do you use data to support your new project requests or only rely on anecdotal feedback?
- Do you use analytics to gather data on how your users are interacting with the catalog?
- Do you think about or take into consideration when ‘good enough’ is good enough? If so, how has this affected your project development?
Finally, we discussed philosophies underlying database maintenance or cleanup work. These included optimizing discovery, following standards, and/or creating better work environments for back-end users. The overwhelming response to this question was phrased succinctly by one respondent: “Do right by the collection. It is of no use to the user if [an item] is unfindable.”
– Julene Jones, e-Forum moderator
Resources mentioned in either the October or December e-Forum:
Websites or Posted PDFs
Slides for setting up Primo’s “Report a problem” mechanism: http://sunyla.org/sunyla_docs/conferences/presentations/midwinter20/Session_4_SUNYLA_Midwinter_2020.pdf
Terry Ballard’s List of Typographical Errors (http://terryballard.org/typos/typoscomplete.html)
Millennium and Sierra Users : Regular Expressions used in Create Lists: https://wiliug.files.wordpress.com/2015/02/matches-handout-2015.pdf
Terry Reese’s video tutorials are also great: https://www.youtube.com/playlist?list=PLrHRsJ91nVFScJLS91SWR5awtFfpewMWg
OpenRefine tutorials via Owen Stephens and mashcat: http://www.mashcat.info/.
Python and PyMARC 2-part ALCTS webinars (2017):
- Terry Reese’s The MarcEdit Field Guide https://marcedit.reeset.net/learning_marcedit/
- websites like https://guides.library.illinois.edu/c.php?g=463460&p=3168244
- YouTube tutorials with links; e.g., those listed at https://marcedit.reeset.net/tutorials
List of OCLC Connexion macros: http://www.hahnlibrary.net/libraries/oml/connex.html
Mike Monaco (2020) Methods for In-Sourcing Authority Control with MarcEdit, SQL and Regular Expressions. Journal of Library Metadata, 20:1, 1-27, https://www.tandfonline.com/doi/full/10.1080/19386389.2019.1703497
Stacey Wolf (2020) Automating Authority Control Processes, code4lib Journal, issue 47, https://journal.code4lib.org/articles/15014
If you attended the 2020 Core Forum, Stacie Traill and Martin Patrick from the University of Minnesota gave a presentation titled “Core Concepts and Practical Techniques for Successful Library Metadata Analysis” that described the “core knowledge areas and problem solving techniques for successful library metadata analysis, along with some suggested tools… to develop a shared understanding of the skillsets required to meet their library’s metadata analysis needs.” We highly recommend this presentation!
Stacey Wolf shared a presentation for Amigos last February about e-resource editing, including a regex handout she created: https://digital.library.unt.edu/ark:/67531/metadc1703867.
Common Database Cleanup Projects/Reports mentioned in this e-forum (may contain duplicates)
(our thanks to Stacey Wolf (University of North Texas Libraries), David Bigwood (Lunar and Planetary Institute) and Kim Tubbs-Nelson (Carlsbad City Library, California) for sharing their lists!)
- Bad Bib Country Code
- Bad Bib Language Code
- Bad Bib Location-Invalid Bib Location none
- Looks for records where location is “none” but record is not withdrawn
- Bad Bib Location-Invalid Bib Location
- Looks for records with a valid location code but the code should not be used in bib records
- Bad Bib Location-Invalid Location
- Looks for records with invalid location
- Bad Bib Material Type
- Bad Bib Suppression
- Bad Bib Type
- Bad Item Gift Stats
- Bad Item IMessage
- Bad Item IType
- Bad Item Location-Invalid Item Location none
- Looks for records where location is “none” but record is not withdrawn
- Bad Item Location-Invalid Item Location
- Looks for records with a valid location code but the code should not be used in item records
- Bad Item Location-Invalid Location
- Looks for records with invalid location
- Bad Item OPAC Message
- Bad Item Status
- Bad Item Suppression
- Bad MARC tags or bad field tags.
- Duplicate 008
- Looks for records with more than one 008 field
- Duplicate Field Content
- Looks for records with duplicate fields (i.e., MARC tags, indicators, and content must be the same)
- The query could be revised to find duplicate content in different fields or with different indicators, but they would have to be individually corrected and there are too many to make it worthwhile.
- Duplicate MARC Tag
- Looks for records with more than one occurrence of a non-repeatable field
- 6XX fields with $2 that not from an approved source or has a $2 from an approved source but second indicator is not 7.
- identify possible date problems
- Bad Bib Material Type and Leader 06
- Checks for consistent coding between leader/06 and Mat Type
- Bad Leader and 008
- Because the 008 depends on the code in leader/06, this checks for coding problems between leader/06 and 008. Given the number of 008 coding problems, I’m having to limit it to a few 008 positions that are important based on the leader/06 and that might indicate a problem with the leader/06 code.
- Correcting URL links in 856 |a
- 505 Contents Notes Enrichment Project (to add missing contents notes & individual titles such as for short story/essay/song collections; used List Bib report with character string searches)
- Call Number Cleanup: just do call number searches & add missing cutters for old fiction materials
- Fictitious Character Names Subject Cleanup Project
- Form Subdivision in YA Records Cleanup Project (changed all form subdivisions in young adult records from juvenile to adult form subdivisions)
- Genre Matching (put all genres in 655 fields & use preferred LCGFT terms)
- Mystery Series Checking Project (add/trace series & series numbers)
- Science Fiction/Fantasy Checking Project (add/trace series & series numbers)
- Spanish Local Subject/Genre Cleanup
- Typos Cleanup: search for common typos using lists
- Upgrade old records: add table of contents, summaries, subjects/genres
- URL checker/check links in bibliographic records & in procedures
Below are 12 Strategic Framework Guiding Questions used by our Business Systems division to answer before a new project can be added to their “Technology Road Map” plan.
1. Is the proposed project consistent with our Mission, Vision and Values?
2. Does the proposed project address a priority need?
3. Is the proposed project already designated as a priority because it is mandated, required to meet a deadline, etc.?
4. Does the project align with the Library & Cultural Arts’ strengths?
5. Does the proposed project address a priority patron group’s need?
6. Is L&CA the best organization to meet the need most efficiently and effectively? If another organization can address the need more efficiently or effectively, list it.
7. Is the project allowed in the context of local, state and national health and safety guidelines?
8. Is infrastructure in place to support the project?
9. Are there available resources? What resources are needed from other divisions/departments? (Note: L&CA Business Systems, IT, Contracts Administration and the City Attorney’s resources are extremely limited. If resources from these departments/divisions are required, the project will likely not get approval unless it is mandated or a part of a city-wide initiative.)
10. Is there available funding, if applicable?
11. How does the timeline align with other priority projects? Can the project be reconsidered at another time?
12. How will success be measured?
Project Decision-Questions #1 through 10 must be answered “Yes” for the project to move forward.
Core e-Forums are two-day, moderated, electronic discussion forums that provide an opportunity for librarians and library professionals to discuss matters of interest on an email discussion list. These discussions are free of charge and available to anyone who wishes to subscribe to the email list.