At the end of last year, we were excited to announce our renewed commitment to community and the launch of three cross-functional programs to guide and accelerate our work. We introduced this new approach to work towards better cross-team alignment, shared responsibility, improved communication and learning, and make more progress on the things members need.
This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
This blog post is from Lettie Conrad and Michelle Urberg, cross-posted from the The Scholarly Kitchen. As sponsors of this project, we at Crossref are excited to see this work shared out.
The scholarly publishing community talks a LOT about metadata and the need for high-quality, interoperable, and machine-readable descriptors of the content we disseminate. However, as we’ve reflected on previously in the Kitchen, despite well-established information standards (e.g., persistent identifiers), our industry lacks a shared framework to measure the value and impact of the metadata we produce.
In 2021, we embarked on a Crossref-sponsored study designed to measure how metadata impacts end-user experiences and contributes to the successful discovery of academic and research literature via the mainstream web. Specifically, we set out to learn if scholarly books with DOIs (and associated metadata) were more easily found in Google Scholar than those without DOIs.
Initial results indicated that DOIs have an indirect influence on the discoverability of scholarly books in Google Scholar – however, we found no direct linkage between book DOIs and the quality of Google Scholar indexing or users’ ability to access the full text via search-result links. Although Google Scholar claims to not use DOI metadata in its search index, the results of our mixed-methods study of 100+ books (from 20 publishers) demonstrate that books with DOIs are generally more discoverable than those without DOIs.
As we finalize our analysis, we are sharing some early results and inviting input from our community. What relevant lessons can we glean from this exercise? What changes might book publishers consider based on the outcomes of this study?
Background on the study
This study was designed to evaluate metadata impacts & benefits to users. Given its popularity with a range of stakeholders in our industry, we set out to measure metadata impacts on discoverability in the mainstream web – namely, Google Scholar.
Our test method and analysis rubric was developed based on our own information-user research, in particular how readers search and retrieve scholarly ebooks, as well as published studies about academic information experiences and research practices. We rated the search performance of more than 100 scholarly books using preset test queries (two for each title). The books tested in this study came from publishers of all sorts and sizes, and represent both monographs and edited volumes from a range of fields; some were open access and others were published under traditional licensing models.
We developed and executed known-item test searches that were designed to simulate common researcher practices. Heuristic analysis of the search results was used to rate the search performance on a 5-point scoring rubric, which was designed to measure the degree of friction in locating the book in question. This method allowed us to assess specific book and metadata attributes by their search performance scores to assess the impact of book metadata on content discoverability in Google Scholar.
Results and findings
In this study, we learned that high-value fields include the primary title paired with subtitles, author/editor surnames and/or field of study. Queries using full book titles performed the best across the board. Those using publication dates and/or author/editor surnames and/or publisher names, but without the book title, were the lowest performers.
Surprisingly, our discoverability scores show no significant variation in performance by the type of book, whether edited or authored. Open-access titles performed somewhat better than traditional ones. Books covering humanities and social science fields performed a bit better than STM books, but only by a slim difference (that is not statistically significant).
We primarily tested the discoverability of book titles, from equal numbers of books with and without chapter-level DOIs. We ran similar tests for chapter-title discoverability but found the majority of test queries for chapters lead users to the full book itself. While books without title-level DOIs were found to be less discoverable, we did not find a measurable difference between books with or without chapter-level DOIs. (Note: All books in this study with chapter-level DOIs assigned also carried a title-level DOI, which was found to be fairly common.)
Based on these results, we are developing a theory that books with DOIs perform better in Google Scholar because they benefit from the structured, open metadata associated with those DOIs – which are used by hundreds of platforms and services, and therefore are “seeded” throughout the mainstream web, which Scholar may draw on for indexing, linking, etc. That said, however, these results also suggest that publishers are best served by a metadata strategy that is well attuned to the protocols expected of each channel for book search and discovery. In a recent conversation about our findings, Anurag Acharya himself noted that these results underscore the need for publishers to invest in the robust construction and broad distribution of book metadata.
In this study, we have observed that the metadata protocols surrounding Google Scholar are not fully integrated into our industry’s established scholarly information standards bodies, like NISO, or infrastructure organizations, like Crossref. While some mainstream data standards prevail in the Scholar index, like the use of schema.org and HTTP, some key metadata attributes seem to be lacking. For example, an indicator of the type of scholarly book (monograph, handbook, etc.) would improve Google Scholar’s search index and could be used to filter search results, thereby improving users’ experiences discovering scholarly books. One clear challenge for book publishers today is the fact that Google Scholar operates outside of our community-governed scholarly information infrastructure.
What comes next
While this study focused on Google Scholar, the results and lessons learned are applicable to other mainstream channels of information seeking/discovery. Our report, due out spring 2023, will contribute to the literature intended to support user-centric information systems design and content architecture by scholarly publishers and service providers.
As we write up our findings, we intend to develop a framework that can help publishers and others measure the impact of their work to enrich and distribute scholarly metadata. We hope this first systematic review of the impacts of metadata on the discoverability of books in Google Scholar will provide valuable insights for this community. In the meantime, please share your thoughts and questions in the comments below – or reach out to us directly (see Lettie’s profile here and Michelle’s profile here).
Acknowledgments: The authors would like to thank Jennifer Kemp at Crossref for the inspiration to take this dive into the metadata literature and reflect on its impact on research information experiences. Special thanks to Anurag Acharya at Google Scholar for his consultation during this study.