6 minute read.Measuring Metadata Impacts: Books Discoverability in Google Scholar
This blog post is from Lettie Conrad and Michelle Urberg, cross-posted from the The Scholarly Kitchen.
As sponsors of this project, we at Crossref are excited to see this work shared out.
The scholarly publishing community talks a LOT about metadata and the need for high-quality, interoperable, and machine-readable descriptors of the content we disseminate. However, as we’ve reflected on previously in the Kitchen, despite well-established information standards (e.g., persistent identifiers), our industry lacks a shared framework to measure the value and impact of the metadata we produce.
In 2021, we embarked on a Crossref-sponsored study designed to measure how metadata impacts end-user experiences and contributes to the successful discovery of academic and research literature via the mainstream web. Specifically, we set out to learn if scholarly books with DOIs (and associated metadata) were more easily found in Google Scholar than those without DOIs.
Initial results indicated that DOIs have an indirect influence on the discoverability of scholarly books in Google Scholar – however, we found no direct linkage between book DOIs and the quality of Google Scholar indexing or users’ ability to access the full text via search-result links. Although Google Scholar claims to not use DOI metadata in its search index, the results of our mixed-methods study of 100+ books (from 20 publishers) demonstrate that books with DOIs are generally more discoverable than those without DOIs.
As we finalize our analysis, we are sharing some early results and inviting input from our community. What relevant lessons can we glean from this exercise? What changes might book publishers consider based on the outcomes of this study?
Background on the study
This study was designed to evaluate metadata impacts & benefits to users. Given its popularity with a range of stakeholders in our industry, we set out to measure metadata impacts on discoverability in the mainstream web – namely, Google Scholar.
Our test method and analysis rubric was developed based on our own information-user research, in particular how readers search and retrieve scholarly ebooks, as well as published studies about academic information experiences and research practices. We rated the search performance of more than 100 scholarly books using preset test queries (two for each title). The books tested in this study came from publishers of all sorts and sizes, and represent both monographs and edited volumes from a range of fields; some were open access and others were published under traditional licensing models.
We developed and executed known-item test searches that were designed to simulate common researcher practices. Heuristic analysis of the search results was used to rate the search performance on a 5-point scoring rubric, which was designed to measure the degree of friction in locating the book in question. This method allowed us to assess specific book and metadata attributes by their search performance scores to assess the impact of book metadata on content discoverability in Google Scholar.
Results and findings
In this study, we learned that high-value fields include the primary title paired with subtitles, author/editor surnames and/or field of study. Queries using full book titles performed the best across the board. Those using publication dates and/or author/editor surnames and/or publisher names, but without the book title, were the lowest performers.
Surprisingly, our discoverability scores show no significant variation in performance by the type of book, whether edited or authored. Open-access titles performed somewhat better than traditional ones. Books covering humanities and social science fields performed a bit better than STM books, but only by a slim difference (that is not statistically significant).
We primarily tested the discoverability of book titles, from equal numbers of books with and without chapter-level DOIs. We ran similar tests for chapter-title discoverability but found the majority of test queries for chapters lead users to the full book itself. While books without title-level DOIs were found to be less discoverable, we did not find a measurable difference between books with or without chapter-level DOIs. (Note: All books in this study with chapter-level DOIs assigned also carried a title-level DOI, which was found to be fairly common.)
Based on these results, we are developing a theory that books with DOIs perform better in Google Scholar because they benefit from the structured, open metadata associated with those DOIs – which are used by hundreds of platforms and services, and therefore are “seeded” throughout the mainstream web, which Scholar may draw on for indexing, linking, etc. That said, however, these results also suggest that publishers are best served by a metadata strategy that is well attuned to the protocols expected of each channel for book search and discovery. In a recent conversation about our findings, Anurag Acharya himself noted that these results underscore the need for publishers to invest in the robust construction and broad distribution of book metadata.
In this study, we have observed that the metadata protocols surrounding Google Scholar are not fully integrated into our industry’s established scholarly information standards bodies, like NISO, or infrastructure organizations, like Crossref. While some mainstream data standards prevail in the Scholar index, like the use of schema.org and HTTP, some key metadata attributes seem to be lacking. For example, an indicator of the type of scholarly book (monograph, handbook, etc.) would improve Google Scholar’s search index and could be used to filter search results, thereby improving users’ experiences discovering scholarly books. One clear challenge for book publishers today is the fact that Google Scholar operates outside of our community-governed scholarly information infrastructure.
What comes next
While this study focused on Google Scholar, the results and lessons learned are applicable to other mainstream channels of information seeking/discovery. Our report, due out spring 2023, will contribute to the literature intended to support user-centric information systems design and content architecture by scholarly publishers and service providers.
As we write up our findings, we intend to develop a framework that can help publishers and others measure the impact of their work to enrich and distribute scholarly metadata. We hope this first systematic review of the impacts of metadata on the discoverability of books in Google Scholar will provide valuable insights for this community. In the meantime, please share your thoughts and questions in the comments below – or reach out to us directly (see Lettie’s profile here and Michelle’s profile here).
Acknowledgments: The authors would like to thank Jennifer Kemp at Crossref for the inspiration to take this dive into the metadata literature and reflect on its impact on research information experiences. Special thanks to Anurag Acharya at Google Scholar for his consultation during this study.
Further reading
- Oct 16, 2018 – Good, better, best. Never let it rest.
- Jan 8, 2025 – Metadata matching: beyond correctness
- Dec 3, 2024 – Metadata beyond discoverability
- Nov 6, 2024 – How good is your matching?
- Aug 28, 2024 – The myth of perfect metadata matching
- Jul 25, 2024 – Re-introducing Participation Reports to encourage best practices in open metadata
- Jul 22, 2024 – Metadata schema development plans
- Jul 1, 2024 – Celebrating five years of Grant IDs: where are we with the Crossref Grant Linking System?