This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
About 11% of available references in records in our OAI-PMH & REST API don’t have DOIs when they should. We have deployed a fix, but it is running on billions of records, and so we don’t expect it to be complete until mid-April.
Note that the Cited-by API that our members use appears to be unaffected by this problem.
The gory details
When a Crossref member registers metadata for a publication, they often include references. Sometimes the member will also include DOIs in the references, but often they don’t. When they don’t include a DOI in the reference, Crossref tries to match the reference to metadata in the Crossref system. If we succeed, we add the DOI of the matched record to the reference metadata. If we fail, we append the reference to an ever-growing list which we re-process on an ongoing basis.
But while testing our new reference matching approach, we started to see inconsistent results with our existing legacy reference matching system. When we implemented new regression tests, we noticed that, even when using our legacy system, we were consistently getting better results than were reflected in the metadata we exposed via our APIs. For example, we would pick a random Crossref DOI record that included 3 matched references, and when we tried matching all the references in the record again using our existing technology, we would get more matched references than were reported in the metadata.
At first, we thought this might have something to do with sequencing issues. For example, that article A might cite article B, but somehow article A would get its DOI registered with Crossref prior to article B. In this theoretical case, we would initially fail to match the reference, but it would eventually get matched as we continued to reprocess our unmatched references. But this wasn’t the issue. And the problem was not with the matching technology we are using. Instead, we discovered a problem with the way we process references on deposit.
When a member deposits references with Crossref, each reference has to include a member-defined key that is unique to each reference they are depositing in the DOI record. When we match a reference- we report to the members that we matched the reference with key X to DOI Y. The problem is that sometimes members would deposit references with an empty key. If there was only one such reference, then, technically, it would pass our test for making sure the key was unique within the record. So we would process the reference, and match it, and report it via our Cited-by service, but later in the process, when we went to include the matched DOI in the reference section of our API metadata, we’d skip including DOIs for references that had blank keys. The reference itself would be included in the metadata, it would just appear that we hadn’t matched it to a DOI when we actually had.
Again, we estimate this to have resulted in about 11% of the references in our metadata to be missing matched DOIs. We are processing our references again and inserting the correctly matched DOIs in the metadata. We expect the process to complete in mid-April. We will keep everybody up-to-date on the progress of this fix.
We will also be integrating the new matching system that we’ve developed. As mentioned at the start of this post, this matching system will also increase the recall rate of our reference matching and so, the two changes combined, should result in users seeing a significant increase in the number of matched references included in Crossref metadata.
And finally, as part of the work that we are doing to improve our reference matching, we are putting a comprehensive testing framework that will make it easier for us to detect inconsistencies and/or regressions in our reference matching.