This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
Event Data is our service to capture online mentions of Crossref records. We monitor data archives, Wikipedia, social media, blogs, news, and other sources. Our main focus has been on gathering data from external sources, however we know that there is a great deal of Crossref metadata that can be made available as events. Earlier this year we started adding relationship metadata, and over the last few months we have been working on bringing in citations between records.
Our members deposit references alongside other metadata, and we have a lot of them. In fact, we have over 1.2 billion, with hundreds of thousands of new references added each day. While our metadata APIs make it easy to see which works are cited, it is much more difficult to find a list of citations to a specific work. We can make this easier by presenting citations as events in Event Data. Now that the huge majority of our members have responded positively to the Initiative for Open Citations (I4OC) campaign and Crossref’s open-by-default reference policy, the move to make this data available via Event Data is a natural step.
A bumpy ride, but we got there
Adding such a large amount of data means a significant increase in the data coming into Event Data, which has presented some challenges. We’ve known for some time that Event Data is not very stable, but we expected it to cope with the new data coming in. We have mitigated by initially only looking at new data, not trying to immediately back-fill with old references. Unfortunately, even with this limitation it hasn’t been a smooth ride, and our first effort to put references into Event Data uncovered bugs we didn’t know about and we had to walk back the changes.
We tried again and found that we were hitting rate limits for our own APIs. This is a sure sign of technical debt: we shouldn’t need to be shifting large amounts of our own data from one place to another, and not at rates that could be putting stress on APIs used by others in the community.
There remains work to be done. We would like to backfill references, and there is also further work to include relationships to objects that have identifiers other than Crossref records (genes, proteins, ArXiv identifiers, and so on). Our work on investigating sources is proceeding and we will be looking to add more next year. While possible, these steps will be costly and time-consuming if we proceed without significant changes to the infrastructure supporting Event Data.
When we started Event Data the volumes of data were much smaller and our infrastructure coped well, but as we’ve said here before, it’s in need of an overhaul. In fact, our recent experience and some other considerations are making us look at some very fundamental changes in how we record events.
We are therefore working on a new data model that will allow events to be stored alongside the rest of our metadata. This work is still in the early stages, but if we are successful it will mean that we won’t need to move data between databases. It will also make it easier to provide access to all of our reference metadata along with other relationships that we’re not currently able to provide, and give us the capacity to add new data sources.
Open references
[EDIT 6th June 2022 - all references are now open by default with the March 2022 board vote to remove any restrictions on reference distribution].
It is worth noting that only open references will be available via Event Data. This covers 88% of works with references at present. Members have the option to deposit references with limited visibility, meaning only Metadata Plus users can access them; or closed visibility, meaning that only the member who owns the cited work can retrieve the citation, via Cited-by.
We encourage our members to make their references open and deposit them as metadata. It makes them usable downstream by thousands of tools that researchers use. Including open references also improves the quality of metadata, and there are reciprocal benefits for the large number of members who openly share their reference data: they contribute to a large, openly available pool of data with many applications that advance research, and drives usage of the content published by our members.
If you are a Crossref member and unsure whether your reference metadata is open or not, check your participation report. This will tell you the percentage of your records with deposited references, and the percentage of those that are open. You can change the reference visibility preference for each DOI prefix that you own by contacting our support team. For guidance on how to deposit references, see our user documentation.