This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
The ancient Romans performed a purification rite (“lustration”) after taking a census every five years. The term “lustrum” designated not only the animal sacrifice (“suovetaurilia”) but was also applied to the period of time itself. At Crossref, we’re not exactly in the business of sacrificial rituals. But over the weekend I thought it would be fun to dive into the metadata and look at very high level changes during this period of time.
The first thing a census typically asks is population size. We know there are new records arriving each month with 95.7mil to date. And they do so at variable rates. But when the data is visualized, a rough yearly pattern emerges into view. (Data were collected on Mar 25, 2018; results are partial for this month.)
Each year brings with it a significant spike, an influx of new entrants, perhaps reflecting an increase in submissions at the end of the previous year. After January, volume drops down dramatically and gradually rises once more over the course of the year. We see smaller spikes at the March, June, and September mark. (Since this was a brief exercise, I did not dive into any formal research conducted on the nature of publishing cycles.)
Metadata Coverage
The next question is a look at how the population is broken up into different demographics. For this, I analyzed four key sub-populations of ORCID, funding information, license, abstract metadata. The following graph shows the percentage of new parties (i.e., works registered at Crossref containing these metadata) across four specific segments.
The census graph shows extensive empty space on the top half, indicating there is ample room for continual growth in these communities. The ORCID population is expanding the fastest, followed by license and funding. Abstracts are a minority group and quite visibly needs a population boost here in Crossref-land.
This view does not capture the percentages across record types nor does it take into account the differential rate of growth between record types (e.g., journal article, book, report, conference proceeding, dissertation, dataset, component, posted content, peer review) as the Crossref corpus has grown. While ORCID, funding, and license information are available for all full record types (viz., excludes components), this matters for abstracts. Abstracts are part of the metadata schema of all relevant record types. This excludes those which do not apply: dataset, component, and peer reviews. All things considered though, the relative impact on the total percentage of metadata deposited (or not deposited) is miniscule given the small sums for these works.
Calling the real demographers & cartographers
This mini-pseudo-lustrum was the result of a few hours of play. The graphs have raised more questions than answers. We welcome more serious and earnest efforts to dive into the metadata and conduct a more detailed, reliable investigation on the size, distribution and composition of the population through our REST API. Next month, we will roll out reports on metadata coverage based on individual members.
This “play” census came out of a session with Karthik Ram, one of the founders of rOpenSci, as we talked about struggle to build better tools for researchers. (rOpenSci is an exciting and influential non-profit that builds open source software for research with a community of users and developers and educates scientists about transparent research practices.) With each round of cocktails, it became clear that a critical subset of the issues boiled down to the problem of limited information about research publications. Why, that is what Crossref does! Indeed. Publishers register their content with Crossref and provide the metadata about the works they publish.
Over the past few years, we have been working with our members to broaden the coverage of the metadata as well as improve their metadata quality. This issue is not exclusive to Crossref - Metadata 2020 rallies stakeholders across the research enterprise to push for change together.
To represent the full breadth and depth of the scholarly communications enterprise, Crossref aims to capture the richness of what our members publish through the content they register. So publishers, powerfully represent your services and make sure your metadata is complete and correct for discovery systems, indexing platforms, research evaluation systems, analytics tools, and the great number of Crossref metadata consumers far and wide.