This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
Recording data citations supports data reuse and aids research integrity and reproducibility. Crossref makes it easy for our members to submit data citations to support the scholarly record.
TL;DR
Citations are essential/core metadata that all members should submit for all articles, conference proceedings, preprints, and books. Submitting data citations to Crossref has long been possible. And it’s easy, you just need to:
Include data citations in the references section as you would for any other citation
Include a DOI or other persistent identifier for the data if it is available - just as you would for any other citation
Submit the references to Crossref through the content registration process as you would for any other record
And your data citations will flow through all the normal processes that Crossref applies to citations. And it will be distributed openly to the community (including DataCite!) via Crossref’s services and APIs. All data citations deposited with Crossref will be exposed in the (soon-to-be launched) Data Citation Corpus.
And then, you can sit back and congratulate yourself for making your publication more useful to researchers who want to be able to reuse the data underlying your publications.
Background
You might ask, “So if submitting Data Citations to Crossref has long been possible, why do you have to write this?”
Historically, authors did not cite data in the way they cited publications. Instead, they would often refer to the data in the main text of the article. This has made it hard to determine what data lay behind the research and/or access the data.
But the research community has increasingly recognized that data is a first-class research output and that we should treat it as such. In short, we should formally cite data.
But because citing data is a comparatively new practice, it has been subject to a lot of new analysis. And unsurprisingly, people analyzing data citation have discovered that there is a lot of nuance to citation of any kind.
There are lots of reasons for citing something. There are lots of internalized conventions for citing things. And there are different conventions for citation for different research objects. And SSH citation practice differs from STEM. And legal citation practices are different from scholarly citation practices. And citation practices even vary by subdiscipline and by journal.
Those who have been looking at what it means to “cite data” have naturally stumbled into a thicket of divergent practices - some of which are historical holdovers, some of which are stylistic preferences, and some of which are clearly adaptations to deal with the specific needs of certain research objects/containers or different disciplines.
The temptation has been to try and rationalize this before extending the practice of citation to data.
“Maybe because data is a distinct record type, we should include the fact that it is a data citation in the citation itself?”
“Maybe because people cite data for different reasons, we should include a typology of citation types in all data citations?”
And so you may hear some people say, “hold off on data citation - we don’t have an optimal way to do it yet, and it can be very complicated.”
But guess what?
We currently don’t label citations to monographs as “citation to monograph.”
And we don’t currently include the reason for citation when we are citing a journal article.
But citations are already useful even without these features. And so, to delay citing data indefinitely because we have an opportunity to improve the act of citation is just perverse. Our community has always opted for progress over perfection.
For one thing - the efforts are not mutually exclusive. We can start citing data with the current limitations of citation practices and simultaneously propose mechanisms for making citation more useful in the future, including new guidelines to deal with the unique issues that citing data poses.
But in the meantime, we will be doing researchers a giant favour if we at least include our imperfect and ambiguous, and unconventional references to data in the references section of an article so that they can be accessed and processed along with all the other imperfect, ambiguous and variant citations that we find so useful.
Some of our members are already doing this. They have been for a long time. And they haven’t found it any more complicated than managing non-data references in the past.
Join them and make your metadata more useful.
Cite data now. Don’t put it off.
And Crossref will continue to work with DataCite and the rest of the community to make the distribution even easier and more useful.
So who is already citing data?
Top 10 members depositing data citations from November-May 2022
(broken down by DOI prefix, which is why you see some publishers listed twice):
Prefix
Member name
Data citations deposited
10.1038
Springer Science and Business Media LLC
7174
10.1016
Elsevier BV
6527
10.1007
Springer Science and Business Media LLC
4748
10.5194
Copernicus GmbH
3017
10.1080
Informa UK Limited
2346
10.1177
SAGE Publications
2082
10.1002
Wiley
2048
10.1111
Wiley
1888
10.1108
Emerald
1876
10.3390
MDPI AG
1827
Top 10 data citations per deposited work
(again, broken down by prefix)
Member name
Prefix
Data citations deposited
Data citations per work
Consortium Erudit
10.7202
580
1.149
SLACK, Inc.
10.3928
462
0.646
S. Karger AG
10.1159
1653
0.532
Proceedings of the National Academy of Sciences
10.1073
973
0.502
American Academy of Pediatrics (AAP)
10.1542
486
0.397
F1000 Research Ltd
10.12688
552
0.341
American Association for the Advancement of Science (AAAS)
10.1126
952
0.317
Springer Science and Business Media LLC
10.1038
7174
0.231
JMIR Publications Inc.
10.2196
864
0.187
American Geophysical Union (AGU)
10.1029
692
0.166
These are for the prefixes with the most data citations deposited (>500 in 6 months) so there might be smaller members doing better than this.
Summaries are great, but I want to see some actual examples!
Here are some examples showing how data is cited by our members:
And here are some example API requests for discovering more metadata citations. You can use these API requests as examples and adapt to your own needs.
Find all the DOIs that cite Dataset X (identified by DOI)