This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
A couple of weeks ago we shared with you that data citation is here, and that you can start doing data citation today. But why would you want to? There are always so many priorities, why should this be at the top of the list?
I’m sure you heard this before—data sharing and data citation are important for scientific progress. The three key reasons for this are:
1) Transparency and reproducibility
Most scientific results that are shared today are just a summary of what researchers did and found. The underlying data are not available, making it difficult to verify and replicate results. If data would always be made available with publications, transparency of research would be greatly improved.
2) Reuse
The availability of raw data allows other researchers to reuse the data. Not just for replication purposes, but to answer new research questions.
3) Credit
When researchers cite the data they used, this forms the basis for a data credit system. Right now researchers are not really incentivized to share their data, because nobody is looking at data metrics and measuring their impact. Data citation is a first step towards changing that.
The benefits described above are all quite long-term, so why, as a publisher or data repository, should you put your resources towards implementing data citation workflows now? During our pre-conference workshop at FORCE2018 we asked repositories and publishers this question. Below you’ll find some of the answers.
Data repositories
For data repositories, data citation leads to increased visibility of both the repository and the datasets. The workshop revealed that many repositories do a lot of work to establish links between articles and datasets, thereby significantly contributing to transparency in research. Some of the repositories explained that they hire curators that text mine articles to find associations and manually curate datasets to ensure information about links is part of the metadata. This is reflected in Event Data, where 99% of links between articles and datasets comes from data repository metadata. This downstream enrichment of metadata is useful, but it would be more effective if all stakeholders strive to establish these links at a much earlier stage in the research communication process.
ICPSR, the Inter-university Consortium for Political and Social Research, shared:
ICPSR views data citation as vital. As a large social science data archive, ICPSR curates, preserves, and distributes data for the research community to re-use over time. Data citation makes data visible to the research community. Without it, data cannot be accessed for re-use or reproduced for transparency. Its use cannot be tracked and counted to reveal its impact and potential for new uses by investigators in new fields or in combination with new types of data. Data creators cannot receive adequate credit for their intellectual output. And the original investment by funders and scientists to create those data stops producing dividends. Therefore, data citation plays an essential role in the data sharing lifecycle.
Proper data citation, with a unique identifier, makes it much easier to measure impact. When data use is not cited or cited obliquely, it is rendered virtually invisible. Hence, much data use is still not easily detected. The ICPSR Bibliography of Data-related Literature represents ICPSR’s efforts to identify publications that analyze data distributed at ICPSR and link them directly to the data in the ICPSR catalog. As of 2018, ICPSR has a searchable database that contains nearly 80,000 citations of published and unpublished works resulting from analyses of data held in the archive. ICPSR also makes the case for data citation in its brief new video, “ICPSR 101: Why Should I Cite Data?”
GBIF, the Global Biodiversity Information Facility, explained:
The work required to collect, clean, compile and publish biodiversity datasets is significant and deserves recognition. Researchers publish studies based on data made available through GBIF.org at a rate of about 2 papers every single day. It is crucial for GBIF to link these scientific uses to the underlying data as one measure of demonstrating the value and impact of sharing free and open biodiversity data. At the moment, however, only about 10 percent of authors cite or acknowledge the datasets used in research papers properly. As a result, data publishers efforts often risk going unnoticed, and the true impact of sharing data remains invisible. GBIF will continue to work with publishers and researchers to provide guidance and input for how to best cite the use of GBIF-mediated data in scientific journals to ensure proper attribution and reproducible research and to demonstrate the true value of free and open access to biodiversity data.
Publishers
By ensuring data is cited in a consistent way, publishers help provide transparency and context for the content they publish. Depositing that information as part of the Crossref metadata helps that work go further by uncovering how data is being used across multiple publications and publishers This means patterns can be explored and researchers can gain more comprehensive recognition and credit for the work they have done.
Melissa Harrison, Head of Production Operations at eLife says:
eLife is committed to ensuring researchers get credit for all their outputs, and data is a major component of this. We’re working with Crossref and JATS4R to enable publishers to tag their JATS data content consistently and thus create an easy crosswalk to their Crossref deposits. The JATS4R guidance on Data Availability Statements, linked to and incorporating data citations, will be updated soon, please watch that space!
It will be really interesting to see how much re-use of previously published data is happening, look for patterns in re-use, and see links and hopefully building up of data by different research groups. Ultimately, this will incentivize researchers and publishers to ensure it is correctly accredited at source and in publications, improving the cycle further.’
Anita de Waard, VP of Research Collaborations at Elsevier, says:
One of the key recommendations of the Force11 Manifesto was to “3.3 Add data, software, and workflows into the publication as first-class research objects”, which will allow greater reproducibility and rigor to experimental research, and allow the reuse of all digital artefacts in the scholarly lifecycle. By following the data citation principles, we achieve two things: the author presents a richer representation of their work, and the data producer receives credit for the hard work of curating and publishing citable datasets.
Mendeley Data and Elsevier are active contributors to the Scholix framework that as a collaborative and open standard, allows the open mining of relationships between articles and datasets. We are also active participants in the new Enabling FAIR Data Project, and next to supporting the TOP Guidelines in all domains, require all authors in the earth and space sciences to deposit their data before publication.
Next week at Crossref LIVE18, Patricia Cruse from DataCite will talk about Data Citations and why they matter. If you’re in Toronto next week, do not hesitate to ask her or anyone from Crossref anything you want to know about data citation!