This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
When we set up the eLife journal in 2012, we knew datasets were an important component of research content and decided to give them prominence in a section entitled ‘Major datasets’ (see images below). Within this section, major previously published and generated datasets are listed. We also strongly encourage data citations in the reference list.
Almost five years on and I feel we have still not cracked it! We have signed up to the Force11 data citation principles, which were published three years back; we have been actively involved in working groups of Force11 and others, for example the Data Citation Roadmap for Scientific Publishers and the JATS XML data citation recommendation of JATS4R. I am also currently working with other publishers to come up with recommended JATS XML tagging for data availability statements, which is easier said than done considering the nuances of dataset uses and also how different publishers approach this.
Added to this, there is still significant push-back from authors about putting all dataset citations in the reference list (for example, authors are concerned about self-citing by citing a dataset created as part of the research article; “dataset citations” that are in effect a link to a search results page on a database; and the necessitation of hundreds of reference entries if an author has used a large base for the research).
While eLife is very active in this space, and aims to arrange and mark up the datasets and citations produced by our authors in line with recommendations, the recommendations still have some gaps and the complete picture is not yet clear.
In late 2014, we brought in-house the process of depositing Crossref metadata (previously our online host did this for us). It gave us control of our processes and, at the time, we sent all the information we could to Crossref and have ensured our references are open and available in the Crossref public API. The code for this conversion process is all open-source and available for reuse. It can be found on GitHub. Since then, besides small improvements to the code and troubleshooting problems, we’ve not updated the code. I have been keeping a list of Crossref features and new deposit metadata we can add to our deposits, and now is the time for us to start working on this again.
One of the items we’ll be addressing is data citations.
The Crossref reference schema does not cater well for non-book or -journal content, and if an item does not have a DOI, the “reference” is not very useful because of the few tags available in the Crossref schema.
However, Crossref have introduced the relationship type to their schema, so data references can be well linked and mineable. As I see Crossref as a potential broker between publishers and data repositories in the future, using the relationship-type deposit for all datasets will assist this and also allow these data points to more easily be seen within the article Nexus framework (see the recent blog post, How do you deposit data citations?).
At eLife, we already distinguish between Dataset generated as part of research results (relationship type in the Crossref schema: “isSupplementedBy”) and Dataset produced by a different set of researchers or previously published (relationship type: “references”). Therefore, it will not be hard for us to convert all the information about data referencing that is within the dataset section into a relationship-type deposit in the conversion to Crossref XML.
We have also recently gone through an exercise of defining a set of rules for all our references and, of the 12 allowed types, one is data. The rules for Schematron (a rule-based validation language for making assertions about the presence or absence of patterns in XML trees; see also this useful article about Schematron on the JATS4R learning centre) have been written for the eLife ‘business’ rules. Subject to final testing, these will be integrated into our workflow (the Schematron is open source and available for reuse on GitHub, and we will also build an API for people to use the Schematron direct). This will allow us to easily identify all data references and convert them into relationship types in the XML delivered to Crossref. This way, they will not be lost in the references section of our deposits, but properly identified.
However, we do appreciate this will become harder for us as authors become more familiar with datasets as references, because we will not be able to identify the difference between generated and analysed datasets so easily.
The code developed and used to complete these conversions will, again, be on Github and open source, and we actively encourage the reuse of this.
While the industry is still working on the best way to deal with data and ensuring it is given the prominence it requires, we feel this is the best approach we can take. Nothing is forever and we can still change what we do in the future. The beauty of open-source code also means that if there is an alternative approach now or in the future, the code we wrote at eLife can be developed by someone else in the future and we can all benefit.
If you have any questions, please do not hesitate to contact us.