This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
Anybody who knows me or reads this blog is probably aware that I don’t exactly hold back when discussing problems with the DOI system. But just occasionally I find myself actually defending the thing…
About once a year somebody suggests that we could replace existing persistent citation identifiers (e.g. DOIs) with some new technology that would fix some of the weaknesses of the current systems. Usually said person is unhappy that current systems like
DOI, Handle, Ark, perma.cc, etc. depend largely on a social element to update the pointers between the identifier and the current location of the resource being identified. It just seems manifestly old-fashioned and ridiculous that we should still depend on bags of meat to keep our digital linking infrastructure from falling apart.
When one of these ideas is posed, there is a brief flurry of activity as another generation goes through the same thought processes and (so far) comes to the same conclusions.
The proposals I’ve seen generally fall into one of the following groups:
Replace persistent identifiers (PIDs) with hashes, checksums, etc.
Just use search (often, but not always coupled with 1 above)
Automagically create PIDs out of metadata.
Automagically redirect broken citations to archived versions of the content identified
I thought it might help advance the discussion and avoid a bunch of dead ends if I summarised (rehashed?) some of the issues that should be considered when exploring these options.
Warning: Refers to FRBR terminology. Those of a sensitive disposition might want to turn away now.
DOIs, PMIDs, etc. and other persistent identifiers are primarily used by our community as “citation identifiers”. We generally cite at the “expression” level.
Consider the difference between how a “citation identifier” a “work identifier” and a “content verification identifier” might function.
How do you deal with “equivalent manifestations” of the same expression. For example the ePub, PDF and HTML representations of the same article are intellectually equivalent and interchangeable when citing. The same applies to csv & tsv representations of the same dataset. So, for example, how do hashes work here as a citation identifier?
Content can be changed in ways that typically doesn’t effect the interpretation or crediting of the work. For example, by reformatting, correcting spelling, etc. In these cases the copies should share the same citation identifier, but the hashes will be different.
Content that is virtually identical (and shares the same hash) might be republished in different venues (e.g. a normal issue and a thematic issue). Context in citation is important. How do you point somebody at the copy in the correct context?
Some copies of an article or dataset are stewarded by publishers. That is, if there is an update, errata, corrigenda, retraction/withdrawal, they can reflect that on the stewarded copy, not on copies they don’t host or control. Location is, in fact, important here.
Some copies of content will be nearly identical, but will differ in ways that would affect the interpretation and/or crediting of the work. A corrected number in a table for example. How would you create a citation form a search that would differentiate the correct version from the incorrect version?
Some content might be restricted, private or under embargo. For example private patient data, sensitive data about archaeological finds or the migratory patterns of endangered animals.
Some content is behind paywalls (cue jeremiads)
Content is increasingly composed of static and dynamic elements. How do you identify the parts that can be hashed?
How do you create an identifier out of metadata and not have them look like this?
This list is a starting point that should allow people to avoid a lot of blind alleys.
In the mean time, good luck to those seeking alternatives to the current crop of persistent citation identifier systems. I’m not convinced it is possible to replace them, but if it is- I hope I beat you to it. 🙂 And I hope I can avoid stabbing myself in the eye.