This year, metadata development is one of our key priorities and weāre making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; weāve added a ātypeā attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, weāre delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. Itās a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the worldās least economically advantaged countries. Eligibility for the program is based on a memberās country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
Thank you to everyone who responded with feedback on the Op Cit proposal. This post clarifies, defends, and amends the original proposal in light of the responses that have been sent. We have endeavoured to respond to every point that was raised, either here or in the document comments themselves.
We strongly prefer for this to be developed in collaboration with CLOCKSS, LOCKSS, and/or Portico, i.e. through established preservation services that already have existing arrangements in place, are properly funded, and understand the problem space. There is low level of trust in the Internet Archive, also given a number of ongoing court cases and erratic behavior in the past. People are questioning the sustainability and stability of IA, and given it is not funded by publishers or other major STM stakeholders there is low confidence in IA setting their priorities in a way that is aligned with that of the publishing industry.
We acknowledge that some of our members have a low level of trust in The Internet Archive, but many of our (primarily open access members) work very closely with the IA and our research has shown that, without the IA, the majority our smaller open access members would have almost no preservation at all. We have already had conversations with CLOCKSS and Portico about involvement in the pilot and thinking through what a scale-to-production would look like. That said, for a proof-of-concept, the Internet Archive presents a very easy way to get off the ground, with a stable system that has been running for almost 30 years.
This seems to be a service for OA content only, but people wonder for how long. Someone already spotted an internal CrossRef comment on the working doc that suggested āwhy not just make it default for everything & everyoneā, and that raises concern.
The primary audience for this service is small OA publishers that are, at present, poorly preserved. These publishers present a problem for the whole scholarly environment because linking to their works can prove non-persistent if preservation is not well handled. Enhancing preservation for this sector therefore benefits the entire publishing industry by creating a persistent linking environment. We have no plans to make this the ādefault for everything and everyoneā because the licensing challenges alone are massive, but also because it isnāt necessary. Large publishers like Elsevier are doing a good job of digitally preserving their content. We want this service to target the areas that are currently weaker.
Crossref will always respect the content rights of our members. We never force our members to release their content through Crossref that they donāt ask us to release.
The purpose of the Op Cit project is to make it easier for our members to fulfil commitments they already made when they joined Crossref.
Crossref is fundamentally an infrastructure for preserving citations and links in the scholarly record. We cannot do that if the content being cited or linked to disappears.
When signing the Crossref membership agreement, members agree to employ their best efforts to preserve their content with archiving services so that Crossref can continue to link citations to it even in extremis. For example- if they have ceased operations.
Some of our members already do this well. They have already made arrangements with the major archiving providers. They do not need the Op Cit service to help them with archiving. However, the Op Cit service will still help them ensure that the DOIs that they cite continue work. So it will still benefit them even if they donāt use it directly.
However, our research shows that many of our members are not fulfilling the commitments they made when joining Crossref. Over the next few years, we will be trying to fix this. Primarily through outreach- encouraging members to set up and record with Crossref archiving arrangements with the archives of their choice.
But we know some members will find this too technically challenging and/or costly. [And frankly, given what weāve learned of the archiving landscape, we can see their point.] The proposed Op Cit service is for these members. The vast majority of these members are Open Access publishers, so the ārightsā questions are far more straightforward- making the implementation of such a service much more tractable.
Someone asked what this means for the publisher-specific DOI prefix for this content? Will this be lost?
No.
There is concern about the interstitial page that Crossref would build that gives the user access options. The value of Crossref to publishers is adding services that are invisible and beneficial to users, not adding a visible step that requires user action.
There is nothing in Crossrefās terms that says that we have to be invisible. The basic truth is that detecting content drift is really hard and several efforts to do so before have failed. Without a reliable way of knowing whether we should display the interstitial page, which may become possible in future, we have to display something for now, or the preservation function will not work.
Crossref has, also, supported user-facing interstitial services for over a decade, including:
Multiple Resolution
Coaccess
CrossMark
Crossref Metdata Search
REST API
So we have a long track record of non-B2B service provision.
There is confusion about why Crossref seems to want to build the capacity to ālockā records in absence of flexibility. People feel no need for Crossref to get involved here.
This is a misunderstanding of the terminology. The Internet Archive allows the domain owner to request content to be removed. This would mean that, in future, if a new domain owner wanted, they could remove previously preserved material from the archive, thereby breaking the preservation function. When we say we want to ālockā a record, we mean that a future domain owner cannot remove content from the preservation archive. This also prevents domain hijackers from compromising the digital preservation.
There is concern about the possibility to hack this system to give uncontrolled access to all full-text content by attacking publishing systems and making them unavailable. This is an unhappy path scenario but something on peopleās minds.
The system only works on content that is provided with an explicitly stated open license (see response above).
I think this project would be improved by better addressing the people doing the preservation maintenance work that this requires. Digital preservation is primarily a labor problem, as the technical challenges are usually easier than the challenge of consistently paying people to keep everything maintained over time. Through that lens, this is primarily a technical solution to offload labor resources from small repositories to (for now) the Internet Archive, where you can get benefits from the economies of scale. There are definitely cases where that could be useful! But I think making this more explicit will further a shared understanding of advantages and disadvantages and help you all see future roadblocks and opportunities for this approach.
This consultation phase was designed, precisely, to ensure that those working in the space could have their say. While this is a technical project, we recognize that any solution must value and understand labor. That means that any scaling to production must and will also include a funding solution to address the social labor challenge.
Is there any sense in polling either the IA Wayback Machine or the LANL Memento Aggregator first to determine if snapshot(s) already exist?
We could do this, but it would add an additional hop/lookup on deposit. Plus, we want to store the specific version deposited at the specific time it is done, including re-deposits.
I would encourage looking at a distributed file system like IPFS (https://en.wikipedia.org/wiki/InterPlanetary_File_System). This would allow easy duplication, switching and peering of preservation providers. Correctly leveraged with IPNS; resolution, version tracking and version immutability also become benefits. Later after beta the IPNS metadata could be included as DOI metadata.
We had considered IPFS for other projects, but really, for this, we want to go with recognised archives, not end up running our own infrastructure for preservation.
It might be useful to look into the 10320/loc option for the Handle server: the https://www.handle.net/overviews/handle_type_10320_loc.html. I can imagine a use case where a machine agent might want to access an archive directly without needing to go to an interstitial page.
It is good to see reference to the HANDLE system and alternative ways that we might use it. We will consult internally on the technical viability of this.
In general, though, we prefer to use web-native mechanisms when they are available. We already support direct machine access via HTTP redirects and by exposing resource URLs in the metadata that can be retreivd via content negotiation. In this case, we would be looking at supporting the 300 (multiple choice) semantics.
Iām curious to see how this will work for DOI versioning mechanisms like in Zenodo, where you have one DOI to reference all versions as well as version specific DOIs. If your record contains metadata + many files and a new version just versions one of the several files my assumption is that within the proposed system an entire new set (so all files) is archived. In theory this could also be a logical package, where simply the delta is stored, but I guess in a distributed preservation framework like the one proposed here, this would be hard to achieve.
This is a good point and it could lead to many more, frustrating, hops before the user reaches the content. We will conduct further research into this scenario, but we also note that Zenodoās DOIs do not come from Crossref, but from DataCite.
Thereās a decent body of research at this point on automated content drift detection. This recent paper: https://ceur-ws.org/Vol-3246/10_Paper3.pdf likely has links to other relevant articles.
We have no illusions about the difficulty of detecting semantic drift but this is helpful and interesting. We will read this material and related articles to appraise the current state of content drift detection.
Out of curiosity, will we be using one type of archive (i.e., IA or CLOCKSS or LOCKSS or whatever) or will it possibly be a combination of a few archives? Reading the comments, it looks like some of them charge a fee, so I see why weād use open source solutions first. Also, eventually could it be something that the member chooses? i.e. which archive they might want to use. Again, the latter question isnāt something for the prototype, but Iām curious about this use case. Also, I wonder about the implementation details if it is more than one archive. The question is totally moot of course, if weāre sticking with one archive for now.
The design will allow for deposit in multiple archives ā and we will have to design a sustainability model that will cover those archives that need funding. As above, this is an important part of the move to production.
Will be good for future interoperability to make sure at least one of the hashes is a SoftWare Hash IDentifier (see swhid.org). The ID is not really software specific and will interoperate with the Software Heritage Archive and git repositories.
We will certainly ensure best practices for checksums.
Comments on the Interstitial Page
Iād keep the interstitial page without planning its eradication. (See why in the last paragraph)
Iād even advocate for it to be a beautiful and useful reminder to users that āThis content is preservedā.
Iād go further and recommend that publishers deposit alternate urls of other preservation agents like PMC etc, that would also be displayed. This page could even be merged with multi-resolution system.
The why: Iām concerned of hackers and of predatory publishers exploiting the spider heuristics by highjacking small journals and keeping just enough metadata as in them as to fool the resolver and then adding links to whatever products, scams and whatnotsā¦
Technical. Scraping landing pages is hard. Weāve had a lot of projects to do this over the years. You can mitigate the risk by tiering / heuristics. Maybe even feedback loop to publishers to encourage them to put the right metadata on the landing page.
This is the only part of this proposal that I donāt like. People are used to DOIs resolving directly to content, and I donāt think that should be changed unless absolutely necessary. I would prefer that the DOI resolves to the publisherās copy if it exists, and the IA copy otherwise.
We will continue the discussion about the interstitial page. The basic technical fact, as above, is that detecting content drift is hard and so we may need, at least, to start with the page. However, some commentators presented reasons for keeping it.
We also have already supported interstitial pages for multiple resolution and co-access for over a decade.
It is memberās choice whether they wish to deposit alternative URLs and we already have a mechanism for this.