This year, metadata development is one of our key priorities and weâre making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; weâve added a âtypeâ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, weâre delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. Itâs a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the worldâs least economically advantaged countries. Eligibility for the program is based on a memberâs country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
Crossref Labs loves to be the last to jump on an internet trend, so what better than than to combine the Doge meme with altmetrics?
Note: The API calls below have been superceeded with the development of the Event Data project. See the latest API documentation for equivalent functionality
Want to know how many times a Crossref DOI is cited by the Wikipedia?
Back in 2011 PLOS released its awesome ALM system as open source software (OSS). At Crossref Labs, we thought it might be interesting to see what would happen if we ran our own instance of the system and loaded it up with a few Crossref DOIs. So we did. And the code fell over. Oops. Somehow it didnât like dealing with 10 million DOIs. Funny that.
But the beauty of OSS is that we were able to work with PLOS to scale the code to handle our volume of data. Crossref contracted with Cottage Labs and we both worked with PLOS to make changes to the system. These eventually got fed back into the main ALM source on Github. Now everybody benefits from our work. Yay for OSS.
So if you want to know technical details, skip to Details for Propellerheads. But if you want to know why we did this, and what we plan to do with it, read on.
Why?
There are (cough) some problems in our industry that we can best solve with shared infrastructure. When publishers first put scholarly content online, they used to make bilateral reference linking agreements. These agreements allowed them to link citations using each otherâs proprietary reference linking APIs. But this system didnât scale. It was too time-consuming to negotiate all the agreements needed to link to other publishers. And linking through many proprietary citation APIs was too complex and too fragile. So the industry founded Crossref to create a common, cross-publisher citation linking API. Crossref has since obviated the need for bilateral linking arrangements.
So-called altmetrics look like they might have similar characteristics. You have ~4000 Crossref member publishers and N sources (e.g. Twitter, Mendeley, Facebook, CiteULike, etc.) where people use (e.g. discuss, bookmark, annotate, etc.) scholarly publications. Publishers could conceivably each choose to run their own system to collect this information. But if they did, they would face the following problems:
The N sources will be volatile. New ones will emerge. Old ones will vanish.
Each publisher will need to deal with each sourceâs different APIs, rate limits, T&Cs, data licenses, etc. This is a logistical headache for both the publishers and for the sources.
If publishers use different systems which in turn look at different sources, it will be difficult to compare results across publishers.
If a journal moves from one publisher to another, then how are the metrics for that journalâs articles going to follow the journal? This isnât a complete list, but it shows that there might be some virtue in publishers sharing an infrastructure for collecting this data. But what about commercial providers? Couldnât they provide these ALM services? Of course - and some of them currently do. But normally they look on the actual collection of this data as a means to an end. The real value they provide is in the analysis, reporting and tools that they build on top of the data. Crossref has no interest in building front-ends to this data. If there is a role for us to play here, it is simply in the collection and distribution of the data.
But yes, it is still likely that some powerful people will come to lazy conclusions based on altmetrics. And following that, other lazy, unscrupulous and opportunistic people will attempt to game said metrics. We may even see an industry emerge to exploit this mess and provide the scholarly equivalent of SEO. Feh. Now Iâm depressed and I need a drink.
So again, why is Crossref doing this? Though we have our doubts about how effective altmetrics will be in evaluating the quality of content, we do believe that they are a useful tool for understanding how scholarly content is used and interpreted. The most eloquent arguments against altmetrics for measuring quality, inadvertently make the case for altmetrics as a tool for monitoring attention.
Critics of altmetrics point out that much of the attention that research receives outside of formal scholarly communications channels can be ascribed to:
Puffery. Researchers and/or university/publisher âPR wonksâ over-promoting research results.
Innocent misinterpretation. A lay audience simply doesnât understand the research results.
Deliberate misinterpretation. Ideologues misrepresent research results to support their agendas.
Salaciousness. The research appears to be about sex, drugs, crime, video games or other popular bogeymen.
In short, scholarly research might be misinterpreted. Shock horror. Ban all metrics. Whew. That wonât happen again.
Scholarly research has always been discussed outside of formal scholarly venues. Both by scholars themselves and by interested laity. Sometimes these discussions advance the scientific cause. Sometimes they undermine it. The University of Utah didnât depend on widespread Internet access or social networks to promote yet-to-be peer-reviewed claims about cold fusion. That was just old-fashioned analogue puffery. And the Internet played no role in the Laetrile or DMSO crazes of the 1980s. You see, there were once these things called ânewspapers.â And another thing called âtelevision.â And a sophisticated meatspace-based social network called a âtown square.â
But there are critical differences between then and now. As citizens get more access to the scholarly literature, it is far more likely that research is going to be discussed outside of formal scholarly venues. Now we can build tools to help researchers track these discussions. Now researchers can, if they need to, engage in the conversations as well. One would think that conscientious researchers would see it as their responsibility to remain engaged, to know how their research is being used. And especially to know when it is being misused.
That isnât to say that we expect researchers will welcome this task. We are no Pollyannas. Researchers are already famously overstretched. They barely have time to keep up with the formally published literature. It seems cruel to expect them to keep up with the firehose of the Internet as well.
Which gets us back to the value of altmetrics tools. Our hope is that, as altmetrics tools evolve, they will provide publishers and researchers with an efficient mechanism for monitoring the use of their content in non-traditional venues. Just in the way that citations were used before they were distorted into proxies for credit and kudos.
We donât think altmetrics are there yet. Partly because some parties are still tantalized by the prospect of usurping one metric for another. But mostly because the entire field is still nascent. People donât yet know how the information can be combined and used effectively. So we still make naive assumptions such as âlink=likeâ and âmore=better.â Surely it will eventually occur to somebody that, instead, there may be a connection between repeated headline-grabbing research and academic fraud. A neuroscientist might be interested in a tool that alerts them if the MRI scans in their research paper are being misinterpreted on the web to promote neurobollocks. An immunologist may want to know if their research is being misused by the anti-vaccination movement. Perhaps the real value in gathering this data will be seen when somebody builds tools to help researchers DETECT puffery, social-citation cabals, and misinterpretation of research results?
But Crossref wonât be building those tools. What we might be able to do is help others overcome another hurdle that blocks the development of more sophisticated tools; getting hold of the needed data in the first place. This is why we are dabbling in altmetrics.
Wikipedia is already the 8th largest referrer of Crossref DOIs. Note that this doesnât just mean that the Wikipedia cites lots of Crossref DOIs, it means that people actually click on and follow those DOIs to the scholarly literature. As scholarly communication transcends traditional outlets and as the audience for scholarly research broadens, we think that it will be more important for publishers and researcher to be aware of how their research is being discussed and used. They may even need to engage more with non-scholarly audiences. In order to do this, they need to be aware of the conversations. Crossref is providing this experimental data source in the hope that we can spur the development of more sophisticated tools for detecting and analyzing these conversations. Thankfully, this is an inexpensive experiment to conduct - largely thanks to the decision on the part of PLOS to open source its ALM code.
What Now?
Crossrefâs instance of PLOSâs ALM code is an experiment. We mentioned that we had encountered scalability problems and that we had resolved some of them. But there are still big scalability issues to address. For example, assuming a response time of 1 second, if we wanted to poll the English-language version of the Wikipedia to see what had cited each of the 65 million DOIs held in Crossref, the process would take years to complete. But this is how the system is designed to work at the moment. It polls various source APIs to see if a particular DOI is âmentionedâ. Parallelizing the queries might reduce the amount of time it takes to poll the Wikipedia, but it doesnât reduce the work. Another obvious way in which we could improve the scalability of the system is to add a push mechanism to supplement the pull mechanism. Instead of going out and polling the Wikipedia 65 million times, we could establish a âscholarly linkbackâ mechanism that would allow third parties to alert us when DOIs and other scholarly identifiers are referenced (e.g. cited, bookmarked, shared). If the Wikipedia used this, then even in an extreme case scenario (i.e. everything in Wikipedia cites at least one Crossref DOI), this would mean that we would only need to process ~ 4 million trackbacks.
The other significant advantage of adding a push API is that it would take the burden off of Crossref to know what sources we want to poll. At the moment, if a new source comes online, weâd need to know about it and build a custom plugin to poll their data. This needlessly disadvantages new tools and services as it means that their data will not be gathered until they are big enough for us to pay attention to. If the service in question addresses a niche of the scholarly ecosystem, they may never become big enough. But if we allow sources to push data to us using a common infrastructure, then new sources do not need to wait for us to take notice before they can participate in the system.
Supporting (potentially) many new sources will raise another technical issue- tracking and maintaining the provenance of the data that we gather. The current ALM system does a pretty good job of keeping data, but if we ever want third parties to be able to rely on the system, we probably need to extend the provenance information so that the data is cheaply and easily auditable.
Perhaps the most important thing we want to learn from running this experimental ALM instance is: what it would take to run the system as a production service? What technical resources would it require? How could they be supported? And from this we hope to gain enough information to decide whether the service is worth running and, if so, by whom. Crossref is just one of several organizations that could run such a service, but it is not clear if it would be the best one. We hope that as we work with PLOS, our members and the rest of the scholarly community, weâll get a better idea of how such a service should be governed and sustained.
Details for Propellerheads
Warning, Caveats and Weasel Words
The Crossref ALM instance is a Crossref Labs project. It is running on R&D equipment in a non-production environment administered by an orangutang on a diet of Redbulls and vodka.
So what is working?
The system has been initially loaded with 317,500+ Crossref DOIs representing publications from 2014. We will load more DOIs in reverse chronological order until we get bored or until the system falls over again.
We have activated the following sources:
PubMed
DataCite
PubMedCentral Europe Citations and Usage
We have data from the following sources but will need some work to achieve stability:
Facebook
Wikipedia
CiteULike
Twitter
Reddit
Some of them are faster than others. Some are more temperamental than others. WordPress, for example, seems to go into a sulk and shut itself off after approximately 1,300 API calls.
In any case, we will be monitoring and tweaking the sources as we gather data. We will also add new sources as we get requested API keys. We will probably even create one or two new sources ourselves. Watch this blog and weâll update you as we add/tweak sources.
Dammit, shut up already and tell me how to query stuff.
PLOS has provided lovely detailed instructions for using the API- So, please, play with the API and see what you make of it. On our side we will be looking at how we can improve performance and expand coverage. We donât promise much- the logistics here are formidable. As we said above, once you start working with millions of documents, the polling process starts to hit API walls quickly. But that is all part of the experiment. We appreciate your helping us and would like your feedback. We can be contacted at: