This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
Crossref holds metadata for approximately 150 million scholarly artifacts. These range from peer reviewed journal articles through to scholarly books through to scientific blog posts. In fact, amid such heterogeneity, the only singular factor that unites such items is that they have been assigned a document object identifier (DOI); a unique identification string that can be used to resolve to a resource pertaining to said metadata (often, but not always, a copy of the work identified by the metadata).
What, though, do we actually know about the state of persistence of these links? How many DOIs resolve correctly? How many landing pages, at the other end of the DOI resolution, contain the information that is supposed to be there, including the title and the DOI itself? How can we find out?
The first and seemingly most obvious way that we can obtain some of these data is by working through the most recent sample of DOIs and attempting to fetch metadata from each of them using a standard python script. This involves using the httpx library to attempt to resolve each of the DOIs to a resource, visiting that resource and seeing what the landing page yields.
Even this is not straightforward. Landing pages can be HTML resources or they can be PDF files, among other things. In the case of PDF files, to detect a run of text is not simple as a single line break can be enough to foil our search. Nonetheless, when using this strategy we find the following statistics:
Total DOI count in sample: 5000 Number of HTTP 200 response: 3301* Percentage of HTTP 200 responses: 66.02% Number of titles found on landing page: 1580 Percentage of titles found on landing page: 31.60% Number of DOIs in recommended format found on landing page: 1410 Percentage of DOIs in recommended format found on landing page: 28.20% Number of titles and DOIs found on landing page: 929 Percentage of titles and DOIs found on landing page: 18.58% Number of PDFs found on landing page: 1469 Percentage of PDFs found on landing page: 29.38% Percent of PDFs found on landing pages that loaded: 44.50%
* an HTTP 200 response means that the web page loaded correctly
While these numbers look quite low, the problem here is that a large number of scholarly publishers use Digital Rights Management techniques on their sites that block a crawl of this type. We can use systems like Playwright to remote control browsers to do the crawling, so that the request looks as much like a genuine user as possible and to evade such detection systems. However, lots of these sites detect headless browsers (where the browser is invisible and running on a server) and block them with a 403 Permission Denied error.
User Agent: in a browser running with puppeteer in headless mode, user agent includes Headless.
App Version: same as User Agent above.
Plugins: headless browsers don’t have any plugins. So we can say that if it has plugin it’s headful, but not otherwise since some browsers, like Firefox, don’t have default plugins.
Plugins Prototype: check if the Plugin and PluginsArray prototype are correct.
Mime Type: similar to Plugins test, where headless browsers don’t have any mime type
Mime Type Prototype: check if the MimeType and MimeTypeArrayprototype are correct.
Languages: all headful browser has at least one language. So we can say that if it has no language it’s headless.
Webdriver: this property is true when running in a headless browser.
Time elapse: it pops an alert() on page and if it’s closed too fast, means that it’s headless.
Chrome element: it’s specific for chrome browser that has an element window.chrome.
Permission: in headless mode Notification.permission and navigator.permissions.query report contradictory values.
Devtool: puppeteer works on devtools protocol, this test checks if devtool is present or not.
Broken Image: all browser has a default nonzero broken image size, and this may not happen on a headless browser.
Outer Dimension: the attributes outerHeight and outerWidth have value 0 on headless browser.
Connection Rtt: The attribute navigator.connection.rtt,if present, has value 0 on headless browser.
Mouse Move: The attributes movementX and movementY on every MouseEvent have value 0 on headless browser.
Using the stealth plugin for Playwright also allows us to evade most of these checks. This just leaves Mouse Move and Broken Image detection, which I thought would not outweigh all the other factors. We can also jitter the connection with arbitrary delays so that it should appear to be coming at random intervals, rather than a robotic crawl.
Yet the basic fact is that we are still blocked from crawling many sites. This does not happen when we put the browser into headful mode, so current detection techniques have clearly evolved in the past half decade (since Detect Headless) was designed.
If, however, we run the browser in a headful mode, the results are somewhat stunningly different:
Total DOI count in sample: 5000 Number of HTTP 200 response: 4852 Percent of HTTP 200 responses: 97.04% Number of titles found on landing page: 2547 Percentage of titles found on landing page: 50.94% Number of DOIs in recommended format found on landing page: 2424 Percentage of DOIs in recommended format found on landing page: 48.48% Number of titles and DOIs found on landing page: 1574 Percentage of titles and DOIs found on landing page: 31.48% Number of PDFs found on landing page: 2085 Percentage of PDFs found on landing page: 41.70% Percentage of PDFs found on landing pages that loaded: 42.97%
Let’s talk about the resolution statistics. Other studies, looking at general links on the web, have found a link-rot rate of about 60%-70% over a ten-year period (Lessig, Zittrain, and Albert 2014; Stox 2022). The DOI resolution rate that we have, with 97% of links resolving (or a 3% link-rot rate), is far better and more robust than a web link in general.
Is 3% a good or a bad number? It’s more robust than the web in general, but it still means that for every 100 DOIs, just under 3 will fail to resolve. We also cannot tell whether these DOIs are resolving to the correct target, except by using the metadata detection metrics (are the title and DOI on the landing page, which we could only detect at a far lower rate). It is entirely possible for a website to resolve with an HTTP 200 (OK) response, but for the page in question to be something very different to what the user expected, a phenomenon dubbed content drift. A good example is domain hijacking, where a domain name expires and spam companies buy them up. These still resolve to a web page, but instead of an article on RNA, for a hypothetical example, the user gets adverts for rubber welding hose. That said, other studies are also prone to this and there is no guarantee that content drift doesn’t affect a huge proportion of supposedly good links in the other studies, too.
Of course, one of the most frustrating elements of this exercise is having to work around publisher blocks on content when visiting using a server-only robot script. It’s important for us periodically to monitor the uptime rate of the DOI system. We also recognise, though, that publishers want to block malicious traffic. However, we can’t perform our monitoring in an easy, automatic way if headless scripts are blocked from resolving DOIs and visiting their respective landing pages. This is not even a call for open access; it’s just saying that current anti-bot techniques, sometimes implemented for legitimate reasons, stifle our ability to know the landscape. Even if the bot resolved a DOI to just a paywall, it would be easier for us to monitor this than it is now. Similarly, CAPTCHA systems such as Cloudflare that would seem to offer an easy way to distinguish between humans (good) and robots (bad) can make life very difficult at the monitoring end. We would certainly be grateful for any proposed solution that could help us to work around these mechanisms.
Conclusion
The context in which I wanted to know this information was so that we can take a snapshot of a page and then, at a later stage, determine whether it is down or has changed substantially. To do this, we are developing Shelob, an experimental content drift spider system; that’s what we’ve used so far to conduct this analysis. Over time, Shelob will evolve, we hope, to give us a way to detect when content has drifted or gone offline. If, however, we can’t detect whether an endpoint is good in the first place, then we likewise cannot detect when things have gone wrong. On the other hand, if, when we first visit, we find the DOI and title on the landing page, but at some future point this degrades, we might be able to say with some confidence that the original has died. I, personally, would encourage publishers not to block automated crawlers, because it’s good when we can determine these types of figures.