This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
I’m here in Toronto and looking forward to a busy week. Maddy Watson and I are in town for the 4:AM Altmetrics Conference, as well as the altmetrics17 workshop and Hack-day. I’ll be speaking at each, and for those of you who aren’t able to make it, I’ve combined both presentations into a handy blog post, which follows on from my last one.
But first, nothing beats a good demo. Take a look at our live stream. This shows the Events passing through Crossref Event Data, live, as they happen. You may need to wait a few seconds before you see anything.
Crossref and scholarly links
You may know about Crossref. If you don’t, we are a non-profit organisation that works with Publishers (getting on for nine thousand) to register scholarly publications, issue Persistent Identifiers (DOIs) and maintain the infrastructure required to keep them working. If you don’t know what a DOI is, it’s a link that looks like this:
When you click on that, you’ll be taken to the landing page for that article. If the landing page moves, the DOI can be updated so you’re taken to the right place. This is why Crossref was created in the first place: to register Persistent Identifiers to combat link rot and to allow Publishers to work together and cite each other’s content. A DOI is a single, canonical identifier that can be used to refer to scholarly content.
Not only that, we combine that with metadata and links. Links to authors via ORCIDs, references and citations via DOIs, funding bodies and grant numbers, clinical trials… the list goes on. All of this data is provided by our members and most of it is made available via our free API.
Because we are the central place that publishers register their content, and we’ve got approaching 100 million items of Registered Content, we thought that we could also curate and collect altmetrics type data for our corpus of publications. After all, a reference from a Tweet to an article is a link, just like a citation between two articles is a link.
An Experiment
So, a few years back we thought we would try and track altmetrics for DOIs. This was done as a Crossref Labs experiment. We grabbed a copy of PLOS ALM (since renamed Lagotto), loaded a sample of DOIs into it and watched as it struggled to keep up.
It was a good experiment, as it showed that we weren’t asking exactly the right questions. There were a few things that didn’t quite fit. Firstly, it required every DOI to be loaded into it up-front, and, in some cases, for the article landing page for every DOI to be known. This doesn’t scale to tens of millions. Secondly, it had to scan over every DOI on a regular schedule and make an API query for each one. That doesn’t scale either. Thirdly, the kind of data it was requesting was usually in the form of a count. It asked the question:
“How many tweets are there for this article as of today?”
This fulfilled the original use case for PLOS ALM at PLOS. But when running it at Crossref, on behalf of every publisher out there, the results raised more questions than they answered. Which was good, because it was a Labs Experiment.
Asking the right question
The whole journey to Crossref Event Data has been a process of working out how to ask the right question. There are a number of ways in which “How many tweets are there for this article as of today?” isn’t the right question. It doesn’t answer:
Tweeted by who? What about bots?
Tweeted how? Original Tweets? Retweets?
What was tweeted? The DOI? The article landing page? Was there extra text?
When did the tweet occur?
We took one step closer toward the right question. Instead of asking “how many tweets for this article are there as of today” we asked:
“What activity is happening on Twitter concerning this article?”
If we record each activity we can include information that answers all of the above questions. So instead of collecting data like this:
Registered Content
Source
Count
Date
10.5555/12345678
twitter
20
2017-01-01
10.5555/87654321
twitter
5
2017-01-15
10.5555/12345678
twitter
23
2017-02-01
We’re collecting data like this:
Subject
Relation
Object
Source
Date
twitter.com/tweet/1234
references
10.5555/12345678
twitter
2017-01-01
twitter.com/tweet/5678
references
10.5555/987654321
twitter
2017-01-11
twitter.com/tweet/9123
references
10.5555/12345678
twitter
2017-02-06
Now we’re collecting individual links between tweets and DOIs, we’re closer to all the other kinds of links that we store. It’s like the “traditional” links that we already curate except:
It’s not provided by publishers, we have to go and collect it ourselves.
It comes from a very diverse range of places, e.g. Twitter, Wikipedia, Blogs, Reddit, random web pages
The places that the Events do come from don’t play by the normal rules. Web pages work differently to articles.
Non-traditional Publishing is Untraditional
This last point caused us to scratch our heads for a bit. We used to collect links within the ’traditional’ scholarly literature. Generally, journal articles:
get published once
have a publisher looking after them, who can produce structured metadata
are subject to a formal process of retractions or updates
Now we’re collecting links between things that aren’t seen as ’traditional’ scholarship and don’t play by the rules.
The first thing we found is that blog authors don’t reference the literature using DOIs. Instead they use article landing pages. This meant that we had to put in the work to collect links to article landing pages and turn them back into DOIs so that they can be referenced in a stable, link-rot-proof way.
When we looked at Wikipedia we noticed that, as pages are edited, references are added and removed all the time. If our data set reflected this, it would have to evolve over time, with items popping into existence and then vanishing again. This isn’t good.
Our position in the scholarly community is to provide data and infrastructure that others can use to create services, enrich and build things. Curating an ever changing data set, where things can disappear, is not a great idea and is hard to work with.
We realised that a plain old link store (also known as an assertion store, triple store, etc.) wasn’t the right approach as it didn’t capture the nuance in the data with sufficient transparency. At least, it didn’t tell the whole picture.
We settled on a new architecture, and Crossref Event Data as we now know it was born. Instead of a dataset that changes over time, we have a continual stream of Events, where each Event tells a new part of the story. An Event is true at the time it is published, but if we find new information we don’t edit Events, we add new ones.
An Event is the way that we tell you that we observed a link. It includes the link, in “subject - relation type - object” format, but so much more. We realised that one question won’t do, so Events now answer the following questions:
What links to what?
How was the link made? Was it with a article’s DOI or straight to an Article landing page?
Which Agent collected it?
Which data source were they looking at?
When was the link observed?
When do we think the link actually happened?
What algorithms were used to collect it?
How do you know?
I’ll come back to the “how do you know” a bit later.
What is an altmetrics Event?
So, an Event is a package that contains a link plus lots of extra information required to interpret and make sense of it. But how do we choose what comprises an Event?
An Event is created every time we notice an interaction between something we can observe out on the web and a piece of registered content. This simple description gives rise to some interesting quirks.
It means that every time we see a tweet that mentions an article, for example, we create an Event. If a tweet mentions two articles, there are two events. That means that “the number of Twitter events” is not the same as “the number of tweets”.
It means that every time we see a link to a piece of registered content in a webpage, we create an Event. The Event Data system currently tries to visit each webpage once, but we reserve the right to visit a webpage more than once. This means that the number of Events for a particular webpage doesn’t mean there are that many references.
We might go back and check a webpage in future to see if it still has the same links. If it does, we might generate a new set of Events to indicate that.
Because of the evolving nature of Wikipedia, we attempt to visit every page revision and document the links we find. This means that if an article has a very active edit history, and therefore a large number of edits, we will see repeated Events to the literature, once for every version of the page that makes references. So the number of Events in Wikipedia doesn’t mean the number of references.
An Event is created every time we notice an interaction. Each source (Reddit, Wikipedia, Twitter, blogs, the web at large) has different quirks, and you need to understand the underlying source in order to understand the Events.
We put the choice into your hands.
If you want to create a metric based on counting things, you have a lot of decisions to make. Do you care about bots? Do you care about citation rings? Do you care about retweets? Do you care about whether people use DOIs or article landing pages? Do you care what text people included in their tweet? The answer to each of these questions means that you’ll have to look at each data point and decide to put a weighting or score on it.
If you wanted to measure how blogged about a particular article was, you would have to look at the blogs to work out if they all had unique content. For example, Google’s Blogger platform can publish the same blog post under multiple domain names.
A blog full of link spam is still a blog. You may be doing a study into reputable blogs, so you may want to whitelist the set of domain names to exclude less reputable blogs. Or you may be doing a study into blog spam, so lower quality blogs is precisely what you’re interested in,
If you wanted to measure how discussed an article was on Reddit, you might want to go to the conversation and see if people were actually talking about it, or whether it was an empty discussion. You might want to look at the author of the post to see if they were a regular poster, whether they were a bot or an active member of the community.
If you wanted to measure how referenced an article was in Wikipedia, you might want to look at the history of each reference to see if it was deleted immediately. Or if it existed for 50% of the time, and to give a weighting.
We don’t do any scoring, we just record everything we observe. We know that everyone will have different needs, be producing different outcomes and use different methodologies. So it’s important that we tell you everything we know.
So that’s an Event. It’s not just a link, it’s the observation of a link, coupled with extra information to help you understand it.
How do you know?
But what if the Event isn’t enough? To come back to the earlier question, “how do you know?”
Events don’t exist in isolation. Data must be collected and processed. Each Agent in Crossref Event Data monitors a particular data source and feeds data into the system, which goes and retrieves webpages so it can make observations. Things can go wrong.
Any one of these things might prevent an Event from being collected:
We might not know about a particular DOI prefix immediately after it’s registered.
We might not know about a particular landing page domain for a new member immediately.
Article landing pages might not have the right metadata, so we can’t match them to DOIs.
Article landing pages might block the Crossref bot, so we can’t match DOIs.
Article landing pages might require cookies, or convoluted JavaScript, so the bot can’t get the content.
Blogs and webpages might require cookies or JavaScript to execute.
Blogs might block the Event Data bot.
A particular API might have been unavailable for a period of time.
We didn’t know about a particular blog newsfeed at the time.
This is a fact of life, and we can only operate on a best-effort basis. If we don’t have an Event, it doesn’t mean it didn’t happen.
This doesn’t mean that we just give up. Our system generates copious logs. It details every API call it made, the response it got, every scan it made, every URL it looked at. This amounts to about a gigabyte of data per day. If you want to find out why there was no Wikipedia data at a given point in time, you can go back to the log data and see what happened. If you want to see why there was no Event for an article by publisher X, you can look at the logs and see, for example, that Publisher X prevented the bot from visiting.
Every Event that does exist has a link to an Evidence Record, which corresponds with the logs. The Evidence Record tells you:
which version of the Agent was running
which Artifacts and versions it was working from
which API requests were made
which inputs looked like possible links
which matched or failed
which Events were generated
Artifacts are versioned files that contain information that Agents use. For example, there’s a list of domain names, a list of DOI prefixes, a list of blog feed urls, and so on. By indicating which version of these Artifacts were used, we can explain why we visited a certain domain and not another.
All the code is open source. The Evidence Record says which version of each Agent was running so you can see precisely which algorithms were used to generate the data.
Between the Events, Evidence Records, Evidence Logs, Artifacts and Open Source software, we can pinpoint precisely how the system behaved and why. If you have any questions about how a given Event was (or wasn’t) generated, every byte of explanation is freely available.
This forms our “Transparency first” idea. We start the whole process with an open Artifact Registry. Open source software then produces open Evidence Records. The Evidence Record is then consulted and turned into Events. All the while, copious logs are being generated. We’ve designed the system to be transparent, and for each step to be open to inspection.
We’re currently in Beta. We have over thirty million Events in our API, and they’re just waiting for you to use them!