At the end of last year, we were excited to announce our renewed commitment to community and the launch of three cross-functional programs to guide and accelerate our work. We introduced this new approach to work towards better cross-team alignment, shared responsibility, improved communication and learning, and make more progress on the things members need.
This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
In my blog post on October 6th, I promised an update on what caused the outage and what we are doing to avoid it happening again. This is that update.
Crossref hosts its services in a hybrid environment. Our original services are all hosted in a data center in Massachusetts, but we host new services with a cloud provider. We also have a few R&D systems hosted with Hetzner.
We know an organization our size has no business running its own data center, and we have been slowly moving services out of the data center and into the cloud.
For example, over the past nine months, we have moved our authentication service and our REST APIs to the cloud.
And, we are working on moving the other existing services too. For example, we are in the midst of moving Event Data and, our next target, after Event Data, is the content registration system.
All new services are deployed to the cloud by default.
While moving services out of the data center, we have also been trying to shore up the data center to ensure it continues to function during the transition. One of the weaknesses we identified in the data center was that the same provider managed both our primary network connection and our backup connection (albeit- on entirely different physical networks). We understood that we really needed a separate provider to ensure adequate redundancy, and we had already had a third network drop installed from a different provider. But, unfortunately, it had not yet been activated and connected.
Meanwhile, our original network provider for the first two connections informed us months ago that they would be doing some major work on our backup connection. However, they assured us that it would not affect the primary connection- something we confirmed with them repeatedly since we knew our replacement backup connection was not yet active.
But, the change our provider made did affect both the backup (as intended) and the primary (not intended). They were as surprised as we were, which kind of underscores why we want two separate providers as well as two separate network connections.
So both our primary and secondary networks went down while we had not yet activated our replacement secondary network.
Also, our only local infrastructure team member was in surgery at the time (He is fine. It was routine. Thanks for asking).
This meant we had to send a local developer to the data center, but the data center’s authentication process had changed since the last time said developer had visited (pre-pandemic). So, yeah, it took us a long time to even get into the data center.
By then, our infrastructure team member was out of surgery and on the phone with our network provider, who realized their mistake and reverted everything. This whole process (getting network connectivity restored, not the surgery) took almost two hours.
Unfortunately, the outage didn’t just affect services hosted in the data center. It also affected our cloud-hosted systems. This is because all of our requests were still routed to the data center first, after which those destined for the cloud were split out and redirected. This routing made sense when the bulk of our requests were for services hosted in the data center. But, within the past month, that calculus had shifted. Most of our requests now are for cloud-based services. We were scheduled to switch to routing traffic through our cloud provider first, and had this been in place, many of our services would have continued running during the data center outage.
It is very tempting to stop this explanation here and leave people with the impression that:
The root cause of the outage was the unpredicted interaction between the maintenance on our backup line and the functionality of our primary line;
Our slowness to respond was exclusively down to one of the two members of our infrastructure staff being (cough) indisposed at the time.
But the whole event uncovered several other issues as well.
Namely:
Even if one of our three lines had stayed active, the routers in the data center would not have cut over to the redundant working system because we had misconfigured them and we had not tested them;
We did not keep current documentation on the changing security processes for accessing the data center;
Our alerting system does not support the kind of escalation logic, and coverage-scheduling that would have allowed us to automatically detect when our primary data center administrator didn’t respond (being in surgery and all) and redirect alerts and warnings to secondary responders; and
We need to accelerate our move out of the data center.
What are we doing to address these issues?
Completing the installation of the backup connection with a second provider;
Scheduling a test of our router’s cutover processes where we will actually pull the plug on our primary connection to ensure that failover is working as intended. We will give users ample warning before conducting this test;
Revising our emergency contact procedures and updating our documentation for navigating our data center’s security process;
Replacing our alerting system with one that gives us better control over escalation rules; and
Adding a third FTE to the infrastructure team to help us accelerate our move to the cloud and to implement infrastructure management best practices.
October 6th, 2021, was a bad day. But we’ve learned from it. So if we have a bad day in the future, it will at least be different.