Blog

Outage of March 24, 2022

Geoffrey Bilder

Geoffrey Bilder – 2022 March 24

In Data CenterPost Mortem

So here I am, apologizing again. Have I mentioned that I hate computers?

We had a large data center outage. It lasted 17 hours. It meant that pretty much all Crossref services were unavailable - our main website, our content registration system, our reports, our APIs. 17 hours was a long time for us - but it was also an inconvenient time for numerous members, service providers, integrators, and users. We apologise for this.

Update on the outage of October 6, 2021

Geoffrey Bilder

Geoffrey Bilder – 2021 October 27

In Data CenterPost Mortem

In my blog post on October 6th, I promised an update on what caused the outage and what we are doing to avoid it happening again. This is that update.

Crossref hosts its services in a hybrid environment. Our original services are all hosted in a data center in Massachusetts, but we host new services with a cloud provider. We also have a few R&D systems hosted with Hetzner.

We know an organization our size has no business running its own data center, and we have been slowly moving services out of the data center and into the cloud.

Outage of October 6, 2021

Geoffrey Bilder

Geoffrey Bilder – 2021 October 06

In Data CenterPost Mortem

On October 6 at ~14:00 UTC, our data centre outside of Boston, MA went down. This affected most of our network services- even ones not hosted in the data centre. The problem was that both of our primary and backup network connections went down at the same time. We’re not sure why yet. We are consulting with our network provider. It took us 2 hours to get our systems back online.

Lesson learned, the hard way: Let’s not do that again!

TL;DR

We missed an error that led to resource resolution URLs of some 500,000+ records to be incorrectly updated. We have reverted the incorrect resolution URLs affected by this problem. And, we’re putting in place checks and changes in our processes to ensure this does not happen again.

How we got here

Our technical support team was contacted in late June by Wiley about updating resolution URLs for their content. It’s a common request of our technical support team, one meant to make the URL update process more efficient, but this was a particularly large request. Shortly thereafter, we were provided with nearly 1,200 separate files by Atypon on behalf of Wiley in order to update the resolution URLs of ~9 million records. We manually spot checked over 50 of these files, because, prior to this issue, our technical support team did not have a mechanism to automatically check for errors. That labor intensive review did not turn up any problems. That is, those 50 samples had no errors with the headers, like were found later.