At the end of last year, we were excited to announce our renewed commitment to community and the launch of three cross-functional programs to guide and accelerate our work. We introduced this new approach to work towards better cross-team alignment, shared responsibility, improved communication and learning, and make more progress on the things members need.
This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
On Wednesday, October 2nd, 2019 we discovered that we had accidentally pushed the main Crossref system as part of a docker image into a developer’s account on Docker Hub. The binaries and configuration files that made up the docker image included embedded passwords and API tokens that could have been used to compromise our systems and infrastructure. When we discovered this, we immediately secured the repo, changed all the passwords and secrets, and redeployed the system code. We have since been scanning all of our logs and systems to see if there has been any unusual activity that could be related to the exposure of the container.
Please note that no external data e.g. member passwords or personal information were exposed; our source code contains only internal passwords and ‘secrets’ such as API tokens.
Thankfully, the way in which these secrets were exposed (in compressed, binary files which were, in turn, in a Docker image) means that they were probably overlooked by the automated exploitation tools which focus on scanning source code. And, so far, we have seen nothing that would indicate that these passwords and secrets have been exploited. We will, of course, inform our members directly (and update this blog) if that changes.
More than you probably want to know
If you are continuing to read this, my guess is that you might have questions like:
Why are you doing something as silly as embedding secrets and passwords in your code?
And wait a minute… I thought Crossref code was open source?
And why is the director of strategic initiatives announcing this?
Let me answer these questions in random order.
In March 2019 I took over Crossref’s technical teams when Chuck Koscher announced that he would be retiring at the end of the year. I’m now the director of technology & research.
A few months earlier we had already concluded that a major portion of the Crossref system had accumulated 20 years of technical debt and that we were going to spend a significant portion of 2019 and 2020 paying down that debt.
Specifically, a lot of the code that runs Crossref was inherited from a third party who developed it back in the early 2000s. This means that, even though any new systems that we’ve developed since 2007 have been open-source, the code for the oldest parts of the system has remained closed because it contained potentially proprietary code as well as a lot of deprecated coding practices. Also - the architecture, the tooling, and the development processes behind the Crossref system had not changed much in those twenty years. It was fantastic architecture, tooling, and code for its time. But architectures that scale to millions of records need to change to handle hundreds of millions of records. Processes that work for configuring one service need to change when you are managing dozens of services. And support tools that work for a few hundred members break down when you are dealing with tens of thousands of members.
These parts of the Crossref system were decidedly not 12 factor. We were not using DevOps or SRE working practices to run them. And the bulk of that part of the system is still being run in a traditional data center.
But since March we have been slowly fixing that. In incremental steps. Some of which are visible as a side effect of the security incident that precipitated this blog post. For example, one of our first moves was to move our development to Gitlab. Even though a big chunk of the base Crossref code is still closed source, we saw moving to Gitlab as a priority because Gitlab offers a fantastic suite of tools to help automate and manage our deployments. Similarly, we have been Dockerizing the Crossref system so that it is easier to scale and run in different environments. And as part of this effort, we have spent a lot of time on the issue of how to best handle secrets. We knew our secrets management in this part of the codebase was horrible. We have been developing some experiments and infrastructure for handling these secrets securely. But we haven’t finished this work yet. And so the system slipped out into a public repo too early. Ironically, this too illustrates a fundamental change in the way we develop things. Our default is to be open and transparent. This case is currently an exception. An exception we want to eliminate, but one we are not ready to do yet. We have to audit and scrub the code first.
Yes, this incident has been embarrassing. But not nearly as embarrassing as the fact that Crossref has succumbed to a technology industry cliche. That we spent so much time growing and focusing on new features for our members, that we neglected some of the creaking infrastructure of our infrastructure.
And I should be clear about two things:
First, not all of our code is like this. We have, for a long time, been building open source software and using modern best practices for secrets management in our newer subsystems and services. The problems described above are confined to twenty-year-old-code that we didn’t write in the first place and that we had been avoiding refactoring.
And second, the technology team has been marvelous at responding to the challenge we face. They have adopted new processes and tools. They are learning new techniques. We are steadily chipping away at these problems.
It is generally considered bad practice to praise or reward technology teams for fire-fighting instead of fire prevention, but this may be the exception that proves the rule.
I was blown away by how the technology, product, and support teams worked together. When we discovered this problem, I sat at my desk in rural France and watched as staff from the UK, and all three US time zones shut down this problem in just a couple of hours. Obviously, I wish we hadn’t had the problem in the first place, but seeing their response did a great deal to encourage me that we are on the right track.
In any case, it looks like we’ve been lucky. And we’ll be working even harder to refactor our code, tools, and processes so that this kind of thing doesn’t happen again.