This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
One of the cool things about working in Crossref Labs is that interesting experiments come up from time to time. One experiment, entitled “what happens if you plot DOI referral domains on a chart?” turned into the Chronograph project. In case you missed it, Chronograph analyses our DOI resolution logs and shows how many times each DOI link was resolved per month, and also how many times a given domain referred traffic to DOI links per day.
We’ve released a new version of Chronograph. This post explains how it was put together. One for the programmers out there.
Big enough to be annoying
Chronograph sits on the boundary between normal-sized data and large-enough-to-be-annoying-size data. It doesn’t store data for all DOIs (it includes only those that are used on average once a day), but it has information on up to 1 million DOIs per month over about 5 years, and about 500 million data points in total.
Storing 500 million data points is within the capabilities of a well-configured database. In the first iteration of Chronograph a MySQL database was used. But that kind of data starts to get tricky to back up, move around and index.
Every month or two new data comes in for processing, and it needs to be uploaded and merged into the database. Indexes need to be updated. Disk space needs to be monitored. This can be tedious.
Key values
Because the data for a DOI is all retrieved at once, it can be stored together. So instead of a table that looks like
This is much lighter on the indexes and takes much less space to store. However, it means that adding new data is expensive. Every time there’s new data for a month, the structure must be parsed, merged with the new data, serialised and stored again millions of times over.
After trials with MySql, MongoDB and MapDB, this approach was taken with MySQL in the original Chronograph.
Keep it Simple Storage Service Stupid
In the original version of Chronograph the data was processed using Apache Spark. There are various solutions for storing this kind of data, including Cassandra, time-series databases and so on.
The flip side of being able to do interesting experiments is wanting them to stick around without having to bother a sysadmin. The data is important to us, but we’d rather not have to worry about running another server and database if possible.
Chronograph fits into the category of ‘interesting’ rather than ‘mission-critical’ projects, so we’d rather not have to maintain expensive infrastructure if possible.
I decided to look into using Amazon Web Services Simple Storage Service (AWS S3) to store the data. AWS itself is a key-value store, so it seems like a good fit. S3 is a great service because, as the name suggests, it’s a simple service for storing a large number of files. It’s cheap and its capabilities and cost scale well.
However, storing and updating up to 80 million very small keys (one per DOI) isn’t very clever, and certainly isn’t practical. I looked at DynamoDB, but we still face the overhead of making a large number of small updates.
Is it weird?
In these days of plentiful databases with cheap indexes (and by ‘these days’ I mean the 1970s onward) it seems somehow wrong to use plain old text files. However, the whole Hadoop “Big Data” movement was predicated on a return to batch processing files. Commoditisation of services like S3 and the shift to do more in the browser have precipitated a bit of a rethink. The movement to abandon LAMP stacks and use static site generators is picking up pace. The term ‘serverless architecture’ is hard to avoid if you read certain news sites.
Using Apache Spark (with its brilliant RDD concept) was useful for bootstrapping the data processing for Chronograph, but the new code has an entirely flat-file workflow. The simplicity of not having to unnecessarily maintain a Hadoop HDFS instance seems to be the right choice in this case.
Repurposing the Wheel
The solution was to use S3 as a big hash table to store the final data that’s served to users.
The processing pipeline uses flat files all the way through from input log files to projections to aggregations. At the penultimate stage of the pipeline blocks of CSV per DOI are produced that represent date-value pairs.
There are 65,536 (0x000 to 0xFFFF) possible files, each with about a thousand DOIs worth of data in each.
When the browser requests data for a DOI, it is hashed and then the request for the appropriate file in S3 is made. The browser then has to perform a linear scan of the file to find the DOI it is looking for.
This is the simplest possible form of hash table: simple addressing with separate linear chaining. The hash function is a 16-bit mask of MD5, chosen because of availability in the browser. It does a great job of evenly distributing the DOIs over all 65,536 possible files.
Striking the balance
In any data structure implementation, there are balances to be struck. Traditionally these concern memory layout, the shape of the data, practicalities of disk access and CPU cost.
In this instance, the factors in play included the number of buckets that need to be uploaded and the cost of the browser downloading an over-large bucket. The size of the bucket doesn’t matter much for CPU (as far as the user is concerned it takes about the same time to scan 10 entries as it does 10,000), but it does make a difference asking user to download a 10kb bucket or a 10MB one.
I struck the balance at 4096 buckets, resulting in files of around 100k, which is the size of a medium sized image.
It works
The result is a simple system that allows people to look up data for millions of DOIs, without having to look after another server. It’s also portable to any other file storage service.