This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
Once a year Crossref releases a public metadata file that includes all of Crossref’s public metadata. We typically release this as a tar file and distribute it via Academic Torrents.
Users of the Metadata Plus service can access similar data snapshots that we updated monthly. These are also tar files, but we distribute them via the REST API with access via a Plus API token.
In either case, these files are large and can be difficult to handle. This document provides you with tips that should make your life easier when handling Crossref public metadata files and Plus snapshots.
Downloading the public data file directly from AWS
Since 2023, the public data file has also been made available via a “Requester Pays” option to provide access for organisations that don’t permit downloads via torrent services. A copy is stored on AWS S3 in a bucket configured with the “Requester Pays” option. This means that rather than the bucket owner (Crossref) paying for bandwidth and transfer costs when downloading objects, the requester pays instead. The cost is expected to vary slightly year to year depending on variables like file size and end-user setups. The 2024 file is approximately 200 GB, and plugging that into this calculator results in an estimated cost of $18 USD. More information about “Requester Pays” can be found in the AWS documentation.
The bucket is called api-snapshots-reqpays-crossref. You can use either the AWS CLI or the AWS REST API to access it. There are code examples in the AWS documentation.
Using the AWS CLI for example, after authenticating, you could run:
# List the objects in the bucket
aws s3 ls --request-payer requester s3://api-snapshots-reqpays-crossref
# Download the public data file
aws s3api get-object --bucket api-snapshots-reqpays-crossref --request-payer requester --key March-2023-public-data-file-from-crossref.tar ./March-2023-public-data-file-from-crossref.tar
Note that the key part of the command is --request-payer requester which is mandatory. Without that flag, the command will fail.
Handling tar files
Q: The tar file contains many files that, in turn, contain the individual DOI records. Some of these files are very large and hard to process. Could you break them out into separate files per DOI instead?
A: Yes, we could. But that creates its own set of problems. Standard filesystems on Linux/macOS/Windows really, really don’t like you to create hundreds of millions of small files on them. Even standard command-line tools like ls choke on directories with more than a few thousand files in them. Unless you are using a specialized filesystem, formatted with custom inode settings optimized for hundreds of millions of files- saving each DOI as an individual record will bring you a world of hurt.
Q: Gah! The tar file is large and uncompressing it takes up a ton of room and generates a huge number of files. What can we do to make this easier? Can you split the tar file so we can manage it in batches?
A: Don’t uncompress or extract the tar file. You can read the files straight from the compressed tar file.
Q: But won’t reading files straight from the tar file be slow?
We did three tests- all done on the same machine using the same tar file, which, at the time of this writing, contained 42,210 files which, in turn, contained records for 127,574,634 DOIs.
Test 1: Decompressing and untarring the file took about 82 minutes.
On the other hand…
Test 2: A python script iterating over each filename in the tar file (without extracting and reading the file into memory) was completed in just 29 minutes.
Test 3: A python script iterating over each filename in the tar file and extracting and reading the file into memory completed in just 61 minutes.
Both of the above scripts worked in a single process. However, you could almost certainly further optimize by parallelizing reading the files from the tar file.
In short - the tar file is a lot easier to handle if you don’t decompress and/or extract it. Instead, it is easiest to read directly from the compressed tar file.
Downloading and using Plus snapshots
Q: How should I best use the snapshots? Can we get them more frequently than each month?
A: The monthly snapshots include all public Crossref metadata up to and including data for the month before they were released. We make them available to seed and occasionally refresh a local copy of the Crossref database in any system you are developing that requires Crossref metadata. In most cases, you should just keep this data current by using the Crossref REST API to retrieve new or modified records. Typically, only a small percentage of the snapshot changes from month to month. So if you are downloading it repeatedly, you are just downloading the same unchanged records time and time again. Occasionally, there will be a large number of changes in a month. This typically happens when:\
\
A large Crossref member adds or updates a lot of records at once.\
We add a new metadata element to the schema.\
We change the way we caluclate something (e.g. citation counts) and that effects a lot of records.
In these cases, it makes sense to refresh your metadata from the newly downloaded snapshot instead of using the API.
In short, if you are downloading the snapshot more than a few times a year- you are probably doing something very inefficient.
Q: The snapshot is large and difficult to download. I keep having it fail and have to start the download again. Can you split the snapshot so that I can download smaller parts instead?
A: If your download gets interrupted, you can resume the download from the point it got interrupted instead of starting over. This is easiest to do using something like wget.
But you can also do it with curl. You can try it yourself:
Then the curl command will calculate the byte offset from where it left off and continue the download from there.
Supplementary tools and alternative formats
In late 2023 we started experimenting with supplementary tools and alternative file formats meant to make our public data files easier to use by broder audiences.
The Crossref Data Dump Repacker is a python application that allows you to repack the Crossref data dump into the JSON Lines format.
doi2sqlite is a tool for loading Crossref metadata into a SQLite database.
And for finding the record of a particular DOI, we’ve published a python API for interacting with the annual public data files. This tool can create an index of the DOIs in the file, enabling easier record lookups without having to iterate over the entire file, which can take hours. A torrent is available for the 2024 index in SQLite format if you do not wish to generate it yourself.