Sponsors make Crossref membership accessible to organizations that would otherwise face barriers to joining us. They also provide support to facilitate participation, which increases the amount and diversity of metadata in the global Research Nexus. This in turn improves discoverability and transparency of scholarship behind the works.
We are looking to work with an individual or organization to perform an audit of, and propose changes to, the structure and information architecture underlying our website, with the aim of making it easier for everyone in our community to navigate the website and find the information they need.
Proposals will be evaluated on a rolling basis. We encourage submissions by May 15, 2025.
At the end of last year, we were excited to announce our renewed commitment to community and the launch of three cross-functional programs to guide and accelerate our work. We introduced this new approach to work towards better cross-team alignment, shared responsibility, improved communication and learning, and make more progress on the things members need.
This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Our metadata is used by thousands of services, researchers, and other organisations. We make it openly available through our APIs, which can be used to obtain a subset of records. If you want to work with our full corpus, the best way is to get a copy of the public data file and update it via the REST API with any new records created or changed since its release.
By providing an annual copy of the full corpus, we also expand the ways in which the metadata can be used and interrogated. It is ideal for groups using large samples of the scholarly record, such as metaresearchers or research integrity experts. You can find examples of the public data file used in research on journal editorial practices and in projects investigating gaps in the scholarly record.
How to access the public data file
The total size of the file is 197 GB and it is available in JSON-lines format. We also provide an experimental tool to convert the file to an Sqlite database. Before downloading the full dataset, you may wish to download the sample dataset containing 100 files (with 100 records in each, around 24 MB). This is a randomly sampled subset of metadata records and can be used for prototyping and development.
To get a copy of the annual data file you can access it directly via https://doi.org/10.13003/87bfgcee6g, or get the sample dataset and previous public data files from Academic Torrents. We make a donation to Academic Torrents to support their work, which allows the data to be accessible in this way. Some organisations have reported policies that prevent access to torrents, so we provide a copy that can be downloaded from AWS, which requires an AWS account and a small payment to cover the data transfer costs. You can find the details about access here.
We have some tips for working with the public data file. If you would like to have access to monthly snapshots of the whole corpus, along with higher API rate limits and other benefits, you can subscribe to Metadata Plus.
What’s different this year?
This year’s public data file contains an additional 9 million records, and many updates to previously deposited records. The formats and method of access are the same as last year, except that it uses JSON lines, meaning that each metadata record is on a single line and the file suffix is jsonl instead of json. The records have been sorted by DOI, meaning it should be easier to navigate.
A change this year is that the file does not contain aliased DOIs, which are DOI that are redirected to another DOI. Aliasing is necessary on rare occasions, for example when two DOIs are registered for the same content. Previously we haven’t indicated aliasing in the REST API and public data files; this year only the prime DOIs (the ones to which they are redirected) are included. This makes statistical analysis of the metadata more accurate, but beware that it may give different results in cases where many aliased DOIs were previously counted. See this community forum post for more details.
If you have questions, want to let us know how you will use the metadata, or want to discuss anything on the topic of retrieving Crossref metadata, head to our community forum. From there, you can also keep updated about changes to our schema and APIs.