Text and data mining for members

2025 April 08

The programs approach: our experiences during the first quarter of 2025

At the end of last year, we were excited to announce our renewed commitment to community and the launch of three cross-functional programs to guide and accelerate our work. We introduced this new approach to work towards better cross-team alignment, shared responsibility, improved communication and learning, and make more progress on the things members need.

2025 March 19

Version 5.4.0 metadata schema update now available

This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.

What is in this update?

Publication typing for citations

This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.

2025 March 12

2025 public data file now available

Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.

2025 March 05

Come ROR with us: Using ROR IDs in place of Funder IDs

Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.

As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.

Text and data mining for members

Text and data mining for researchers Providing licensing information to TDM tools

Even if you already have an API, ours provides additional benefits as it’s a common, standards-based API that works across all members and records. Researchers having to learn many different member APIs for TDM projects doesn’t scale well.

It is up to you to decide formats for your full-text in: some offer PDF, others XML, and some plain text. Some members vary what they deliver depending on the age of the content or other variables. Our API does not provide automatic access to subscription content - access to subscription content is managed on your site using your existing access control systems.

As a member, you need to do two things to enable text and data mining for the metadata records that you have registered with us:

Include the link to full-text in the metadata for each DOI so researchers can follow it to access your content
Include a license URL in the metadata for each DOI so researchers can use this to find out if they have permission to carry out TDM with your content item

Register this information with us using a resource-only deposit or by uploading a .csv file containing the URLs and the related DOIs.

If you are concerned about the impact of automated TDM harvesters on your site performance, you may choose to implement rate-limiting headers.

Rate limiting

TDM may change the volume of traffic that your servers have to handle when researchers download large numbers of files in bulk. You can mitigate performance issues with rate limiting.

We have defined a set of standard HTTPS headers that can be used by servers to convey rate-limiting information to automated text and data mining tools. Well-behaved TDM tools can simply look for these headers when they query member sites in order to understand how to behave so as not to affect the site’s performance. The headers allow a member to define a rate limit window - a time span, such as a minute, an hour, or a day. The member can then specify:

Header name	Example value	Explanation
CR-TDM-Rate-Limit	1500	Maximum number of full-text downloads that are allowed to be performed in the defined rate limit window
CR-TDM-Rate-Limit-Remaining	76	Number of downloads left for the current rate limit window
CR-TDM-Rate-Limit-Reset	1378072800	Remaining time (in UTC epoch seconds) before the rate limit resets and a new rate limit window is started

We do not provide or enforce this rate limiting - it’s up to you to implement it if required, and to define a rate limit appropriate for your servers.

Example member site

We have created TinyPub to show an implementation of our API, including rate limiting and IP-based subscription access. You can download this code for reference, but please note that it’s just to illustrate the workings of the system, and is not intended for production.

Text and data mining for researchers Providing licensing information to TDM tools

Page owner: Martyn Rittman | Last updated 2020-April-08

Get involved

Find a service

Documentation

About us

2025 April 08

The programs approach: our experiences during the first quarter of 2025

2025 March 19

Version 5.4.0 metadata schema update now available

What is in this update?

Publication typing for citations

2025 March 12

2025 public data file now available

2025 March 05

Come ROR with us: Using ROR IDs in place of Funder IDs

Documentation

Text and data mining for members

Rate limiting

Example member site