Helping researchers identify content they can text mine

2 minute read.

Helping researchers identify content they can text mine

Geoffrey Bilder – 2020 April 16

TL;DR

Many organisations are doing what they can to aid in the response to the COVID-19 pandemic. Crossref members can make it easier for researchers to identify, locate, and access content for text mining. In order to do this, members must include elements in their metadata that:

Point to the full text of the content.
Indicate that the content is available under an open access license or that it is being made available for free (gratis).

How to do it.

If your content is open access

Make sure the Crossref metadata for all of your open access content includes:

The URL of the open access license the content is under.
A URL that points to the full text of the content on your site (PDF, XML or HTML).

Instructions for including license and full text URLs in your metadata.

If you are making subscription content available for text mining (temporarily or otherwise).

Make sure the Crossref metadata for the content you are making freely available for text mining includes:

The URL of the publisher license the content is under.
A URL that points to the full text of the content where it is being made freely available (PDF, XML or HTML). This might not be on your site.

Instructions for including license and full text URLs in your metadata.

In addition, you need to flag the content that you are making freely available.

A “free to read” element in the access indicators section of your metadata indicating that the content is being made available free-of-charge (gratis).
An assertion element indicating that the content being made available is available free-of-charge.

Instructions for flagging your content as “free”

Note that step #4 is required in order for users to be able to find content marked as “gratis” in Crossref’s REST API.

And if you decide to revoke the free access in the future, you will need to update the data to reflect that restrictions have been reimposed.

Sounds great. Has anybody else actually done this?

Yes.

Over 43 million metadata records already have a license and a full text link. https://api.crossref.org/works?filter=has-license:true,has-full-text:true&rows=0

Millions of the above items have one of the Creative Commons licenses or a dedicated text and data mining license provided by the publisher.

And in the past three weeks (as of the writing of this blog post) over 23,000 articles have been flagged as “free” so they are available for text mining.

https://api.crossref.org/v1/works?filter=assertion:free,has-full-text:true

Get involved

Find a service

Documentation

About us

2026 April 01

Reflections from Bangkok

2026 March 31

Voices from Crossref Metadata Sprint in São Paulo

2026 March 30

DOI resolution and deposit outage on 17 March 2026

2026 March 24

Strengthening support for data citations and saying goodbye to Event Data

Blog