This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
As part of our blog post series on the Crossref REST API, we talked to Silvio Peroni and David Shotton of OpenCitations (OC) about the work they’re doing, and how they’re using the Crossref REST API as part of their workflow.
Introducing OpenCitations
OpenCitations employs Semantic Web technologies to create an open repository of the citation data that publishers have made available. This repository, called the OpenCitations Corpus (OCC), contains RDF-based scholarly citation data that are made freely available so that others may use and build upon them. All the resources published by OC – namely the data within the OCC, the ontologies describing the data, and the software developed to build the OCC – are available to the public with open licenses.
What problem is your service trying to solve?
OC was started to address the lack of RDF-based open citation data. To our knowledge, when the project formally started with Jisc funding in 2010 the prototype OCC was the first RDF-based dataset of open citation data.
We collect accurate scholarly citation data derived from bibliographic references harvested from the scholarly literature, so as to make them available under a Creative Commons public domain dedication (CC0) by means of Semantic Web technologies, thus making them findable, accessible, interoperable, and re-usable, as well as structured, separable, and open.
The OCC resources are made available and accessible in different ways, so as to facilitate their reuse in different contexts: as monthly dumps, via the SPARQL endpoint, and by accessing them directly by means of the HTTP URIs of the stored resources (via content negotiation; example)
Can you tell us how you are using the Crossref Metadata API at OpenCitations?
At present, basic citation information is retrieved from PubMed Central, and the Crossref API is then used to retrieve additional metadata describing the citing and cited articles, and to disambiguate bibliographic resources and agents by means of the identifiers retrieved (e.g., DOI, ISSN, ISBN, URL, and Crossref member URL). In future, we will retrieve full citation data direct from Crossref.
What metadata values do you pull from the API?
We pull the titles, subtitles, identifiers (e.g. DOI, ISSN, ISBN, URL, and Crossref member URL), author list, publisher, container resources (issue, volume, journal, book, etc.), publication year and pages.
Have you built your own interface to extract this data?
The SPAR Citation Indexer, a.k.a. SPACIN, is a script and a series of Python classes that allow one to process particular JSON files containing the bibliographic reference lists of papers, produced from the PubMed Central API by another script included in the OpenCitations GitHub repository.
SPACIN processes such JSON files and retrieves additional metadata information about all the citing and cited articles by querying the Crossref API, among others. Once SPACIN has retrieved all these metadata, RDF resources are created (or reused, if they have been already added in the past) and stored in the file system in JSON-LD format. In addition, they are also uploaded to the OCC triplestore (via the SPARQL UPDATE protocol).
How often do you extract/query data?
The entire OpenCitations ingestion workflow is running continuously, processing about half a million citations per month.
What do you do with the metadata once it’s pulled from the API?
All the metadata relevant to bibliographic entities are stored by using the OCC metadata model. The ontological terms of such metadata model are collected within an ontology called the OpenCitations Ontology (OCO), which includes several terms from the SPAR Ontologies and other vocabularies. In particular, the following six bibliographic entity types occur in the datasets created by SPACIN:
bibliographic resources (br), class fabio:Expression – resources that either cite or are cited by other bibliographic resources (e.g. journal articles), or that contain such citing/cited resources (e.g. journals);
resource embodiments (re), class fabio:Manifestation – details of the physical or digital forms in which the bibliographic resources are made available by their publishers;
bibliographic entries (be), class biro:BibliographicReference – literal textual bibliographic entries occurring in the reference lists of bibliographic resources;
responsible agents (ra), class foaf:Agent – names of agents having certain roles with respect to the bibliographic resources (i.e. names of authors, editors, publishers, etc.);
agent roles (ar), class pro:RoleInTime – roles held by agents with respect to the bibliographic resources (e.g. author, editor, publisher);
identifiers (id), class datacite:Identifier – external identifiers (e.g. DOI, ORCID, PubMedID) associated to bibliographic resources and agents.
Do you have plans to enhance your metadata input?
We already handle additional information, such as ORCIDs, that are extracted by means of the ORCID API applied to the citing and cited articles included in the OCC. In addition, we are developing scripts in order to use all the new citation data Crossref now makes available as consequence of the Initiative for Open Citations (I4OC).
What are the future plans for OpenCitations?
With funding received from the Alfred P. Sloan Foundation, we will shortly extend the current infrastructure and the rate of data ingest. Our immediate goal is to increment the daily ingestion of citation data from about half a million citations per month to about half a million citations per day. In addition, we plan to analyse the OCC so as to understand the quality of its current data, and to develop new user interfaces, including graph visualizations of citation networks, that will expand the means whereby users can interact with the OpenCitations data.
What else would you like to see our REST API offer?
Categorising articles/journals/any bibliographic resources according to their main discipline (Computer Science, Biology, etc.) and, eventually, by means of subject terms and/or keywords. Additionally, provision of authors’ institutional affiliations and funder information would be extremely valuable.
Thank you Silvio and David!
If you are keen to share what you’re doing with the our Metadata APIs, contact feedback@crossref.org and share your story.