This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
I thought I could take this opportunity to demonstrate one evolution path from traditional record-based search to a more contemporary triple-based search. The aim is to show that these two modes of search do not have to be alternative approaches but can co-exist within a single workflow.
Let me first mention a couple of terms I’m using here: ‘graphs’ and ‘properties’. I’m using ‘property’ loosely to refer to the individual RDF statement (or triple) containing a property, i.e. a triple is a ‘(subject, property, value)’ assertion. And a ‘graph’ is just a collection of ‘properties’ (or, more properly, triples). Oh, and I’ll also use the term ‘records’ when considering ‘graphs’ as pre-fabricated objects returned within a result set.
So, what do we have here? We have on the left a traditional means of disseminating search results which is typically record based. A new set of records may be generated by querying using the API provided, whether proprietary or public such as Lucene or SRU/CQL. We can thus consider this search service as a ‘record store’ – even though records tend to generated anew rather than retrieved. The individual records in the result set are collections or groupings of ‘properties’ about the subjects of the query. Note that this is somewhat similar to the way music is packaged for physical distribution with many tracks (‘properties’) combined onto a single album (‘record’ or ‘graph’) which contains a thematic coherence – either same artist or compilation around a given topic.
Digital music distribution, on the other hand, allows for albums to be atomized so that individual tracks may be cherry-picked at will. This is not dissimilar from what happens in a ‘triple store’ where the basic properties (‘tracks’) that in a regular search engine were together combined in a ‘record’ (‘album’) to present a search result can now be plucked apart and recombined into newer bespoke ensembles. Note that this querying and recombination can be applied across the full triple store or even across this triple store and remote triple stores since the same data model is applied. Certainly, at the data model level federated searching thus becomes a non-issue.
Suppose now that our search server (or record store) is an OpenSearch-type service, i.e. the result sets are distributed as some list-based format, typically RSS, and that the list-based format either provides an RDF graph or can be transformed to such a graph, we could then use that as a basis for feeding an RDF triple store.
So, now then at right we have a triple store which is a large database of triples (or properties) compiled from all the records in the record store. And since this is a triple store we can query it using SPARQL. For example, this trival SPARQL query:
returns the first five articles (referenced by DOI) with title containing the word ‘boson’:
--------------------------------------------------------------------------------------------------
| doi | title |
==================================================================================================
| "10.1038/nature05513" | "Comparison of the Hanbury Brown–Twiss effect for bosons and fermions" |
| "10.1038/221999a0" | "Physics: The Intermediate Boson" |
| "10.1038/313506b0" | "The nuts and bolts of bosons" |
| "10.1038/301287a0" | "The search for bosons: A golden year for the weak force" |
| "10.1038/424003a" | "Below-par performance hampers Fermilab quest for Higgs boson" |
--------------------------------------------------------------------------------------------------
Now let’s contrast this with a conventional record-based search, such as shown at left, to find the first five articles (referenced by DOI) with title containing the word ‘boson’ would use a query (here SRU/CQL, and CQL is bolded) such as:
and would receive a set of result records (here RSS) like so:
...
<item rdf:about="http://dx.doi.org/10.1038/nature05513">
<title>Comparison of the Hanbury Brown–Twiss effect for bosons and fermions</title>
<link>http://dx.doi.org/10.1038/nature05513</link>
<dc:identifier>doi:10.1038/nature05513</dc:identifier>
<dc:title>Comparison of the Hanbury Brown–Twiss effect for bosons and fermions</dc:title>
...
</item>
<item rdf:about="http://dx.doi.org/10.1038/221999a0">
<title>Physics: The Intermediate Boson</title>
<link>http://dx.doi.org/10.1038/221999a0</link>
<dc:identifier>doi:10.1038/221999a0</dc:identifier>
<dc:title>Physics: The Intermediate Boson</dc:title>
...
</item>
...
Note also that there is an interesting halfway house as shown in the diagram, whereby a set of result records presenting a single RDF graph can be queried as its own (very) restricted triple store.
In general, because a triple store is so primitive and it can be queried alongside other triple stores the queries that can be put together can be highly complex and customized with arbitrary data. The result from such a query differs from a traditional ‘record’ where a fixed property set is bound together in a presentation. Such a result is user-determined as opposed to the server-determined nature of traditional result ‘records’.
I hope that this post has been able to show in some degree that although there are some obvious differences there is nevertheless a synergy between these two modes of searching: prêt-à-porter and tailored.