This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
(Update - 2007.08.28: I inadvertently missed out the term names in the last example of XMP as RDF/N3 with QNames and have now added these in. Also - a biggie - I said that PRISM had no XMP schema defined. This is actually wrong and as I blogged here today, the new PRISM 2.0 spec does indeed have a mapping of PRISM terms to XMP value types. Should actually have read the spec instead of just blogging about it earlier here. :~)
Having previously stooped to an extremely crass hack for pulling out a document information dictionary from PDFs (for which no apologies are sufficient but it does often work) I feel I should make some kind of amends and mention the wonderful ExifTool by Phil Harvey for reading and writing metadata to media files. This is both a Perl library and command-line application (so it’s cross-platform - a Windows .exe and Mac OS .dmg are also provided.) Besides handling EXIF tags in image files this veritable swissknife of metadata inspectors can also read PDFs for the information dictionary and the document XMP packet. And moreover, intriguingly, can dump the raw (document) XMP packet.
I’m still experimenting with it. There’s quite a number of features to explore. But some preliminary finds are listed below.
Taking one of our standard (metadata poor) PDFs we get this dump:
% exiftool nature05428.pdf
ExifTool Version Number : 6.95
File Name : nature05428.pdf
Directory : .
File Size : 367 kB
File Modification Date/Time : 2007:07:26 14:01:23
File Type : PDF
MIME Type : application/pdf
Page Count : 3
Producer : Acrobat Distiller 6.0.1 (Windows)
Mod Date : 2006:12:19 15:03:23+08:00
Creation Date : 2006:12:18 16:57:58+08:00
Creator : 3B2 Total Publishing System 7.51n/W
Creator Tool : 3B2 Total Publishing System 7.51n/W
Modify Date : 2006:12:19 15:03:23+08:00
Create Date : 2006:12:18 16:57:58+08:00
Metadata Date : 2006:12:19 15:03:23+08:00
Document ID : uuid:f598740b-ad11-41c5-a49e-7caffea783f0
Format : application/pdf
Title : untitled
By way of comparison, if we take a demo (metadata rich) PDF with added descriptive DC and PRISM metadata terms, we then get this dump:
Note that the DC and PRISM terms are encoded as my earlier examples and do not take account of a) how DC is defined as an XMP schema (i.e. the XMP value types for the seperate terms), or b) how PRISM might (because it isn’t yet) be defined as an XMP schema. Nor are identifier considerations fully taken into account. Nonetheless this gives more than an idea of what things could look like.
Now, with ExifTool it is also possible to list out the terms by group, e.g.
Note that this PDF also included XMP packets for illustrations but the tool extracted the main, or document, XMP packet.
And now that it’s easier to extract the metadata one can look to do something more interesting. For example, if one has cwm installed (Tim BL’s Closed World Machine for semweb dabblings - a Python application, so again cross-platform) one can pipe the XMP packet into cwm as RDF/XML, verify it as valid RDF and read out in another format, e.g. RDF/N3. For the above example we can so this as follows.
But let me first define a pipeline to extract the XMP, a couple filters to strip out processing instructions (includes the open and close bracketing <?xpacket> XMP PI’s as well as an undocumented - legacy? - <?adobe> Adobe PI), and then fed into cwm as RDF/XML and read out as RDF/N3. (Note that instead of ExifTool to extract the XMP another tool could have been used, e.g. something based on the sample apps shipped with the Adobe XMP SDK, or something bespoke.)