This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
Emboldened by my own researches, by the recent handle plugin announcement from CNRI (on which, more in a follow-on post), and by Alexander Griekspoor’s comment to my earlier post, I thought I’d write a more extensive piece about embedding metadata in PDF with a view to the following:
Discover what other publishers are currently doing
Stimulate discussions between content providers and/or consumers
Lay groundwork for a Crossref best practice guidelines
Why should Crossref be interested? Well, at minimum to embed the DOI along with the digital asset would seem to be inherently “a good thing”. (And, in fact, this is precisely the approach that CNRI have taken for their plugin demos. I’ll look later at what they actually did and consider whether that is a model that Crossref publishers might usefully follow.)
Why include the DOI as an explicit piece of metadata rather than have it included by virtue of its appearance in a content section? The main reason is that it is then unambiguously accessible. Content sections in PDFs are typically filtered and sometimes encrypted), whereas metadata is usually plain text and moreover is marked up as to field type.
Another question concerns whether to add in the identifier alone, or to embed a full metadata set. Why not just embed the identifier and visit Crossref for the metadata? This is feasible in some cases although it does involve an extra network trip, requires an application to service the identifier and is obviously not workable in offline contexts. Seems like a “no-brainer” to include a fuller description from the outset. Note that publishers frequently make some of this information available anyway in other metadata delivery channels, e.g. RSS feeds.
There are two (complementary) approaches to embedding document-level metadata in a PDF:
A - Document Information Dictionary
This is an optional object (a dictionary) referenced from the PDF trailer dictionary. Example:
<<
/Title ( PostScript Language Reference, Third Edition )
/Author ( Adobe Systems Incorporated )
/Creator ( Adobe FrameMaker 5.5.3 for Power Macintosh® )
/Producer ( Acrobat Distiller 3.01 for Power Macintosh )
/CreationDate ( D:19970915110347-08'00' )
/ModDate ( D:19990209153925-08'00' )
>>
endobj
B - (Document) Metadata Stream
This is an optional object (a stream) referenced from the document catalog, itself referenced from the PDF trailer dictionary. Example:
Both approaches usually make the embedded metadata in the PDF available in the clear, whereas content is frequently filtered and sometimes encrypted. (Note that the information dictionary is always in the clear, while the metadata stream can be filtered and rendered unreadable although in practice this tends not to be filtered.)
Below I examine both approaches and see how they can be used to encode the kind of metadata that scholarly publishers are accustomed to.
A - Document Information Dictionary
Note that keys in the document information dictionary divide equally between the logical document description (non-asterisked keys) and the physical asset description (asterisked keys):
This is the complete listing of keys in the PDF specification, although foreign keys are allowed (and ignored).
What is missing here is any document identifier and/or any other descriptive metadata. From a Crossref point of view the identifier (the DOI) is a “hook” into the metadata record and so at minimum this could usefully be added. The question then is how? Either the identifier can be squeezed into one of the existing fields (“Title”, “Author”, “Subject”, “Keywords”) or else a new foreign key could be created.
IMO if an existing keyword is used then I would opt for “Subject” or “Keywords”, and probably the former. If, on the other hand, a new foreign key were to be created I would choose something generic and (in keeping with the other terms) use something like “Identifier” (rather than, say, “DOI”).
Of preference, I think I would go for the latter (“Identifier”) but if one wanted to make this more robust one could think of also adding in a known term (e.g. “Subject” or “Keywords”). So, to include metadata for the news article “Cosmology: Ripples of early starlight” printed in Nature magazine Nature 445, 37 (2007): doi:10.1038/445037a, we might include the following terms in the document information dictionary as:
where the bolded term represents a foreign key/value pair.
Note: This (including the DOI in the “Subject” field) is a fix intended to get the DOI listed by Adobe apps which would not otherwise recognize the foreign key “Identifier”.
Since it is not really feasible to include separate enumerated fields within the information dictionary (although it could be done), one might also consider including a descriptive citation field as a foreign key, e.g., something like:
/Source (Nature 445, 37 \(2007\))
Aternatively that might better be presented as the “Subject” along with the DOI. Which would then limit the number of foreign keys to one (“Identifier”).
B - (Document) Metadata Stream
The metadata stream with its use of XMP packets (wrapping RDF/XML instances) is a much more flexible approach to embedding metadata and allows multiple schemas to be used. As noted in my previous post here on XMP, PDFs with XMP packets mostly use media-specific terms and schemas, although there is also a token showing of DC. From a descriptive metadata point of view we would more likely make use of DC and PRISM for our schemas.
Reprising the example from the previous post (and again using citation example listed above) this would mean we may be inclined to include the following terms for a scholarly work (here in RDF/N3 for readability):
Note b): Ref. [3] is a fairly brief draft which covers both the Information Dictionary and Metadata Dictionary (XMP) approaches. There is an Adobe-hosted update to this document from June 2002 but that only discusses the XMP approach.