This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
Now, assuming XMP is a good idea - and I think on balance it is (as blogged earlier), why are we not seeing any metadata published in scholarly media files? The only drawbacks that occur to me are:
Hard to write - it’s too damn difficult, no tools support, etc.
Hard to model - rigid, “simple” XMP data model, both complicates and constrains the RDF data model
Well, I don’t really believe that 1) is too difficult to overcome. A little focus and ingenuity should do the trick. I do, however, think 2) is just a crazy straitjacket that Adobe is forcing us all to wear but if we have to live with that then so be it. Better in Bedlam than without. (RSS 1.0 wasn’t so much better but allowed us to do some useful things. And that came from the RDF community itself.) We could argue this till the cows come home but I don’t see any chance of any change any time soon.
(Continues)
So, putting the RDF issue aside for the moment (as if RDF didn’t have problems of its own - XML, URI, etc.) let’s just look at the options for writing the stuff. (Btw, I’m not referencing any tools or toolkits. This is just in the round.) There are various means of publishing metadata in XMP:
**Sidecar**
: XMP can be produced as standalone files - see [XMP Specification, (Sept. ’05)][3], p. 36. (These are called “sidecar” files if the file has the same name as the main document and is in the same directory.) The only things needed to produce these files are a text editor and a good grasp of the XMP serialization. A template will do for that. The main problem with a standalone file is that it does not travel with the media file and so risks being left behind.
Worth a note here. Not standalone as such but the [Mars][4] format (the draft XML formalization for PDF) discloses its metadata in an independent XMP file “metadata.xml” under the “META-INF/” directory. For distribution the whole directory structure is packaged up as a zip file and so the XMP is embedded in a “.mars” file, but accessed directly from the zip file or from the unpackaged directory the XMP can be manipulated just like any other XML document.
**Embedded**
: This is the normal means of distributing XMP - embedded within the media file. Some graphics formats are essentially linear (JPEG, PNG, GIF) and it is relatively straightforward to add in an XMP packet. Other formats (PDF, TIFF) have internal cross-referencing and are more difficult to deal with.
**Embedded + Sidecar**
: One possible method for dealing with the difficulty of writing XMP is to note that some media (especially PDFs) already have embedded XMP packets. As noted earlier, much if not all of the metadata in these XMP packets will be workflow-related and thus dispensible for final-form products where authority work-related metadata is desired. These packets may, or may not, be writeable and thus include additional padding whitespace. Even for read-only packets there is much (if not all) that can be discarded and also sometimes unnecesary bulk (e.g. default namespace declarations which are never used). _The bottom line is that any legacy XMP packet may typically be 2-3K in size and, just as in transplanting a cell nucleus, the XMP packet innards can be deftly substituted with a minimal XMP packet content, say 1K in size, which would be guaranteed to fit with suitable padding._ A packet that size would be sufficient to provide at minimum for a DOI and for a reference to additional metadata, e.g. a more complete standalone XMP packet. The two forms can coexist.
The third way option here allows embedding a minimal XMP packet into “difficult” packaging structures while pointing out to a fully-formed XMP packet. The “simple” packaging structures may both include a fully-formed XMP packet while also possibly referencing extended metadata sources as per my previous post [here][4].