Metadata in PDF: 2. Use Cases

4 minute read.

Metadata in PDF: 2. Use Cases

Tony Hammond – 2007 August 01

Well, this is likely to be a fairly brief post as I’m not aware of many use cases of metadata in PDFs from scholarly publishers. Certainly, I can say for Nature that we haven’t done much in this direction yet although are now beginning to look into this.

I’ll discuss a couple cases found in the wild but invite comment as to others’ practices. Let me start though with the CNRI handle plugin demo for Acrobat which I blogged here.

Handle Plugin

First off, the handle plugin PDF samples do include an embedded (test) DOI in both the document information dictionary

5 0 obj
<<
/CreationDate (D:20070614140125-04'00')
/Author (Simon)
/Creator (PScript5.dll Version 5.2.2)
/Producer (Acrobat Distiller 8.1.0 \(Windows\))
/ModDate (D:20070614140240-04'00')
/HDL (10.5555/pdftest-crossref)
/Title (Microsoft Word - crossref-rev.doc)
>>
endobj

and in the (document) metadata stream

<rdf:Description rdf:about="" xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/">
<pdfx:HDL>10.5555/pdftest-crossref</pdfx:HDL>
</rdf:Description>

Bar any fuller disclosure of metadata terms at large (and one of the demo cases makes use of DOI to retrieve metadata form Crossref) this is excellent. I would, however, quibble with the use of “HDL” as a foreign key for the information dictionary. I realize this is just a test but the term “HDL” (or “DOI”, for that’s what it really is) is somewhat specific and a more general term such as “Identifier” would probably have more mileage, e.g.

5 0 obj
<<
...
/Identifier (doi:10.5555/pdftest-crossref)
...
>>
endobj

In the second example from the metadata dictionary I don’t think the term “HDL” from the PDF extension schema “pdfx” is very helpful. (Is that namespace actually defined anywhere?) From a descriptive metadata viewpoint a more usual schema such as DC would have wider coverage. So again the second example would be better rendered as

<rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:identifier>doi:10.5555/pdftest-crossref</dc:identifier>
</rdf:Description>

or, alternately,

<rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:identifier>info:hdl/10.5555/pdftest-crossref</dc:identifier>
</rdf:Description>

Elsevier

Well, we have Alexander Griekspoor’s comment earlier that Elsevier are including the DOI in their PDFs. I don’t know how consistently this is being done but I’ve checked a couple sample articles and it would seem that they have embedded the DOI (here from Cancer Cell, doi:0.1016/j.ccr.2007.06.004) in the title element which shows up in the information dictionary as

361 0 obj
<<
/Producer (Adobe LiveCycle PDFG 7.2)
/Creator (Elsevier)
/Author ()
/Keywords ()
/Title (doi:10.1016/j.ccr.2007.06.004)
/ModDate (D:20070630031637+05'30')
/Subject ()
/CreationDate (D:00000101000000Z)
>>
endobj

and in the (document) metadata dictionary as

365 0 obj
<<
/Type /Metadata
/Subtype /XML
/Length 1526
>>
stream
<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d' bytes='1526'?>
 
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
xmlns:iX='http://ns.adobe.com/iX/1.0/'>
 
<rdf:Description about=''
xmlns='http://ns.adobe.com/pdf/1.3/'
xmlns:pdf='http://ns.adobe.com/pdf/1.3/'>
<pdf:Producer>Adobe LiveCycle PDFG 7.2</pdf:Producer>
<pdf:ModDate>2007-06-30T03:16:37+05:30</pdf:ModDate>
<pdf:Title>doi:10.1016/j.ccr.2007.06.004</pdf:Title>
<pdf:Creator>Elsevier</pdf:Creator>
<pdf:Author></pdf:Author>
<pdf:Keywords></pdf:Keywords>
<pdf:Subject></pdf:Subject>
<pdf:CreationDate>0-01-01T00:00:00Z</pdf:CreationDate>
</rdf:Description>
 
<rdf:Description about=''
xmlns='http://ns.adobe.com/xap/1.0/'
xmlns:xap='http://ns.adobe.com/xap/1.0/'>
<xap:CreatorTool>Elsevier</xap:CreatorTool>
<xap:ModifyDate>2007-06-30T03:16:37+05:30</xap:ModifyDate>
<xap:Title>
<rdf:Alt>
<rdf:li xml:lang='x-default'>doi:10.1016/j.ccr.2007.06.004</rdf:li>
</rdf:Alt>
</xap:Title>
<xap:Author></xap:Author>
<xap:Description>
<rdf:Alt>
<rdf:li xml:lang='x-default'/>
</rdf:Alt>
</xap:Description>
<xap:CreateDate>0-01-01T00:00:00Z</xap:CreateDate>
<xap:MetadataDate>2007-06-30T03:16:37+05:30</xap:MetadataDate>
</rdf:Description>
 
<rdf:Description about=''
xmlns='http://purl.org/dc/elements/1.1/'
xmlns:dc='http://purl.org/dc/elements/1.1/'>
<dc:title>doi:10.1016/j.ccr.2007.06.004</dc:title>
<dc:creator/>
<dc:description/>
</rdf:Description>
 
</rdf:RDF>
<?xpacket end='r'?>
endstream
endobj

Kudos anyway to Elsevier for emebedding this piece of information in their PDFs (if indeed it is a general practice). This has the merit of being picked up by Adobe apps and displayed in e.g. Reader. Also third party apps can pull this and use this to retrieve the metadata record from Crossref.

The only downside is that technically this seems to be a kludge to satisfy Adobe apps and is not the correct field for filing this information. I would have thought that some other information dictionary field (e.g. “Subject”) would be a better kludge, and then reserve the “Title” and “Author” fields for their proper purposes. The RDF/XML title fields would appear to be inherited from the “Title” field in the information dictionary. It’s a bit of a shame really because the DOI is embedded - it’s just in the wrong place(s). (OK, so that’s still way better, maybe, than not providing this information at all.)

Hopefully, with more examples to mull over and experiences to learn from we can arrive at a much better and more systematic way of including the DOI, and other key metadata fields, within a PDF so that this information can be gleaned easily and unambiguously by third party apps.

Get involved

Find a service

Documentation

About us

2025 March 19

Version 5.4.0 metadata schema update now available

What is in this update?

Publication typing for citations

2025 March 12

2025 public data file now available

2025 March 05

Come ROR with us: Using ROR IDs in place of Funder IDs

2025 February 27

The GEM program - Year Two 2024

Blog

Metadata in PDF: 2. Use Cases

Further reading

Recent Posts

Categories

Archives