PDF-Extract

Geoffrey Bilder – 2012 April 17

In Citation FormatsCrossref LabsMetadataPDF

PDF-EXTRACT

Crossref Labs is happy to announce the first public release of “pdf-extract” an open source set of tools and libraries for extracting citation references (and, eventually, other semantic metadata) from PDFs. We first demonstrated this tool to Crossref members at our annual meeting last year. See the pdf-extract labs page for a detailed introduction to this new set of tools.

If you are unable to download and install the tool, you can play with a experimental web interface called “Extracto.” Be warned, Extracto is running on very feeble server using an erratic and slow internet connection. The only guarantee that we can make about using it is that it will repeatedly fall over and annoy you. The weasel has spoken.

Turning DOIs into formatted citations

Karl Ward – 2011 November 28

In Citation FormatsCrossref LabsDOIsLinked DataMetadata

Today two new record types were added to dx.doi.org resolution for Crossref DOIs. These allow anyone to retrieve DOI bibliographic metadata as formatted bibliographic entries. To perform the formatting we’re using the citation style language processor, citeproc-js which supports a shed load of citation styles and locales. In fact, all the styles and locales found in the CSL repositories, including many common styles such as bibtex, apa, ieee, harvard, vancouver and chicago are supported.

Content Negotiation for Crossref DOIs

Geoffrey Bilder – 2011 April 19

In DataCiteIdentifiersLinked DataMetadataProgrammingStandards

So does anybody remember the posting DOIs and Linked Data: Some Concrete Proposals? Well, we went with option “D.” From now on, DOIs, expressed as HTTP URIs, can be used with content-negotiation. Let’s get straight to the point. If you have curl installed, you can start playing with content-negotiation and Crossref DOIs right away: curl -D - -L -H “Accept: application/rdf+xml” “http://dx.doi.org/10.1126/science.1157784” curl -D - -L -H “Accept: text/turtle” “http://dx.doi.org/10.1126/science.1157784”

Add Crossref metadata to PDFs using XMP

Geoffrey Bilder – 2009 December 09

In MetadataPDFXMP

In order to encourage publishers and other content producers to embed metadata into their PDFs, we have released an experimental tool called “pdfmark”, This open source tool allows you to add XMP metadata to a PDF. What’s really cool, is that if you give the tool a Crossref DOI, it will lookup the metadata in Crossref and then apply said metadata to the PDF. More detail can be found on the pdfmark page on the Crossref Labs site.

Recommendations on RSS Feeds for Scholarly Publishers

Geoffrey Bilder – 2009 October 19

In InteroperabilityMetadataNews ReleaseRSS

We’re pleased to announce that a Crossref working group has released a set of best practice recommendations for scholarly publishers producing RSS feeds. Variations in practice amongst publisher feeds can be irritating for end-users, but they can be insurmountable for automated processes. RSS feeds are increasingly being consumed by knowledge discovery and data mining services. In these cases, variations in date formats, the practice of lumping all authors together in one <dc:creator> element, or generating invalid XML can render the RSS feed useless to the service accessing it.

Citation Typing Ontology

Geoffrey Bilder – 2009 March 20

In Citation FormatsDataIdentifiersLinkingMetadata

I was happy to read David Shotton’s recent Learned Publishing article, Semantic Publishing: The Coming Revolution in scientific journal publishing, and see that he and his team have drafted a Citation Typing Ontology.* Anybody who has seen me speak at conferences knows that I often like to proselytize about the concept of the “typed link”, a notion that hypertext pioneer, Randy Trigg, discussed extensively in his 1983 Ph.D. thesis.. Basically, Trigg points out something that should be fairly obvious- a citation (i.

Poorboy Metadata Hack

Tony Hammond – 2009 January 06

In Metadata

I was playing around recently and ran across this little metadata hack. At first, I thought somebody was doing something new. But no, nothing so forward apparently. (Heh! 🙂 I was attempting to grab the response headers from an HTTP request on an article page and was using by default the Perl LWP library. For some reason I was getting metadata elements being spewed out as response headers - at least from some of the sites I tested.

And the DOI is …

Tony Hammond – 2008 December 22

In Metadata

Once structured metadata is added to a file then retrieving a given metadata element is usually a doddle. For example, for PDFs with embedded XMP one can use Phil Harvey’s excellent Exiftool utility. Exiftool is a Perl library and application which I’ve blogged about here earlier which is available as a ‘.zip‘ file for Windows (no Perl required) or ‘.dmg‘ for MacOS. Note that Phil maintains this actively and has done so over the last five years.

Machine Readable: Are We There Yet?

Tony Hammond – 2008 November 19

In Metadata

The guidelines for Crossref publishers (“DOI Name Information and Guidelines” - [PDF, 210K][1]) has this to say in “Sect. 6.3 The response page” regarding the response page for a DOI:

“A minimal response page must contain a full bibliographic citation displayed to the user. A response page without bibliographic information should never be presented to a user.”

which would seem to be all fine and dandy. But if that user is a machine (or an agent acting for a user) they’ll likely be out of luck as the metadata in the bibliographic citation is generally targeted at human users.

So here’s a quick and dirty implementation of what a machine readable page could look like using RDFa. (The demo uses Jeni Tennison’s wonderful [rdfQuery][2] plugin which I [blogged][3] about earlier.)

Clicking the DOI link below will bring up in a sub-window a bibliographic citation which might be found in a typical DOI repsonse page. If you now click the “Read Me” link you should see an alert message which presents the bibliographic metadata as a complete RDF document (in a simple N3 – or Notation3 – format). This document is assembled on the fly by rdfQuery using the RDFa markup embedded in the page.

See the “View Source” link to list the actual XHTML markup and the RDFa properties which have been added. And note also that some of the properties are partially “hidden” to the human reader, e.g. a publication date is given in year form only whereas the machine record has the date in full, and some of the properties are fully “hidden”: print and electronic ISSNs, issue number, ending page, etc.

(Continues below.)

rdfQuery

Tony Hammond – 2008 November 17

In Metadata

Whaddya know? I was just on the point of blogging about the real nice demo given by Jeni Tennison at last week’s SWIG UK meeting at HP Labs in Bristol of rdfQuery (an RDF plugin for jQuery - the zip file is here). And there today on her blog I see that she has a full writeup on rdfQuery, so I’ll defer to the expert. :~) All I can really add to that is that rdfQuery is a pretty darn cool way to add and manipulate RDFa using jQuery.

RSS Feed

Get involved

Find a service

Documentation

About us

2025 March 05

Come ROR with us: Using ROR IDs in place of Funder IDs

2025 February 27

The GEM program - Year Two 2024

2025 January 29

Retraction Watch retractions now in the Crossref API

2025 January 28

POSI 2.0 feedback

Blog

PDF-Extract

PDF-EXTRACT

Turning DOIs into formatted citations

Content Negotiation for Crossref DOIs

Add Crossref metadata to PDFs using XMP

Recommendations on RSS Feeds for Scholarly Publishers

Citation Typing Ontology

Poorboy Metadata Hack

And the DOI is …

Machine Readable: Are We There Yet?

rdfQuery

Recent Posts

Categories

Archives