8 minute read.Python and Ruby Libraries for accessing the Crossref API
I’m a co-founder with rOpenSci, a non-profit that focuses on making software to facilitate reproducible and open science. Back in 2013 we started to make an R client working with various Crossref web services. I was lucky enough to attend last year’s Crossref annual meeting in Boston, and gave one talk on details of the programmatic clients, and another higher level talk on text mining and use of metadata for research.
Crossref has a newish API encompassing works, journals, members, funders and more (check out the API docs), as well as a few other services. Essential to making the Crossref APIs easily accessible—and facilitating easy tool/app creation and exploration—are programmatic clients for popular languages. I’ve maintained an R client for a while now, and have been working on Python and Ruby clients for the past four months or so.
The R client falls squarely into the analytics/research use cases, while the Python and Ruby clients are ideal for general data access and use in web applications (the Javascript library below as well).
I’ve strived to make each client in idiomatic fashion according to the language. Due to this fact, there is not generally correspondence between the different clients with respect to data outputs. However, I’ve tried to make method names similar across Ruby and Python; although the R client is quite a bit older, so method names differ from the other clients and I’m resistant to changing them so as not to break current users’ projects. In addition, R users are likely to want a data.frame (i.e., table) of results, so we give back that - whereas with Python and Ruby we give back dictionaries and hashes, respectively.
Crossref clients
- Python:
- Ruby:
- R:
- Javascript:
I’ll cover the Python, Ruby, and R libraries below.
Installation
Python
on the command line
pip install habanero
Ruby
on the command line
gem install serrano
R
in an R session
install.packages("rcrossref")
Examples
Output is indicated by the syntax #>
in all examples below.
Python
in a Python REPL (e.g. iPython)
Import the Crossref module from within habanero, and initialize a client
from habanero import Crossref
cr = Crossref()
Query for the phrase “ecology”
x = cr.works(query = "ecology", limit = 5)
Index to various parts of the output
x['message']['total-results']
#> 276188
Extract similar data items from each result. The records are in the “items” slot
[ z['DOI'] for z in x['message']['items'] ]
#> [u'10.1002/(issn)1939-9170',
#> u'10.4996/fireecology',
#> u'10.5402/ecology',
#> u'10.1155/8641',
#> u'10.1111/(issn)1439-0485']
In habanero for some methods we require you to instantiate a client.
You can set a base URL and API key. This is a future looking feature
as Crossref API does not require an API key.
Note: I’ve tried to make sure habanero is Python 2 and 3 compatible. Hopefully you’ll find that’s true.
Ruby
in a Ruby repl (e.g., pry), load serrano
require 'serrano'
Query for “peerj” on the journals route
x = Serrano.journals(query: "peerj")
Collect just ISSN’s from each result
x['message']['items'].collect { |z| z['ISSN'] }
#> => [["2376-5992"], ["2167-8359"]]
Shell
The serrano
command line tool is quite powerful if you are used to doing things there.
Here, search for one article; summary data is shown.
serrano works 10.1371/journal.pone.0033693
#> DOI: 10.1371/journal.pone.0033693
#> type: journal-article
#> title: Methylphenidate Exposure Induces Dopamine Neuron Loss and Activation of Microglia in the Basal Ganglia of Mice
There’s also a -json
flag to give back JSON data, which can be parsed with the command line tool jq.
serrano works --filter=has_full_text:true --json --limit=5 | jq '.message.items[].link[].URL'
#> "http://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2F9781119208082.ch9"
#> "http://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2F9781119208082.index"
#> "http://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2F9781119208082.ch11"
#> "http://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2F9781119208082.ch15"
#> "http://api.wiley.com/onlinelibrary/tdm/v1/articles/10.1002%2F9781119208082.ch4"
R
In an R session, load rcrossref
library("rcrossref")
Search the works
route for the phrase “science”
res <- cr_works(query = "science", limit = 5)
#> $meta
#> total_results search_terms start_index items_per_page
#> 1 4333827 science 0 5
#>
#> $data
#> Source: local data frame [5 x 23]
#>
#> alternative.id container.title created deposited DOI funder indexed
#> (chr) (chr) (chr) (chr) (chr) (chr) (chr)
#> 1 2013-11-21 2013-11-21 10.1126/science <NULL> 2015-12-27
#> 2 Science Askew 2004-11-26 2013-12-16 10.1887/0750307145/b426c18 <NULL> 2015-12-24
#> 3 2006-04-10 2010-07-30 10.1002/(issn)1557-6833 <NULL> 2015-12-25
#> 4 2013-08-27 2013-08-27 10.1002/(issn)1469-896x <NULL> 2015-12-27
#> 5 2013-12-19 2013-12-19 10.5152/bs. <NULL> 2015-12-28
#> Variables not shown: ISBN (chr), ISSN (chr), issued (chr), link (chr), member (chr), prefix (chr), publisher
#> (chr), reference.count (chr), score (chr), source (chr), subject (chr), title (chr), type (chr), URL
#> (chr), assertion (chr), author (chr)
#>
#> $facets
#> NULL
Index through to get the DOIs
res$data$DOI
#> [1] "10.1126/science" "10.1887/0750307145/b426c18" "10.1002/(issn)1557-6833"
#> [4] "10.1002/(issn)1469-896x" "10.5152/bs."
rcrossref also has faster versions of most functions with an underscore at the end (_
) which only do the http request and give back json (e.g., cr_works_()
)
Comparison of Crossref Client Methods
After installation and loading the libraries above, the below methods are available
API Route | <th>
<span >Python</span>
</th>
<th>
<span >Ruby</span>
</th>
<th>
<span >R</span>
</th>
---|
works | <td>
<span ><code>cr.works()</code></span>
</td>
<td>
<span ><code>Serrano.works()</code></span>
</td>
<td>
<span ><code>cr_works()</code></span>
</td>
members | <td>
<span ><code>cr.members()</code></span>
</td>
<td>
<span ><code>Serrano.members()</code></span>
</td>
<td>
<span ><code>cr_members()</code></span>
</td>
funders | <td>
<span ><code>cr.funders()</code></span>
</td>
<td>
<span ><code>Serrano.funders()</code></span>
</td>
<td>
<span ><code>cr_funders()</code></span>
</td>
types | <td>
<span ><code>cr.types()</code></span>
</td>
<td>
<span ><code>Serrano.types()</code></span>
</td>
<td>
<span ><code>cr_types()</code></span>
</td>
licenses | <td>
<span ><code>cr.licenses()</code></span>
</td>
<td>
<span ><code>Serrano.licenses()</code></span>
</td>
<td>
<span ><code>cr_licenses()</code></span>
</td>
journals | <td>
<span ><code>cr.journals()</code></span>
</td>
<td>
<span ><code>Serrano.journals()</code></span>
</td>
<td>
<span ><code>cr_journals()</code></span>
</td>
members | <td>
<span ><code>cr.members()</code></span>
</td>
<td>
<span ><code>Serrano.members()</code></span>
</td>
<td>
<span ><code>cr_members()</code></span>
</td>
registration agency | <td>
<span ><code>cr.registration_agency()</code></span>
</td>
<td>
<span ><code>Serrano.registration_agency()</code></span>
</td>
<td>
<span ><code>cr_agency()</code></span>
</td>
random DOIs | <td>
<span ><code>cr.random_dois()</code></span>
</td>
<td>
<span ><code>Serrano.random_dois()</code></span>
</td>
<td>
<span ><code>cr_r()</code></span>
</td>
Other Crossref Services
Service | <th>
<span >Python</span>
</th>
<th>
<span >Ruby</span>
</th>
<th>
<span >R</span>
</th>
---|
content negotiation | <td>
<span ><code>cn.content_negotiation()</code><a href="#footnote-1">[1]</a></span>
</td>
<td>
<span ><code>Serrano.content_negotiation()</code></span>
</td>
<td>
<span ><code>cr_cn()</code></span>
</td>
CSL styles | <td>
<span ><code>cn.csl_styles()</code><a href="#footnote-1">[1]</a></span>
</td>
<td>
<span ><code>Serrano.csl_styles()</code></span>
</td>
<td>
<span ><code>get_styles()</code></span>
</td>
citation count | <td>
<span ><code>counts.citation_count()</code><a href="#footnote-2">[2]</a></span>
</td>
<td>
<span ><code>Serrano.citation_count()</code></span>
</td>
<td>
<span ><code>cr_citation_count()</code></span>
</td>
Features
These are supported in all 3 libraries:
- Filters (see below)
- Deep paging (see below)
- Pagination
- Verbose curl output
Filters (see API docs for details) are a powerful way to get closer to exactly what you want in your queries. In the Crossref API filters are passed as query parameters, and are comma-separated like filter=has-orcid:true,is-update:true . In the client libraries, filters are passed in idiomatic fashion according to the language.
Python
from habanero import Crossref
cr = Crossref()
cr.works(filter = {'award_number': 'CBET-0756451', 'award_funder': '10.13039/100000001'})
Ruby
require 'serrano'
Serrano.works(filter: {award_number: 'CBET-0756451', award_funder: '10.13039/100000001'})
R
library("rcrossref")
cr_works(filter=c(award_number=TRUE, award_funder='10.13039/100000001'))
Note how syntax is quite similar among languages, though keys don’t have to be quoted in Ruby and R, and in R you pass in a vector or list instead of a hash as in the other two.
All 3 clients have helper functions to show you what filters are available and what the options are for each filter.
Action | <th>
<span >Python</span>
</th>
<th>
<span >Ruby</span>
</th>
<th>
<span >R</span>
</th>
---|
Filter names | <td>
<span ><code>filters.filter_names</code><a href="#footnote-3">[3]</a></span>
</td>
<td>
<span ><code>Serrano::Filters.names</code></span>
</td>
<td>
<span ><code>filter_names()</code></span>
</td>
Filter details | <td>
<span ><code>filters.filter_details</code><a href="#footnote-3">[3]</a></span>
</td>
<td>
<span ><code>Serrano::Filters.filters</code></span>
</td>
<td>
<span ><code>filter_details()</code></span>
</td>
Deep paging
Sometimes you want a lot of data. The Crossref API has parameters for paging (see rows and offset), but large values of either can lead to long response times and potentially timeouts (i.e., request failure). The API has a deep paging feature that can be used when large data volumes are desired. This is made possible via Solr’s cursor feature (e.g., blog post on it). Here’s a run down of how to use it:
cursor
: each method in each client library that allows deep paging has a cursor
parameter that if you set to *
will tell the Crossref API you want deep paging.cursor_max
: for boring reasons we need to have feedback from the user when they want to stop, since each request comes back with a cursor value that we can make the next request with, thus, an additional parameter cursor_max
is used to indicate the number of results you want back.limit
: this parameter when not using deep paging determines number of results to get back. however, when deep paging, this parameter sets the chunk size. (note that the max. value for this parameter is 1000)
For example, cursor=”*”
states that you want deep paging, cursor_max
states maximum results you want back, and limit
determines how many results per request to fetch.
Python
from habanero import Crossref
cr = Crossref()
cr.works(query = "widget", cursor = "*", cursor_max = 500)
Ruby
require 'serrano'
Serrano.works(query: "widget", cursor: "*", cursor_max: 500)
R
library("rcrossref")
cr_works(query = "widget", cursor = "*", cursor_max = 500)
Text mining clients
Just a quick note that I’ve begun a few text-mining clients for Python and Ruby, focused on using the low level clients discussed above.
Do try them out!
Further reading
- May 14, 2024 – 2024 public data file now available, featuring new experimental formats
- Mar 13, 2024 – Subject codes, incomplete and unreliable, have got to go
- Jan 19, 2024 – Increasing Crossref Data Reusability With Format Experiments
- May 2, 2023 – 2023 public data file now available with new and improved retrieval options
- May 13, 2022 – 2022 public data file of more than 134 million metadata records now available
- Mar 31, 2022 – With a little help from your Crossref friends: Better metadata
- Jan 19, 2022 – A ROR-some update to our API
- Jan 19, 2021 – New public data file: 120+ million metadata records