2 minute read.DOIs and matching regular expressions
We regularly see developers using regular expressions to validate or scrape for DOIs. For modern Crossref DOIs the regular expression is short
/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i
For the 74.9M DOIs we have seen this matches 74.4M of them. If you need to use only one pattern then use this one.
The other 500K are mostly from Crossref’s early days when the battle between “human-readable” identifiers and “opaque” identifiers was still being fought, the web was still new, and it was expected that “doi” would become as well a supported URI schema name as “gopher”, “wais”, …. Ok, that didn’t go so well.
An early Crossref’s member was John Wiley & Sons. They faced the need to design DOIs without much prior work to lean on. Many of those early DOIs are not expression friendly. Nevertheless, they are still valid and valuable permanent links to the work’s version of record. You can catch 300K more DOIs with
/^10.1002/[^\s]+$/i
While the DOI caught is likely to be the DOI within the text it may also contain trailing characters that, due to the lack of a space, are caught up with the DOI. Even the recommended expression catches DOIs ending with periods, colons, semicolons, hyphens, and underscores. Most DOIs found in the wild are presented within some visual design program. While pleasant to look at the visual design can misdirect machines. Is the period at the end of the line part of the DOI or part of the design? Is that endash actually a hyphen? These issues lead to a DOI bycatch.
Adding the following 3 expressions with the previous 2 leaves only 72K DOIs uncaught. To catch these 72K would require a dozen or more additional patterns. Each additional pattern, unfortunately, weakens the overall precision of the catch. More bycatch.
/^10.\d{4}/\d+-\d+X?(\d+)\d+<[\d\w]+:[\d\w]*>\d+.\d+.\w+;\d$/i
/^10.1021/\w\w\d++$/i
/^10.1207/[\w\d]+\&\d+_\d+$/i
Crossref is not the only DOI Registration Agency and while our members account for 65-75% of all registered DOIs this means there are tens of millions of DOIs that we have not seen. Luckily, the newer RAs and their publishers can copy our successes and avoid our mistakes.