Blog

What’s your (citations’) style?

Bibliographic references in scientific papers are the end result of a process typically composed of: finding the right document to cite, obtaining its metadata, and formatting the metadata using a specific citation style. This end result, however, does not preserve the information about the citation style used to generate it. Can the citation style be somehow guessed from the reference string only?

TL;DR

  • I built an automatic citation style classifier. It classifies a given bibliographic reference string into one of 17 citation styles or “unknown”.
  • The classifier is based on supervised machine learning. It uses TF-IDF feature representation and a simple Logistic Regression model.
  • For training and testing, I used datasets generated automatically from Crossref metadata.
  • The accuracy of the classifier estimated on the test set is 94.7%.
  • The classifier is open source and can be used as a Python library or REST API.

Introduction

Threadgill-Sowder, J. (1983). Question Placement in Mathematical Word Problems. School Science and Mathematics, 83(2), 107-111

This reference is the end result of a process that typically includes: finding the right document, obtaining its metadata, and formatting the metadata using a specific citation style. Sadly, the intermediate reference forms or the details of this process are not preserved in the end result. In general, just by looking at the reference string we cannot be sure which document it originates from, what its metadata is, or which citation style was used.