2 minute read.Citing Data Sets
This D-Lib paper by Altman and King looks interesting: “A Proposed Standard for the Scholarly Citation of Quantitative Data”. (And thanks to Herbert Van de Sompel for drawing attention to the paper.) Gist of it (Sect. 3) is
_“We propose that citations to numerical data include, at a minimum, six required components. The first three components are traditional, directly paralleling print documents. … Thus, we add three components using modern technology, each of which is designed to persist even when the technology changes: a unique global identifier, a universal numeric fingerprint, and a bridge service. They are also designed to take advantage of the digital form of quantitative data.
An example of a complete citation, using this minimal version of the proposed standards, is as follows:
**Micah Altman; Karin MacDonald; Michael P. McDonald, 2005, “Computer Use in Redistricting”,
hdl:1902.1/AMXGCNKCLU UNF:3:J0PkMygLPfIyT1E/8xO/EA==
http://id.thedata.org/hdl%3A1902.1%2FAMXGCNKCLU
“_
So the abbreviated citation (author, date, title, unique ID) is supplemented by a UNF which fingerprints the data. UNFs would appear to be a sort of super MD5 in providing a signature of the data content independent of the data serialization to a filestore.
_“Thus, we add as the fifth component a Universal Numeric Fingerprint or UNF. The UNF is a short, fixed-length string of numbers and characters that summarize all the content in the data set, such that a change in any part of the data would produce a completely different UNF. A UNF works by first translating the data into a canonical form with fixed degrees of numerical precision and then applies a cryptographic hash function to produce the short string. The advantage of canonicalization is that UNFs (but not raw hash functions) are format-independent: they keep the same value even if the data set is moved between software programs, file storage systems, compression schemes, operating systems, or hardware platforms.
…
Finally, since most web browsers do not currently recognize global unique identifiers directly (i.e., without typing them into a web form), we add as the sixth and final component of the citation standard a bridge service, which is designed to make this task easier in the medium term.”_
Certainly looks promising. I’m not sure if there’s any other contestants in this arena.
Further reading
- May 24, 2021 – Service Provider perspectives: A few minutes with our publisher hosting platforms
- Feb 1, 2021 – Event Data: A Plan of Action
- Mar 27, 2020 – Events got the better of us
- Nov 23, 2018 – Data Citation: what and how for publishers
- Nov 8, 2018 – Why Data Citation matters to publishers and data repositories
- Oct 4, 2018 – Data citation: let’s do this
- Sep 12, 2018 – Event Data is production ready
- Mar 29, 2018 – Hello, meet Event Data Version 1, and new Product Manager