Blog

Dominika Tkaczyk

Dominika joined Crossref’s R&D group in the Tech team in August 2018. Her research interests focus on machine learning and natural language processing, in particular their applications to the automated analysis of scientific literature and research outputs. Previously, she has worked on a number of projects, including the extraction of machine-readable metadata from scholarly documents, predicting people’s demographic features based on their internet browsing history, and developing new metrics for assessing the effectiveness of worldwide air traffic. Dominika’s career started in Poland, where she was a researcher and a data scientist at the University of Warsaw. She received a PhD in Computer Science from the Polish Academy of Sciences in 2016. In 2017 Dominika was awarded a Marie Sklodowska-Curie EDGE Fellowship and moved to Ireland to work as a postdoctoral researcher at Trinity College Dublin. When not busy training yet another random forest or neural network, you can find her at the nearest Doctor Who convention or rock/metal concert.

Read more about Dominika Tkaczyk on their team page.

Matchmaker, matchmaker, make me a match

Matching (or resolving) bibliographic references to target records in the collection is a crucial algorithm in the Crossref ecosystem. Automatic reference matching lets us discover citation relations in large document collections, calculate citation counts, H-indexes, impact factors, etc. At Crossref, we currently use a matching approach based on reference string parsing. Some time ago we realized there is a much simpler approach. And now it is finally battle time: which of the two approaches is better?

What does the sample say?

At Crossref Labs, we often come across interesting research questions and try to answer them by analyzing our data. Depending on the nature of the experiment, processing over 100M records might be time-consuming or even impossible. In those dark moments we turn to sampling and statistical tools. But what can we infer from only a sample of the data?