This year, metadata development is one of our key priorities and we’re making a start with the release of version 5.4.0 of our input schema with some long-awaited changes. This is the first in what will be a series of metadata schema updates.
What is in this update?
Publication typing for citations
This is fairly simple; we’ve added a ‘type’ attribute to the citations members supply. This means you can identify a journal article citation as a journal article, but more importantly, you can identify a dataset, software, blog post, or other citation that may not have an identifier assigned to it. This makes it easier for the many thousands of metadata users to connect these citations to identifiers. We know many publishers, particularly journal publishers, do collect this information already and will consider making this change to deposit citation types with their records.
Every year we release metadata for the full corpus of records registered with us, which can be downloaded for free in a single compressed file. This is one way in which we fulfil our mission to make metadata freely and widely available. By including the metadata of over 165 million research outputs from over 20,000 members worldwide and making them available in a standard format, we streamline access to metadata about scholarly objects such as journal articles, books, conference papers, preprints, research grants, standards, datasets, reports, blogs, and more.
Today, we’re delighted to let you know that Crossref members can now use ROR IDs to identify funders in any place where you currently use Funder IDs in your metadata. Funder IDs remain available, but this change allows publishers, service providers, and funders to streamline workflows and introduce efficiencies by using a single open identifier for both researcher affiliations and funding organizations.
As you probably know, the Research Organization Registry (ROR) is a global, community-led, carefully curated registry of open persistent identifiers for research organisations, including funding organisations. It’s a joint initiative led by the California Digital Library, Datacite and Crossref launched in 2019 that fulfills the long-standing need for an open organisation identifier.
We began our Global Equitable Membership (GEM) Program to provide greater membership equitability and accessibility to organizations in the world’s least economically advantaged countries. Eligibility for the program is based on a member’s country; our list of countries is predominantly based on the International Development Association (IDA). Eligible members pay no membership or content registration fees. The list undergoes periodic reviews, as countries may be added or removed over time as economic situations change.
Looking back over 2024, we wanted to reflect on where we are in meeting our goals, and report on the progress and plans that affect you - our community of 21,000 organisational members as well as the vast number of research initiatives and scientific bodies that rely on Crossref metadata.
In this post, we will give an update on our roadmap, including what is completed, underway, and up next, and a bit about what’s paused and why. We’ll describe how we have been making resourcing and prioritisation decisions, including a revised management structure, and introduce new cross-functional program groups to collectively take the work forward more effectively.
It’s important to acknowledge that Crossref has evolved significantly from just five years ago - our member count has more than doubled from 10,000 to 21,000 organisations since 2019 and they include all kinds of organisations such as funders, universities, government bodies, NGOs, and of course scholar- and library-led publishers. The smaller organisations now collectively contribute the majority of Crossref funding. We’ve gone from 100 million records to 160 million in five years, and our metadata is retrieved more than 2 billion times monthly, quadrupling what it was five years ago.
It’s within this context that we’ve spent quite a lot of time thinking about scalability, how we collect and process feedback and contributions from many organisations, how to automate our operations, and refining the plans for the next few years.
Our strategic agenda remains the same
A few times a year we update the strategy page where there is a quadrant of projects showing what’s completed, in progress, up next, and in planning/ideas - for each strategic theme. We also link from there to our live public roadmap which shows more specifics about individual projects, including projected timelines, and is updated more frequently.
If you’ve been watching the strategy page, checking in on the public roadmap or this blog, or joining webinars and annual meetings, you’ll know that we’ve had some longstanding plans to—among other things—reduce technical debt, rebuild our metadata management system, move to the cloud, modernise our schema, support multiple languages, and partner with multiple data sources to build the Research Nexus.
You’ve heard us talk about these initiatives a lot, but you’ve not seen particularly swift action.
Moving the work forward more effectively
Earlier this year, it became clear that our almost three-year project to build a new relationships API had not worked out. The project, dubbed ‘manifold’, was to initially deliver data citations, and eventually replace our central metadata system, but what was prototyped didn’t scale, even with a subset of our metadata. We weren’t confident enough about the project’s timeline or costs to justifiably continue investing further time and resources.
Meanwhile, we’d barely scratched the surface of our aim to pay down technical and operational debt, and we’d also been neglecting to keep the live system up to date with the numerous metadata changes that have been queued up, waiting to be implemented.
We knew the manifold project was ambitious – our system has grown in complexity over the years. We were trying to rebuild the car while driving it (our system needed to continue to operate and be maintained by our team) while trying to design a new approach to manage the many relationships between 160+ million database records. In the years we worked on this project, we learned a lot that will inform future plans for a large system redesign.
In March this year, we decided to pause the manifold project. We apologised to our community partners for not delivering the promised data<->literature matches they hoped to use. They were frustrated but thankfully understanding.
We then resolved to focus on backend infrastructural changes, conduct cross-training so that all of our staff would become familiar with current in-use systems instead of greenfield tech (for now), and start to make a dent in the backlog of bugs and long-promised schema updates in our mainstream services.
We’re happy to report some movement on these things and some milestones that have been achieved in these areas in recent months.
Fostering a happy and dedicated team
Any kind of work can only happen when our staff are in a good place, feeling supported and comfortable to question things, and well-equipped with information, purpose, and clear priorities. In June, when the whole staff met up in person, we had some really good conversations about culture, communication, and about sharing responsibilities. Some people ran birds-of-a-feather sessions to explore the issues that had been keeping them up at night, such as authentication/security, and rebuilding the Crossref System (CS), and the team also co-created a set of prioritisation drivers that are now in use within our roadmap and planning processes.
Taking on feedback from the all-staff meeting and then the July board meeting, we thought strategically about the organisational structure Crossref would need over the next few years to reflect the growth in scope and size, and fulfil its longer term goals. We have long had an ambitious agenda but realised we didn’t yet have the capacity to do it all. So we came to the conclusion that we needed an updated team and management structure to take us through the next phase of our development.
The structural changes were concluded at the end of November. They included:
Moving Technology under Operations, since Technology—though a vital enabler—still works in service to our mission and in support of our community, just like other operational things like board governance and finance.
Reframing product development as Programs and Services, and reducing our workstreams from five product portfolios to three programs. We formed cross-team steering groups around clearly articulated program areas (more on those below).
Broadening the leadership to include an Executive team and an extended Director team, and forming a Senior Management Team (SMT). These changes ensure that the collective responsibility for Crossref now rests on a wider group of experts who can back each other up and share the risk and the knowledge, rather than on just a few individuals.
We started recruiting for directors for two new leadership positions. We’ll welcome a new Director of Programs and Services and a new Director of Technology in the new year.
Evolving the strategic initiatives team into a data science team, integrating research & development functions throughout all teams and with the SMT taking collective responsibility for strategic initiatives.
Unfortunately, with the shift in approach for product development and by sharing responsibility for strategic initiatives and research among the wider team, we made the difficult decision that four positions would no longer work within the new structure.
A new approach: joined-up initiatives and cross-functional programs
Research has always been an important role for Crossref, but as this function had been annexed from our regular work, it became hard to coordinate strategic initiatives across the wider organisation. In recent years we inadvertently created more technical debt for ourselves, i.e., built multiple prototype tools without plans for adoption or moving them into production. Strategic initiatives, by their nature, need thorough research and high-level alignment, so we made such initiatives—things like Resourcing Crossref for Future Sustainability (RCFS) and improving the Integrity of the Scholarly record (ISR)—the responsibility of the whole senior management team.
Some useful research had been conducted, but we were never in a position to act on any of it. Particularly promising work has been in the field of metadata matching, and with the growth in the community reliance on our metadata, and attention on data quality rightly increasing, we decided to create a new data science team to be dedicated to this work, led by Dominika Tkaczyk.
We had also struggled with a traditional product management approach since all our tools and activities are interconnected, and we found we were trying to do too many things at once but not all of them very effectively. We also acknowledged that product management comes from the commercial e.g. retail world and therefore is designed to help companies sell/upsell, which is not our goal. So we looked to other approaches more suitable to mission-based nonprofits.
Introducing three programs
We have introduced cross-functional program management in order to work towards the following:
better cross-team alignment
shared responsibility
improve communication and learning
make more progress on the things members need.
Supporting the strategic theme of co-creation, a new program, facilitated by Program Lead Lena Stoll, now manages and oversees all activities around co-creation and community trends. A cross-team steering group just began meeting regularly and will be responsible for interfaces such as reports/dashboards, record registration interfaces, connections and collaborations such as Open Funder Registry, ROR, ORCID auto-update, as well as OJS and other partner integrations. This program also includes the Crossref website and any front-end things to support other programs. And it includes ISR (the integrity of the scholarly record) and our tools in this area such as Crossmark and retraction/correction tooling, and Similarity Check for text comparisons.
Supporting the strategic theme of complete and global metadata and relationships, a new program, facilitated by Program Lead Martyn Rittman, now manages and oversees all activities relating to contributing to the Research Nexus. Working particularly closely with the metadata team, led by Patricia Feeney, this program addresses how metadata is modelled, used, enriched, and extended. Work includes our APIs, incorporating external data sources like Retraction Watch and Event Data, building out metadata matching services with the new data science team, supporting the community of metadata users with API sprints and more modern options for retrieving metadata based on usage and need.
Supporting the strategic theme of open and sustainable operations and keeping to the POSI framework, a new program, facilitated by Program Lead Sara Bowman, now manages and oversees all activities relating to making our operations more open, transparent, and sustainable. This program focuses on supporting and strengthening the core functions our members rely on and enabling future growth. It includes metadata deposit and processing, most apps for e.g. managing titles, authentication, and architectural and infrastructural projects like moving from the data centre to the AWS cloud service. This program also includes modernising our operations in general, which is not just technology but also finance and human resources, so projects like membership process automation, fee modelling and financial analyses, and business system integrations.
The Programs will start to be reflected across our website and in our communications from next year.
What are Crossref’s new prioritisation drivers?
These are the drivers that our ~40 staff co-created in June that are guiding decisions about the priorities on our roadmap. New ideas will be evaluated in the following areas:
Encourage participation from new or under-represented communities
Respond to and lead trends in scholarly communications
Benefit the greatest number of members and users
Reflect on how the community works with each other and allow members to self-serve
Expand to support and connect relevant resource types and metadata fields
Make it easier to create and update metadata
Enhance metadata for completeness and accuracy
Make it easier to retrieve and use metadata
Automate repetitive/manual tasks
Address technical and operational debt
Maintain critical systems and operations and ensure their security
Control or reduce costs - to Crossref, our community, or the environment
We’re happy to report that the changes made this year have resulted in a productive last few months of the year. As reported in our annual meeting, here is the progress update.
What’s paused
A relationships API endpoint and, therefore, a specific data citation feed
Manifold, the the three-year effort to modernise our tech stack
Most of the strategic initiatives prototypes that can’t yet be scaled, such as Labs API and Labs reports
What’s recently completed
We succeeded in moving the entire Crossref corpus to an open-source database, PostgreSQL
Fixed numerous REST API data quality issues and lots of troublesome bugs
Schema development - support for ROR as a Funder identifier is live and currently in testing
We automated some very manual membership and billing processes, saving hundreds of staff hours a year
Released a new form for journal article record registration, building on the grant registration form
Since the rest of the community stops for no Crossref product roadmap issue, we also progressed a number of community and governance initiatives:
The Grant Linking System (GLS) reached 5 years with over 40 funders joining Crossref and registering over 130,000 grants and awards, including use of facilities and projects
Our research for Resourcing Crossref for Future Sustainability (RCFS) with the Membership & Fees Committee is going well, and we’ll have new fee proposals for review in 2025
The integrity of the Scholarly Record (ISR) conversations have deepened, and we’ve formed strong relationships with editorial experts and research integrity sleuths, who are getting up to speed on our metadata, and we’re working with some sleuthing consultants to change our processes to handle deceptive member behaviour such as paper mills, cloned journals, and citation manipulation. The new data science team plays a role here, along with membership and governance.
What’s currently in focus
In our efforts to do less but do it more effectively, we have two current priorities:
Get out of the physical data centre and into the cloud.
These two projects are underway, involving lots of communication and learning. Since we haven’t released any schema updates in many years, all our staff are learning for the first time how a metadata schema model is interpreted in a systemic way, learning about the structure of research objects, and honing the process as they go. We’ve high hopes we’ll be in a position to release continuous metadata schema versions and catch up on the backlog over the coming years.
What’s next
Continuous metadata development, with contributor roles up next
Retraction Watch data integrated into the REST API so users have a single source of retraction/correction data
Upgraded preprint matching and notifications
Modelling more equitable fees through the RCFS projects
Piloting a non-voting membership category
Once we’re fully in the cloud and in the groove of metadata updates, and with the support of newly-hired technology and program directors joining in the new year, we’ll turn our attention to rebuilding the central metadata system that we call the Crossref System, or “CS” and report more on this next year.
So that was our summary of 2024 and an indication of what’s coming in 2025 and beyond; sorry it’s so long, and thanks for reading this far! Next year we’ll get back to more regular updates as the strategic agenda and the programs progress.