[Estimated reading time: 10 mins]
The Invitation to Tender that cOAlition S put out on 10 Jul 2023 for a study to assess the impact of Plan S on the global scholarly communication ecosystem stated:
This report [addressing the questions and issues posed in Section 2 of this ITT] must be made available under the Creative Commons attribution licence (CC-BY). Any data used in this study must also be made available under a CC BY licence.
This requirement for making the dataset underpinning the Counterfactual Impact Evaluation (CIE) openly available under a CC BY licence strongly influences the source for the data samples to be used for the study. Previous similar analyses on the impact of Read & Publish agreements under the DEAL framework in Germany conducted by specific members of the scidecode team relied on Scopus as a data source. However, this work was carried out under an institutional affiliation, meaning that the institutional subscription to the international literature database could be used for free.
Things are different when a work is to be done on an external consultant basis and it soon became clear that the use of Scopus datasets from a commercial basis would come at a cost. Moreover, early explorations showed that this cost would amount to two thirds of the total budget for the project – a budget in which no specific expenditure had been planned for these sorts of costs.
The alternative choice was clear: to select an open literature database like OpenAlex or The Lens. By doing this we would not only meet the requirement on the ITT, but would also join a wider trend to explore the feasibility of using open data sources to conduct bibliometric analyses. Research Consulting went for a similar choice for their Sep 2023 report “Monitoring and evaluating the effectiveness of UKRI’s open access policy: Principles, opportunities and challenges”. The report is available in Zenodo, https://doi.org/10.5281/zenodo.7773581. The data specification supporting their study was published as an annex to the report and made openly available at https://doi.org/10.5281/zenodo.7773583. The introductory considerations read:
The prioritisation of open data sources pursued as part of this work should not be considered as a formal recommendation to UKRI, as no testing has taken place to date to compare results using different sources (…)
There is in fact a growing – if very recent – body of tests comparing the comprehensiveness and quality of open bibliographic data sources against commercial databases, see a couple of examples below:
Jiao, C., Li, K. & Fang, Z (2023). How are exclusively data journals indexed in major scholarly databases? An examination of four databases. Sci Data 10, 737. https://doi.org/10.1038/s41597-023-02625-x
Scheidsteger, T., Haunschild, R. & Bornmann, L. (2023). How similar are field-normalized scores from different free or commercial databases calculated for large German universities? [preprint]. 27th International Conference on Science, Technology and Innovation Indicators (STI 2023). https://doi.org/10.55835/6441118c643beb0d90fc543f
Akbaritabar, A., Theile, T., Zagheni, E. (2023). Global flows and rates of international migration of scholars. MPIDR Working Paper WP-2023-018, 13 pages. https://doi.org/10.4054/MPIDR-WP-2023-018
Partnering with OA.Works
Given the relevance of the quantitative analysis for the current project, this decision on what data sources to use is not one we were wishing to take lightly – some previous benchmark for the quality of specific data samples was needed. As opposite to other longer-term initiatives looking into the quality of open bibliographic datasets – such as this ETHZ project “Towards Open Bibliometric Indicators” (TOBI) – we also needed to conduct this quality assessment quickly so that we could move onto the actual quantitative analysis of the impact of Plan S. After the consultations on the use of commercial literature databases for a fee seemed to lead into a dead end, we started talking to OA.Works as a potential alternative for data provision using their OA.Report tool. OA.Works are the non-profit team behind the 2013 Open Access button development. They rebranded as OA.Works in 2021 and collect OA.Report’s data from their various open bibliographic sources – such as OpenAlex, Crossref, Unpaywall and many others – including their own data sources. When during an early discussion on the licensing requirement for the dataset underpinning the cOAlition S study, they said that the licence should be CC0 instead of CC BY, we knew these were the sort of partners we would like to work with. A Memorandum of Understanding has subsequently been signed between scidecode and OA.Works to provide the basis for this partnership.
Analysis of a specific data sample taken from a commercial and an open bibliographic source
The data sample selected for benchmarking was the subset of publications funded by the Brazilian Fundação Oswaldo Cruz (Fiocruz) in the period 2021-2022. The main reasons for the choice of this subset of publications as a statistically meaningful sample are that Fiocruz is not in the Global North and that English is not its “native language”. As a result, the sample is bound to include a good number of non-English publications and – possibly – a good number of publications without a DOI (these were some of the hypotheses to be validated in the course of the benchmarking exercise).
Also, while the quality and comprehensiveness of OpenAlex-based metadata is rapidly increasing, it’s potentially risky to rely on it for a rigorous quantitative study. The funding information is a particularly difficult metadata element to capture, especially when compared to commercial databases owned by publishers that may greatly benefit from the funding information directly collected from researchers as part of the manuscript submission process for their publications. While this rich metadata may eventually be shared via Crossref, aspects like standardisation and the application of PIDs may still be better addressed by commercial sources. The ultimate objective of this comparison was to validate the hypothesis that the quality and comprehensiveness of the bibliographic references provided by open metadata sources (with the added value provided by the enrichment process conducted by OA.Report) are on par with – if not better than – those provided by commercial sources. If this is the case, this will have wider implications beyond the specific study on the impact of Plan S currently underway.
Two complementary approaches were applied in order to analyse the quality of both datasets to be compared (OA.Report vs commercial database-sourced). Both datasets were available in Excel files with a reasonably similar field structure.
- A quantitative approach compared the number of publications in each of the samples and the reasons for the divergence. Automatically comparing both Excel files on the basis of the DOI, allows the degree of overlapping to be assessed and where the differences stem from between them;
- A qualitative analysis subsequently looked into how accurate the capturing of the funding information was on both samples. This was done by examining records available in both samples and by double-checking that the “unique” records provided by whichever sample is the richest are actually funded by Fiocruz.
The Oct 29, 2023 Excel file provided by OA.Report with a snapshot for Fiocruz-funded publications in 2021-22 contained 1301 publications. The field structure for the records is as follows:
DOI openalex no title publisher journal issn published_date published_year PMCID volume issue authorships.author.display_name authorships.author.orcid authorships.institutions.display_name authorships.institutions.ror funder.name funder.award is_oa oa_status journal_oa_type publisher_license has_repository_copy repository_license repository_version repository_url has_oa_locations_embargoed can_archive version concepts.display_name subject pmc_has_data_availability_statement cited_by_count is_funded__fiocruz
Note: The date of publication is always a slippery field, as there is often a lack of clarity on whether this is the date of first online release of the date of the “printed issue”. Commercial databases typically provide just the latter, while OpenAlex tends to focus on online release dates. This can be an issue when trying to assign a year to a publication (both dates should ideally be available) and may subsequently impact the sample comparison.
The second Excel file gathered the subset of Fiocruz-funded publications in 2021-22 as downloaded from a commercial database on 29 Oct 2023. The spreadsheet contained 903 documents. The features on the result display included classifications by language of the publication and by country where it was produced, both of which provide valuable info. The field structure of the Excel file provided by the commercial database is configurable, but the structure that has been selected for the comparison is as follows:
DOI Authors Author full names Author(s) ID Title Year Source title Volume Issue Art. No. Page start Page end Page count Cited by Link Funding Details Funding Texts Publisher Language of Original Document Document Type Publication Stage Open Access
Some results from the comparison of both data samples
As mentioned above, the comparison between the two datasets needed to be quick so we could move onto the quantitative analysis that constitutes the main goal of the project, namely the Counterfactual Impact Evaluation of the impact of Plan S. So rather than a fully comprehensive analysis of both samples, the emphasis was on ascertaining whether the data quality for the sample sourced from the open bibliographic database(s) was reliable enough as a basis for the subsequent CIE analysis.
Results of the comparison between the Oct 29 samples provided by the commercial database and OA.Report showed that 282 DOI publications in the commercial database dataset (903 documents) were not listed on the sample of 1301 publications provided by OA.Report). This is approximately one third of the commercial database subset – and significantly higher than initially expected.
An analysis of the missing entries was performed to see whether there may have been mistakes in the coding by DOI – and also what publishers are most represented on the list of missing references. These are often large publishers like Elsevier, Springer Nature/BMC, Frontiers and Cambridge.
On the basis of a small sample and of a notoriously difficult funding acknowledgement format to adequately capture on the metadata it is difficult to tell with absolute certainty, but the analysis shows that most missing papers are in fact not directly funded by Fiocruz. This is because Fiocruz is often mentioned in funding acknowledgements as providers of research facilities or equipment and not of research funding. However, these mentions are enough to bring the papers onto the list of the commercial data source even if they may not be necessarily Fiocruz-funded. The inclusion of the funder name in English in the funding acknowledgements may also have played a role in some of the gaps, see example below. This is another issue that non-English organisations typically face – the onset of PIDs for organisations will eventually fix these, but in the meantime it’s necessary to carry out multilingual queries for consistency.
All this said, there are still cases like this „Metformin improves depressive-like behavior in experimental Parkinson’s disease by inducing autophagy in the substantia nigra and hippocampus“ (Inflammopharmacology, Springer) where the publication wasn’t included (at the time) in the OA.Report sample despite the Fiocruz funding acknowledgement being very clearly stated and even providing a grant number:
The authors would like to express their gratitude to the Knowledge Generation Program—Fundação Oswaldo Cruz (FIOCRUZ; # VPPCB-007-FIO-18 -2-17)
It is then safe to assume that none of the two sources that have been compared will be perfect. What we wanted to test was not perfection though, but whether the data samples provided by openly available bibliographic sources were actually good enough. The qualitative analysis below proved that this was indeed the case.
The qualitative analysis focused on the unique publications included in the OA.Report data sample – which, as stated above, was significantly larger than the one provided by the commercial database. Some of the findings are summarised below:
- There was a significant number of SciELO entries on the list of unique publications provided by OA.Report. These were marked as FapUNIFESP (SciELO) on the publisher column and amounted to dozens of publications typically not included in the commercial database sample as many of these SciELO titles are probably not indexed.
- The proportion of Portuguese-language publications was small in both samples, but both the share and the number were higher on the OA.Report dataset. This is in line with the findings reported in other open vs commercial data sample quality comparisons.
- The more of an ‚outlier‘ a given publication is (meaning no DOI or title in Portuguese) the more likely the reference will only be listed in the OA.Report dataset. Also, the more likely that metadata gaps will appear on the OA.Report records. This is not unexpected and the issue is proves to be more acute the older the publications are.
- There may be papers that made it to the list because authors are affiliated with Fiocruz rather than funded by Fiocruz, but this can easily be checked and it’s not a big issue anyway. Upon provision of the OA.Report sample a warning was issued that funding acknowledgements to Fiocruz projects are not always straightforward (meaning first that they may lack grant numbers and that, as stated above, the inclusion of Fiocruz doesn’t always imply a research funding acknowledgement)
Checking the OA status provided by Unpaywall on the OA.Report data sample it becomes evident that any entry that has a PMC number is probably going to be OA regardless of what Unpaywall says – because PMC only hosts OA papers. OA status-related issues were also brought up in the Nov 6, 2023 “Analysing and reclassifying open access information in OpenAlex“ blog post by Najko Jahn, Nick Haupka and Anne Hobert at the State and University Library Göttingen and were already addressed by the team behind OpenAlex (one of the clearest advantages of having open infrastructures is that feedback from users can be transparently acted upon).
While the snapshot for publications funded by Fiocruz in 2021-22 provided by a range of open bibliographic databases is currently not perfect, it is sufficiently comprehensive for the analysis of the impact of Plan S to rely on it. The two datasets used for the comparison between an open bibliographic source and a commercial database are not being openly shared here given there are commercial use limitations involved for one of them. However, the analysis is sound and clearly shows that the datasets provided by OA.Report will provide a solid basis for the comparison between the pre- and post-Plan S OA landscape for various funders while at the same time meeting the licensing requirements stated in the cOAlition S call.
 For instance “We would like to thank the Confocal Platform from Oswaldo Cruz Foundation” or “performing the biochemical assays in the Multiplex platform of the Instituto Oswaldo Cruz, FIOCRUZ, Rio de Janeiro”
- Choosing a data sample provider for our study on the impact of Plan S - 22. Januar 2024
- Scidecode to explore the impact of Plan S - 6. Oktober 2023
- „At present Wikidata ID is the most suitable ID for organisations in terms of coverage“ - 22. Mai 2023