Choosing a data sample provider for our study on the impact of Plan S

[Estimated reading time: 10 mins]

The Invitation to Tender that cOAlition S put out on 10 Jul 2023 for a study to assess the impact of Plan S on the global scholarly communication ecosystem stated:

This report [addressing the questions and issues posed in Section 2 of this ITT] must be made available under the Creative Commons attribution licence (CC-BY).  Any data used in this study must also be made available under a CC BY licence.

This requirement for making the dataset underpinning the Counterfactual Impact Evaluation (CIE) openly available under a CC BY licence strongly influences the source for the data samples to be used for the study. Previous similar analyses on the impact of Read & Publish agreements under the DEAL framework in Germany conducted by specific members of the scidecode team relied on Scopus as a data source. However, this work was carried out under an institutional affiliation, meaning that the institutional subscription to the international literature database could be used for free.

Things are different when a work is to be done on an external consultant basis and it soon became clear that the use of Scopus datasets from a commercial basis would come at a cost. Moreover, early explorations showed that this cost would amount to two thirds of the total budget for the project – a budget in which no specific expenditure had been planned for these sorts of costs.

The alternative choice was clear: to select an open literature database like OpenAlex or The Lens. By doing this we would not only meet the requirement on the ITT, but would also join a wider trend to explore the feasibility of using open data sources to conduct bibliometric analyses. Research Consulting went for a similar choice for their Sep 2023 report “Monitoring and evaluating the effectiveness of UKRI’s open access policy: Principles, opportunities and challenges”. The report is available in Zenodo, The data specification supporting their study was published as an annex to the report and made openly available at The introductory considerations read:

The prioritisation of open data sources pursued as part of this work should not be considered as a formal recommendation to UKRI, as no testing has taken place to date to compare results using different sources (…)

There is in fact a growing – if very recent – body of tests comparing the comprehensiveness and quality of open bibliographic data sources against commercial databases, see a couple of examples below:

Jiao, C., Li, K. & Fang, Z (2023). How are exclusively data journals indexed in major scholarly databases? An examination of four databases. Sci Data 10, 737.

Scheidsteger, T., Haunschild, R. & Bornmann, L. (2023). How similar are field-normalized scores from different free or commercial databases calculated for large German universities? [preprint]. 27th International Conference on Science, Technology and Innovation Indicators (STI 2023).

Akbaritabar, A., Theile, T., Zagheni, E. (2023). Global flows and rates of international migration of scholars. MPIDR Working Paper WP-2023-018, 13 pages.

Partnering with OA.Works

Given the relevance of the quantitative analysis for the current project, this decision on what data sources to use is not one we were wishing to take lightly – some previous benchmark for the quality of specific data samples was needed. As opposite to other longer-term initiatives looking into the quality of open bibliographic datasets – such as this ETHZ project “Towards Open Bibliometric Indicators” (TOBI) – we also needed to conduct this quality assessment quickly so that we could move onto the actual quantitative analysis of the impact of Plan S. After the consultations on the use of commercial literature databases for a fee seemed to lead into a dead end, we started talking to OA.Works as a potential alternative for data provision using their OA.Report tool. OA.Works are the non-profit team behind the 2013 Open Access button development. They rebranded as OA.Works in 2021 and collect OA.Report’s data from their various open bibliographic sources – such as OpenAlex, Crossref, Unpaywall and many others  – including their own data sources. When during an early discussion on the licensing requirement for the dataset underpinning the cOAlition S study, they said that the licence should be CC0 instead of CC BY, we knew these were the sort of partners we would like to work with. A Memorandum of Understanding has subsequently been signed between scidecode and OA.Works to provide the basis for this partnership.

Analysis of a specific data sample taken from a commercial and an open bibliographic source

The data sample selected for benchmarking was the subset of publications funded by the Brazilian Fundação Oswaldo Cruz (Fiocruz) in the period 2021-2022. The main reasons for the choice of this subset of publications as a statistically meaningful sample are that Fiocruz is not in the Global North and that English is not its “native language”. As a result, the sample is bound to include a good number of non-English publications and – possibly – a good number of publications without a DOI (these were some of the hypotheses to be validated in the course of the benchmarking exercise).

Also, while the quality and comprehensiveness of OpenAlex-based metadata is rapidly increasing, it’s potentially risky to rely on it for a rigorous quantitative study. The funding information is a particularly difficult metadata element to capture, especially when compared to commercial databases owned by publishers that may greatly benefit from the funding information directly collected from researchers as part of the manuscript submission process for their publications. While this rich metadata may eventually be shared via Crossref, aspects like standardisation and the application of PIDs may still be better addressed by commercial sources. The ultimate objective of this comparison was to validate the hypothesis that the quality and comprehensiveness of the bibliographic references provided by open metadata sources (with the added value provided by the enrichment process conducted by OA.Report) are on par with – if not better than – those provided by commercial sources. If this is the case, this will have wider implications beyond the specific study on the impact of Plan S currently underway.


Two complementary approaches were applied in order to analyse the quality of both datasets to be compared (OA.Report vs commercial database-sourced). Both datasets were available in Excel files with a reasonably similar field structure.

  1. A quantitative approach compared the number of publications in each of the samples and the reasons for the divergence. Automatically comparing both Excel files on the basis of the DOI, allows the degree of overlapping to be assessed and where the differences stem from between them;
  2. A qualitative analysis subsequently looked into how accurate the capturing of the funding information was on both samples. This was done by examining records available in both samples and by double-checking that the “unique” records provided by whichever sample is the richest are actually funded by Fiocruz.

The Oct 29, 2023 Excel file provided by OA.Report with a snapshot for Fiocruz-funded publications in 2021-22 contained  1301 publications. The field structure for the records is as follows:

DOI       openalex no       title       publisher            journal issn              published_date published_year PMCID  volume issue      authorships.institutions.display_name   authorships.institutions.ror     funder.award    is_oa     oa_status           journal_oa_type              publisher_license has_repository_copy     repository_license          repository_version              repository_url has_oa_locations_embargoed   can_archive       version              concepts.display_name subject pmc_has_data_availability_statement   cited_by_count is_funded__fiocruz

The date of publication is always a slippery field, as there is often a lack of clarity on whether this is the date of first online release of the date of the “printed issue”. Commercial databases typically provide just the latter, while OpenAlex tends to focus on online release dates. This can be an issue when trying to assign a year to a publication (both dates should ideally be available) and may subsequently impact the sample comparison.

The second Excel file gathered the subset of Fiocruz-funded publications in 2021-22 as downloaded from a commercial database on 29 Oct 2023. The spreadsheet contained 903 documents. The features on the result display included classifications by language of the publication and by country where it was produced, both of which provide valuable info. The field structure of the Excel file provided by the commercial database is configurable, but the structure that has been selected for the comparison is as follows:

DOI       Authors Author full names           Author(s) ID       Title       Year              Source title         Volume Issue     Art. No. Page start           Page end            Page count         Cited by              Link       Funding Details Funding Texts    Publisher            Language of Original Document     Document Type Publication Stage            Open Access

Some results from the comparison of both data samples

As mentioned above, the comparison between the two datasets needed to be quick so we could move onto the quantitative analysis that constitutes the main goal of the project, namely the Counterfactual Impact Evaluation of the impact of Plan S. So rather than a fully comprehensive analysis of both samples, the emphasis was on ascertaining whether the data quality for the sample sourced from the open bibliographic database(s) was reliable enough as a basis for the subsequent CIE analysis.

Quantitative analysis

Results of the comparison between the Oct 29 samples provided by the commercial database and OA.Report showed that 282 DOI publications in the commercial database dataset (903 documents) were not listed on the sample of 1301 publications provided by OA.Report). This is approximately one third of the commercial database subset – and significantly higher than initially expected.

An analysis of the missing entries was performed to see whether there may have been mistakes in the coding by DOI – and also what publishers are most represented on the list of missing references. These are often large publishers like Elsevier, Springer Nature/BMC, Frontiers and Cambridge.

On the basis of a small sample and of a notoriously difficult funding acknowledgement format to adequately capture on the metadata it is difficult to tell with absolute certainty, but the analysis shows that most missing papers are in fact not directly funded by Fiocruz. This is because Fiocruz is often mentioned in funding acknowledgements as providers of research facilities or equipment and not of research funding[1]. However, these mentions are enough to bring the papers onto the list of the commercial data source even if they may not be necessarily Fiocruz-funded. The inclusion of the funder name in English in the funding acknowledgements may also have played a role in some of the gaps, see example below. This is another issue that non-English organisations typically face – the onset of PIDs for organisations will eventually fix these, but in the meantime it’s necessary to carry out multilingual queries for consistency.


All this said, there are still cases like this „Metformin improves depressive-like behavior in experimental Parkinson’s disease by inducing autophagy in the substantia nigra and hippocampus“ (Inflammopharmacology, Springer) where the publication wasn’t included (at the time) in the OA.Report sample despite the Fiocruz funding acknowledgement being very clearly stated and even providing a grant number:

The authors would like to express their gratitude to the Knowledge Generation Program—Fundação Oswaldo Cruz (FIOCRUZ; # VPPCB-007-FIO-18 -2-17)

It is then safe to assume that none of the two sources that have been compared will be perfect. What we wanted to test was not perfection though, but whether the data samples provided by openly available bibliographic sources were actually good enough. The qualitative analysis below proved that this was indeed the case.

Qualitative analysis

The qualitative analysis focused on the unique publications included in the OA.Report data sample – which, as stated above, was significantly larger than the one provided by the commercial database. Some of the findings are summarised below:

  • There was a significant number of SciELO entries on the list of unique publications provided by OA.Report. These were marked as FapUNIFESP (SciELO) on the publisher column and amounted to dozens of publications typically not included in the commercial database sample as many of these SciELO titles are probably not indexed.
  • The proportion of Portuguese-language publications was small in both samples, but both the share and the number were higher on the OA.Report dataset. This is in line with the findings reported in other open vs commercial data sample quality comparisons.
  • The more of an ‚outlier‘ a given publication is (meaning no DOI or title in Portuguese) the more likely the reference will only be listed in the OA.Report dataset. Also, the more likely that metadata gaps will appear on the OA.Report records. This is not unexpected and the issue is proves to be more acute the older the publications are.
  • There may be papers that made it to the list because authors are affiliated with Fiocruz rather than funded by Fiocruz, but this can easily be checked and it’s not a big issue anyway. Upon provision of the OA.Report sample a warning was issued that funding acknowledgements to Fiocruz projects are not always straightforward (meaning first that they may lack grant numbers and that, as stated above, the inclusion of Fiocruz doesn’t always imply a research funding acknowledgement)

Checking the OA status provided by Unpaywall on the OA.Report data sample it becomes evident that any entry that has a PMC number is probably going to be OA regardless of what Unpaywall says – because PMC only hosts OA papers. OA status-related issues were also brought up in the Nov 6, 2023 “Analysing and reclassifying open access information in OpenAlex“ blog post by Najko Jahn, Nick Haupka and Anne Hobert at the State and University Library Göttingen and were already addressed by the team behind OpenAlex (one of the clearest advantages of having open infrastructures is that feedback from users can be transparently acted upon).


While the snapshot for publications funded by Fiocruz in 2021-22 provided by a range of open bibliographic databases is currently not perfect, it is sufficiently comprehensive for the analysis of the impact of Plan S to rely on it. The two datasets used for the comparison between an open bibliographic source and a commercial database are not being openly shared here given there are commercial use limitations involved for one of them. However, the analysis is sound and clearly shows that the datasets provided by OA.Report will provide a solid basis for the comparison between the pre- and post-Plan S OA landscape for various funders while at the same time meeting the licensing requirements stated in the cOAlition S call.

[1] For instance “We would like to thank the Confocal Platform from Oswaldo Cruz Foundation” or “performing the biochemical assays in the Multiplex platform of the Instituto Oswaldo Cruz, FIOCRUZ, Rio de Janeiro”

Pablo de Castro
Choosing a data sample provider for our study on the impact of Plan S

7 Gedanken zu „Choosing a data sample provider for our study on the impact of Plan S

  1. That’s a really interesting article, thanks for sharing! I have been wondering how solid the data behind OA.Report would be for these purposes, and it’s very good to see your positive conclusions on the matter.

    I just wanted to add a small clarification on the Research Consulting report for UKRI – we didn’t meant to say that there was no research looking at commercial vs. open sources, but only that we didn’t assess how fit for purpose either of these would be to meet UKRI’s specific M&E requirements. The examples you shared here look very useful, I’ll add them to my reading list 🙂

    1. Thanks for your comment, Andrea, happy to hear this was useful. If you check the dates of the three references for benchmarking we’ve provided, they were all published in 2023, often after your UKRI report — a source of inspiration itself for our work on this project — was first released in Sep. Things keep evolving so quickly in the domain!

  2. Very interesting article and very impressive to see that is able to capture 400 publications more than the commercial database! Apparently OA.Works funder information is much more rich and comprehensive. Would be very interested to learn how they manage to do that given, indeed as you suggest, like many other open bibliographic databases, they will often not have access to funding information from full text papers, (nor from submission systems of course).

    1. Thanks Hans, there’s indeed quite a difference on the number of publications. I am not privy to the specific querying strategies OA.Works have followed to put together their data samples – I just examined them as a user interested in building on top of the sample for our own analysis – but I think this may have involved some level of text mining to unearth ‚obscure‘ references (meaning published in sources typically not indexed in commercial databases) including plenty of them coming from the Fiocruz institutional repository arca which contains over 50,000 publications. If OA.Works wish to provide any further info it’ll be added as a comment in this thread.

      Perhaps worth noting that while this arca IR is listed as a (potential) data source on OpenAIRE Explore as a result of its presence in the OpenDOAR repository directory, it’s not registered as a data provider meaning all these ‚obscure‘ references don’t necessarily make it into the aggregation (even if a number of them will be captured through the indexing of the small Diamond OA journals where they were published).

      arca repository in OpenAIRE

  3. Hi Pablo, thanks for sharing this exploratory analysis (and for mentioning the Research Consulting report and its data specification 🙂

    It would be great to also have some more info about the added value OA.Works provides to the funder information currently available in Crossref (and consequently OpenAlex) itself, both by discussing the additional sources OA.Works uses, and by sharing (only) the OA.Works dataset created for the analysis reported here, so it would be possible to compare that to Crossref (and OpenAlex) data.

    For reference, just using the Crossref Funder Registry for Fiocruz today gives 359 records in Crossref for 2021-2022 (using the year from the ‚issued‘ field, which is the earliest of the dates provided for print and online publication).,until-issued-date:2022,funder:501100006507&select=DOI,funder

    OpenAlex currently only takes funder information from Crossref, and only based on Funder Registry IDs, and indeed the count in OpenAlex is similar (360, the discrepancy explained by how dates are treated),publication_year:2021-2022

    Crossref does have a bit more funder information not taken into account by OpenAlex, namely funder names – both the names included in the Funder Registry entry and other names provided by the publisher as funder names when registering metadata with Crossref.

    In this case, including funder names does not add significantly to the number records identified for 2021-2022, but that may differ for different funders.

    In any case, it would be interesting to be able to compare these data to the data from OA.Works and see, for instance, how much of their additional records have Crossref DOIs and how many don’t. Currently, this remains a bit of a black box, which I hope it doesn’t need to be?

    The dataset I extracted from Crossref is available here:
    together with the SQL script used to generate it from the Crossref using Metadata Plus snapshot, using the COKI (Curtin Open Knowledge Initiative) instance of GBQ which has these ingested.

    1. Thanks Bianca for all this info, please see the reply to the previous comment by Hans de Jonge. Bringing the Fiocruz repository into the mix of data sources also provides a (partial) answer to your comment above. As for the follow-up comment, I will leave the fair points you raise unaddressed for the moment in case it’s possible to share a URL to the data sample OA.Works provided for Fiocruz-funded publications (I would then post this in a reply to the follow-up comment). What I’m trying to hint at in the sentences you highlight is that the vertical product integration covering the whole scholarly publishing cycle – from manuscript processing systems to the commercial literature databases through Current Research Information Systems (CRIS) used by institutions – that certain providers enjoy may in principle offer them clear advantages in terms of identifying suitable publications (incl on the basis of funding information) and conducting an analysis on top of that. The “Understanding Amsterdam’s Competitive Advantage” report provided interesting – if indirect – evidence on this.

  4. And a couple of additional comments, sorry this has become a bit long…

    – where you write „commercial databases owned by publishers that may greatly benefit from the funding information directly collected from researchers as part of the manuscript submission process for their publications.“ – I think this mixes two things: information publishers have that they collect during the submission and production process, and information (commercial) databases have through agreements with publishers, including access to funding acknowledgements for papers, also those that are not open access. Thus, these databases, whether owned by a publisher (e.g. Scopus) or not (e.g. Web of Science) can create enriched funding information for articles from a range of publishers.

    – I don’t fully understand the claim „While this rich metadata may eventually be shared via Crossref, aspects like standardisation and the application of PIDs may still be better addressed by commercial sources.“ – as metadata are provided to Crossref exclusively by publishers, and the extent to which publishers match author-supplied funder information to PIDs (currently Funder IDs and in future RORs) and then include those in the metadata they supply to Crossref, is not necessarily greater for commercial vs. non-commercial publishers.

    If this is more about the ability of providers of bibliographic databases to perform internal standardization and assign internal PIDs, here also, I do not see why commercial sources would necessarily be better at this – it is what OpenAlex and OpenAIRE are putting extensive work in, and the same could actually be said of OA.Works 🙂

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

Nach oben scrollen

Durch die weitere Nutzung der Seite stimmst du der Verwendung von Cookies zu. / By further use of this site, you agree accept the use of cookies. Weitere Informationen/ Further Information

Die Cookie-Einstellungen auf dieser Website sind auf "Cookies zulassen" eingestellt, um das beste Surferlebnis zu ermöglichen. Wenn du diese Website ohne Änderung der Cookie-Einstellungen verwendest oder auf "Akzeptieren" klickst, erklärst du sich damit einverstanden. The cookie settings on this website are set to "Allow cookies" for the best browsing experience. If you use this website without changing the cookie settings or click "Accept", you agree to this.

Schließen/ Close