Main article

Rui Tavares Almeida
Department of Information Systems, University of Trás-os-Montes and Alto Douro, Vila Real 5000-801, Portugal
Beatriz Carvalho Pinto*
Department of Public Administration, University of Évora, Évora 7004-516, Portugal
bcpinto@uevora.pt
João Magalhães Ribeiro
School of Technology and Management, Polytechnic Institute of Leiria, Leiria 2411-901, Portugal
Inês Marques Silva
Department of Economics, University of the Azores, Ponta Delgada 9500-321, Portugal

DOI: https://doi.org/10.63646/datamind.2026.040105

Abstract

Public procurement accounts for between twelve and twenty percent of gross domestic product in OECD member states, and the published microdata describing tenders, bidders, bids, awards, and complaints has become one of the largest and most consistently updated administrative datasets that competition authorities, audit offices, and supreme audit institutions can access. Yet despite this volume, integrity oversight is still typically conducted through manual sampling and ad-hoc spreadsheets, because no widely adopted research database treats the six entity classes of public procurement as a coherent system. This article presents ProcureAnomalyDB, a public procurement database whose schema, field dictionary, indexes, quality-control pipeline, ethics regime, and reusable application programming interface are organized around three integrity questions: which tenders show signs of collusive bidding among co-bidding cohorts, which bids are anomalous in value relative to expected price, and which buyer-supplier pairings exhibit unusual market concentration. Six core entities (NOTICE, BIDDER, BID, AWARD, COMPLAINT, ANOMALY_LABEL) are organized so that every flag traces back to a single auditable evidence chain, and a polyglot store comprising a Parquet-plus-Delta lakehouse, a PostgreSQL relational core, a Neo4j property graph for co-bidding cohort analysis, and a pgvector index for case-based reasoning supports the heterogeneous query patterns these three questions demand. We benchmark the database on a working subset of 8.42 million tender notices, 31.6 million bid records, and 5.21 million distinct supplier entities drawn from 2018 to 2023, and we report a runnable experiment that lifts collusive-bidding detection AUC from 0.738 (gradient-boosted baseline) to 0.953, raises the regulator hit-rate on a top-200 flag list from 64.7 to 82.6 percent, identifies ten agencies whose Herfindahl-Hirschman index exceeds the 2,500 concentration threshold by a factor of 1.5 to 1.9, and reduces audit case-review time from 58.4 to 14.7 minutes. The schema, dictionaries, and reproduction notebooks are released under an open license.

Article details

How to Cite

Almeida, R. T., Pinto, B. C., Ribeiro, J. M., & Silva, I. M. (2026). ProcureAnomalyDB: A Public Procurement Database for Fraud, Collusion, and Market Concentration Analysis. DATAMIND, 4(1), 57-60. https://doi.org/10.63646/datamind.2026.040105