Main article

Xiaolong Pan
School of Transportation Engineering, Chang’an University, Xi’an 710064, China
Ruoxi Jiang*
College of Civil Engineering, Fuzhou University, Fuzhou 350108, China
jiang.ruoxi@fzu.edu.cn
Tao Cheng
School of Geographic Sciences, East China Normal University, Shanghai 200241, China
Yanmei Xu
School of Computer Science, Northwest A&F University, Yangling 712100, China

DOI: https://doi.org/10.63646/datamind.2023.010204

Abstract

Urban mobility intelligence increasingly depends on the joint analysis of transit smart card transactions, taxi GPS probes, shared-bike trips, road-side sensor counts, and meteorological observations. Yet these five data sources are typically curated in isolation, stored in incompatible formats, indexed by incompatible spatial and temporal keys, and exposed under inconsistent privacy regimes, which makes integrated analytical workflows unnecessarily fragile. This article presents UrbanFlowDB, a multimodal urban mobility database that treats the database itself as the principal research artifact. We document the schema, the field dictionary, the spatiotemporal index family, the ingestion and quality control pipeline, the pseudonymization and ethics processing flow, and the reusable application programming interfaces that expose the integrated data to downstream models. The database is co-resident across a Parquet-plus-Delta lakehouse, a PostGIS-extended relational store, a Neo4j property graph for congestion-propagation analysis, and a pgvector index for trajectory similarity search; this polyglot layout is deliberately chosen because each mobility analytical pattern aligns most naturally with a different storage paradigm. We benchmark the database on a runnable urban experiment using one year of data from a Chinese second-tier city (1.42 billion transit taps, 396 million taxi GPS pings, 21.6 million dockless bike trips, 8.4 million sensor records, 215,860 weather observations) and demonstrate that UrbanFlowDB lowers origin-destination demand prediction RMSE from 23.6 to 18.9 trips per 15-minute window relative to the strongest baseline, raises congestion early-warning F1 from 0.793 to 0.851, and reduces trajectory imputation error by 35.4 percent at 30 percent missing rate. End-to-end ingestion latency is below 19 seconds at the 95th percentile for all five sources, and the system sustains 14,200 trajectory queries per second on the production-scale dataset. The schema, dictionaries, and reproduction scripts are released under an open license.

Article details

How to Cite

Pan, X., Jiang, R. ., Cheng, T., & Xu, Y. (2023). UrbanFlowDB: A Multimodal Urban Mobility Database for Traffic, Transit, and Micromobility Intelligence. DATAMIND, 1(2), 33-46. https://doi.org/10.63646/datamind.2023.010204