2026 ASEE Annual Conference & Exposition

RelSim: A Process-Based Synthetic Relational Database Generator for Engineering Education

Presented at DSAI-Session 12: Data Science Projects, Datasets, and Real-World Applications

Database and data engineering instruction is often constrained by the limited availability of realistic, open-source relational datasets. Widely used examples such as Sakila and Northwind are small, static, and quickly exhausted in the classroom, limiting instructors’ ability to create varied assignments across cohorts and restricting opportunities to teach realistic temporal workflows and operational provenance. This limitation is increasingly problematic in modern data science and artificial intelligence (DSAI) curricula, where students must integrate multi-table data, reason over event logs, engineer features from temporal histories, and construct model-ready datasets without information leakage.

To address this gap, we present RelSim, an open-source platform for generating synthetic relational databases and event traces using a simulation-driven, process-based approach. RelSim operates in three phases: Configure, where instructors define schemas and workflows using a concise declarative specification; Generate, where entity populations are created using templates, formulas, and distributions; and Simulate, where discrete-event process logic produces time-stamped operational records that naturally reflect waiting, capacity constraints, and causal ordering. Unlike purely statistical synthetic data generators, RelSim produces relational coherence and temporal consistency by construction.

This paper describes the design motivation and architecture of RelSim and demonstrates its capabilities through a healthcare scheduling case study. We also present validation results showing referential integrity and temporal consistency across generated tables, including many-to-many relationships. Finally, we outline a planned Spring 2026 classroom deployment in a master’s-level analytics engineering course and define an evaluation framework to assess learning outcomes related to multi-table data integration, temporal reasoning, and feature engineering in DSAI workflows.

Authors
  1. Yan Fang University of Southern California
  2. Dr. Bruce Wilcox University of Southern California [biography]
Note

The full paper will be available to logged in and registered conference attendees once the conference starts on June 21, 2026, and to all visitors after the conference ends on June 24, 2026