Host: Anurag Khandelwal
Title: Data Lakes: Discovery, Data Debiasing, and Query Answering
Data for AI is increasingly reliant on the integration of multiple sources – sometimes obtained from open data repositories or data marketplaces. Despite decades of research in data integration and cleaning, we are still not sure how to construct AI-ready structured datasets – data with descriptive features and representative distribution. In this talk, first, I will describe how to discover relevant datasets, based on join and union operations from large-scale data repositories, by designing efficient index structures. Next, I will show how to tailor a dataset with desired distribution requirements from multiple sources, in order to construct unbiased datasets. We will also see how to obtain an IID sample over normalized data, to improve the efficiency of model training and perform approximate query answering. Finally, I will conclude by discussing distribution-aware and human-centric aspects of the management of data lakes.
Fatemeh Nargesian is an assistant professor of computer science at the University of Rochester. She obtained her PhD at the University of Toronto. Her research interests are in data management for AI-based data analytics and scientific time-series management. Her work has appeared at top-tier venues including VLDB, SIGMOD, and ICDE, and has won the best demo award of VLDB 2017.