CS Colloquium
Title: The Statistics of Dirty Data
Hosts: Avi Silberschatz and Zhong Shao
Abstract:
A statistical model is only as good as its training data. Systematic errors can arise when data are integrated from untrustworthy sources, collected in mixed formats, or contain inconsistent references of the same real-world entities. This talk describes the classical relational database topic of “data cleaning”, i.e., the process of transforming the data to remove such issues, from a modern statistical perspective. My talk emphasizes two central themes: (1) analyzing data cleaning algorithms using statistical theory regarding sample-complexity and generalization and (2) building data cleaning systems for emerging statistical machine learning and AI applications. My results include new error bounds for query processing after data cleaning, learning-theoretic models for understanding the accuracy of data transformation rules on unseen data, and experimental results on the design of scalable data cleaning systems deployed in applications ranging from real-time robot learning to investigative journalism. I conclude by describing our ongoing effort on a system called AlphaClean, which leverages reinforcement learning to synthesize data cleaning programs for very unstructured data cleaning problems.
Bio:
Sanjay Krishnan is a Computer Science PhD candidate in the RISELab and in the AUTOLAB (Berkeley Laboratory for Automation Science and Engineering) at UC Berkeley. His research studies problems at the intersection of database theory, machine learning, and robotics. Sanjay’s work has received a number of awards including the 2016 SIGMOD Best Demonstration award, 2015 IEEE GHTC Best Paper award, and Sage Scholar award.