Host: Abhishek Bhattacharjee
Title: Automating Failure Diagnosis for Distributed Systems
Distributed software systems have become the backbone of Internet services. Consequently, Failures in production distributed systems have severe consequences. A 63-minute outage of Amazon in 2018 caused a 100-million loss in revenue. Moreover, the frequency of failures rises with the increasing complexity of software systems. 2019 has experienced noticeably more Internet outages and is sometimes considered as the “year of outages”.
Diagnosing such failures in distributed systems at data center scale is a particularly critical, yet notoriously difficult task because these systems are complex: there are numerous threads, processes, and nodes communicating concurrently. Existing diagnosis techniques are either intrusive and incur non-negligible performance overhead in a production environment, or face scalability challenges when applied to complex software systems.
A promising approach is to replicate how developers diagnose these failures. Guided by this notion, this talk will describe two tools, namely Pensieve and Kairux, which automate two major tasks of failure diagnosis: failure reproduction and root cause localization. Given the logs and code of a distributed system that has failed (in production), Pensieve is capable of formulating a minimal set of operations necessary to reproduce the failure, and Kairux can further pinpoint the single instruction that is the root cause.
Yongle Zhang is a PhD candidate in the Distributed Systems Research Group at the University of Toronto, working with Prof. Ding Yuan. His research interest is in systems software with a focus on improving the reliability and availability of complex, real-world systems.