Linguistics and Computer Science PhD Disseration Defense
Title: Theory and Applications of Attribution for Interpretable Language Technology
Co-Advisors: Dana Angluin (Computer Science), Bob Frank (Linguistics), Vladimir Rokhlin (advisor of record)
Other committee members:
John Lafferty Jason Shaw Yoav Goldberg (Bar-Ilan University)
Attribution methods (Lipovetsky and Conklin, 2001; Štrumbelj et al., 2009; Simonyan et al., 2014; Zeiler and Fergus, 2014; Bach et al., 2015; Ribeiro et al., 2016; Shrikumar et al., 2017a; Murdoch et al., 2018; Sundararajan et al., 2017; Sundararajan and Najmi, 2020, inter alia) are a family of local interpretability techniques that measure the “contribution” of input features towards an individual model output. In natural language processing (NLP), attribution methods are often used to identify input tokens (e.g., Li et al., 2016, 2017; Arras et al., 2017b; Jumelet et al., 2019) or neural network units (e.g., Lakretz et al., 2019; Serrano and Smith, 2019) that strongly impact the overall behavior of a model. This dissertation takes steps towards developing a conceptual framework designed to guide the development and evaluation of attribution methods, with particular focus on NLP applications. We begin with an intrinsic evaluation of five attribution methods, which shows that the notion of “contribution” formalized by attribution methods does not match our intuitive understanding thereof. We argue that these results are due to an incongruence between the theories of causation that underlie the design of attribution methods and the vaguely-defined goals of explanation against which attribution methods are evaluated. We then explore two applications of attribution methods: one that seeks to explain the behavior of an LSTM language model and one that uses measurements of causation in a downstream task. We conclude with a reflection of the conceptual structures and evaluation criteria imposed on attribution methods by these two applications, and propose a program for application-oriented research in attribution.