Alexander R. Fabbri
Title: Text Summarization Across High and Low-resource Settings
Advisor: Dragomir Radev
Other committee members:
Smaranda Muresan (Columbia University)
Natural language processing aims to build automated systems that can both understand and generate natural language textual data. As the amount of textual data available online has increased exponentially, so has the need for intelligence systems to comprehend and present this data to the world. As a result, automatic text summarization, the process by which a text’s salient content is automatically distilled into a concise form, has become a necessary tool.
Automatic text summarization approaches and applications vary based on the input summarized, which may constitute single or multiple documents of different genres. Furthermore, the desired output style may consist of a sentence or sub-sentential units chosen directly from the input in extractive summarization or a fusion and paraphrase of the input document in abstractive summarization. Despite differences in the above use-cases, specific themes, such as the role of large-scale data for training these models, the application of summarization models in real-world scenarios, and the need for adequately evaluating and comparing summaries, are common across these settings.
This dissertation presents novel data and modeling techniques for deep neural network-based summarization models trained across large-scale and low-resource data settings and a comprehensive evaluation of the model and metric progress in the field. We examine both Recurrent Neural Network (RNN)-based and Transformer-based models to extract and generate summaries from the input. We introduce datasets to facilitate the training of large-scale networks in two applications of multi-document summarization. We also propose unsupervised learning techniques for both extractive summarization in question answering and abstractive summarization and few-shot and distant supervision learning to use unlabeled data better.
In particular, this dissertation addresses the following research objectives:
1) High-resource Summarization. We introduce two datasets for multi-document summarization, focusing on pedagogical applications for NLP and news summarization. In both cases, we analyze how the models trained on these large-scale datasets fare when applied to real-world scenarios and introduce a novel model to reduce redundancy in multi-document summaries.
2) Low-resource Summarization. We propose a pipeline for creating synthetic training data for training extractive question-answering models, a form of query-based extractive summarization with short-phrase summaries. In other work, we propose an automatic pipeline for training a multi-document summarizer in answer summarization on community question-answering forums without labeled data. Finally, we push the boundaries of abstractive summarization model performance when little or no training data is available.
3) Automatic Summarization Evaluation. We study the current metrics used to compare summarization output quality across 12 metrics across 23 deep neural network models and propose better-motivated summarization evaluation guidelines.