The rapid growth of documents across the web has necessitated finding means of discarding redundant documents and retaining novel ones. Capturing redundancy is challenging as it may involve investigating at a deep semantic level. Techniques for detecting such semantic redundancy at the document level are scarce. In this work we propose a deep Convolutional Neural Networks (CNN) based model to classify a document as novel or redundant with respect to a set of relevant documents already seen by the system. The system is simple and do not require any manual feature engineering. Our novel scheme encodes relevant and relative information from both source and target texts to generate an intermediate representation which we coin as the Relative Document Vector (RDV). The proposed method outperforms the existing state-of-the-art on a document-level novelty detection dataset by a margin of ∼5% in terms of accuracy. We further demonstrate the effectiveness of our approach on a standard paraphrase detection dataset where paraphrased passages closely resemble to semantically redundant documents.
|Association for Computational Linguistics
|Published - May 1 2018