Unsupervised Dense Retrieval for Scientific Articles

Dan Li, Vikrant Yadav, Zubair Afzal, Georgios Tsatsaronis

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this work, we build a dense retrieval based semantic search engine on scientific articles from Elsevier. The major challenge is that there is no labeled data for training and testing. We apply a state-of-the-art unsupervised dense retrieval model called Generative Pseudo Labeling that generates high-quality pseudo training labels. Furthermore, since the articles are unbalanced across different domains, we select passages from multiple domains to form balanced training data. For the evaluation, we create two test sets: one manually annotated and one automatically created from the meta-information of our data. We compare the semantic search engine with the currently deployed lexical search engine on the two test sets. The results of the experiment show that the semantic search engine trained with pseudo training labels can significantly improve search performance.

Original languageEnglish
Title of host publicationEMNLP 2022 - Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Subtitle of host publicationIndustry Track
PublisherAssociation for Computational Linguistics (ACL)
Pages323-331
Number of pages9
ISBN (Electronic)9781952148255
DOIs
StatePublished - 2022
Externally publishedYes
Event2022 Conference on Empirical Methods in Natural Language Processing: Industry Track , EMNLP 2022 - Abu Dhabi, United Arab Emirates
Duration: Dec 7 2022Dec 11 2022

Publication series

NameEMNLP 2022 - Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

Conference

Conference2022 Conference on Empirical Methods in Natural Language Processing: Industry Track , EMNLP 2022
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period12/7/2212/11/22

Fingerprint

Dive into the research topics of 'Unsupervised Dense Retrieval for Scientific Articles'. Together they form a unique fingerprint.

Cite this