TY - GEN
T1 - Unsupervised Dense Retrieval for Scientific Articles
AU - Li, Dan
AU - Yadav, Vikrant
AU - Afzal, Zubair
AU - Tsatsaronis, Georgios
N1 - Publisher Copyright:
© 2022 Association for Computational Linguistics.
PY - 2022
Y1 - 2022
N2 - In this work, we build a dense retrieval based semantic search engine on scientific articles from Elsevier. The major challenge is that there is no labeled data for training and testing. We apply a state-of-the-art unsupervised dense retrieval model called Generative Pseudo Labeling that generates high-quality pseudo training labels. Furthermore, since the articles are unbalanced across different domains, we select passages from multiple domains to form balanced training data. For the evaluation, we create two test sets: one manually annotated and one automatically created from the meta-information of our data. We compare the semantic search engine with the currently deployed lexical search engine on the two test sets. The results of the experiment show that the semantic search engine trained with pseudo training labels can significantly improve search performance.
AB - In this work, we build a dense retrieval based semantic search engine on scientific articles from Elsevier. The major challenge is that there is no labeled data for training and testing. We apply a state-of-the-art unsupervised dense retrieval model called Generative Pseudo Labeling that generates high-quality pseudo training labels. Furthermore, since the articles are unbalanced across different domains, we select passages from multiple domains to form balanced training data. For the evaluation, we create two test sets: one manually annotated and one automatically created from the meta-information of our data. We compare the semantic search engine with the currently deployed lexical search engine on the two test sets. The results of the experiment show that the semantic search engine trained with pseudo training labels can significantly improve search performance.
UR - http://www.scopus.com/inward/record.url?scp=85152969038&partnerID=8YFLogxK
U2 - 10.18653/v1/2022.emnlp-industry.32
DO - 10.18653/v1/2022.emnlp-industry.32
M3 - Contribución a la conferencia
AN - SCOPUS:85152969038
T3 - EMNLP 2022 - Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track
SP - 323
EP - 331
BT - EMNLP 2022 - Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
PB - Association for Computational Linguistics (ACL)
T2 - 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track , EMNLP 2022
Y2 - 7 December 2022 through 11 December 2022
ER -