Abstract
CLEF SimpleText Lab is centered around finding relevant passages from a large collection of scientific documents in response to a lay query, detecting and explaining difficult terminology within those passages, and finally simplifying the passages. The first task is similar to the ad-hoc retrieval task in which given a topic/query, the goal is to retrieve relevant passages, but in addition to the relevance, ranking models should assess documents based on their readability/complexity as well. This paper describes our approach towards building a ranking model to tackle the first task. To build the ranking model, we first evaluate performance of several models on a proprietary test collection constructed based on scientific documents across multiple science domains. Then, we fine-tune the best performing model on a large collection of unlabelled documents using the Generative Pseudo Labeling approach. The key contribution and findings of our approach is that a bi-encoder model, trained on the MS-Marco dataset, fine-tuned further on a large collection of unlabelled scientific passages achieves the highest performance on the proprietary dataset which is specifically designed for the scientific passage retrieval task. Finally, fine-tuning a model in the same fashion, but only using the Computer Science queries from the test collection has proven to be successful for SimpleText Task 1.
Original language | English |
---|---|
Pages (from-to) | 2923-2934 |
Number of pages | 12 |
Journal | CEUR Workshop Proceedings |
Volume | 3497 |
State | Published - 2023 |
Externally published | Yes |
Event | 24th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF-WN 2023 - Thessaloniki, Greece Duration: Sep 18 2023 → Sep 21 2023 |
Keywords
- Domain Adaptation
- Information Retrieval
- Scholarly Document Processing
- Scientific Documents