Abstract
The CLEF SimpleText Lab focuses on identifying pertinent sections from an vast collection of scientific papers in response to general queries, recognizing and explaining complex terminology in those sections, and ultimately, making the sections easier to understand. The first task is akin to the ad-hoc retrieval task where the objective is to find relevant sections based on a query/topic, but it also requires ranking models to evaluate documents according to their readability and complexity, alongside relevance. The third task is centered around simplifying sentences from scientific abstracts. In this paper, we outline our strategy for creating a ranking model to address the first task and our methods for employing GPT-3.5 in a zero-shot manner for the third task. To create the ranking model, we initially assess the performance of several models on a proprietary test collection built using scientific papers from various science fields. Subsequently, we fine-tune the top-performing model on a large set of unlabelled documents using the Generative Pseudo Labeling approach. We further experiment with generating new queries using the provided queries, topics, and abstracts to generate a search query. Our approach's primary contribution and findings indicate that a bi-encoder model, trained on the MS-Marco dataset and fine-tuned further on a vast collection of unlabelled scientific sections, yields the best results on the proprietary dataset, specifically designed for the scientific passage retrieval task. For the third task, we aim to test the limits of a zero-shot Large Language Model (LLM), namely GPT-3.5, by experimenting with various zero-shot and few-shot prompts on both sentence-level and abstract-level. We find that few-shot prompting results in a higher performance on BLEU and SARI, but leads to a higher FKGL, as the simplified sentences in the provided test set have a higher FKGL as well. Conversely, lower FKGL can be obtained with zero-shot prompting, but will result in lower BLEU and SARI scores as well.
| Original language | English |
|---|---|
| Pages (from-to) | 3206-3229 |
| Number of pages | 24 |
| Journal | CEUR Workshop Proceedings |
| Volume | 3740 |
| State | Published - 2024 |
| Event | 25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024 - Grenoble, France Duration: Sep 9 2024 → Sep 12 2024 |
Keywords
- Domain Adaptation
- Information Retrieval
- Scholarly Document Processing
- Scientific Documents