Activating qualified thesaurus terms for automatic indexing with taxonomy-based WSD

Inga Kohlhof, Boris Kozlov, Marius Doornenbal

Research output: Contribution to journalConference article

Abstract

Many thesauri contain a number of descriptors consisting of the term proper plus a suffix in brackets meant to explain the term's intended interpretation. For instance, the MeSH thesaurus contains a term Polymorphism (Genetics). For different thesauri, these terms account for 1%-5% of all descriptors. For automatic indexing based on recognizing term occurrences in free text, these terms are practically useless |free text never or very rarely contains term references of this form. A naive text annotation method, matching these terms with their bracketed qualifiers stripped o (thèbare' terms) results in frequently wrong interpretations. We investigated to what extent short forms of qualified terms (viz. Polymorphism) can be disambiguated by looking for concepts in their textual environment that are ontologically related to the represented concepts (in casu, Genetic Polymorphism), or to the concepts used to qualify (Genetics). Using the NLP framework of the Elsevier Fingerprint Enginer we created a set-up to test disambiguation for a set of 30 qualified terms from the NAL thesaurus, that we annotated in approximately 1500 scientific abstracts from the agricultural domain found in Scopusr. By their ambiguity with respect to the NAL Thesaurus we distinguished three groups of test terms: Terms with unqualified homonyms, terms with qualified homonyms and terms without homonyms inside the thesaurus. For all three groups, the best results (65-75% recall, 83-93% precision) are found when both the concept hosting the qualified terms and the qualifier concept are used to identify supporting concepts in the terms' contexts. Like similar Word Sense Disambiguation (WSD) techniques our approach is attractive as the system is informed by existing knowledge and therefore does not require huge knowledge-intensive investments. At the same time the system delivers reasonable precision. For these reasons we will seek to refine it to bring up recall scores.

Original languageEnglish
Pages (from-to)17-28
Number of pages12
JournalComputational Linguistics in the Netherlands Journal
Volume4
StatePublished - Dec 1 2014
Event24th Meeting of Computational Linguistics in the Netherlands, CLIN 2014 - Leiden, Netherlands
Duration: Jan 17 2014Jan 17 2014

Fingerprint Dive into the research topics of 'Activating qualified thesaurus terms for automatic indexing with taxonomy-based WSD'. Together they form a unique fingerprint.

  • Cite this