TY - JOUR
T1 - Activating qualified thesaurus terms for automatic indexing with taxonomy-based WSD
AU - Kohlhof, Inga
AU - Kozlov, Boris
AU - Doornenbal, Marius
N1 - Publisher Copyright:
© 2014 Kohlhof, Kozlov and Doornenbal.
PY - 2014/12/1
Y1 - 2014/12/1
N2 - Many thesauri contain a number of descriptors consisting of the term proper plus a suffix in brackets meant to explain the term's intended interpretation. For instance, the MeSH thesaurus contains a term Polymorphism (Genetics). For different thesauri, these terms account for 1%-5% of all descriptors. For automatic indexing based on recognizing term occurrences in free text, these terms are practically useless |free text never or very rarely contains term references of this form. A naive text annotation method, matching these terms with their bracketed qualifiers stripped o (thèbare' terms) results in frequently wrong interpretations. We investigated to what extent short forms of qualified terms (viz. Polymorphism) can be disambiguated by looking for concepts in their textual environment that are ontologically related to the represented concepts (in casu, Genetic Polymorphism), or to the concepts used to qualify (Genetics). Using the NLP framework of the Elsevier Fingerprint Enginer we created a set-up to test disambiguation for a set of 30 qualified terms from the NAL thesaurus, that we annotated in approximately 1500 scientific abstracts from the agricultural domain found in Scopusr. By their ambiguity with respect to the NAL Thesaurus we distinguished three groups of test terms: Terms with unqualified homonyms, terms with qualified homonyms and terms without homonyms inside the thesaurus. For all three groups, the best results (65-75% recall, 83-93% precision) are found when both the concept hosting the qualified terms and the qualifier concept are used to identify supporting concepts in the terms' contexts. Like similar Word Sense Disambiguation (WSD) techniques our approach is attractive as the system is informed by existing knowledge and therefore does not require huge knowledge-intensive investments. At the same time the system delivers reasonable precision. For these reasons we will seek to refine it to bring up recall scores.
AB - Many thesauri contain a number of descriptors consisting of the term proper plus a suffix in brackets meant to explain the term's intended interpretation. For instance, the MeSH thesaurus contains a term Polymorphism (Genetics). For different thesauri, these terms account for 1%-5% of all descriptors. For automatic indexing based on recognizing term occurrences in free text, these terms are practically useless |free text never or very rarely contains term references of this form. A naive text annotation method, matching these terms with their bracketed qualifiers stripped o (thèbare' terms) results in frequently wrong interpretations. We investigated to what extent short forms of qualified terms (viz. Polymorphism) can be disambiguated by looking for concepts in their textual environment that are ontologically related to the represented concepts (in casu, Genetic Polymorphism), or to the concepts used to qualify (Genetics). Using the NLP framework of the Elsevier Fingerprint Enginer we created a set-up to test disambiguation for a set of 30 qualified terms from the NAL thesaurus, that we annotated in approximately 1500 scientific abstracts from the agricultural domain found in Scopusr. By their ambiguity with respect to the NAL Thesaurus we distinguished three groups of test terms: Terms with unqualified homonyms, terms with qualified homonyms and terms without homonyms inside the thesaurus. For all three groups, the best results (65-75% recall, 83-93% precision) are found when both the concept hosting the qualified terms and the qualifier concept are used to identify supporting concepts in the terms' contexts. Like similar Word Sense Disambiguation (WSD) techniques our approach is attractive as the system is informed by existing knowledge and therefore does not require huge knowledge-intensive investments. At the same time the system delivers reasonable precision. For these reasons we will seek to refine it to bring up recall scores.
UR - http://www.scopus.com/inward/record.url?scp=84921907488&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:84921907488
SN - 2211-4009
VL - 4
SP - 17
EP - 28
JO - Computational Linguistics in the Netherlands Journal
JF - Computational Linguistics in the Netherlands Journal
T2 - 24th Meeting of Computational Linguistics in the Netherlands, CLIN 2014
Y2 - 17 January 2014 through 17 January 2014
ER -