TY - GEN
T1 - Let's agree to disagree
T2 - 6th International Conference on Knowledge Capture, KCAP 2011
AU - Tordai, Anna
AU - Van Ossenbruggen, Jacco
AU - Schreiber, Guus
AU - Wielinga, Bob
PY - 2011
Y1 - 2011
N2 - Gold standard mappings created by experts are at the core of alignment evaluation. At the same time, the process of manual evaluation is rarely discussed. While the practice of having multiple raters evaluate results is accepted, their level of agreement is often not measured. In this paper we describe three experiments in manual evaluation and study the way different raters evaluate mappings. We used alignments generated using different techniques and between vocabularies of different type. In each experiment, five raters evaluated alignments and talked through their decisions using the think aloud method. In all three experiments we found that inter-rater agreement was low and analyzed our data to find the reasons for it. Our analysis shows which variables can be controlled to affect the level of agreement including the mapping relations, the evaluation guidelines and the background of the raters. On the other hand, differences in the perception of raters, and the complexity of the relations between often ill-defined natural language concepts remain inherent sources of disagreement. Our results indicate that the manual evaluation of ontology alignments is by no means an easy task and that the ontology alignment community should be careful in the construction and use of reference alignments.
AB - Gold standard mappings created by experts are at the core of alignment evaluation. At the same time, the process of manual evaluation is rarely discussed. While the practice of having multiple raters evaluate results is accepted, their level of agreement is often not measured. In this paper we describe three experiments in manual evaluation and study the way different raters evaluate mappings. We used alignments generated using different techniques and between vocabularies of different type. In each experiment, five raters evaluated alignments and talked through their decisions using the think aloud method. In all three experiments we found that inter-rater agreement was low and analyzed our data to find the reasons for it. Our analysis shows which variables can be controlled to affect the level of agreement including the mapping relations, the evaluation guidelines and the background of the raters. On the other hand, differences in the perception of raters, and the complexity of the relations between often ill-defined natural language concepts remain inherent sources of disagreement. Our results indicate that the manual evaluation of ontology alignments is by no means an easy task and that the ontology alignment community should be careful in the construction and use of reference alignments.
KW - empirical study
KW - inter-rater agreement
KW - manual evaluation
KW - vocabulary alignment
UR - http://www.scopus.com/inward/record.url?scp=79960270087&partnerID=8YFLogxK
U2 - 10.1145/1999676.1999689
DO - 10.1145/1999676.1999689
M3 - Contribución a la conferencia
AN - SCOPUS:79960270087
SN - 9781450303965
T3 - KCAP 2011 - Proceedings of the 2011 Knowledge Capture Conference
SP - 65
EP - 72
BT - KCAP 2011 - Proceedings of the 2011 Knowledge Capture Conference
Y2 - 26 June 2011 through 29 June 2011
ER -