TY - GEN
T1 - Learning domain labels using conceptual fingerprints
T2 - 20th International Conference on Knowledge Engineering and Knowledge Management, EKAW 2016
AU - Afzal, Zubair
AU - Tsatsaronis, George
AU - Doornenbal, Marius
AU - Coupet, Pascal
AU - Gregory, Michelle
N1 - Publisher Copyright:
© Springer International Publishing AG 2016.
PY - 2016/1/1
Y1 - 2016/1/1
N2 - Modelling a science domain for the purposes of thematically categorizing the research work and enabling better browsing and search can be a daunting task, especially if a specialized taxonomy or ontology does not exist for this domain. Elsevier, the largest academic publisher, faces this challenge often, for the needs of supporting the journals submission system, but also for supplying ScienceDirect and Scopus, two flagship platforms of the company, with sufficient metadata, such as conceptual labels that characterize the research works, which can improve the user experience in browsing and searching the literature. In this paper we describe an Elsevier in-use case study of learning appropriate domain labels from a collection of 6, 357 full text articles in the neurology domain, exploring different document representations and clustering mechanisms. Besides the baseline approaches for document representation (e.g., bag-of-words) and their variations (e.g., n-grams), we employ a novel in-house methodology which produces conceptual fingerprints of the research articles, starting from a general domain taxonomy, such as the Medical Subject Headings (MeSH). A thorough empirical evaluation is presented, using a variety of clustering mechanisms and several validity indices to evaluate the resulting clusters. Our results summarize the best practices in modelling this specific domain and we report on the advantages and disadvantages of using the different clustering mechanisms and document representations that were examined, with the aim to learn appropriate conceptual labels for this domain.
AB - Modelling a science domain for the purposes of thematically categorizing the research work and enabling better browsing and search can be a daunting task, especially if a specialized taxonomy or ontology does not exist for this domain. Elsevier, the largest academic publisher, faces this challenge often, for the needs of supporting the journals submission system, but also for supplying ScienceDirect and Scopus, two flagship platforms of the company, with sufficient metadata, such as conceptual labels that characterize the research works, which can improve the user experience in browsing and searching the literature. In this paper we describe an Elsevier in-use case study of learning appropriate domain labels from a collection of 6, 357 full text articles in the neurology domain, exploring different document representations and clustering mechanisms. Besides the baseline approaches for document representation (e.g., bag-of-words) and their variations (e.g., n-grams), we employ a novel in-house methodology which produces conceptual fingerprints of the research articles, starting from a general domain taxonomy, such as the Medical Subject Headings (MeSH). A thorough empirical evaluation is presented, using a variety of clustering mechanisms and several validity indices to evaluate the resulting clusters. Our results summarize the best practices in modelling this specific domain and we report on the advantages and disadvantages of using the different clustering mechanisms and document representations that were examined, with the aim to learn appropriate conceptual labels for this domain.
KW - Best practices
KW - Clustering evaluation
KW - Conceptual fingerprints
KW - Document clustering
KW - Document labeling
KW - Domain taxonomy
KW - Neurology domain
UR - http://www.scopus.com/inward/record.url?scp=84997498673&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-49004-5_47
DO - 10.1007/978-3-319-49004-5_47
M3 - Conference contribution
AN - SCOPUS:84997498673
SN - 9783319490038
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 731
EP - 745
BT - Knowledge Engineering and Knowledge Management - 20th International Conference, EKAW 2016, Proceedings
A2 - Ciancarini, Paolo
A2 - Poggi, Francesco
A2 - Vitali, Fabio
A2 - Blomqvist, Eva
PB - Springer Verlag
Y2 - 19 November 2016 through 23 November 2016
ER -