TY - JOUR
T1 - Chemical entity recognition in patents by combining dictionary-based and statistical approaches
AU - Akhondi, Saber A.
AU - Pons, Ewoud
AU - Afzal, Zubair
AU - van Haagen, Herman
AU - Becker, Benedikt F.H.
AU - Hettne, Kristina M.
AU - van Mulligen, Erik M.
AU - Kors, Jan A.
N1 - Publisher Copyright:
© The Author(s) 2016. Published by Oxford University Press.
PY - 2016
Y1 - 2016
N2 - We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small.Database URL: http://biosemantics.org/chemdner-patents.
AB - We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small.Database URL: http://biosemantics.org/chemdner-patents.
UR - http://www.scopus.com/inward/record.url?scp=84985929575&partnerID=8YFLogxK
U2 - 10.1093/database/baw061
DO - 10.1093/database/baw061
M3 - Artículo
C2 - 27141091
AN - SCOPUS:84985929575
SN - 1758-0463
VL - 2016
JO - Database : the journal of biological databases and curation
JF - Database : the journal of biological databases and curation
ER -