TY - JOUR
T1 - ChEMU 2020
T2 - Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents
AU - He, Jiayuan
AU - Nguyen, Dat Quoc
AU - Akhondi, Saber A.
AU - Druckenbrodt, Christian
AU - Thorne, Camilo
AU - Hoessel, Ralph
AU - Afzal, Zubair
AU - Zhai, Zenan
AU - Fang, Biaoyan
AU - Yoshikawa, Hiyori
AU - Albahem, Ameer
AU - Cavedon, Lawrence
AU - Cohn, Trevor
AU - Baldwin, Timothy
AU - Verspoor, Karin
N1 - Publisher Copyright:
Copyright © 2021 He, Nguyen, Akhondi, Druckenbrodt, Thorne, Hoessel, Afzal, Zhai, Fang, Yoshikawa, Albahem, Cavedon, Cohn, Baldwin and Verspoor.
PY - 2021
Y1 - 2021
N2 - Chemical patents represent a valuable source of information about new chemical compounds, which is critical to the drug discovery process. Automated information extraction over chemical patents is, however, a challenging task due to the large volume of existing patents and the complex linguistic properties of chemical patents. The Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab 2020, part of the Conference and Labs of the Evaluation Forum 2020 (CLEF2020), was introduced to support the development of advanced text mining techniques for chemical patents. The ChEMU 2020 lab proposed two fundamental information extraction tasks focusing on chemical reaction processes described in chemical patents: (1) chemical named entity recognition, requiring identification of essential chemical entities and their roles in chemical reactions, as well as reaction conditions; and (2) event extraction, which aims at identification of event steps relating the entities involved in chemical reactions. The ChEMU 2020 lab received 37 team registrations and 46 runs. Overall, the performance of submissions for these tasks exceeded our expectations, with the top systems outperforming strong baselines. We further show the methods to be robust to variations in sampling of the test data. We provide a detailed overview of the ChEMU 2020 corpus and its annotation, showing that inter-annotator agreement is very strong. We also present the methods adopted by participants, provide a detailed analysis of their performance, and carefully consider the potential impact of data leakage on interpretation of the results. The ChEMU 2020 Lab has shown the viability of automated methods to support information extraction of key information in chemical patents.
AB - Chemical patents represent a valuable source of information about new chemical compounds, which is critical to the drug discovery process. Automated information extraction over chemical patents is, however, a challenging task due to the large volume of existing patents and the complex linguistic properties of chemical patents. The Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab 2020, part of the Conference and Labs of the Evaluation Forum 2020 (CLEF2020), was introduced to support the development of advanced text mining techniques for chemical patents. The ChEMU 2020 lab proposed two fundamental information extraction tasks focusing on chemical reaction processes described in chemical patents: (1) chemical named entity recognition, requiring identification of essential chemical entities and their roles in chemical reactions, as well as reaction conditions; and (2) event extraction, which aims at identification of event steps relating the entities involved in chemical reactions. The ChEMU 2020 lab received 37 team registrations and 46 runs. Overall, the performance of submissions for these tasks exceeded our expectations, with the top systems outperforming strong baselines. We further show the methods to be robust to variations in sampling of the test data. We provide a detailed overview of the ChEMU 2020 corpus and its annotation, showing that inter-annotator agreement is very strong. We also present the methods adopted by participants, provide a detailed analysis of their performance, and carefully consider the potential impact of data leakage on interpretation of the results. The ChEMU 2020 Lab has shown the viability of automated methods to support information extraction of key information in chemical patents.
KW - chemical reactions
KW - cheminformatics
KW - event extraction
KW - information extraction
KW - named entity recognition
KW - patent text mining
UR - http://www.scopus.com/inward/record.url?scp=85113505930&partnerID=8YFLogxK
U2 - 10.3389/frma.2021.654438
DO - 10.3389/frma.2021.654438
M3 - Artículo
AN - SCOPUS:85113505930
SN - 2504-0537
VL - 6
JO - Frontiers in Research Metrics and Analytics
JF - Frontiers in Research Metrics and Analytics
M1 - 654438
ER -