Optimized Machine Learning Methods Predict Discourse Segment Type in Biological Research Articles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

To define salient rhetorical elements in scholarly text, we have earlier defined a set of Discourse Segment Types: semantically defined spans of discourse at the level of a clause with a single rhetorical purpose, such as Hypothesis, Method or Result. In this paper, we use machine learning methods to predict these Discourse Segment Types in a corpus of biomedical research papers. The initial experiment used features related to verb type and form, obtaining F-scores ranging from 0.41–0.65. To improve our results, we explored a variety of methods for balancing classes, before applying classification algorithms. We also performed an ablation study and stepwise approach for feature selection. Through these feature selection processes, we were able to reduce our 37 features to the 9 most informative ones, while maintaining F1 scores in the range of 0.63–0.65. Next, we performed an experiment with a reduced set of target classes. Using only verb tense features, logistic regression, a decision tree classifier and a random forest classifier, we predicted that a segment type was either a Result/Method or a Fact/Implication, with F1 scores above 0.8. Interestingly, findings from this machine learning approach are in line with a reader experiment, which found a correlation between verb tense and a biomedical reader’s interpretation of discourse segment type. This suggests that experimental and concept-centric discourse in biology texts can be distinguished by humans or machines, using verb tense as a key feature.

Original languageEnglish
Title of host publicationSemantics, Analytics, Visualization - 3rd International Workshop, SAVE-SD 2017, and 4th International Workshop, SAVE-SD 2018, Revised Selected Papers
EditorsFrancesco Osborne, Silvio Peroni, Sahar Vahdati, Alejandra González-Beltrán
PublisherSpringer Verlag
Pages95-109
Number of pages15
ISBN (Print)9783030013783
DOIs
StatePublished - Jan 1 2018
Event3rd International Workshop on Semantics, Analytics, Visualization, SAVE-SD 2017 and 4th International Workshop on Semantics, Analytics, Visualization, SAVE-SD 2018 - Lyon, France
Duration: Apr 24 2018Apr 24 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10959 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference3rd International Workshop on Semantics, Analytics, Visualization, SAVE-SD 2017 and 4th International Workshop on Semantics, Analytics, Visualization, SAVE-SD 2018
CountryFrance
CityLyon
Period04/24/1804/24/18

Keywords

  • Discourse segments
  • Linguistics
  • Machine learning
  • Sentence structure

Fingerprint Dive into the research topics of 'Optimized Machine Learning Methods Predict Discourse Segment Type in Biological Research Articles'. Together they form a unique fingerprint.

  • Cite this

    Cox, J., Harper, C. A., & de Waard, A. (2018). Optimized Machine Learning Methods Predict Discourse Segment Type in Biological Research Articles. In F. Osborne, S. Peroni, S. Vahdati, & A. González-Beltrán (Eds.), Semantics, Analytics, Visualization - 3rd International Workshop, SAVE-SD 2017, and 4th International Workshop, SAVE-SD 2018, Revised Selected Papers (pp. 95-109). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10959 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-030-01379-0_7