Projects per year
Abstract
To define salient rhetorical elements in scholarly text, we have earlier defined a set of Discourse Segment Types: semantically defined spans of discourse at the level of a clause with a single rhetorical purpose, such as Hypothesis, Method or Result. In this paper, we use machine learning methods to predict these Discourse Segment Types in a corpus of biomedical research papers. The initial experiment used features related to verb type and form, obtaining F-scores ranging from 0.41–0.65. To improve our results, we explored a variety of methods for balancing classes, before applying classification algorithms. We also performed an ablation study and stepwise approach for feature selection. Through these feature selection processes, we were able to reduce our 37 features to the 9 most informative ones, while maintaining F1 scores in the range of 0.63–0.65. Next, we performed an experiment with a reduced set of target classes. Using only verb tense features, logistic regression, a decision tree classifier and a random forest classifier, we predicted that a segment type was either a Result/Method or a Fact/Implication, with F1 scores above 0.8. Interestingly, findings from this machine learning approach are in line with a reader experiment, which found a correlation between verb tense and a biomedical reader’s interpretation of discourse segment type. This suggests that experimental and concept-centric discourse in biology texts can be distinguished by humans or machines, using verb tense as a key feature.
Original language | English |
---|---|
Title of host publication | Semantics, Analytics, Visualization - 3rd International Workshop, SAVE-SD 2017, and 4th International Workshop, SAVE-SD 2018, Revised Selected Papers |
Editors | Francesco Osborne, Silvio Peroni, Sahar Vahdati, Alejandra González-Beltrán |
Publisher | Springer Verlag |
Pages | 95-109 |
Number of pages | 15 |
ISBN (Print) | 9783030013783 |
DOIs | |
State | Published - Jan 1 2018 |
Event | 3rd International Workshop on Semantics, Analytics, Visualization, SAVE-SD 2017 and 4th International Workshop on Semantics, Analytics, Visualization, SAVE-SD 2018 - Lyon, France Duration: Apr 24 2018 → Apr 24 2018 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 10959 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 3rd International Workshop on Semantics, Analytics, Visualization, SAVE-SD 2017 and 4th International Workshop on Semantics, Analytics, Visualization, SAVE-SD 2018 |
---|---|
Country/Territory | France |
City | Lyon |
Period | 04/24/18 → 04/24/18 |
Keywords
- Discourse segments
- Linguistics
- Machine learning
- Sentence structure
Fingerprint
Dive into the research topics of 'Optimized Machine Learning Methods Predict Discourse Segment Type in Biological Research Articles'. Together they form a unique fingerprint.Projects
- 1 Finished
-
DARPA project: Extracting Cancer Abstracts with Carnegie Mellon University
Hovy, E. H. (CoI), Burns, G. (CoI) & de Waard, A. (CoI)
01/1/17 → 12/31/19
Project: Research