Annotating patient clinical records with syntactic chunks and named entities: the Harvey Corpus

Aleksandar Savkov, John Carroll, Rob Koeling, Jackie Cassell

Research output: Contribution to journalArticlepeer-review

31 Scopus citations

Abstract

The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning.

Original languageEnglish
Pages (from-to)523-548
Number of pages26
JournalLanguage Resources and Evaluation
Volume50
Issue number3
DOIs
StatePublished - Sep 1 2016
Externally publishedYes

Keywords

  • Annotation guidelines
  • Chunking
  • Clinical text
  • Corpus annotation
  • Named entities

Fingerprint

Dive into the research topics of 'Annotating patient clinical records with syntactic chunks and named entities: the Harvey Corpus'. Together they form a unique fingerprint.

Cite this