A Simple and Practical Dictionary-based Approach for Identification of Proteins in Medline Abstracts

Sergei Egorov, Anton Yuryev, Nikolai Daraselia

Research output: Contribution to journalArticlepeer-review

20 Scopus citations

Abstract

Objective: The aim of this study was to develop a practical and efficient protein identification system for biomedical corpora. Design: The developed system, called ProtScan, utilizes a carefully constructed dictionary of mammalian proteins in conjunction with a specialized tokenization algorithm to identify and tag protein name occurrences in biomedical texts and also takes advantage of Medline "Name-of-Substance" (NOS) annotation. The dictionaries for ProtScan were constructed in a semi-automatic way from various public-domain sequence databases followed by an intensive expert curation step. Measurements: The recall and precision of the system have been determined using 1,000 randomly selected and hand-tagged Medline abstracts. Results: The developed system is capable of identifying protein occurrences in Medline abstracts with a 98% precision and 88% recall. It was also found to be capable of processing approximately 300 abstracts per second. Without utilization of NOS annotation, precision and recall were found to be 98.5% and 84%, respectively. Conclusion: The developed system appears to be well suited for protein-based Medline indexing and can help to improve biomedical information retrieval. Further approaches to ProtScan's recall improvement also are discussed.

Original languageEnglish
Pages (from-to)174-178
Number of pages5
JournalJournal of the American Medical Informatics Association : JAMIA
Volume11
Issue number3
DOIs
StatePublished - 2004
Externally publishedYes

Fingerprint

Dive into the research topics of 'A Simple and Practical Dictionary-based Approach for Identification of Proteins in Medline Abstracts'. Together they form a unique fingerprint.

Cite this