Abstract
Objective: The aim of this study was to develop a practical and efficient protein identification system for biomedical corpora. Design: The developed system, called ProtScan, utilizes a carefully constructed dictionary of mammalian proteins in conjunction with a specialized tokenization algorithm to identify and tag protein name occurrences in biomedical texts and also takes advantage of Medline "Name-of-Substance" (NOS) annotation. The dictionaries for ProtScan were constructed in a semi-automatic way from various public-domain sequence databases followed by an intensive expert curation step. Measurements: The recall and precision of the system have been determined using 1,000 randomly selected and hand-tagged Medline abstracts. Results: The developed system is capable of identifying protein occurrences in Medline abstracts with a 98% precision and 88% recall. It was also found to be capable of processing approximately 300 abstracts per second. Without utilization of NOS annotation, precision and recall were found to be 98.5% and 84%, respectively. Conclusion: The developed system appears to be well suited for protein-based Medline indexing and can help to improve biomedical information retrieval. Further approaches to ProtScan's recall improvement also are discussed.
Original language | English |
---|---|
Pages (from-to) | 174-178 |
Number of pages | 5 |
Journal | Journal of the American Medical Informatics Association : JAMIA |
Volume | 11 |
Issue number | 3 |
DOIs | |
State | Published - 2004 |
Externally published | Yes |