Solr Dictionary Annotator: SODA

Sujit Pal (Developer)

Research output: Non-textual formSoftware

Abstract

The Solr Dictionary Annotator (SoDA) is a Dictionary-based Annotator (or Gazetteer) that supports exact as well as fuzzy lookups across multiple lexicons.

SoDA is backed by a Solr index which holds entity names (primary and alternate names), as well as an identifier for that entity. Multiple copies of these entity names, stemmed by a set of stemming algorithms of various strengths, are created and stored in the index. During annotation, the text to be annotated is stemmed and spans matched against similarly stemmed entity names in the index. Fast (FST based) span lookup is done using the SolrTextTagger project.

SoDA supports multiple dictionaries (lexicons) within the same Solr index. Matching modes currently supported are exact, lower (case insensitive), stop (english stopwords removed), and three levels of stemming (stem1, stem2, stem3) implemented using Solr's Minimal English Stemmer, KStem stemmer and Porter Stemmer respectively.
Original languageAmerican English
Media of outputOnline
Publication statusPublished - Jan 10 2018

    Fingerprint

Cite this