An Open Access Corpus of Scientific, Technical, and Medical Content

Dataset

Description

Natural Language Processing (NLP) tools perform best if they are used on the same kind of content on which they were trained and tested. Unfortunately for those in the STM domains, our content has some big differences from the newswire text that is commonly used in the development of most NLP tools. There are some corpora of STM content, but the ones we know of are specific to one domain, such as biomedicine, and will typically consist of abstracts instead of full articles. This is less than optimal because math articles are very different from biomed articles, and articles are very different from abstracts.

To improve this situation, Elsevier is providing a selection of articles from 10 different STM domains as a freely-redistributable corpus. The articles were selected from our Open Access content and have a Creative Commons CC-BY license so they are free to redistribute and use. The domains are agriculture, astronomy, biology, chemistry, computer science, earth science, engineering, materials science, math, and medicine. Currently we provide 11 articles in each of the 10 domains. (We also provide instructions on how to find all of our Open Access CC-BY content.)

For each article in the corpus we provide:

the XML source,
a simple text version for easier text mining,
several versions with different annotations. These include part of speech tags, sentence breaks, NP and VP chunks, lemmas, syntactic constituents parses, wikipedia concept identification, and discourse analysis.
Most of the annotations are automatically created. However, we have identified 10 documents as a default test set. As new annotation types are added, those articles should be the first choice for manually reviewed and corrected test data.
Date made available2015
PublisherGithub
Date of data production2015

Cite this

Daniel, R. (Creator), Groth, P. (Creator), Scerri, A. (Creator), Harper, C. A. (Creator), Vandenbussche, P. (Creator), Cox, J. (Creator) (2015). An Open Access Corpus of Scientific, Technical, and Medical Content. Github.