Selecting documents relevant for chemistry as a classification problem

Zhemin Zhu, Saber A. Akhondi, Umesh Nandal, Marius Doornenbal, Michelle Gregory

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review


    We present a first version of a system for selecting chemical publications for inclusion in a chemistry information database. This database, Reaxys (, is a portal for the retrieval of structured chemistry information from published journals and patents. There are three challenges in this task: (i) Training and input data are highly imbalanced; (ii) High recall (≥95%) is desired; and (iii) Data offered for selection is numerically massive but at the same time, incomplete. Our system successfully handles the imbalance with the undersampling technique and achieves relatively high recall using chemical named entities as features. Experiments on a real-world data set consisting of 15,822 documents show that the features of chemical named entities boost recall by 8% over the usual n-gram features being widely used in general document classification applications. For fostering research on this challenging topic, a part of the data set compiled in this paper can be requested.

    Original languageEnglish
    Title of host publicationKnowledge Engineering and Knowledge Management - EKAW 2016 Satellite Events, EKM and Drift-an-LOD, Revised Selected Papers
    EditorsMari Carmen Suarez-Figueroa, Jun Zhao, Matthew Horridge, Valentina Presutti, Tudor Groza, Mathieu d’Aquin, Paolo Ciancarini, Francesco Poggi
    PublisherSpringer Verlag
    Number of pages4
    ISBN (Print)9783319586939
    StatePublished - Jan 1 2017
    Event20th International Conference on Knowledge Engineering and Knowledge Management, EKAW 2016 - Bologna, Italy
    Duration: Nov 19 2016Nov 23 2016

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume10180 LNAI
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349


    Conference20th International Conference on Knowledge Engineering and Knowledge Management, EKAW 2016


    • Document classification
    • Language processing
    • Machine learning cheminfomatics
    • Natural


    Dive into the research topics of 'Selecting documents relevant for chemistry as a classification problem'. Together they form a unique fingerprint.

    Cite this