TY - GEN
T1 - Estimating the F1 Score for Learning from Positive and Unlabeled Examples
AU - Tabatabaei, Seyed Amin
AU - Klein, Jan
AU - Hoogendoorn, Mark
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - Semi-supervised learning can be applied to datasets that contain both labeled and unlabeled instances and can result in more accurate predictions compared to fully supervised or unsupervised learning in case limited labeled data is available. A subclass of problems, called Positive-Unlabeled (PU) learning, focuses on cases in which the labeled instances contain only positive examples. Given the lack of negatively labeled data, estimating the general performance is difficult. In this paper, we propose a new approach to approximate the F1 score for PU learning. It requires an estimate of what fraction of the total number of positive instances is available in the labeled set. We derive theoretical properties of the approach and apply it to several datasets to study its empirical behavior and to compare it to the most well-known score in the field, LL score. Results show that even when the estimate is quite off compared to the real fraction of positive labels the approximation of the F1 score is significantly better compared with the LL score.
AB - Semi-supervised learning can be applied to datasets that contain both labeled and unlabeled instances and can result in more accurate predictions compared to fully supervised or unsupervised learning in case limited labeled data is available. A subclass of problems, called Positive-Unlabeled (PU) learning, focuses on cases in which the labeled instances contain only positive examples. Given the lack of negatively labeled data, estimating the general performance is difficult. In this paper, we propose a new approach to approximate the F1 score for PU learning. It requires an estimate of what fraction of the total number of positive instances is available in the labeled set. We derive theoretical properties of the approach and apply it to several datasets to study its empirical behavior and to compare it to the most well-known score in the field, LL score. Results show that even when the estimate is quite off compared to the real fraction of positive labels the approximation of the F1 score is significantly better compared with the LL score.
UR - http://www.scopus.com/inward/record.url?scp=85101267594&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-64583-0_15
DO - 10.1007/978-3-030-64583-0_15
M3 - Contribución a la conferencia
AN - SCOPUS:85101267594
SN - 9783030645823
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 150
EP - 161
BT - Machine Learning, Optimization, and Data Science - 6th International Conference, LOD 2020, Revised Selected Papers
A2 - Nicosia, Giuseppe
A2 - Ojha, Varun
A2 - La Malfa, Emanuele
A2 - Jansen, Giorgio
A2 - Sciacca, Vincenzo
A2 - Pardalos, Panos
A2 - Giuffrida, Giovanni
A2 - Umeton, Renato
PB - Springer Science and Business Media Deutschland GmbH
T2 - 6th International Conference on Machine Learning, Optimization, and Data Science, LOD 2020
Y2 - 19 July 2020 through 23 July 2020
ER -