TY - JOUR
T1 - Unsupervised feature selection and class labeling for credit card fraud
AU - Kennedy, Robert K.L.
AU - Villanustre, Flavio
AU - Khoshgoftaar, Taghi M.
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/12
Y1 - 2025/12
N2 - Large datasets frequently lack class labels, and obtaining labeled data often involves substantial financial and time costs, along with risks of label noise and inaccuracies due to manual annotation. In the context of fraud detection, such as credit card fraud, these challenges are compounded by privacy concerns and high class imbalances, which severely degrades classification performance of machine learning models. In this paper, we present a fully unsupervised approach that combines SHapley Additive exPlanations (SHAP) for feature selection with an autoencoder based method for generating class labels for a widely used credit card fraud detection dataset. Using this publicly available and well-known dataset, we construct different sized datasets using feature selection, generate class labels, and measure the quality and efficacy of the labels. We evaluate the labels by training different types of supervised classifiers on the newly generated labels and measure their Area Under the Precision-Recall Curve (AUPRC). Empirical results show that using SHAP feature selection consistently and significantly improves the quality and usability of the generated class labels, as measured by the AUPRC performance of classifiers trained on them. Results also show that the generated labels, both with and without a feature selection preprocessing step, outperform Isolation Forest (IF), an unsupervised anomaly detection method used as a baseline. This demonstrates that SHAP-based feature ranking and selection significantly improves generated class label quality for credit card fraud detection and is a promising strategy for handling large, imbalanced, and unlabeled fraud detection datasets.
AB - Large datasets frequently lack class labels, and obtaining labeled data often involves substantial financial and time costs, along with risks of label noise and inaccuracies due to manual annotation. In the context of fraud detection, such as credit card fraud, these challenges are compounded by privacy concerns and high class imbalances, which severely degrades classification performance of machine learning models. In this paper, we present a fully unsupervised approach that combines SHapley Additive exPlanations (SHAP) for feature selection with an autoencoder based method for generating class labels for a widely used credit card fraud detection dataset. Using this publicly available and well-known dataset, we construct different sized datasets using feature selection, generate class labels, and measure the quality and efficacy of the labels. We evaluate the labels by training different types of supervised classifiers on the newly generated labels and measure their Area Under the Precision-Recall Curve (AUPRC). Empirical results show that using SHAP feature selection consistently and significantly improves the quality and usability of the generated class labels, as measured by the AUPRC performance of classifiers trained on them. Results also show that the generated labels, both with and without a feature selection preprocessing step, outperform Isolation Forest (IF), an unsupervised anomaly detection method used as a baseline. This demonstrates that SHAP-based feature ranking and selection significantly improves generated class label quality for credit card fraud detection and is a promising strategy for handling large, imbalanced, and unlabeled fraud detection datasets.
KW - Credit card fraud detection
KW - Feature selection
KW - Label generation
KW - SHAP
KW - Unsupervised learning
UR - http://www.scopus.com/inward/record.url?scp=105004359928&partnerID=8YFLogxK
U2 - 10.1186/s40537-025-01154-1
DO - 10.1186/s40537-025-01154-1
M3 - Artículo
AN - SCOPUS:105004359928
SN - 2196-1115
VL - 12
JO - Journal of Big Data
JF - Journal of Big Data
IS - 1
M1 - 111
ER -