Unsupervised feature selection and class labeling for credit card fraud

Robert K.L. Kennedy, Flavio Villanustre, Taghi M. Khoshgoftaar

Research output: Contribution to journalArticlepeer-review

Abstract

Large datasets frequently lack class labels, and obtaining labeled data often involves substantial financial and time costs, along with risks of label noise and inaccuracies due to manual annotation. In the context of fraud detection, such as credit card fraud, these challenges are compounded by privacy concerns and high class imbalances, which severely degrades classification performance of machine learning models. In this paper, we present a fully unsupervised approach that combines SHapley Additive exPlanations (SHAP) for feature selection with an autoencoder based method for generating class labels for a widely used credit card fraud detection dataset. Using this publicly available and well-known dataset, we construct different sized datasets using feature selection, generate class labels, and measure the quality and efficacy of the labels. We evaluate the labels by training different types of supervised classifiers on the newly generated labels and measure their Area Under the Precision-Recall Curve (AUPRC). Empirical results show that using SHAP feature selection consistently and significantly improves the quality and usability of the generated class labels, as measured by the AUPRC performance of classifiers trained on them. Results also show that the generated labels, both with and without a feature selection preprocessing step, outperform Isolation Forest (IF), an unsupervised anomaly detection method used as a baseline. This demonstrates that SHAP-based feature ranking and selection significantly improves generated class label quality for credit card fraud detection and is a promising strategy for handling large, imbalanced, and unlabeled fraud detection datasets.

Original languageEnglish
Article number111
JournalJournal of Big Data
Volume12
Issue number1
DOIs
StatePublished - Dec 2025
Externally publishedYes

Keywords

  • Credit card fraud detection
  • Feature selection
  • Label generation
  • SHAP
  • Unsupervised learning

Fingerprint

Dive into the research topics of 'Unsupervised feature selection and class labeling for credit card fraud'. Together they form a unique fingerprint.

Cite this