
During my time working on a personal project to deepen my understanding of financial fraud detection, I built a machine learning pipeline using a synthetic FPS transaction dataset. My goal was to simulate the workflow of a real fraud detection system — from raw data to interpretable results — while documenting key insights along the way.
My analysis revealed the following:
- Fraud cases are extremely rare (<0.2%) and mostly occur in TRANSFER and CASH_OUT operations.
- Fraudulent transactions often involve empty origin accounts or suspicious balance shifts.
- Basic rule-based detection fails — behavioral features and ML models significantly improve detection.
Here are the steps I followed to complete this project:
- Performed EDA to understand transaction types and patterns of fraud.
- Engineered features based on balance changes, ratios, and transaction behavior.
- Handled class imbalance using SMOTETomek and class weighting techniques.
- Trained and evaluated three models: Logistic Regression, Random Forest, and XGBoost.
- Chose Random Forest for its balance of performance and interpretability.
- Structured the project for reproducibility and future modular expansion.
This project is part of an ongoing series where I share short posts explaining each stage of the process, the challenges I encounter, and how data can help us uncover hidden risk.
🔗 Follow the project
📂 GitHub Repo