Predicting-fraud-proyect.jpg

During my time working on a personal project to deepen my understanding of financial fraud detection, I built a machine learning pipeline using a synthetic FPS transaction dataset. My goal was to simulate the workflow of a real fraud detection system — from raw data to interpretable results — while documenting key insights along the way.

My analysis revealed the following:

  1. Fraud cases are extremely rare (<0.2%) and mostly occur in TRANSFER and CASH_OUT operations.
  2. Fraudulent transactions often involve empty origin accounts or suspicious balance shifts.
  3. Basic rule-based detection fails — behavioral features and ML models significantly improve detection.

Here are the steps I followed to complete this project:

  1. Performed EDA to understand transaction types and patterns of fraud.
  2. Engineered features based on balance changes, ratios, and transaction behavior.
  3. Handled class imbalance using SMOTETomek and class weighting techniques.
  4. Trained and evaluated three models: Logistic Regression, Random Forest, and XGBoost.
  5. Chose Random Forest for its balance of performance and interpretability.
  6. Structured the project for reproducibility and future modular expansion.

This project is part of an ongoing series where I share short posts explaining each stage of the process, the challenges I encounter, and how data can help us uncover hidden risk.

🔗 Follow the project

📂 GitHub Repo