Multi-Tier System for Detecting Fraud with AI and Machine Learning

In this tutorial, we will demonstrate how to build a fraud detection framework that can detect known and unknown methods of transaction fraud. By stacking multiple approaches one on top of the other, you can erect multiple lines of defense to surface and prevent a wide variety of fraud.

Fraud is a challenging problem for the following reasons:

  1. Examples of known fraud are rare. This creates a common machine-learning problem called class imbalance, which can adversely affect the performance of your models.
  2. Fraud is nonstationary. The underlying dataset can shift, making existing detection techniques obsolete. (Fraud evolves as fraudsters engage in a cat-and-mouse game with institutions and regulators.)

We attempt to solve these problems by leveraging a system of filters.


Gate 1: Simple Rules Engine - We will show example rules and explain the reasons why rules engines are still important.

Gate 2: Supervised Learning of Known Attacks - A run through of training, deploying, and querying a multilayer perceptron using the Skymind Intelligence Layer (SKIL).

Gate 3: Unsupervised Learning of Unknown Attacks - Demonstrating how an autoencoder can be used to detect fraud by measuring differences between normal and fraudulent transactions.

Let's start by outlining the requirements for this tutorial.


  1. Install the Skymind Intelligence Layer (SKIL). We will use this tool to build our fraud detection framework.
  2. Download this dataset (182 MB). PaySim is a synthetically created dataset of over 6,000,000 mobile money transactions. While PaySim offers only a simplified representation of the real world, the dataset provides a clean and properly labeled training set to build our model.
  3. Download the code for this tutorial.

The Gates of Minas Tirith

Gate 1: Simple Rules Engine (Expert Systems)

This is good, old-fashioned "expert systems" AI, a rules engine created manually and derived from experience and a priori knowledge. Rules are a natural place to start and a convenient way to encode institutional knowledge because rules are easy to implement and can capture well understood and obvious signs of fraud.

Don't dismiss a rules engine as too basic for your business. Rules are necessary because they provide an objective baseline of performance. Without a baseline, you cannot objectively measure and improve the performance of machine learning algorithms.

Here are some made-up example rules:

  1. If a transaction occurs more than 500 miles from a user's home address, then flag for human review.
  2. If the transaction amount is greater than two standard deviations from a user's mean transaction size, then flag for human review.
  3. If the number of transactions from a single account exceeds 3 within 60 seconds, then flag for human review.

You would only create rules for events that are known to be highly suspicious. Otherwise, your system would be inundated with false alarms. You can also white-list certain transactions, if you know them to be safe despite their rarity. This, too, can be a time saver.


  • Leverages existing knowledge.
  • Easy to implement and understand.


  • Lots of false negatives.
  • Rules may become obsolete.
  • Provides only yes/no answers without any measure of uncertainty.
  • Hard to maintain and understand as the number of rules grows over time.

Gate 2: Supervised Learning of Known Fraud

If you have labeled cases of fraud, this is a good starting point for supervised deep learning. At this second gate, we use a neural network called a Multilayer Perceptron (MLP) to classify a transaction as fraud or not fraud.


Training the Multilayer Perceptron

Please locate the file named Fraud MLP.json inside the Github repository. This file contains an importable Zeppelin notebook.

Inside SKIL, follow the instructions here to create an "Experiment" and "upload an existing notebook" using Fraud MLP.json. This will import the code used for training the MLP as a Zeppelin notebook.


Now open the Zeppelin notebook and click "run." It should take 5 minutes or less for all processes to complete.

You should get the following result after the network is done training.

('Test loss:', 0.366028719997406)
('Test accuracy:', 0.9547)

The accuracy is pretty good. It means that using our training dataset (which, again, we acknowledge to be overly simplified and not reflective of real life), we are able to correctly identify a transaction as fraud or not fraud 95% of the time.


Looking at the confusion matrix, we can see 5,408 false positives and 112 false negatives. In an ideal world, these numbers can be adjusted upward or downward by toggling a threshold in which the system should flag a transaction. I have set my threshold to items with a "loss" greater than 0.35 (higher "loss" means the transaction is more likely to be fraudulent).

A copy of this model is automatically saved in your SKIL Workspace.

Now let's deploy this model. Navigate to the "Deployments" tab and create a new deployment.

Go back to Workspaces, open the model in your Experiments, and click the "deploy" button to finish.

Your model has been deployed!

Query the MLP Model to Get Predictions

Using the Skymind Intelligence Layer (SKIL), we will expose our newly trained model via REST API. This enables you to integrate real-time predictions into your existing applications using a standard technology.


  1. Obtain the JAR file (instructions here).
  2. Start and query the REST Endpoint to get predictions.

Navigate to your deployment and look for the "start" button. This may take a couple minutes.

Using the command line, paste the following script to query the model. Don't forget to replace YourEndpointURLHere with the endpoint URL from your SKIL dashboard.

//Query the Model
java -jar skil-example-anomaly-detection-1.0.0.jar --feature fraud.csv --endpoint [YourEndpointURLHere]

That will return a JSON object containing a prediction.

// "Probabilities" is an arbritrary score. The higher the score, the more likely that this transaction is fraudulent. In this case, it's probably not fraud because the score is less than the average of 0.366028719997406.
classification return: {"results":[0],"probabilities":[0.2721463]}

These results can be consumed by your application to score transactions by how fraudulent they appear. You have now created a second layer of defense on top of your existing rules engine to weed out transactions that may have fallen through the cracks.


  • Provides a measure of certainty on top of a rules engine.
  • Can be improved as more labeled cases of fraud become available. (Accuracy scales with labeled data.)


  • Requires a large dataset of labeled cases of fraud.

Let's move on to the final layer of defense: detecting "unknown unknowns", as Donald Rumsfeld would say.

Layer 3: Unsupervised Learning of Unknown Fraud

Once you have mastered multilayer perceptrons, you are ready to apply more sophisticated detection methods. In Layer 3, we will apply an unsupervised model called an autoencoder, with the goal of detecting unknown unknowns.

To put this more concretely, our goal is to discover novel cases of fraud that neither your organization nor fraud-detection system has seen before. These needles in a haystack are hard to find, because layers 1 and 2 require the fraudulent method be known upfront. This is key. Unknown unknowns are increasingly common, because technological innovation coupled with a fraudster's creativity in evading detection makes it easier than ever to develop novel ways to commit fraud. You are in an arms race with your adversary.

Let's start by framing this approach with a question: Do fraudulent transactions look different than normal transactions?

Introducing Autoencoders

We can answer this question by using something called an autoencoder.


An autoencoder attempts to learn a representation of input data that can be used to reconstruct that same input. So the model ingests input data and spits out a reconstruction of the same input, and the error is the difference between the actual input and the autoencoder's reconstruction of the same input.

As you can see in the example above, an autoencoder tries to reconstruct the Mario mushroom, and in the process, captures the underlying structure of what a Mario mushroom should look like (i.e., the representation or encoding is a way of compressing data).

This means we can teach an autoencoder to learn what ingredients make up a normal transaction (e.g., the Mario mushroom) and compare that to a fraudulent transaction to see if there are differences without knowing anything about the transaction itself. This is the key to finding unknown unknowns.

Training the Autoencoder

Please find the file named Fraud Autoencoder.json in the Github repository. Using the same steps as before, import this notebook into SKIL and run the code.

Evaluating an Autoencoder

The output of an autoencoder is different than the output of other neural networks.

Instead of outputting a classification (e.g., fraud or not_fraud), it outputs a score called "reconstruction error." A reconstruction error of 0 could mean that a particular transaction is perfectly normal. A reconstruction error of, say 0.5, could mean that a particular transaction is very anomalous and requires further scrutiny.

Let's inspect the results to find the answer to our question. Do fraudulent transactions look different than normal transactions?.

Not Fraud

Reconstruction Error (Not Fraud)
mean 0.012224
std 0.013587
min 0.000005
25% 0.003067
50% 0.008335
75% 0.017505
max 0.087747

For a normal transaction, the mean reconstruction error is 0.012224. Keep that in mind.


Reconstruction Error (Fraud)
mean 0.030405
std 0.029059
min 0.000031
25% 0.003441
50% 0.023584
75% 0.049816
max 0.098196

For a fraudulent transaction, the mean reconstruction error is 0.030405 -- much higher than the mean reconstruction error of 0.012224 for a normal transaction. This is a key insight. We can now confidently say that items with higher reconstruction errors are more likely to be fraudulent.

With these insights, we can rank transactions by how anomalous they are (i.e., high reconstruction error) -- allowing a human analyst to focus their attention on manually inspecting the "high-risk" transactions. Let's set an arbitrary fraud threshold of 0.03 error and see what type of transactions it flags.

Ground Truth Reconstruction Error
1 0.098195916
1 0.08953883
1 0.089078016
1 0.088795841
1 0.088753483
1 0.087913285
0 0.087747353
1 0.08768236
0 0.087635667
0 0.087525821

Interestingly, 7 of the top 10 transactions with the highest reconstruction error are fraudulent.


A full count reveals that with a fraud threshold of 0.03 (around the mean reconstruction error for fraudulent transactions), 50 out of 5657 transactions (0.09%) are fraudulent.

In contrast, of the remaining 95,387 transactions, 66 (0.0007%) are fraudulent, an order of magnitude lower frequency. The goal here is to give human analysts the purest possible subset of transactions likely to be fraud, to maximize the value of their time.


  • Works with unlabeled datasets.
  • Able to detect unknown unknown cases of fraud.


  • Assumes training data contains very few cases of fraud (if there is a lot of fraud in our training data, then our model may learn to treat fraud as "normal").
  • Assumes training data is representative of normal, i.e., when we deploy, we won't see brand new non-fraudulent patterns. If we do, our model will probably flag them as anomalies.

Putting it Together

Gate 3 is not very accurate, but how can we leverage this result? We can use Layer 3 as a final filter to weed out fraud that may have bypassed Layer 1 and Layer 2. From there, an analyst could confirm if a transaction is fraudulent or not, label it, and feed the learnings back into Layer 1 and 2. This enables you to incorporate new learnings into the fraud detection system to keep up with shifting purchase behavior.


Now, how can we maintain and improve this system?

Evaluation and Monitoring

A/B Testing - Compare our system versus a random sampling to see if fraud detection rate is better than random. If it's not, our system is essentially useless.
Auditing - Run the model on random samples of data to confirm that it's catching fraud as expected.
Tracking - Track model performance over time and keep an eye out for big changes in performance. This could indicate that the data is shifting, requiring you to retrain your model.

Advanced Topics

Combining Models - Use the rules to help train the MLP by, e.g., labeling fraud examples with the rules in addition to the hand-labeled examples.
Active Learning - Use principles from active learning to help identify cases that should be labeled. For example, choose examples with borderline scores (~0.5) from the MLP classifier and ask a human analyst to label it manually.


In conclusion, I'd like to highlight two important takeaways of our multi-tier fraud detection framework.

The first is the importance of ensembles. Despite the hype, AI is not the be-all and end-all. A traditional rules engine, in which institutional knowledge can be hard coded, may detect well-established cases of fraud better than an AI-only system, whose performance is tied to the quality of the data. It would be wise to leverage the best of every approach.

The second is the challenge of behavioral shift. Due to rapid technological change and the growth of big data, the occurence of unknown unknowns has become an especially tricky issue. Building a framework that can continually monitor, audit, scale, and adapt is crucial to building a system that can perform effectively.