A practical guide to how entity resolution improves machine learning to detect fraud
Introduction
Online fraud is an ever-growing issue for finance, e-commerce and other related industries. In response to this threat, organizations use fraud detection mechanisms based on machine learning and behavioral analytics. These technologies enable the detection of unusual patterns, abnormal behaviors, and fraudulent activities in real time.
Unfortunately, often only the current transaction, e.g. an order, is taken into consideration, or the process is based solely on historic data from the customer’s profile, which is identified by a customer id. However, professional fraudsters may create customer profiles using low value transactions to build up a positive image of their profile. Additionally, they might create multiple similar profiles at the same time. It is only after the fraud took place that the attacked company realizes that these customer profiles were related to each other.
Using entity resolution it is possible to easily combine different customer profiles into a single 360° customer view, allowing one to see the full picture of all historic transactions. While using this data in machine learning, e.g. using a neural network or even a simple linear regression, would already provide additional value for the resulting model, the real value arises from also looking at how the individual transactions are connected to each other. This is where graph neural networks (GNN) come into play. Beside looking at features extracted from the transactional records, they offer also the possibility to look at features generated from the graph edges (how transactions are linked with each other) or even just the general layout of the entity graph.
Example Data
Before we dive deeper into the details, I have one disclaimer to put here: I am a developer and entity resolution expert and not a data scientist or ML expert. While I think the general approach is correct, I might not be following best practices, nor can I explain certain aspects such as the number of hidden nodes. Use this article as an inspiration and draw upon your own experience when it comes to the GNN layout or configuration.
For the purposes of this article I want to focus on the insights gained from the entity graph’s layout. For this purpose I created a small Golang script that generates entities. Each entity is labeled as either fraudulent or non-fraudulent and consists of records (orders) and edges (how those orders are linked). See the following example of a single entity:
Each record has two (potential) features, the total value and the number of items purchased. However, the generation script completely randomized these values, hence they should not provide value when it comes to guessing the fraud label. Each edge also comes with two features R1 and R2. These could e.g. represent whether the two records A and B are linked via a similar name and address (R1) or the via a similar email address (R2). Furthermore I intentionally left out all the attributes that are not relevant for this example (name, address, email, phone number, etc.), but are usually relevant for the entity resolution process beforehand. As R1 and R2 are also randomized, they also don’t provide value for the GNN. However, based on the fraud label, the edges are laid out in two possible ways: a star-like layout (fraud=0) or a random layout (fraud=1).
The idea is that a non-fraudulent customer is more likely to provide accurate matching relevant data, usually the same address and same name, with only a few spelling errors here and there. Hence new transactions may get recognized as a duplicate.
A fraudulent customer might want to hide the fact that they are still the same person behind the computer, using various names and addresses. However, entity resolution tools may still recognize the similarity (e.g. geographical and temporal similarity, recurring patterns in the email address, device IDs etc.), but the entity graph may look more complex.
To make it a little less trivial, the generation script also has a 5% error rate, meaning that entities are labeled as fraudulent when they have a star-like layout and labeled as non-fraudulent for the random layout. Also there are some cases where the data is insufficient to determine the actual layout (e.g. only one or two records).
In reality you most likely would gain valuable insights from all three kinds of features (record attributes, edge attributes and edge layout). The following code examples will consider this, but the generated data does not.
Creating the Dataset
The example uses python (except for the data generation) and DGL with a pytorch backend. You can find the full jupyter notebook, the data and the generation script on github.
Let’s start with importing the dataset:
This processes the entities file, which is a JSON-line file, where each row represents a single entity. While iterating over each entity, it generates the edge features (long tensor with shape [e, 2], e=number of edges) and the node features (long tensor with shape [n, 2], n=number of nodes). It then proceeds to build the graph based on a and b (long tensors each with shape [e, 1]) and assigns the edge and graph features to that graph. All resulting graphs are then added to the dataset.
Model Architecture
Now that we have the data ready, we need to think about the architecture of our GNN. This is what I came up with, but probably can be adjusted much more to the actual needs:
The constructor takes the number of node features, number of edge features, number of hidden nodes and the number of labels (classes). It then creates two layers: a NNConv layer which calculates the hidden nodes based on the edge and node features, and then a GraphSAGE layer that calculates the resulting label based on the hidden nodes.
Training and Testing
Almost there. Next we prepare the data for training and testing.
We split with a 80/20 ratio using random sampling and create a data loader for each of the samples.
The last step is to initialize the model with our data, run the training and afterwards test the result.
We initialize the model by providing the feature sizes for nodes and edges (both 2 in our case), the hidden nodes (64) and the amount of labels (2 because it’s either fraud or not). The optimizer is then initialized with a learning rate of 0.01. Afterwards we run a total of 50 training iterations. Once the training is done, we test the results using the test data loader and print the resulting accuracy.
For various runs, I had a typical accuracy in the range of 70 to 85%. However, with a few exceptions going down to something like 55%.
Conclusion
Given that the only usable information from our example dataset is the explanation of how the nodes are connected, the initial results look very promising and suggest that higher accuracy rates would be possible with real-world data and more training.
Obviously when working with real data, the layout is not that consistent and does not provide an obvious correlation between the layout and fraudulent behavior. Hence, you should also take the edge and node features into consideration. The key takeaway from this article should be that entity resolution provides the ideal data for fraud detection using graph neural networks and should be considered part of a fraud detection engineer’s arsenal of tools.