Try IdentityRAG today - unify customer data for GenAI

How to Build Your Own Identity Resolution System

By

Hendrik Nehnes

Identity Resolution (IDR) is an important process that helps organizations create a unified view of customers by normalising, deduplicating, and linking customer data from diverse sources. IDR software can be used for fraud prevention and KYC/AML as well as customer data matching for marketing use cases (e.g. upstream of a Customer Data Platform (CDP)).

A simple one-off deduplication task might better be handled by a data scientist writing their own scripts in Python and handling the data task on their own laptop. However, dealing with large volumes of continuous data, real-time requirements or staying data privacy compliant, requires an enterprise-grade solution.

Creating an in-house IDR system can be a gratifying endeavor for the curious software engineer; it is, however, also a complex technical challenge with many potentially expensive pitfalls.

In this article, we will provide a comprehensive guide to the essential features you must consider and highlight these potential pitfalls you must avoid if you decide to develop your own identity resolution system.

Define your requirements

During the process of creating a complex system it is crucial to define the requirements upfront before jumping into the project of building the solution. Small changes in the requirements can have a big impact on the design of the solution and the cost of its development.

Data Sources: Identify the types and volume of data sources that need integration and resolution. Consider whether the data should only be read from the data source or if the IDR system should also update the data source.

Accuracy: Determine the level of precision required in identity matching for your use cases. How consistent or clean is the data? Is data normalization and fuzzy matching needed to match the data?

Scalability: Evaluate the potential growth of your data. Two numbers are interesting - initial amount of data and data changes per day (updates, ingestions and deletions).

Performance: Define your performance criteria. Consider search response times for different complexities of identity graphs.

Data Freshness: Decide how long the ingestion and processing of new data should take. How much data might need to be ingested in parallel?

Availability: Define your requirements for this central system. Is it a one time batch process or are other processes interacting with the IDR system synchronously?

Data Compliance: Identify which data compliance requirements your IDR system has to fulfill. Examples are GDPR, CCPA.

Customization: Assess your need for customization and integration with existing systems.

Usability: Do you need a user interface where data scientists or analysts can search and examine individual identity graphs?

Data Science: Do you want to use the identity graphs held in the IDR system to train machine learning models, for example for fraud detection?

Transformation and Normalisation

Start by identifying the datasets that need resolution (deduplication and matching). Whether it is names, addresses, phone numbers, or email addresses, ensure these data points are accessible to your dedicated data team and compliance allows the team to use the data.

The first step in identity resolution should be transformation and normalization. This will standardize the collected data ensuring consistent formatting and structure of all data fields. This preliminary step facilitates a streamlined matching process by transforming inconsistent, disparate data into a uniform schema and increases the quality of the matching.

If you are working with address data, consider normalizing the addresses and enriching the data records with geographic coordinates so that later matching is based on geographic distance, rather than fuzzy matching based on address strings. To do this you will need to build or access an address normalization service as well as a service that knows the geographic coordinates of every address in the relevant territory.

Create the matching logic

Establish a matching model or set of rules that will govern the data resolution process. As data often comes with variations and inconsistencies, consider incorporating techniques such as fuzzy matching and geographical matching to achieve accurate linkage.

There are a large number of fuzzy matching algorithms that can be used in combination depending on the data. This process should continue until you meet your accuracy requirements, meaning you have to refine the matching in several iterations, add advanced libraries for your use case and increase the size of the sample.

Matching itself can become very complicated, especially at scale. Be sure to ask your engineering team to read our guide to the complexities of entity resolution.

Danger!

Engineering manager beware! This is the stage that can easily flatter to deceive. It is quite possible that your lone, brave data scientist can, on their own within a few weeks, develop a perfectly good Python script that will deduplicate a small, static dataset on their own machine.

Indeed, the matching rate of a home-grown solution can quite plausibly match the matching rate of an enterprise-level identity resolution system. However, while it might sound contradictory, the matching part of identity resolution is arguably the “easy” part. The difficulty in enterprise level identity resolution lies beyond the matching.

Only the bravest should continue reading…

Define the architecture for the production system

Requirements to keep in mind

Your identity resolution system serves as the authoritative source of customer data truth. Any modifications within this system will have a direct bearing on important decision-making processes.

Real-time vs batch

Do you need your source of customer truth to be continuously up to date? Identity resolution is significantly easier if that is not the case, as then you can update data in batches but then your system will always be out of date.

If you need to use the resolved customer data in any process such as KYC, fraud detection, or real-time marketing personalisation, then real-time identity resolution is essential.

Beware of “fake” real-time, which would be when vendors describe frequent batch updates as real-time. Real-time, in a data streaming sense, means that data is ingested into your identity resolution system, in the order that the data is received, as it is received, without delay. The actual time to process that data may vary depending on the technology from low milliseconds to a second or two, but the point is that each data record is processed individually in turn.

Don’t forget that a real-time system can process batch data, but the opposite is not true.

Security

Given its critical role, robust security measures are imperative. This means considering:

managed access control to data at the API level
data encryption
continuous system monitoring

Incoming data

The identity resolution system may need to handle varying loads of data, depending on the use case. Ensure that your system is able to handle the peak loads of incoming data records for ingestion as well as, simultaneously, peak search query loads.

Ideally, although your system is designed to handle peak data loads, that does not mean that you are paying for unused server or system capacity during lulls in demand.

Availability

Remember, your identity resolution system is probably at the heart of a number of business critical processes. Tailor your system to meet your availability targets. If necessary, consider deploying across multiple data centers, each equipped with its own cluster of machines, to preempt downtime in the event of server failures.

Back-up

Back-ups of identity resolution systems are not trivial, as if it is based on real-time data ingestion then the identity graphs contained within are constantly changing. Taking a snapshot in time that can be used to recover your IDR system in the worst case scenario is a significant engineering challenge on its own.

Define the Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) that align with your business needs. Develop a robust backup and restoration process from the beginning as it will have a big impact on the architecture of your solution.

User interface

It is likely that the end-user of your identity resolution is other software systems, such as fraud detection algorithms or a customer data platform (CDP). Nevertheless, depending on your use case, you may want analysts to be able to directly examine identity graphs themselves.

In this case you will need a graphical user interface (GUI), that is firstly able to represent the identity graph data in its native data format - that is the same as the data that would be provided via API to another system. Ideally, for ease of analysis and understanding, you should also be able to show a visual identity graph, which visually represents the data records and the connections between them (which we refer to as edges).

A “nocode” GUI also makes configuration of the identity resolution system easier, and can allow a non-software engineer to make improvements to the matching rules without having to write actual software code. This significantly reduces the dependency on the engineering department which is usually in high demand.

Documentation

Don’t forget to write comprehensive documentation covering all the above! If the person that developed your company's identity resolution software leaves the company, you need someone else to be able to take over maintenance and development of the software, potentially at short notice.

Data in and out

Depending on your requirements, create the endpoints to ingest and query the data. For identity resolution systems GraphQL APIs are quite useful as you do not need to add all data into a response but only send the data that is really needed. This also decreases the response times, load, and costs.

Consider also whether access is required to bulk data from your identity resolution software. i.e. do you want to be able to query your source of customer truth for “all customers ages 20-30 who live in New York State?”. That sort of query is not suitable for an API, meaning you will need some kind of SQL-like interface for analytics queries.

Events

An IDR system is a living process in which the data is constantly changing. It is useful to not only query data but also to make changes available to the other systems via events. This means that as soon as an identity or data set is changed or deleted, this information can be sent to upstream systems, but it also adds complexity to the architecture of the system.

Scenarios

While using an identity resolution system to train machine learning models, it is desirable to simulate how your identity graphs would look under different scenarios e.g. if data source C were not available, or Rule X were different.

To do this conventionally you may need to run multiple parallel IDR systems, each simulating these different scenarios. Our preferred method, which avoids this system duplication, is something we call the “what if machine”.

Compliance

A system that acts as a source of truth for customer data is sure to be the subject of attention from your organization’s Data Protection Officer (DPO), so involve them in the requirements development from the beginning, or you may waste your time developing a system that cannot be used without incurring significant data privacy regulation fines and reputational damage.

The worst case scenario is that you develop an identity resolution system that is fundamentally non-compliant with data privacy regulations - e.g. one in which it is impossible to delete data.

The most important compliance related features are:

ability to delete individual records in identity graphs
automated data deletion based on data source and time
data encryption
granular access control based on data sources
ability to automatically generate Data Subject Access Requests (DSARs)
explainability of identity graph assembly (i.e. why Record A and B are connected)

Deployment

As already mentioned, the identity resolution system will be an integral part of your overall data infrastructure (the so-called “modern data stack”). As data schemas change and matching will need to be tuned over time, the identity resolution systems should never be manually deployed. Instead you should use an Infrastructure as Code tool, such as Terraform to manage the deployments for you.

Many identity resolution systems will constantly have data coming in and data being searched. This needs to be considered before the deployment. Is it acceptable to have downtimes or does the system need to be able to constantly accept incoming data and also respond to queries during the deployment?

Before deploying to production, the identity resolution software system should be tested, so you need a second system with the same configuration as your production system to be able to test changes before going live. This system can also be used to do performance testing and to check if your response time requirements are met under different circumstances.

Run the Identity Resolution system

Now the IDR system is deployed to production you have to ensure that the system is maintained and stays within the defined SLAs.

Based on your requirements this means conducting failover tests within local clusters or also failover tests for geographic clusters. It also means you have to set up (or use an existing) operations team to deploy updates, solve any issues and to monitor the system. In many cases this also means creating an on-call rotation for out-of-business hours.

As no system is perfect, data can be damaged or lost. For this you already defined your RTO and RPO. Now it is time to prove that you can restore the system within these objectives. As this information is relevant to audits, you should document the restores you did and store the logs for future reference.

Identity resolution systems often handle personal information. This means that you have to comply with strict regulations. Most time usually is spent on data subject access and data deletion requests. As these requests have to be answered within a defined time you should either automate this process or define a team to handle this.

Cost Considerations

Factor in the costs of identity resolution system development, encompassing logic development by a data science team, system creation by a development team, hosting for servers/containers, and maintenance resources. Allocate resources for backup systems and redundancy in data centers.

Depending on the design of the solution, the costs will increase over time as data volumes are increasing and the system performance is not linearly scaling with the costs.

To give you the idea of the development time required to develop an enterprise level identity resolution system, Tilores was originally developed within a mid-sized European consumer credit bureau. They were already very experienced in identity resolution technologies, and had built several previous systems before building Tilores, which ultimately took the equivalent of one development team (four engineers, one product manager, one QA engineer) approximately three years to develop.

Conclusion

Building an identity resolution system is an ongoing journey - it will never be “finished”. You must continuously assess its performance, gather user feedback, and adapt to evolving business needs to drive enhancements and maintain its value and functionality.

The path to constructing your own entity/identity resolution system is both intricate and rewarding. By following this structured approach and remaining mindful of potential challenges, you'll be better equipped to craft a resilient identity resolution system tailored to your organization's unique requirements.

Remember, an identity resolution system transcends a mere technical tool—it serves as a cornerstone for data-driven decisions across your entire enterprise.

Explore Similar Articles

The API to unify scattered customer data in real-time.