← Back to Blog
Engineering 2024 · 12 min read

How to Build Your Own Identity Resolution System

HN
Hendrik Nehnes
Tilores

Building Tilores took the equivalent of one development team — four engineers, one product manager, one QA engineer — approximately three years to develop. This article shares what we learned about what it actually takes to build a production-grade identity resolution system.

This isn’t a tutorial. It’s a reality check.

What Identity Resolution Actually Requires

At its core, identity resolution answers a simple question: “Do these two records refer to the same real-world person?” The answer is never binary — it’s probabilistic, contextual, and dependent on data quality that varies wildly between source systems.

A production identity resolution system needs to handle at least these components:

  1. Data ingestion — accepting records from multiple sources in different formats
  2. Data transformation — normalizing, cleaning, and enriching records before matching
  3. Matching — comparing records to find duplicates using fuzzy algorithms
  4. Entity assembly — grouping matched records into unified entities
  5. Storage — maintaining the entity graph and source record linkage
  6. Query API — retrieving resolved entities on demand
  7. Real-time updates — handling new records without full re-processing

Each of these is its own engineering challenge. Together, they form a system that is significantly harder to build than it appears.

The Data Transformation Problem

Before you can match records, you need to normalize them. This sounds simple until you encounter real-world data:

  • “Jon” should match “Jonathan” but not “Jonas”
  • “Müller” should match “Mueller” and “Muller”
  • “Hauptstraße 14” should match “Hauptstr. 14” and “Hauptstrasse 14”
  • “+49 30 12345678” should match “030-12345678” and “004930 12345678”
  • sarah@gmail.com” might be the same person as “sarah@googlemail.com” — but not necessarily

Each attribute type (name, address, phone, email, date of birth) needs its own transformation logic. And the transformations are locale-specific — German name normalization is different from Japanese or Arabic name normalization.

We spent approximately six months just on the transformation layer.

The Matching Challenge

Matching algorithms need to balance precision (avoiding false positives) with recall (catching true matches). The naive approach — comparing every record against every other record — has O(n²) complexity, which is unusable at scale.

At 150 million records, O(n²) means 22.5 quadrillion comparisons. Even at one million comparisons per second, that would take 713 years.

Production matching engines use blocking strategies to reduce the comparison space — grouping records by shared attributes (e.g., same first three letters of last name, same postal code) and only comparing within blocks. Getting the blocking strategy right is critical: too broad and you miss matches, too narrow and performance degrades.

Our matching engine uses a multi-pass approach with progressively looser blocking criteria, combined with attribute-specific similarity functions (Jaro-Winkler for names, Levenshtein for addresses, phonetic matching for transliterations).

The Real-Time Problem

Batch entity resolution — processing all records at once — is well-understood. Real-time entity resolution — resolving a single new record against all existing entities in milliseconds — is a fundamentally different engineering challenge.

When a new record arrives, the system needs to:

  1. Transform and normalize it (< 1ms)
  2. Identify candidate matching entities using blocking keys (< 2ms)
  3. Score the similarity against all candidate records (< 5ms)
  4. Decide whether to create a new entity or merge into an existing one (< 1ms)
  5. Update the entity graph (< 1ms)

All of this needs to happen in under 10 milliseconds, every time, at any scale. This is where our serverless architecture pays off — compute scales automatically with load, and there’s no connection pooling or thread management to worry about.

The Scale Problem

Identity resolution at scale introduces challenges that don’t exist at smaller volumes:

  • Entity chaining — Record A matches B, B matches C, but A doesn’t match C directly. How do you handle transitive matches without creating entity sprawl?
  • Split and merge — New data can reveal that what looked like one entity is actually two people, or vice versa. The system needs to handle both directions.
  • Consistency under concurrent writes — Multiple records for the same entity can arrive simultaneously from different sources.
  • Storage growth — The entity graph grows non-linearly as record volume increases.

What We’d Do Differently

If we started over today, we’d make one key change: start with the query API, not the matching engine. The matching engine is the hardest part technically, but the API is what determines product-market fit. Build a minimal matching engine, ship the API, get real users, and iterate on matching quality based on actual data patterns.

Should You Build It Yourself?

Probably not. Here’s a quick decision framework:

  • Build it yourself if: Entity resolution is your core product (it’s literally what you sell), you have 3+ years and a dedicated team, and your matching requirements are genuinely unique
  • Use Tilores if: Entity resolution is a capability you need but not your core business, you need to go to production in days/weeks instead of years, and you’d rather spend engineering time on your actual product

We spent three years building Tilores so you don’t have to.


Ready to skip the three-year build? Try Tilores free — resolve your first entity in minutes.

Ready to try entity resolution?

Start Building Free →