Join the weekly live Tilores demo webinar

Why Graph Databases Fail at Entity Resolution (And What to Use Instead)

By

Steven Renwick

A conversation between Steven Renwick (CEO, Tilores) and Max Latey (Founder, Pinboard Consulting)

Graph Databases can be powerful tools for Entity Resolution (ER), but frequently run into performance and complexity roadblocks when scaled up to Enterprise data volumes. In a recent webinar, Steven Renwick, CEO Tilores, sat down with Max Latey, founder of graph technology consultancy Pinboard Consulting, to untangle why graph databases often fail at ER when they seem so well suited, and share hard-won lessons from real-world implementations.

The result is a remarkably practical conversation for anyone working with messy, fragmented, or siloed data.

What Is a Graph Database, Really?

The graph technology world suffers, as Max Latey puts it, from a "terminology problem." Most people associate the word "graph" with bar charts and line charts — the kind you'd produce in Excel. But in computer science, network science, and mathematics, a graph is something fundamentally different: a structure made up of nodes (objects) and edges (relationships between those objects).

Where a traditional relational database stores data in rows and columns, a graph database stores it as a network. A node might represent a person, a company, a train station, or even a chemical molecule. An edge represents the relationship between two nodes — a financial transaction, a social connection, a train track, or a covalent bond.

As Latey explains: "Graph databases store things as nodes and edges — nodes being objects, either physical or conceptual, and edges being some kind of relationship between those. You could model two people having a conversation as two nodes with the relationship between them being 'having conversation', which could have a duration as an attribute."

This makes graph databases particularly well-suited to use cases like social network mapping, fraud detection, financial transaction analysis, and knowledge graphs — anywhere where the relationships between things are as important as the things themselves.

Entity Resolution: A Different Problem Entirely

Entity resolution — the process of identifying that different records in one or more datasets refer to the same real-world entity — should be something that graph databases excel at as ER deals with relationships and connections in data. Unfortunately, this is precisely what causes so much confusion for teams trying to implement ER solutions.

As Renwick explains: "Entity resolution is the deduplication and linking of record data from one or many sources to identify unique things or entities. Most of the time that's a person or a company, but it could be an object, a place, or a product. It's different from graph databases in that we are trying to establish these distinct entities rather than suggesting relationships between things."

The key distinction is that text-based entity resolution — the most common form — is fundamentally about determining whether two blocks of text refer to the same thing. "Max Lately" and "Max Latey" probably refer to the same person. "Steven Renwick" and "Stephen Renick" might or might not. Resolving this requires fuzzy matching, rules engines, and probabilistic scoring — not graph traversal.

Where Graph Databases Fall Short for Entity Resolution

Both Renwick and Latey have first-hand experience of organisations attempting to use graph databases for text-based entity resolution — and discovering why it doesn't work.

As Latey has seen: "Trying to do text-based entity resolution on a graph is just terrible. You get a huge combinatoric explosion. You start to get proliferation of nodes, a proliferation of edges. The logic becomes hard because the way you're trying to resolve the entities has nothing to do with the relationships."

Consider a simple example: a customer record with variants like "123 Kings Street", "123 Kingstr", a postcode with a space and one without. In a graph database, each variation spawns new nodes and edges, creating an ever-expanding tangle with no clean way to collapse them into a single identity.

Renwick adds that this is a pattern Tilores encounters constantly: "People come to us having already tried to do entity resolution with a graph database — usually Neo4j, sometimes Neptune. And they've also tried Elasticsearch, because conceptually it makes sense: I've got a new account, let me search for related accounts. But then you want to find anything related to that one, so you have to jump again, and you end up with a horrible transitive hop problem."

Where Graph Databases Excel: Graph Entity Resolution

There is, however, a distinct and legitimate use case where graph databases are genuinely the right tool: graph entity resolution, where the goal is to determine whether two graph nodes represent the same real-world entity based on their relationships, not their text attributes.

Latey gives a compelling example from financial compliance: two individuals — call them John Smith and Kurt Mueller — both have shareholdings in the same three companies, both live in the same country, both bank with the same institution. Their names are completely different, so text matching would never link them. But their relationship patterns are nearly identical, raising a legitimate question: are they actually the same person operating under different identities?

"For graph entity resolution, where you're trying to work out on the basis of relationships which nodes may actually be the same entity, graph databases are great," says Latey. "But for text-based entities — blocks of text that you're trying to determine are similar or not — graph is actually terrible."

The Right Architecture: Text Resolution First, Then Graph

The practical implication is that text-based entity resolution and graph entity resolution are not competing approaches — they are complementary steps in a well-designed data pipeline.

The recommended architecture: resolve your text entities first using a purpose-built tool like Tilores, then load the clean, deduplicated data into your graph database for relationship-based analysis. As Latey summarises: "Entity-resolved graphs will absolutely perform better if you use something like a text rules engine prior to loading the nodes into the graph database, then do graph entity resolution."

Without this step, you load four records of "Max Latey" into your graph as four separate nodes — and all your downstream relationship analysis will be built on a flawed foundation.

Case Study: 100 Million Records in Last-Mile Logistics

Theory is one thing. The conversation takes a more concrete turn when the two discuss a recent logistics project — a customer Latey brought to Tilores after meeting them at a logistics conference.

The challenge: a last-mile logistics provider dealing with data from dozens of suppliers — eBay, Amazon, Temu, Shein, and others — each sending data in different formats, different quality levels, with different conventions. Some mask email addresses. Some truncat names to initials. Some split addresses in unexpected places. Previous attempts to build a single customer view had failed.

"I took one look at it and went: Tilores can smash this," says Latey. "We did a very successful POC. It literally only took a couple of weeks to get their data schema in, do the resolution, set up the models, ingest over 100 million records, and resolve those down to around 30 million unique golden records. The false negatives and false positives were tiny, and they were completely blown away by it."

Renwick notes that the scale of the company's previous failures made the result almost hard to believe: "I think the scale of the challenges they'd had before made them almost not quite believe that it could actually be simple when you use the right tools."

The Problem You Didn't Know You Had

One of the most striking themes in the conversation is how often entity resolution is the underlying cause of data problems that organisations are trying to solve with other tools.

"Entity resolution is probably the most common data science challenge that you don't know you have," says Renwick. "Almost every single company has this problem with siloed, messy data — and everyone's running around trying to build AI on top of it without realising the data is actually pretty crappy. It's like trying to build a fancy modern house on the beach without any foundations."

Latey describes a health insurance company that had a claims-and-remittances matching problem for years, never realising it was an entity resolution use case: "It wasn't until I spoke to them about this scenario that they kind of saw the light and went: oh my God, we've got an entity resolution problem."

The second failure mode is organisations that do recognise the problem, but assume they can solve it themselves — typically using Python libraries with Levenshtein distance, Jaccard similarity, or cosine matching.

Why "We Can Build It Ourselves" Usually Fails

A data scientist with a good Python toolkit can build entity matching that works well on a sample of a few thousand records. The problem is scale.

"It's a quadratic problem," explains Renwick. "You've got to compare every single record to every single other record. You do that over 10,000 records, you've got a significant number of comparisons. You've got 10 million people — the numbers start going a bit crazy. There are techniques like blocking, but you're still making too many compromises. Getting something that works on small data samples into production is just about impossible."

The pair are diplomatic but clear about what happens when teams decide to try SQL joins, custom Python scripts, or graph databases before eventually calling in a purpose-built solution: "We say to them: fine, this is just going to delay us getting started by three months. Feel free to go away and try that out."

What About AI and Large Language Models?

Given the current enthusiasm for LLMs, the conversation inevitably turns to whether AI can solve entity resolution. The answer is nuanced.

"LLMs can do entity resolution," acknowledges Renwick. "If you've got a small data sample and you want to do a one-off deduplication, chuck it into Claude and you'll get quite decent results. Try and put that into production, and there's a whole raft of things that won't work."

Latey adds: "It's not a viable part of a large-scale data workflow. The token cost, the accuracy — and forget explainability. Tilores is literally 100% white box: you will see what matched what and why it matched it, stored against the record."

The more interesting future application is LLMs as a human-in-the-loop replacement for edge cases: grey-area matches that a rules engine flags as uncertain could be reviewed by an LLM rather than requiring human review — handling low volumes of ambiguous cases while leaving the bulk of deterministic matching to a rules engine.

Key Takeaways

Graph-Based and Text-Based entity resolution are related but fundamentally different disciplines — confusing them leads to expensive failed implementations.
Text-based entity resolution ("are these two records the same person?") is poorly suited to graph databases, which suffer a combinatoric explosion on fuzzy text matching.
Graph-based entity resolution ("are these two nodes actually the same entity, based on their relationships?") is where graph databases genuinely excel — e.g. beneficial ownership analysis.
The optimal architecture is: text entity resolution first, then load clean golden records into your graph database.
Entity resolution is one of the most common data problems organisations don't know they have — and one of the most common ones they underestimate.
LLMs can assist with entity resolution at small scale or for edge-case review, but are not viable as a core production matching engine.

Explore Similar Articles

The API to unify scattered customer data in real-time.

service@tilores.io