Elasticsearch for Entity Resolution
If you were to ask ten different data experts, “what is Elasticsearch?” chances are that you would be met with ten different answers — “an index”, “a search engine”, “it’s Google… but for data”. Though not particularly helpful in answering your question or providing you with context, all these answers are technically correct.
Elasticsearch is a distributed, open-source (under Apache 2.0) full-text search and analytics engine based on Apache Lucene. It is accessible via the RESTful API or the Elasticsearch Java API client and can be used to store, search, and analyse huge volumes of data and return answers in near to real-time due to its document-based structure. Elasticsearch sits at the core of the Elastic Stack, a set of free tools for data ingestion, enrichment, storage, analytics, and visualisation.
How Does Elasticsearch Work?
Elasticsearch is fed by flows of raw data from a variety of sources such as logs and web applications. Users are free to import data from any source, and the Elastic Stack includes a range of out-of-the-box integrations to streamline ingestion. Once data comes in, it is parsed, normalised, and then enriched before finally being indexed in Elasticsearch.
In Elasticsearch, this index is a collection of documents that are related to one another. Data is stored in JSON documents, and each document correlates a set of keys with their corresponding values. Using a data structure known as an inverted index, it is possible for Elasticsearch to identify every unique word that appears in any document and identify all the documents that each unique word appears in — and it does so in near real-time.
Once data has been indexed, users can run queries against it and use aggregations to retrieve summaries. Using the wider ELK Stack, it then becomes possible to begin building powerful visualisations, perform deeper analytics, and manage deployments all from one place.
The speed and scalability of Elasticsearch mean that it can be applied to a variety of use cases including application, web, and enterprise search, application performance monitoring, and business analytics.
Elasticsearch Entity Resolution
It’s all well and good to be able to rapidly run search queries, but what use is this if your data is scattered and badly structured? This is where entity resolution comes in. Entity resolution is the process of deduplicating and matching the data that belongs to a particular entity so that only one single source of truth for that entity exists in the system. In other words, you are taking non-identical related data from disparate sources and combining it into a single entity.
Let’s say for instance that you’re an eCommerce business and you have a database that contains the contact information of your customers and clients. One of your customers is Daniel Jones, and he has a personal entry in your database. Further to this personal entry, there’s also a corporate data record for 123 Ltd that has a director named ‘Dan Jones’. Are they the same person? If you can’t answer this question, then how can you make informed decisions about the risks and opportunities associated with this person?
This is where entity resolution comes in.
Where Entity Resolution Comes In
Entity resolution makes it possible to quickly link scattered data points into one entity, remove duplicates, combine all available data across all datasets, and create a single source of truth for the entity in question.
While this might sound trivial — after all, the data is there regardless — entity resolution is a critical process because it matches non-identical records without the need to constantly formulate new rules. This makes it possible to analyse information more efficiently, draw patterns over unified information, see the bigger picture, and benefit from a single view of the entity in question: in this case, a customer — priceless.
The Problem with Using Elasticsearch for Entity Resolution
Using Elasticsearch is simply a case of setting up your interface and adding your data. Most of the time, this search is carried out recursively which can create problems because, conceptually, it’s the wrong approach; each time a matching dataset is found, the search is restarted. The obvious trade-off here is a loss in speed and flexibility, especially when large datasets are involved.
Let’s imagine for a second that you are in a maze with three doorways and each time you try to progress to the next stage, you must check the same three options each time until you build the perfect pathway. This is slow and inefficient; you should only have to check one option once. This is the issue with ‘vanilla’ Elasticsearch — pathways must be drawn each time, which is a drain on time and resources.

Many Companies Try Elasticsearch for Entity Resolution
At Tilo, we frequently encounter companies that are using Elasticsearch for their initial entity resolution. It makes sense — at least at first — as it’s a technology that many engineers are familiar with, so it’s quite quick to get it up and running. It’s only once the entity resolution needs start to grow (typically once you have around 1 million entities) that the scaling problems become apparent, with slow search times and a small team required to constantly work on the technology to keep it running on multiple costly clusters.
This is where TiloRes comes in. Our scalable, serverless solution eliminates this problem by building up entities as data is ingested. When searches are subsequently performed, a comparison with all existing datasets is made, and any pre-built entity that the matching dataset belongs to is delivered as a search result. Since a dataset can only belong to one entity at any given time, a single search query can return multiple entities as a result. In addition, TiloRes uses GraphQL API, which means that you’re only getting the data that you ask for from your searches. Nothing else. This paired with TiloRes search rules delivers the most accurate results possible.
Conclusion of Using Elasticsearch for Entity Resolution
While the Elasticsearch engine might be a powerful method for searching and analysing huge volumes of data in real-time, it’s not without its limits. Bloating, split-brain situations, slow time-to-delivery for large searches, and limited scalability are all potential problems that data teams might run into.
At Tilo, we help firms to overcome these challenges with our proprietary data solution, TiloRes: a serverless entity resolution technology. TiloRes offers super-fast searching, unlimited scaling, and real-time deduplication to speed up, simplify, and future-proof the entity resolution process.
Ready to try entity resolution?
Start Building Free →