Entity Resolution: Reflections on the most common data science challenge

By

We noticed recently a post on Hacker News titled "Entity Resolution: The most common data science challenge", which generated an interesting discussion about how difficult entity resolution is at scale.

We were too late to join the comments in the discussion thread, but thought it was worthwhile to write a post in answer to the top comment in the discussion, from user "smeeth", which addressed the entity resolution tutorial linked from the original post:

Danger awaits all ye who enter this tutorial and have large datasets.
The tutorial is fun marketing material and all but its FAR too slow to be used on anything at scale. Please, for your sanity, don't treat this as anything other than a fun toy example.

ER wants to be an O(n^2) problem and you have to fight very hard for it not to turn into one. Most people doing this at scale are following basically the same playbook:

1) Use a very complicated speed optimized non-ml algorithm to find groups of entities that are highly likely to be the same, usually based on string similarity, hashing, or extremely complicated heuristics. This process is called blocking or filtering.

2) Use fancy ML to determine matches within these blocks.

If you try to combine 1 & 2, skip 1, or try to do 1 with ML on large datasets you are guaranteed to have a bad time. The difference between mediocre and amazing ER is how well you are doing blocking/filtering.

Quote from user smeeth on Hacker News

From our experience, what most solutions, that follow this pattern, lack is that: 

  1. 1) they do not provide the possibility to do real time data modifications,
  2. 2) they are missing transitive links between matches, or,
  3. 3) the links between the entries are not persisted.

(1): most of the time blocking and matching is done from within batch processes. Assign each entry from your data set to at least one block. And then compare each entry from a block with all other entries from the same block. Something that btw. does not require "fancy ML", but could also be done using "simple" rules, that often are more explainable.

Eventually, you still end up with O(n^2) within each block. And only the size of each block determines how much of a problem that is for you. Ideally, each block contains only the entries that belong to each other - but that never happens. The issue starts once new data is added, as often people would just rerun the same process again - wasting valuable computing resources and time.

One thing that can be optimized is to not rerun the blocking for everything - only new data needs to be added to the existing blocks. And then you don't need to rerun the whole matching among all blocks, but only the ones that were actually affected by the new data. Reality is however, that people are lazy (sometimes), have to solve other topics (often) or that even an optimized process still takes too long for real time updates (almost always). We've often seen solutions that due to their real-time requirements end up with some kind of crappy self-build recursive search that gets slower and slower the more data is added.

(2): As mentioned before, each entry of your data set can be assigned to at least one block. Unfortunately it often ends up with not "at least" but with "exactly one". This results in scenarios where A is matching with B and C is matching with D, but ignoring the fact that also B might match with C, but C not with A and D not with B. We would call that deduplication, but not proper entity resolution. But even if you end up with putting entries into multiple blocks, you're going to struggle with the next issue: identifying from all these connections you now have, the actual entities. It can be done e.g. with a connected components algorithm but is also tricky to scale.

(3): This one is fairly implementation specific, but one thing we often noticed was that tools that help you with entity resolution only provide you with an entity ID for each entry you have. Every entry with the same entity ID belongs to the same entity - simple, isn't it?

But you just lost a bunch of valuable information: how all entries within one entity are actually connected and how certain each link is. The reason for this is duplicates. Let's consider you have three records A1, A2 and A3, each containing the exact same information and therefore clearly matching. Now if you create all the links between these you end up with the tuples A1-A2, A1-A3 and A2-A3, so three links. However, if you have A1..A5 it is already 10 links. For A1..An it is n*(n-1)/2 links. For n=1000 that is almost half a million links. Now you could argue, that all of these duplicates don't provide any value, but reality is, that often we're not working with real duplicates, but with non-identical duplicates. That means data that is exactly the same for the matching relevant attributes, but differs in other attributes and therefore cannot be thrown away. A good solution needs to handle these differently than a mediocre solution.

See the discussion about this post on Hacker News here.

Related

Posts

Explore Similar Articles

The API to unify scattered customer data in real-time.

Get the latest updates

©2023 Tilores, All right reserved.