You must know the person you are serving and the product that you are selling.
If you know exactly who you serve as a business, or the product you are selling, you have successfully performed entity resolution. What this means is that you have taken any entity, be it a person or a product, collected any and every information regarding it, and then connected the information to that unique entity in such a way that there exists no other duplicate of the same entity.
In this article, we aim to explain the following basics of entity resolution,
What is entity resolution?
What are the challenges in entity resolution?
How can you perform entity resolution?
Is there a method to balance the advantages and disadvantages of batch and real-time processing?
Does entity resolution completely replace your existing databases?
Are there any data-related regulations that we should be aware of?
How does GDPR apply to entity resolution?
How does entity resolution apply to the real world?
How does entity resolution help with products?
1. What is entity resolution?
Let’s say there are two names for the same person, Robert and Bob, and this “entity” (person) in question has acquired two different credit cards under both names. A bank with a proper entity resolution system in place can quickly call out the fraud, but if not, they are simply at a loss here.
There are varying levels of entity resolution for various industries. The level of entity resolution varies based on the use cases; eCommerce industries might only require basic information to resolve entities, but a finance company may need more data points for entity resolution.
Entity resolution in the real world has existed for a long time, and many industries have been doing this process for ages, but they often don’t have a specific name for it and can’t point a finger to explain what it is or how it applies to their business.
Knowing what entity resolution is and understanding how it impacts your business plays a huge role in shaping your business because data is everything and more importantly, correct data is everything!
With the enormous incoming stream of data available, businesses are discovering new ways to decipher data and apply artificial intelligence and machine learning to extract valuable insights from the data, but what use is all this data if they are badly structured?
This is where entity resolution comes in. Entity resolution is the process of deduplicating and matching the data that belongs to a particular entity in a way that there is only one single source of truth for that entity.
For example, a CRM database has the contact information of a person who is a real-world entity, but due to misspellings, wrongly entered data, broken imports, or multiple entries, the same person has many entries in the same system.
But with Entity resolution, you can quickly link all these contacts/datasets to one entity, remove duplicates, enrich the information by combining the available data across all datasets, and create one truth of data.
That is entity resolution for you!
2. What are the challenges in entity resolution?
Is it as easy as it looks? When you say entity data, you are talking about vast amounts of data, and performing entity resolution on a large scale can be a challenge. It is a challenge simply because you are comparing one data set with a sea of other data sets in the pool, and these datasets are also housed in different disparate data silos.
This means that for every dataset, there are n*(n-1)/2 possible unique pairs (also known as “edges”). So, for 10,000 datasets, there are already 49,995,000 potential combinations to compare. For a million datasets, this would equate to 499,999,500,000 possible combinations.
Several scientific techniques exist to solve this problem, such as blocking, clustering, and matching. Still, only a few software solutions can function at scale without slipping into the exponential edges trap.
Now that is a challenge right there.
So how do you approach solving entity resolution use cases?
3. How can you perform entity resolution?
You can resolve entities using rule-based methods or AI/ML-based approaches.
Let’s decode each of these approaches here.
Rule-based entity resolution involves defining the specific string-based rules to identify similarities between two entities. When you enter a request or a record per se, the rule-based process will match the record you entered with every other record in the database to deduce their similarities based on the predefined rules for that use case.
But rule-based entity resolution cannot be defined for all the possible use cases; it is unscalable and might sometimes give false positives and false negatives.
Whereas with AI/ML-based approaches, a set of data is fed into the system to provide the kind of similarities or dissimilarities that you want the system to look into.
Based on systematic learning from the training datasets and with the help of specific algorithms the machine learns to identify the similarities in an entity while matching records.
Though AI/ML-based entity resolution is easier to set up as you do not have to configure rules, the matches are not easily explainable which leads to problems when mixing up two entities without knowing why.
Whether you follow Rule-based or AI/ML-based approaches, there are two common ways the data is processed within these approaches: They are Batch Processing and Real-time processing.
Batch processing means the incoming data is collected and stored until a particular amount of data is collected and then processed together in a batch.
Batch processing has two steps: live query and data processing.
Live query provides immediate results during searches and a read-only copy of the data.
Data processing involves matching the incoming data, handling data deletions, adding new data to the existing records, etc.
The advantage of the batch query is that the live query is immediate with no lag since the data in a batch is already processed, but the downside is it doesn’t include the new incoming data, and every time you perform batch processing, the live query has to be shifted to the latest batch.
The data processing is performed once a day or week, but these intervals are not always set since the amount of incoming data continuously varies, meaning that the dataset on which you are performing a live query is always outdated.
Unlike live queries in batch processing, real-time processing involves processing incoming data in real-time as soon as it enters the system. The finest thing is that, unlike a read-only copy, the data is current and instantaneously accessible.
The disadvantage in real-time processing can be due to the nature of competing interests — read and write — the writes can slow down reads and vice versa. Large batch uploads can also cause the reads to be timed out.
4. Is there a method to balance the advantages and disadvantages of batch and real-time processing?
Can you achieve batch processing speed combined with the up-to-the-minute timely query replies of real-time processing?
The answer is serverless technology for entity resolution.
Serverless technology abstracts all the underlying technology stack that a user must maintain, such as backend infrastructure, integration platforms, applications, etc.
Instead, it provides a function-based environment that responds to user requests as and when it comes, thus providing scalability and speed to process any user request.
Serverless technology for entity resolution eliminates read and write conflicts, provides the latest up-to-date data, query times are more predictable, and the hard data switches that occur with batch processing are eliminated.
5. Does entity resolution completely replace your database?
Entity resolution effectively adds a new database to the existing ones, but there is also the option of creating a stand-alone data layer, such as a database or a data lake that stores entity data in a user-centric manner.
The entity resolution ensures that all the data is stored in one single plane with references to source databases such as CRM systems, internal databases, collaboration software like email systems, marketing databases, etc.
As a result, any live query to the entity resolution software provides you with the desired data and retrieves all the references to source databases.
Thus you can perform client data requests or deletion requests within seconds without any human involvement.
However, the disadvantage of centralizing all company data is that this system is in high demand in terms of availability and performance. The entity resolution software must not cause any data loss or slowdown in the source systems.
So how can you tackle this issue?
Event streams were developed to address the issue of source systems slowing down, allowing source systems to be isolated from entity resolution software.
Communication happens asynchronously between source systems and entity resolution software with event streams. Any data from source systems is pushed to the event stream, which subsequently reaches the entity resolution software asynchronously.
The result is
Requests for data, changes, or deletions from entity resolution software are sent to the source database in the form of events.
You have a data layer in addition to your source database.
Any slowdowns in entity resolution software do not slow down the source database.
Now that we know more about entity resolution, how is data handled, and where is data stored.
6. Are there any data-related regulations that we should be aware of?
While processing various data for entity resolution, there is one very important thing to remember, entity data about individuals may be private, and there are laws in place to protect an individual’s data rights.
While processing sensitive personal data, it is critical that any mechanisms you use to execute entity resolution conform with rules such as GDPR or CCPA; failure to do so implies you will be unable to process personal data.
This means that
The purpose for combining two datasets must be explainable (which in the case of ML sometimes is not easily possible)
You must automatically erase data after a set length of time or when the legitimate reason for retaining it has passed.
A person’s entire dataset must be made available upon request.
You must remove data upon request.
You must encrypt data before being stored.
7. How does GDPR apply to entity resolution?
General Data Protection Regulations (GDPR) for the European Union came into effect in 2018, keeping in mind the exploitation of personal data that was taking place by companies and other organizations for their own personal gain.
In light of such exploitation, a set of laws were released to protect consumer data.
Any organization that processes personal data must disclose the information used for data processing upon the user’s request and must also delete data from their systems on request.
Why is it difficult to implement GDPR compliance at an enterprise scale?
The problem is that with massive data stored in disparate systems and an entity not having a unique identification number, it is rather tricky to query the data immediately upon a consumer’s request.
Now the data science team and the legal team must be involved in extracting this data from the disparate systems manually. Even this process is complicated, since two subsequent data disclosure requests could lead to different results.
The most challenging part is dealing with data deletion requests, which might come from consumers or as part of GDPR compliance, which is challenging to perform due to information residing in diverse data sources.
One solution to handle GDPR compliance is entity resolution.
Though entity resolution has its roots in personal data management, this can be very beneficial since the entity’s data is stored in one single space. Hence, it is easier to respond to consumer requests for data access or data deletions.
Using entity resolution, the data in all the source systems can be easily deleted, ensuring GDPR compliance and saving time and effort for the data science and legal teams.
8. How does entity resolution apply to the real world?
Let’s look at a few industries that use entity resolution to tackle business issues head-on.
Ecommerce is an industry where decisions have to be made in real-time since customers can take advantage of the slightest loopholes.
Suppose an eCommerce store offers vouchers or discounts to newbie customers. In that case, an existing customer can take advantage of these first-time offers simply by registering with a different email address, leading to voucher fraud.
The buy now pay later option provided by Ecommerce stores can be taken advantage of because the response to queries from payment and security systems is slow or all the data regarding the customer is unavailable in one single place.
Fraudulent orders are placed using the address of an unsuspecting customer who receives invoices for the orders they never placed, leading to delivery fraud.
Timely intervention is possible with all the information regarding a patient in one single database with even the minute details of the medicines they are allergic to and their past surgeries.
Thus a better relationship with the patient develops, and a complete view of their health is available, helping them achieve the best possible treatment.
The finance industry is the most vulnerable to fraudulent activities. Any slight mismatch in the data related to a person might lead to wrongly evaluating their credit scores, not detecting debt frauds, or even issuing loans to an ineligible individual.
The common thread among all these industries is that even a slight mismatch in the entity data can lead to enormous losses, negatively affect the customer experience, and fraud can go undetected.
To ensure good entity resolution for any industry use case, two things must happen: one, the data must be correctly related to the same person or product, and two, the data regarding the person or the product must be accessible in a split second.
We looked at how entity resolution helps decode information connected to people, but what about product data?
9. How does entity resolution help with products?
When processing product data, it’s not uncommon for products to be duplicated because the title or description differs significantly.
Assume you plan to purchase a camera and search for “canon Ixus 285 black.”
These are the search results that crop up.
“Canon Ixus 285 Hs Black Compact Camera”, “Canon Ixus 285 HS black”, “Canon IXUS 285 HS 20,2 Megapixel Full HD Compact Camera, 25–300 mm “, “Canon Ixus 285 HS BLACK 20,2 “, “Canon IXUS 285 HS Essential Kit black “, “Canon Ixus PowerShot 285 HS (Black) “
But all you just wanted to see was the Canon Ixus 285 black model and not six different Canon Models.
With entity resolution, the details of only the model you searched for comes up in the catalog and not six other products helping your search get accurate, faster, and better.
Entity resolution is a necessity for the success of a business. Entity resolution backed with serverless technology is even better.
With serverless technology, you can scale entity resolution to handle large-scale requests and, at the same time, maintain speed.
When done correctly, entity resolution allows you to have all of the information you need at your fingertips. With this single source of truth, you can make split-second decisions, cater to customer needs, assess business needs, and derive insights from every segment of your business, effectively increasing your enterprise value exponentially.