Comparison of AWS and Tilores Entity Resolution Software

By

In our data-driven world, the information accuracy is crucial for company success. However, ensuring this accuracy becomes a daunting task when data contains redundant or inconsistent entries for the same real-world entities. An "entity" could represent a person, a company, a machine, or any other tangible or intangible object.

Fortunately, entity resolution software comes to the rescue by efficiently linking and deduplicating these scattered pieces of information. These systems work to create a clean and accessible datasets, free from redundancies, while preserving all essential information in a deduplicated manner. By harmonizing the data and eliminating duplicates, entity resolution systems empower organizations to make more reliable and informed decisions based on a unified view of their data.

The ER systems in this comparison have different features and functionalities however the result should be the same - clean, properly matched data.

As the topic of entity resolution can be quite complex, ER systems should provide a simple way to start matching the data quickly and functionalities to then fine-tune the matching process.

General Setup

We use 10 records of persons with similar names and same address. These records are stored in one csv file as both systems are able to work with csv. The file is not normalized and contains some spaces in the data.

undefined

The resulting matching table should not mix up two different persons, but also should not miss any records that belong to the same person.

As we can see with this data we should retrieve 4 different clusters based on the data:

Cluster ID Record IDs
1 1, 2, 3, 4, 8, 9
2 5, 6
3 7
4 10

AWS Entity Resolution

AWS recently launched their own Entity Resolution service. This service can be either configured to use rules or machine learning for matching. Currently the service runs as a batch service and has no real time capabilities. The configurability is very limited at the moment and the matching of records is only possible for personal data like name and address. As the service was just started a few weeks ago, it is likely that the capabilities will be extended soon.

Product Page: https://aws.amazon.com/de/entity-resolution/

Documentation: https://docs.aws.amazon.com/entityresolution/latest/userguide/setting-up.html

Requirements:

  • AWS Account (best with Administrator permissions as you need to use different services)
  • AWS S3 for the input and result files
  • AWS Glue to create the tables
  • AWS IAM to create the needed roles to access all needed services
  • AWS Entity Resolution for the final matching of the data

Setup:

First the setup of the required infrastructure, services and data load into a glue table has to be done. Next you need to create a schema mapping and define which columns should be used for matching and which are passed through. Then you group the input fields to be able to match them.

Based on the matching technique there are now two different flows:

undefined
  • ML based: This technique does not need any configuration and tries to match the records based on a machine learning model. The model cannot be changed or trained by the user. So this is a fast way to do entity resolution but for our test case the results were not good as the model did not match any records.
undefined
  • Rule based: Using the rule based matching, rules can be defined that match the data. The rules have priorities so that the first rule that matches two records is used and then the processing for this record is stopped. For rule based matching the cadence for the matching runs can be defined (this is not available for the ML process). For the test we choose manual but for changing data it makes sense to use the automatic cadence that starts the job whenever the data is changed.
undefined

Different combinations of the match keys can be used as seen in the screenshot below. As with every entity resolution system, the more independent attributes available the better the matching can be.

undefined
undefined

As the matching keys contain several attributes like first name and last name, the user can choose how the matching should be done by selecting the comparison type.

undefined

Results:

ML based

Cluster ID Record IDs
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10

The ML based approach delivers zero matches and does not allow any additional configuration.

Rule based

Cluster ID Record IDs
972d03fa0efc489c9707e4756e7867e9 7
5291e16cc8c14368a5db34760a6c9c3f 9
23b30c5fd03a3407b61aaa932cc74eb2 3, 2, 1
303f56fbf9ce472cb34ad1ebbdf1d8d9 10
7bda65da2d66499daad1438773658105 4
6de6133e05b43af4ba9a22245cb02358 5, 6
f4f85d56361e405588337896dde55af2 8

The rules based approach allows to configure simple rules for the matching. The rest of the process is similar to the ML based approach. For the rule based approach the records were matched to 7 resulting entities.

Tilores

Tilores is an entity resolution software that links and deduplicates record data in real-time. Data can be searched simultaneously as data is inserted into Tilores. The software can be deployed into a customer's AWS account or used as a SaaS service. This test uses the public SaaS service’s UI.

undefined

Tilores can be configured for all kinds of data, including person and company matching and deduplication. 

Product Page: https://tilores.io/

Documentation: https://docs.tilotech.io/

AWS Marketplace: https://aws.amazon.com/marketplace/pp/prodview-2yn5oirwdwq74?sr=0-1&ref_=beagle&applicationId=AWSMPContessa

Requirements:

  • Registration using email/password or google signup at app.tilores.io (the service can be tested for free)

Setup:

After successful signup, the user can select if a sample file shall be provided by Tilores, if a customer's file is to be uploaded (this is the option we used for the test) or if Tilores should be configured manually.

undefined

The first step is to upload the sample file

undefined

Tilores built-in AI recognized the data and proposes to use preconfigured rules for this use case:

undefined

Using this configuration Tilores now sets up an instance for us and ingests the data.

undefined

When the deployment is done, one of the entities is shown using the Tilores search UI.

undefined

The whole setup needed 4 clicks from the user, the rest was done automatically.

Usually, data is uploaded to Tilores and also searched using the custom GraphQL API that is part of every instance.

For the test, the setup process configured two rules that match on name and address. In Tilores, the rules are OR connected which means that always all rules are executed which results in a confidence score for each link. 

undefined
undefined

As the data was not normalized and contained white spaces, the data had to be transformed by the system before matching. This is done using the field transformation and extraction that was automatically configured during the setup of the instance. Based on the use case this can further be fine tuned.

undefined

For analytics, exports and to provide clean data for machine learning, the Tilores analytics engine can be used. For the test it was used to export the data.

undefined

Results:

Cluster ID Record IDs
976a6fdc-f6da-4719-920d-8c60f0fe6c91 1, 2, 3, 4, 8, 9
dda2d449-03d2-461d-9012-326b7b01d2ee 5, 6
5b92221d-df4f-4acc-aaf8-bcfe0f0aac75 7
b41199ac-2fb3-44a5-ad87-0ba7ca05e31b 10

The result shows 4 entities. This was achieved without any additional fine-tuning.

Conclusion

The main criteria for entity resolution (ER) use cases is the matching quality, second comes the usability, third the performance.

AWS created a new ER service to meet the increasing demand for quality entity data and integrated it tightly into their ecosystem. As with every AWS service, it is expected that this is just the start and more features will be added based on the customer demand.

Tilores is a patent-pending technology that is used by diverse customers with a wealth of features to be configured for all different ER use cases.

To keep this comparison fair, only the simplest possible configuration was used for both systems.

For criteria 1, AWS Entity Resolution Service matched the records in best case (rules) to 7 entities. Tilores matched the records to 4 entities.

Criteria 2 shows how tightly the AWS service is integrated into its ecosystem. Many other AWS services are utilized but also the connectivity to other services is possible in this way. For non-AWS users the configuration will take at least one hour, for AWS experts only a few minutes as a detailed documentation was created. Tilores on the other hand only needs 4 clicks to match the records.

Criteria 3: AWS ER needed between 9 and 11 minutes to run the test. We expect that this time is needed to set up the whole environment and then to run the resolution job. This will be tested again with a bigger dataset. Tilores setup took 3 minutes. As for the AWS ER setup, first the instance has to be deployed. The matching itself only takes a few milliseconds.

Related

Posts

Explore Similar Articles

The API to unify scattered customer data in real-time.

Get the latest updates

©2023 Tilores, All right reserved.