← Back to Blog
Engineering August 10, 2023 · 3 min read

Automating Data Deduplication in Snowflake with Tilores

SR
Steven Renwick
Tilores
Automating Data Deduplication in Snowflake with Tilores

Preventing Free Trial Abuse using Identity Resolution with your Snowflake Customer Data

The problem

Ending up with duplicate customer data is almost unavoidable, whether it is due to acquiring data from different sources, customers registering multiple times or mistakes made during manual data entry. Another possible cause of duplicate customer data is when customers sign up online multiple times, each time slightly changing their details, to take advantage of free trials or first-use discounts. 

This leads to multiple rows in a table that represent the same customer, company or any other kind of entity. A classic identity resolution problem. So how can one solve this in Snowflake?

What to expect?

In this guide we will start with a Snowflake table that contains duplicated customer data and we will end up with another table introducing a new field (entity_id) which is a unique identifier for each customer.

image

An **entity_id **will be automatically calculated for any new data added to the table.

Create test table in Snowflake

Create a new worksheet and run the following:

Download the result as CSV. We will use this in the next step to automatically generate the matching rules.

Connecting Snowflake to Tilores

Sign up for a few Tilores account at app.tilores.io. After signing in choose “Upload Data File”. And then upload the CSV file from the previous step, Then click “Yes, use rules” and complete the wizard. It will take around 3 minutes to finish setting up.

Now we will create an API integration which will allow Snowflake to make calls to Tilores.

Snowflake will always be the one initiating the connection, sending and requesting data from Tilores, not the other way around.

In the Tilores UI go to Integration — Snowflake and under the section “Where to get these values?” copy the code sections and run them in a new worksheet in your Snowflake workspace.

The result of the last query should have API_AWS_IAM_USER_ARN, copy the value and use it in User ARN field in Tilores UI. Also copy the value of **API_AWS_EXTERNAL_ID **and use it in External ID in Tilores UI then click **UPDATE **to activate the connection.

The following banner should show:

image

Now Snowflake and Tilores are connected.

Setup automatic entity identification in Snowflake

Next we will create a task which will process any new data added to the **customers **table we created at the beginning and create a row with an **entity_id **in a new table customers_entities. Run the following in Snowflake:

Testing the automation

After two minutes, a row is created in customers_entities for each customer. So if we perform a join we should be able to tell which customer rows belong to which actual customer.

Should result in:

image

Based on the entity ids, it shows that these rows represent only three actual customers.

And if we now add another customer with a similar name and a few spelling mistakes, and a different email domain. It should end up with the same **entity_id **as the first row.

Wait two minutes, then run the following again:

The result should now be:

image

In this case entity_id is the actual customer identifier.

Conclusion - Identity Resolution in Snowflake

By introducing unique entity IDs for customers, this identity resolution approach ensures an accurate customer 360 view in Snowflake and allows for easy customer segmentation for marketing. Further use cases include detecting duplicate account creation associated with free trial abuse.

Ready to try entity resolution?

Start Building Free →