Automating Data Deduplication in Snowflake with Tilores

By

Preventing Free Trial Abuse using Identity Resolution with your Snowflake Customer Data 

The problem

Ending up with duplicate customer data is almost unavoidable, whether it is due to acquiring data from different sources, customers registering multiple times or mistakes made during manual data entry. Another possible cause of duplicate customer data is when customers sign up online multiple times, each time slightly changing their details, to take advantage of free trials or first-use discounts. 

This leads to multiple rows in a table that represent the same customer, company or any other kind of entity. A classic identity resolution problem. So how can one solve this in Snowflake?

What to expect?

In this guide we will start with a Snowflake table that contains duplicated customer data and we will end up with another table introducing a new field (entity_id) which is a unique identifier for each customer.

undefined

An entity_id will be automatically calculated for any new data added to the table.

Create test table in Snowflake

Create a new worksheet and run the following:

Download the result as CSV. We will use this in the next step to automatically generate the matching rules.

Connecting Snowflake to Tilores

Sign up for a few Tilores account at app.tilores.io. After signing in choose “Upload Data File”. And then upload the CSV file from the previous step, Then click “Yes, use rules” and complete the wizard. It will take around 3 minutes to finish setting up.

Now we will create an API integration which will allow Snowflake to make calls to Tilores.

Snowflake will always be the one initiating the connection, sending and requesting data from Tilores, not the other way around.

In the Tilores UI go to Integration — Snowflake and under the section "Where to get these values?" copy the code sections and run them in a new worksheet in your Snowflake workspace.

The result of the last query should have API_AWS_IAM_USER_ARN, copy the value and use it in User ARN field in Tilores UI. Also copy the value of API_AWS_EXTERNAL_ID and use it in External ID in Tilores UI then click UPDATE to activate the connection.

The following banner should show:

undefined

Now Snowflake and Tilores are connected.

Setup automatic entity identification in Snowflake

Next we will create a task which will process any new data added to the customers table we created at the beginning and create a row with an entity_id in a new table customers_entities. Run the following in Snowflake:

Testing the automation

After two minutes, a row is created in customers_entities for each customer. So if we perform a join we should be able to tell which customer rows belong to which actual customer.

Should result in:

undefined

Based on the entity ids, it shows that these rows represent only three actual customers.

And if we now add another customer with a similar name and a few spelling mistakes, and a different email domain. It should end up with the same entity_id as the first row.

Wait two minutes, then run the following again:

The result should now be:

undefined

In this case entity_id is the actual customer identifier.

Conclusion - Identity Resolution in Snowflake

By introducing unique entity IDs for customers, this identity resolution approach ensures an accurate customer 360 view in Snowflake and allows for easy customer segmentation for marketing. Further use cases include detecting duplicate account creation associated with free trial abuse.

Related

Posts

Explore Similar Articles

The API to unify scattered customer data in real-time.

Get the latest updates

©2023 Tilores, All right reserved.