A brief introduction to data deduplication
Duplicate data is virtually impossible to prevent, and its impact can be a serious drain on organizational resources. Any seasoned data professional will be familiar with the woes of it. In fact, you’re likely to have had the pleasure of dealing with duplicate data before. Whether you have uploaded data to a system yourself through a data import process or have mistakenly created a duplicate entry manually, it’s something easily done.
What’s even more irritating, because of how ubiquitous duplicate data is, it’s pretty much guaranteed to occasionally bypass your data validation when it gets into your system, no matter how good your processes and checks are. And while duplicated data may sound relatively nefarious (save for how annoying it is, at least), it’s a challenge that many businesses ignore to their peril.
Duplicate data is harmful
Once duplicate data enters your ecosystem, it can lead to all sorts of problems and headaches. Losses in productivity, wasted marketing budgets, disjointed customer service, and additional storage costs are just a few of them, and they lead to an estimated annual loss of more than $6 billion to businesses in the United States alone.
If you are working with a lot of data then, it is crucial for you and your data teams to understand the methods of reducing instances of duplicated data for your application to be a success. This is where data deduplication comes in.
What is data deduplication?
In simple terms, data deduplication, or ‘Dedup’ for short, is a process that eliminates excessive copies and other redundant data within a dataset. Deduplication works by deleting this data, leaving a single copy that is stored. Any duplicate or redundant data that is purged is replaced with a reference that points to the stored chunk, thus significantly reducing storage capacity requirements.
Data deduplication can be run as an inline process that operates in real-time as data is being ported into a system. It can also be run as a background process to catch and remove duplicates after data has been written to disk.
Deduplication in action
An identical duplicate is straightforward — if two records have the exact same values for all their fields, then it ticks the box. For non-identical duplicates, the situation is somewhat more complex.
Let’s look at an example.

Above we have two simple data records for a person, Jim John Smith. Although all the values that are used for matching are the same, the difference between someData means that this is not an identical duplicate. Visualizing this, it could look something like this:

You may be thinking, “meh, that’s fine”, but this is what four of these records look like:

Now multiply this by 40, 400, or even 4,000. But it’s not visualization that’s the biggest problem, it’s the number of records that need to be indexed — this can cause bloating and sluggishness over time while simultaneously inflating costs.
Data deduplication steps in to solve this problem. And although it is a common concept, not all deduplication techniques work the same. In TiloRes, deduplication is managed through rules*. These can be defined in several ways, i.e., to distinguish identical and non-identical duplicates and treat them differently when it comes to indexing. In the excerpt below, the rule R1EXACT has been applied to ensure that the non-identical duplicates above are stored more efficiently.
*We recommend learning more about rules in TiloRes so that you can benefit from a more contextual understanding — helpful information can be found in our documentation and in our demo.

Thanks to deduplication, the four records from before are now stored like this:

This both saves on storage space and benefits speed because you will only receive a hit on one record when making a search rather than all the records that are available.
Why is deduplication needed?
Although data deduplication is often touted as a way for organizations to reduce their storage costs, there are several other arguably more important reasons why deduplication is needed for all organizations that are tasked with handling large volumes of data.
Deduplication can, for example, enable organizations to quickly back up and store large datasets in the cloud and simultaneously make them available to the C-suite for business insights, or to help them address emerging compliance, regulatory, and data governance challenges. Deduplication can also help to reduce network load, leaving more bandwidth available to dev teams for critical production tasks.
In essence, advanced data deduplication is helping organizations to better manage potentially overwhelming increases in data volume. At the end of the day, this benefits all stakeholders.
If you want to find out more about deduplication, let’s talk!
Ready to try entity resolution?
Start Building Free →