Deduplication Showcase

Companies House Directors Deduplicated

Companies House is the UK's registry of companies and directors.
Unfortunately, the database has many duplicate director records.

So we used Tilores to deduplicate the Companies House directors database.

10,108,807 directors before deduplication

9,494,686 directors after Tilores

 

Disclaimer: this is an experimental showcase and results displayed before should not be considered definitive. Due to the quality of the available data, false positive associations may occur (see technical details below). This tool should be used for research purposes only. Tilores provides no guarantee as to the accuracy of the data.

Profile Metric Map

  • Entities : Number of recognized persons with an address in the locality.
  • Profiles : Number of companies house profiles with an address in the locality.
  • Maximum Profiles Per Entity : Number of profiles belonging to the person with the highest profile count whos most frequent address is in the locality.
  • Profile Variation : The deviation of person profile count from the ideal number, the ideal being one profile per person.

Why did we

deduplicate Companies House directors?

The UK is arguably the easiest country in the world in which to register a company. For as little as GBP12, a UK limited liability company can be registered within minutes.

However speed comes at the cost of accuracy, and as Companies House, the UK’s registry of companies and directors, says itself, it “does not verify the accuracy of the information filed”. This is problematic. 

Companies House does not verify the accuracy of the information filed

 The result, as the Guardian newspaper puts it, is a registry full of “Fakes, fast sign-ups and frauds .”

Whilst the Guardian article focuses on criminally fraudulent company registrations, we want to focus our attention here to a related topic that is close to our hearts at Tilores - duplicate data.

Companies House is riddled with duplicated company director data, including many notable business people (as we will see below). It has, to use the data science term, an “identity resolution” problem. This makes it difficult to conduct due diligence on company directors and erodes trust in Companies House data.

614,121 Duplicate Directors

Since we hate messy, duplicate data at Tilores, we wanted to do something about it. So we downloaded the entire Companies House Directors database and loaded it into Tilores to deduplicate the data.

Before deduplication in Tilores, the Companies House directors database had 10,108,807 records. After deduplication it has 9,494,686 records1, meaning that 614,121 of the records were duplicates of other director records.

If you want to check it out yourself, you can search our deduplicated version of the UK’s Companies House database in a Tilores instance above. Below we will go over some of the technical details of this data deduplication and identity resolution challenge, but before that let's look at some interesting examples to illustrate the Companies House duplicate data problem.

N.B. Tilores does not mean to suggest that any of these duplicate records are a result of fraudulent activity.

1This represents a static snapshop of the Companies House database updated until end of July 2023. We could take a daily update of Companies House data to make it always up-to-date. Let us know if you think we should do this!

Duplicate Director Data Examples

David Beckham

The husband of former Spice Girl, Victoria (aka "Posh Spice"), has got his fingers in a few British pies. Indeed he has 6 separate Companies House profiles, associated with 15 companies, including Victoria Beckham Limited. Companies House shows seven profiles for David Beckham, as one of the profiles is only associated to a dissolved company.

David Beckham Deduplicated

Ed Sheeran

Sometimes accused (unfairly we would say, as do the courts recently) of copying others’ music, it seems there are also a few duplicates of young Edward Christopher Sheeran himself, with 3 versions of the "singer/songwriter" (which stranegly is listed as his occupation for only one company) in the registry.

Ed has been associated with a total of 11 companies, including FAT PUNT LTD and the catchily named HAYAGOTATOURBOI TOURING LLP, all of which you can see on his deduplicated Tilores company director page.

Duncan Bannatyne

Duncan is our favourite former Dragons’ Den dragon. Not just because he is Scottish, but also because he was once in the Royal Navy and was court-martialed for threatening to throw his commanding officer overboard. Legend.

Monaco resident Duncan has 8 profiles in Companies House and is linked to 33 companies. 

Peter Jones

Another Dragons' Den favourite. Never known to wear the same pair of socks twice, the same cannot be said about Peter’s presence in Companies House, where we found 8 duplicate accounts, linked to 43 companies.

What is interesting in Peter's case, is that at some point someone has made a mistake when entering data to Companies House and put Peter's former name as James Holdgate. James actually seems to be a company secretary. 

Why is Companies House Director Data so Bad?

Every time you register a new company and list yourself as a director, Companies House will treat you as a completely new record if you use a different registered postal address.

Whereas a company is only added once to Companies House, and has a unique ID (the company registration number), individual directors are added multiple times, by the directors themselves, with no unique identifier connected to them. A classic example of an identity resolution problem! 

The Consequences of Companies House Identity Resolution Problem

Quite simply, it makes it more difficult to conduct due diligence on companies and their directors. If you look at a given company, check their directors, then click on the name of the director, you would expect to see all companies with which they are associated.

This is not the case. You need to search separately for that director, and then try to work out from the results, which records are actually about the director in which you are interested.

Whilst with Companies House data, the duplication is mostly an inconvenience, when it comes to banking, customers with multiple accounts can become a real compliance liability.

Take the recent example of Block , where Hindenberg’s research uncovered evidence of multiple related accounts that were potentially being used for nefarious activities. Indeed, in some of the recent crypto-related bank crashes, it has come to light that customers had multiple accounts which could be used for laundering money using cryptocurrencies.

If a bank is performing perpetual KYC properly, it should know the instant a new customer joins if they are related to another account. This is technically challenging if you do not have a real-time identity resolution technology. 

How Did we Deduplicate the Companies House Directors Data?

Tilores is a high-performance entity/identity resolution technology . That means it can deduplicate and link large volumes of record data, in real-time, resulting in clean “identity graphs” that become the source of truth about your data. You might consider them “golden customer records”.

First, we downloaded all ~5 Million companies data from Companies House as one zip file. Once we had the companies data, we used the companies' registered company number to query the Companies House API for the each company's directors.

The API is rate limited to 600 queries per 5 minutes so this took a while - over a month in fact to download all 10 million directors.

Next we used samples of the data to come up with data matching rules. Tilores is designed to use fuzzy matching on record attributes, such as names, to link records when, for example, the same name is spelt slightly differently in two different records.

Within Tilores any of the following rule combinations would trigger a link between two records:

Tilores Companies House Rules

Where the fuzzy matching algorithms used for matching “similar names” were a combination of Metaphone with Levenshtein. If two records had a similar name and the same date of birth, or a similar name and exactly the same postcode, then they were considered a match.

Important: Companies House data only publicly provides the month of birth - not the day


In a general population that would not be reliable for matching identities because there is a high probability that two people with the same name are born in the same month and year. However, given company directors represent a significant sub-section of the population, we felt that this matching would be acceptable for a public showcase.

Nevertheless, this means that "false positive" matches are unavoidable.

If Companies House themselves were running this instance of Tilores themselves, they could use the exact date of birth, including day, for the matching (and still only display the month publicly), thus increasing accuracy.

This is an example of how the data for one deduplicated and linked director looks in identity graph form in the Tilores UI:

Importing Companies House Data

Importing 10 million records could take a while, however as Tilores is build on AWS serverless technology it is technically not a challenge and the speed of import only really depends on how many resources we deploy.

Note that we have taken a one-off snapshot of the Companies House data as of the end of July 2023. We could extend this service by connecting the Companies House API to Tilores so that new Directors are added and deduplicated as they are created. This would make so Tilores a useful source of truth about UK company directors. Let us know if you think we should do this. 

How does search work?

Tilores is not a search engine per se, nevertheless you need to be able to search the data that Tilores holds.

You will notice that when searching Tilores Companies House instance, you need to supply either a year of birth OR the city location of a director. This is to improve search performance when common names are searched as Tilores is designed to preferably return a single result when it is searched. 

Why have Companies House not done something about this?

It is technically very challenging to deduplicate and match data on this scale. That is the reason we built Tilores in the first place.

Companies House is actually hosted on AWS, so if they wanted, we would be able to sort their own data pretty quickly and painlessly using Tilores. Same for any other organisation that relies on Companies House data.

The Legal Basis

Companies House information is publicly available and free to search. The data held within forms the basis of all business credit reports supplied by well known credit agencies such as Experian and Creditsafe, and the data is also used by many other organisations for marketing and lead generation.

Tilores is relying on the “legitimate interest” ground for processing this personal data, whereby we aim to provide a free service to help people conduct better due diligence on companies and their directors for potential business transactions, to reduce risk and avoid fraud.

Please note again the following Disclaimer: this is an experimental showcase and results displayed before should not be considered definitive. Due to the quality of the available data, false positive associations may occur (see technical details below). This tool should be used for research purposes only and Tilores provides no guarantee as to the accuracy of the data.

Tilores

Fuzzy Matching Algorithms

We provide the following Fuzzy Matching algorithms for the deuplication and linking of data in Tilores (docs): 

Cosine

Cosine similarity

DamerauLevenshteinAT 

Damerau-Levenshtein distance with adjacent transpositions

DamerauLevenshteinOSA

Damerau-Levenshtein with optimal string alignment distance

Jaccard

Jaccard

Jaro

Jaro similarity

JaroWinkler 

Jaro-Winkler similarity

LCS

Longest common subsequence

Levenshtein

Levenshtein distance

SorensenDice

Sørensen–Dice coefficient

QGram

Q-gram

Hamming

Hamming distance - not in our online tools

Fuzzy Wuzzy 

Fuzzy Wuzzy 

Cologne Phonetic

Cologne Phonetic 

Soundex Phonetic 

Soundex Phonetic 

Metaphone Phonetic 

Metaphone Phonetic 

Are we missing a fuzzy matching algorithm you would like to test?

About

Tilores

When you need to do fuzzy matching on high-volume data in real-time, you need a built-for-purpose technology: enter Tilores.

Consistently fast search response times

Built for unlimited serverless scaling

Real-time data ingestion and simultaneous search.

Configure matching rules easily in the UI

Data privacy compliant by design

Identity resolution for fraud prevention, KYC and marketing.

Get the latest updates

©2023 Tilores, All right reserved.