Try IdentityRAG today - unify customer data for GenAI

The Identity Resolution of Kevin Bacon

By

Steven Renwick

Six Degrees of Kevin Bacon

Maybe you have heard of “Six Degrees of Kevin Bacon”? It’s a game where you have to find the shortest connection between the American actor Kevin Bacon, and another actor, via films in which either actor has appeared.

For example, British actor, Simon Pegg, appeared in Mission Impossible, together with Tom Cruise; Tom Cruise appeared in A Few Good Men with Kevin Bacon, therefore Simon Pegg has a “Bacon number” of 2.

Six Degrees of Kevin Bacon to Simon Pegg — Simon Pegg is linked to Kevin Bacon via Tom Cruise

You can test any actor for their Bacon number on the website The Oracle of Bacon.

The game is based on the concept of “six degrees of separation”, which posits that any two people on the planet can be connected by six acquaintances of acquaintances. This means that in theory, all of us have a maximum Bacon number of 6.

The Six Degrees of Kevin Bacon is actually a pretty good example of network or graph theory (or a social graph), whereby each actor is a “node” and the connections between them are “edges”. It is the perfect sort of data for a graph database, such as Neo4j, where every actor is somehow connected to another via a pathway of nodes and edges.

Kevin Bacon can also be used to illustrate the difference between graphs and entity resolution. Or rather, why graph databases don’t work well for entity resolution.

Entity Resolution of Kevin Bacon

Entity resolution is the process by which data records, which may differ but are about the same real-world entity (e.g. a person, or identity), are connected together (hence it is also sometimes known as “record linkage”). These data records are still nodes, and the connections between them are still edges, but in the Six Degrees of Kevin Bacon, we are connecting unrelated entities - the various actors - so it is not entity resolution.

Nevertheless, the esteemed Mr Bacon can still be used as an example of entity/identity resolution.

What links Ren McCormack, Valentine McKee, Jack Swigert and Sebastian Caine?

They are all roles that Kevin Bacon has played in movies throughout his career (Footloose (1984), Tremors (1990), Apollo 13 (1995), and Hollow Man (2000)).

Each record is unique, as the character name is different. However by matching using the actor’s real name, and perhaps adding an attribute such as his date of birth to be sure(*1), we now know that all these characters relate to the same real-world entity, in this case an actor(*2).

Kevin Bacon based Identity Resolution — Some of Kevin Bacon's stand-out roles

In reality, this is quite an easy example, if we assume that in each data record Kevin’s correctly spelt name is used and his date of birth is correctly recorded.

Fuzzy Matching of Kevin Bacon

However, if we imagine a dataset in which the data is not so clean, then we would have to use so-called “fuzzy data matching” techniques to know that Kevin Bacon and Kev Bakon (perhaps a data entry error) are the same person, although the records are different, but not the same person as Michael Bacon (Kevin’s brother).

Kevin Bacon is not Michael Bacon — Differentiating Kevin Bacon from his brother Michael Bacon

In this case, using the Jaro-Winkler fuzzy matching algorithm, we can see that Kevin and Kev have a 90.70% similarity, whilst Bacon and Bakon have an 89.3% similarity. Michael and Kevin/Kev clearly do not match.

Deduplication of Kevin Bacon

Let’s dive deeper into Kevin Bacon…

Our first example of the Kevin Bacon entity is in fact an over-simplified representation. If we were matching based on name and date of birth, then you can see that in fact all of the records would actually connect to each other. So in this case we would actually have 6 “edges” rather than 3.

Over-edgy Kevin Bacon — Kevin Bacon has too many edges here

Since more edges mean more complexity, in Tilores we would very likely “deduplicate” these four records, so that one Kevin Bacon becomes the master record, and the duplicates are only connected to the master Kevin Bacon record. Any further non-identical Kevin Bacons would only be linked to the master Kevin Bacon record, thus reducing the number of edges. All the duplicate Kevin Bacon records, and their metadata, are still available via the master Kevin Bacon, but the edges are now reduced so the entity is simpler and faster to retrieve.

Deduplication of Kevin Bacon improves identity resolution performance

I’ll leave you to decide for yourself which of Kevin Bacon’s roles is his “master” role, but in the case of Tilores the master record is usually the first record that is ingested. If the master record is deleted, then the entity will immediately reorganise such that the second ingested record is now the master record, thus maintaining the entity’s data integrity.

So why not graph databases for entity resolution?

If we tried to use a graph database for our Kevin Bacon entity resolution exercise, then all our Kevin Bacon character records would be mixed in there with every other actor character ever and everybody would be connected (edges) to everybody. It would be possible to find all the Kevin Bacon characters, but the complexity of the vast number of edges would make it incredibly slow to retrieve the Kevin Bacon entity.

In an entity resolution system, all the character records are still in there, but only the related data records are linked together. The system still uses graph theory (at least in Tilores), but only at the entity level. This means that the entire system is significantly faster, and much more suitable for real-time use cases, such as fraud detection, KYC and customer loyalty in ecommerce.

If you need to do entity resolution on thousands or millions of Kevin Bacon records then you are welcome to talk to us to see if we can use Tilores to help you. We won’t judge you. On the contrary - we applaud your dedication to cleaning Kevin Bacon-related data.

Footnotes

(1) Universally accepted actor law helps us here as no two actors can have the same name. Fun fact: Irish actor Killian Scott’s real name is Cillian Murphy, so he had to change his name because of unnaturally attractive actor Cillian Murphy (also Irish), who claimed his name in the acting world first.

(2) I’m not saying actors are not people. They are. And they have feelings too. Especially Kevin Bacon.

Cillian Murphy and Killian Scott — Look at Cillian Murphy's eyes (the one on the left). Just look at them. My goodness...

Explore Similar Articles

The API to unify scattered customer data in real-time.

The Identity Resolution of Kevin Bacon

Six Degrees of Kevin Bacon

Entity Resolution of Kevin Bacon

Fuzzy Matching of Kevin Bacon

Deduplication of Kevin Bacon

So why not graph databases for entity resolution?

Footnotes

Posts

Navigation

Company

Get the latest updates

The Identity Resolution of Kevin Bacon

Six Degrees of Kevin Bacon

Entity Resolution of Kevin Bacon

Fuzzy Matching of Kevin Bacon

Deduplication of Kevin Bacon

So why not graph databases for entity resolution?

Do YOU want to clean Kevin Bacon related data?

Footnotes

Posts

Navigation

Company

Get the latest updates