Maybe you have heard of “Six Degrees of Kevin Bacon”? It’s a game where you have to find the shortest connection between the American actor Kevin Bacon, and another actor, via films in which either actor has appeared.
For example, British actor, Simon Pegg, appeared in Mission Impossible, together with Tom Cruise; Tom Cruise appeared in A Few Good Men with Kevin Bacon, therefore Simon Pegg has a “Bacon number” of 2.
You can test any actor for their Bacon number on the website The Oracle of Bacon.
The game is based on the concept of “six degrees of separation”, which posits that any two people on the planet can be connected by six acquaintances of acquaintances. This means that in theory, all of us have a maximum Bacon number of 6.
The Six Degrees of Kevin Bacon is actually a pretty good example of network or graph theory (or a social graph), whereby each actor is a “node” and the connections between them are “edges”. It is the perfect sort of data for a graph database, such as Neo4j, where every actor is somehow connected to another via a pathway of nodes and edges.
Kevin Bacon can also be used to illustrate the difference between graphs and entity resolution. Or rather, why graph databases don’t work well for entity resolution.
Entity resolution is the process by which data records, which may differ but are about the same real-world entity (e.g. a person, or identity), are connected together (hence it is also sometimes known as “record linkage”). These data records are still nodes, and the connections between them are still edges, but in the Six Degrees of Kevin Bacon, we are connecting unrelated entities - the various actors - so it is not entity resolution.
Nevertheless, the esteemed Mr Bacon can still be used as an example of entity/identity resolution.
What links Ren McCormack, Valentine McKee, Jack Swigert and Sebastian Caine?
They are all roles that Kevin Bacon has played in movies throughout his career (Footloose (1984), Tremors (1990), Apollo 13 (1995), and Hollow Man (2000)).
Each record is unique, as the character name is different. However by matching using the actor’s real name, and perhaps adding an attribute such as his date of birth to be sure(*1), we now know that all these characters relate to the same real-world entity, in this case an actor(*2).
In reality, this is quite an easy example, if we assume that in each data record Kevin’s correctly spelt name is used and his date of birth is correctly recorded.
However, if we imagine a dataset in which the data is not so clean, then we would have to use so-called “fuzzy data matching” techniques to know that Kevin Bacon and Kev Bakon (perhaps a data entry error) are the same person, although the records are different, but not the same person as Michael Bacon (Kevin’s brother).
In this case, using the Jaro-Winkler fuzzy matching algorithm, we can see that Kevin and Kev have a 90.70% similarity, whilst Bacon and Bakon have an 89.3% similarity. Michael and Kevin/Kev clearly do not match.
Let’s dive deeper into Kevin Bacon…
Our first example of the Kevin Bacon entity is in fact an over-simplified representation. If we were matching based on name and date of birth, then you can see that in fact all of the records would actually connect to each other. So in this case we would actually have 6 “edges” rather than 3.
Since more edges mean more complexity, in Tilores we would very likely “deduplicate” these four records, so that one Kevin Bacon becomes the master record, and the duplicates are only connected to the master Kevin Bacon record. Any further non-identical Kevin Bacons would only be linked to the master Kevin Bacon record, thus reducing the number of edges. All the duplicate Kevin Bacon records, and their metadata, are still available via the master Kevin Bacon, but the edges are now reduced so the entity is simpler and faster to retrieve.
I’ll leave you to decide for yourself which of Kevin Bacon’s roles is his “master” role, but in the case of Tilores the master record is usually the first record that is ingested. If the master record is deleted, then the entity will immediately reorganise such that the second ingested record is now the master record, thus maintaining the entity’s data integrity.
If we tried to use a graph database for our Kevin Bacon entity resolution exercise, then all our Kevin Bacon character records would be mixed in there with every other actor character ever and everybody would be connected (edges) to everybody. It would be possible to find all the Kevin Bacon characters, but the complexity of the vast number of edges would make it incredibly slow to retrieve the Kevin Bacon entity.
In an entity resolution system, all the character records are still in there, but only the related data records are linked together. The system still uses graph theory (at least in Tilores), but only at the entity level. This means that the entire system is significantly faster, and much more suitable for real-time use cases, such as fraud detection, KYC and customer loyalty in ecommerce.
If you need to do entity resolution on thousands or millions of Kevin Bacon records then you are welcome to talk to us to see if we can use Tilores to help you. We won’t judge you. On the contrary - we applaud your dedication to cleaning Kevin Bacon-related data.
(1) Universally accepted actor law helps us here as no two actors can have the same name. Fun fact: Irish actor Killian Scott’s real name is Cillian Murphy, so he had to change his name because of unnaturally attractive actor Cillian Murphy (also Irish), who claimed his name in the acting world first.
(2) I’m not saying actors are not people. They are. And they have feelings too. Especially Kevin Bacon.