Online Tool
Submit two text strings in this calculator to see how they match with the Cosine similarity algorithm. No registration. No logging.
Automated pre-processing
Customer record unification
Sync data from source systems
Fuzzy search customer data
Build real-time unified customer data applications
What Is The
Similarity metrics, such as cosine similarity, are used to compare data vectors in an n-dimensional space. These metrics enhance search algorithms, understand semantic relations, and compare information in data science.
The cosine similarity metric compares vectors using the angle formed between them as they are projected from the origin. Here’s what you will learn on this page:
Different similarity metrics apply different mathematical operations to understand the relationship between data elements. Some popular metrics include:
These are widely used in various web applications. Similarity metrics are used to link matching entities together and are popularly used in recommendation systems. The recommender system holds all information in the form of vectors and uses metrics like Euclidean distance or cosine similarity to find similar items and recommend them to users.
The same concept is used on social media platforms to link users and recommend friends.
Similarity metrics, especially cosine similarity, are also used in organizations for fuzzy matching applications. They compare sets of documents and link different information elements together. Fuzzy matching allows organizations to identify potential duplicate items in their data and clean and optimize their database.
Cosine Similarity is a popular similarity metric that calculates the similarity between 2 vectors in an N-dimensional space. Intuitively, every vector in a latent space begins at the origin (with all 0s as coordinates) and ends at the position defined by the vector itself. Since the vectors are joined at the tail, every two vectors form an angle between them.
Cosine similarity is defined as the cosine of the angle created between the two vectors. It ranges between -1 and 1, where 1 means a perfect alignment, and -1 means the vectors are completely opposite. But before diving further into the mathematics, let’s understand how the cosine function works.
The cosine function resembles a sinusoidal wave but with a phase shift of 90 degrees (see diagram below). Its value is bound between 1 and -1 and varies in a sinusoidal pattern as the angle increases.
Image credit: Mathsisfun.com
From the cosine wave, we can see that its value is 1 when the angle is 0. As the angle increases, the cosine value decreases and becomes zero when the angle is 90 degrees. Beyond 90, the cosine value becomes negative and reaches its lower bound of -1 when the angle is 180 degrees.
Understanding and visualizing vectors is easy if they are in two or three dimensions. However, the concept of angles between vectors is much more challenging in higher-dimensional spaces, so practical applications implement the dot product formula for two vectors to capture the cosine similarity.
A dot product between two vectors quantifies their relationship in terms of their magnitude and direction. It is formulated as:
Here, |A| and |B| are the magnitudes of the vectors, and ፁ is the angle between them. The formula can be rearranged as:
The above formula provides us with the cosine of the angle between the vectors using the vector dot product and their magnitudes. We can plug any two vectors into the formula and calculate the cosine similarity between them. Here’s how it will work. Let’s suppose we have the following vectors A and B.
Now using the formula above, we first need the dot product of the two vectors.
Now we will work on the denominator and calculate the vector magnitudes.
Finally, we will plug in the values to the cosine similarity formula,
So, our cosine similarity comes out as 0.1172 which means the vectors are quite dissimilar.
The cosine similarity metric adopts several characteristics from the cosine function. These are:
Image credit: LearnDataSci.com
Modern AI algorithms use similarity metrics to build semantic relationships between different data items. The semantic understanding allows data scientists to build robust and concrete models to tackle real-world scenarios.
Cosine similarity is popularly used in natural language processing (NLP) pipelines to link different text samples. The text documents are tokenized and encoded as embeddings in an n-dimensional space. These embeddings are information-rich, and cosine similarity helps us understand how they relate to each other.
Cosine similarity is also used in retrieval augmented generation (RAG) applications as an information retrieval metric. Most modern large language model (LLM) architectures use a vector database to store information as embeddings.
When the model is prompted for a response, it uses its knowledge to construct a response and also uses keywords in the prompt to retrieve additional information. The retrieval algorithm uses the cosine similarity metric to retrieve all embeddings related to the query, which the LLM then uses to refine its answer and construct an information-rich response.
RAG has proven to be ground-breaking for the practical adoption of LLMs. It allows the language model to stay updated with the latest information without additional training, fact-check its own knowledge, and prevent any hallucinations in the response.
Tilores’s cosine similarity calculator tool provides a simple and easy-to-use interface to calculate the similarity between two strings. To calculate the similarity, simply type or paste the text strings in the input boxes and click on ‘Compare’. The tool requires no registration or logging in and our algorithm takes care of the embedding calculation. The output provides the similarity score as a percentage and a TRUE boolean flag if the similarity is above 80% and FALSE otherwise.
The cosine similarity algorithm is part of our fuzzy-matching arsenal for creating data linkage and deduplication. Our API allows customers to unify their data by resolving scattered data elements and identifying duplicate records. We provide a scalable and automated solution that adapts to your system needs and manages data workflows in real time.
More reading about the Cosine similarity algorithm (Wikipedia)
Other
Unlock the value trapped in your messy, inconsistent and duplicate-riddled data. Let Tilores be your data "source of truth".
Compare Fuzzy Matching Algorithms
About
When you need to do fuzzy matching on high-volume data in real-time, you need a built-for-purpose technology: enter Tilores.
©2023 Tilores, All right reserved.