Online Tool

Cosine Similarity Calculator

Submit two text strings in this calculator to see how they match with the Cosine similarity algorithm. No registration. No logging.

 

Supercharge Your Cosine Similarity Data Matching

  • Automated pre-processing

  • Customer record unification

  • Sync data from source systems

  • Fuzzy search customer data

  • Build real-time unified customer data applications

What Is The 

Cosine Similarity Algorithm?

Similarity metrics, such as cosine similarity, are used to compare data vectors in an n-dimensional space. These metrics enhance search algorithms, understand semantic relations, and compare information in data science.

The cosine similarity metric compares vectors using the angle formed between them as they are projected from the origin. Here’s what you will learn on this page:

  • An understanding of the cosine function.
  • The mathematics behind cosine similarity.
  • Cosine similarity applications.
  • Cosine similarity in Modern AI.

Similarity Metrics in Everyday Applications

Different similarity metrics apply different mathematical operations to understand the relationship between data elements. Some popular metrics include:

These are widely used in various web applications. Similarity metrics are used to link matching entities together and are popularly used in recommendation systems. The recommender system holds all information in the form of vectors and uses metrics like Euclidean distance or cosine similarity to find similar items and recommend them to users.

The same concept is used on social media platforms to link users and recommend friends.

Similarity metrics, especially cosine similarity, are also used in organizations for fuzzy matching applications. They compare sets of documents and link different information elements together. Fuzzy matching allows organizations to identify potential duplicate items in their data and clean and optimize their database.

Understanding Cosine Similarity

Cosine Similarity is a popular similarity metric that calculates the similarity between 2 vectors in an N-dimensional space. Intuitively, every vector in a latent space begins at the origin (with all 0s as coordinates) and ends at the position defined by the vector itself. Since the vectors are joined at the tail, every two vectors form an angle between them.

Cosine similarity is defined as the cosine of the angle created between the two vectors. It ranges between -1 and 1, where 1 means a perfect alignment, and -1 means the vectors are completely opposite. But before diving further into the mathematics, let’s understand how the cosine function works.

The Cosine Intuition

The cosine function resembles a sinusoidal wave but with a phase shift of 90 degrees (see diagram below). Its value is bound between 1 and -1 and varies in a sinusoidal pattern as the angle increases.

Cosine Graph

Image credit: Mathsisfun.com 

From the cosine wave, we can see that its value is 1 when the angle is 0. As the angle increases, the cosine value decreases and becomes zero when the angle is 90 degrees. Beyond 90, the cosine value becomes negative and reaches its lower bound of -1 when the angle is 180 degrees.

The Mathematics Behind Cosine Similarity

Understanding and visualizing vectors is easy if they are in two or three dimensions. However, the concept of angles between vectors is much more challenging in higher-dimensional spaces, so practical applications implement the dot product formula for two vectors to capture the cosine similarity.

A dot product between two vectors quantifies their relationship in terms of their magnitude and direction. It is formulated as:

Here, |A| and |B| are the magnitudes of the vectors, and ፁ is the angle between them. The formula can be rearranged as:

The above formula provides us with the cosine of the angle between the vectors using the vector dot product and their magnitudes. We can plug any two vectors into the formula and calculate the cosine similarity between them. Here’s how it will work. Let’s suppose we have the following vectors A and B.

Now using the formula above, we first need the dot product of the two vectors.

Now we will work on the denominator and calculate the vector magnitudes.

Finally, we will plug in the values to the cosine similarity formula,

So, our cosine similarity comes out as 0.1172 which means the vectors are quite dissimilar.

Key Points for Cosine Similarity

The cosine similarity metric adopts several characteristics from the cosine function. These are:

  • The similarity score is between 1 and -1.
  • A score of 1 means the angle is zero and the vectors match perfectly, while -1 means the vectors are perfectly opposite in direction (180-degree angle).
  • A zero score means the vectors are orthogonal and are considered completely dissimilar.
  • Cosine similarity does not measure the vector's magnitude, only their position in the n-dimensional space. The magnitude of each vector does not matter; the similarity score is only impacted by the angle between them.

Vectors

Image credit: LearnDataSci.com 

Significance of Cosine Similarity in Modern AI

Modern AI algorithms use similarity metrics to build semantic relationships between different data items. The semantic understanding allows data scientists to build robust and concrete models to tackle real-world scenarios.

Cosine similarity is popularly used in natural language processing (NLP) pipelines to link different text samples. The text documents are tokenized and encoded as embeddings in an n-dimensional space. These embeddings are information-rich, and cosine similarity helps us understand how they relate to each other.

Cosine similarity is also used in retrieval augmented generation (RAG) applications as an information retrieval metric. Most modern large language model (LLM) architectures use a vector database to store information as embeddings.

When the model is prompted for a response, it uses its knowledge to construct a response and also uses keywords in the prompt to retrieve additional information. The retrieval algorithm uses the cosine similarity metric to retrieve all embeddings related to the query, which the LLM then uses to refine its answer and construct an information-rich response.

RAG has proven to be ground-breaking for the practical adoption of LLMs. It allows the language model to stay updated with the latest information without additional training, fact-check its own knowledge, and prevent any hallucinations in the response.

Cosine Similarity with Tilores

Tilores’s cosine similarity calculator tool provides a simple and easy-to-use interface to calculate the similarity between two strings. To calculate the similarity, simply type or paste the text strings in the input boxes and click on ‘Compare’. The tool requires no registration or logging in and our algorithm takes care of the embedding calculation. The output provides the similarity score as a percentage and a TRUE boolean flag if the similarity is above 80% and FALSE otherwise.

The cosine similarity algorithm is part of our fuzzy-matching arsenal for creating data linkage and deduplication. Our API allows customers to unify their data by resolving scattered data elements and identifying duplicate records. We provide a scalable and automated solution that adapts to your system needs and manages data workflows in real time.

More reading about the Cosine similarity algorithm (Wikipedia)

Other

Fuzzy Matching Algorithm Tools

Unlock the value trapped in your messy, inconsistent and duplicate-riddled data. Let Tilores be your data "source of truth". 

Compare all

Compare Fuzzy Matching Algorithms

Other Fuzzy Matching Algorithm Tools

Are we missing a fuzzy matching algorithm you would like to test?

About

Tilores

When you need to do fuzzy matching on high-volume data in real-time, you need a built-for-purpose technology: enter Tilores.

Consistently fast search response times

Built for unlimited serverless scaling

Real-time data ingestion and simultaneous search.

Configure matching rules easily in the UI

Data privacy compliant by design

The API to unify scattered customer data in real-time.

Get the latest updates

©2023 Tilores, All right reserved.