Company Name Normalization Isn't Enough for Fuzzy Matching

Sami Yaseen

Every entity-resolution tool ships legal-form normalization. None of it fixes the false positives resulting from matching based on common words like Group, Holdings, Solutions etc. These are meaningful business words that can't be normalized, and token weighting is what closes the gap.

Standard fuzzy-matching algorithms fail on company name records for a structural reason: they treat every token as equal evidence. Weighted token matching uses Inverse Document Frequency (IDF) rarity weights to fix that. No per-token normalization lists, no per-domain tuning.

Assuming you've handled the Ltd/Limited problem and normalized your legal forms, your fuzzy matcher still rates "Acme Group Ltd" and "Beta Group Ltd" (pair A) as more similar than "Acme Group Ltd" and "Acme Holdings Ltd" (pair B). On every standard similarity algorithm (Jaccard, Cosine, Jaro, Jaro Winkler, Sorensen-Dice), pair A scores higher than pair B. The matcher is doing exactly what it was designed to do, and that's the problem.

The setup

Try this using your preferred fuzzy matching algorithm. Run two pairs of company names:

Pair A: Acme Group Ltd vs Beta Group Ltd

Pair B: Acme Group Ltd vs Acme Holdings Ltd

Pair A is two unrelated companies. Pair B is plausibly the same group, possibly the same company. Pair B should score higher. Here's what every standard fuzzy matcher actually says, with match threshold 80% similarity, distance ≤ 3, phonetic codes equal. Cell colored 🟢 above-threshold; 🔴 below-threshold:

Algorithm	Pair A: should NOT match	Pair B: should match	What goes wrong
Cosine (char n-gram)	🔴 69.2%	🔴 48.5%	🔴 Recall problem: misses the real match (pair B)
Jaccard (char n-gram)	🔴 52.9%	🔴 31.8%	↑
Sorensen-Dice	🔴 69.2%	🔴 48.3%	↑
Levenshtein distance	🔴 4	🔴 8	↑
Cologne phonetic	🔴 different encoding	🔴 different encoding	↑
Jaro	🟢 85.7% ⚠	🔴 70.1%	🔴 Both problems: accepts pair A (false positive) AND misses pair B
Jaro Winkler	🟢 85.7% ⚠	🟢 82.1%	🔴 Precision problem: accepts pair A even though it shouldn't
Weighted token matcher	🔴 ~13%	🟢 ~84%	🟢 Gets both right

⚠ = matcher said match when it shouldn't have (false positive on the unrelated pair).

Verify these numbers yourself in Tilores's free fuzzy-matching comparison tool: pair A and pair B. The weighted token matcher row uses sensible illustrative token-frequency weights.

There's no threshold that fixes both. Lower the threshold to catch pair B and you also accept pair A as a false positive. Raise it to reject pair A and you lose pair B. With these algorithms, you pick between bad precision (wrong matches) and bad recall (missed matches) on exactly the kind of records entity matching is supposed to handle.

Only the weighted token matcher gets both right. Acme is doing the work in pair B; Group and Ltd contribute almost nothing in pair A. The 80% threshold lands cleanly between the two scores: pair A rejected at ~13%, pair B accepted at ~84%. Both decisions correct, with room to spare in either direction.

The standard matchers aren't broken. Jaccard, Dice, Cosine, Jaro Winkler are all doing exactly what they're defined to do: count or measure shared characters and tokens, whilst ignoring which characters and tokens are actually present. And for company names, or any short natural-language identifier, that's exactly the wrong thing to count.

This issue is most acute with company names, where reliable addresses often aren't there to begin with. The same problem is there with person names. Smith and Garcia hide the matching signal of rare names that actually identify someone. With addresses, a house number plus a rare street name carries the signal whilst Street, Avenue, Road are not unique to the address.

So how does the weighted token matcher get both pairs right? It starts with knowing which words actually matter.

Why common business words break fuzzy matching

Pull a million company names from any registry and tokenize. The top of the frequency distribution falls into two layers, and they're not equally hard to deal with.

The easier half (for Western companies): legal forms. Ltd, Inc, Corp, Limited, Company, GmbH, SA, AG, Pty. You can handle these with a reference list of known legal-form variants and a normalization step. Most entity-resolution tools have this baked in; OpenSanctions publishes open reference data you can reuse; commercial tools ship preset lists. For Western legal forms, mostly a solved problem.

For Chinese, Russian, transliterated, or non-Latin-script company names, it gets complicated quickly. The reference-list approach hits its limits when transliteration variants, mixed scripts, and locale-specific suffixes pile up. That's a different article. For the rest of this piece, assume the easy cases are handled and focus on what's left.

The harder half: business-language tokens that aren't legal forms. Group, Holdings, International, Global, Services, Solutions, Systems, Technologies, Industries, Consulting, Management, Capital, Digital, Cloud etc.. Each of these appear in 5–25% of records (i.e. close to 90% of records in any dataset will have one of these words). They're real meaningful words. They sometimes distinguish a company (a "Solutions" subsidiary is different from an "Industries" subsidiary) and sometimes don't ("Acme Group" rebranded to "Acme Holdings"). You can't normalize them away. They carry domain meaning. But the likelihood of two records sharing any one of them is barely better than chance.

This is where naive token-set similarity falls apart for company name matching. A matcher that doesn't distinguish Solutions (in 18% of records) from Petrochemical (in 0.01%) will treat "Acme Solutions Ltd" and "Beta Solutions Ltd" as a strong match candidate (both share Solutions and Ltd) even though the Acme/Beta difference is the only thing that actually identifies these as different companies.

The same shows up in person names. Smith, Garcia, Wang, and Lee together cover something like 15% of records in their respective populations. Two records in a US dataset sharing Smith says almost nothing. Two records sharing Petrov in the same dataset says a lot. With addresses: Main Street and High Street appear in thousands of cities; a number plus Magnolia Lane is uniquely yours.

The fix isn't a better string-similarity algorithm. Edit distance, Levenshtein, Jaro-Winkler, all of them have the same problem when applied to short identifiers. And it isn't a longer normalization list either. Solutions, Holdings, Industries aren't candidates for normalization. They're meaningful tokens that happen to be common. The fix is to give the matcher information it doesn't have: a sense of which tokens carry information.

Inverse Document Frequency: an old idea in a new place

Search engines have known about this for fifty years. They use TF-IDF (Term Frequency × Inverse Document Frequency): the first half counts how often a word appears in a given document; the second half discounts it by how common the word is across the whole corpus. Together they measure how distinctive a word is for that document. A token's value as evidence is inversely proportional to how often it shows up.

The IDF half:

weight(token) = log(N / count(token))

N is the total number of records, count(token) is how many of them contain that token. Common tokens get weights near zero. Rare tokens get high weights.

Nothing exotic. The same intuition every search engine uses.

What's not commonly applied is using IDF for entity matching. Search uses it to rank documents against a query: which document best matches these words? Entity matching uses it for a related but different question: do these two short records refer to the same thing? Same idea, different application.

You take the tokens of both records, weight each by its IDF score, and now your matcher has the information it didn't have before. Acme should count for orders of magnitude more than Solutions or Holdings or Group when measuring whether two records overlap.

And critically, you don't need to maintain a per-token normalization list to make it happen. The weights come from the data itself. The same technique works across domains (company names, person names, product titles, addresses) because the IDF derivation pulls the rare-vs-common distinction out of whatever corpus you point it at. Domain knowledge comes from the data, automatically. No per-domain tuning, no manual rule lists, no expert-curated exception files.

If you've used the ratio matcher (matching tokens / unique tokens, or any equivalent uniform-weight set-similarity), this is the natural next step. Same idea, just with rarity as the weighting principle instead of treating every token equally.

How weighted token matching works

The scoring formula:

matching_score = matching_token_weights / all_token_weights

Where:

matching_token_weights = the combined IDF weights of tokens that match between the two records. Fuzzy matches (typos like Aleksandr vs Aleksander) contribute their weight scaled by the string-similarity score, so half-matches count for half.

all_token_weights = the combined IDF weights of every token across both records.

What you're computing: the fraction of the informative content the two records share, instead of the fraction of tokens. Hook a threshold to that score (say, 80%) and you have a matcher that doesn't fall for the Group/Holdings/Solutions trap.

Even mature probabilistic linkage tools, such as Splink, explicitly say they're not designed for the single-column "bag of words" case, a table with only a company name column, nothing else. That's exactly the case this technique handles, and it's a chunk of real-world entity-matching work that often goes underserved.

A few details worth knowing about:

Fuzzy matches scale proportionally. Two records with Aleksandr and Aleksander (typo on a rare token) shouldn't score the same as an exact match. They shouldn't score zero either. Weighting the intersection contribution by the string-similarity score handles this naturally.

Position can carry weak evidence. If the same word appears at vastly different positions in the two records, that's a small signal that the records mean different things. A modest position-distance penalty handles it; how aggressive depends on the domain.

I'm describing the general structure. Several specific scoring functions fit it. What matters is the structural change: from "all tokens equal" to "tokens weighted by rarity."

How to derive token-frequency weights from your data

We glossed over an important question: where does count(token) come from?

You have two choices. Use a generic weight list (i.e. an off-the-shelf English IDF), or derive your own from your data.

Generic is easier. It's also wrong, in subtle ways. Industries is rare in general English and would get a high weight. In a B2B vendor dataset, Industries is everywhere. Petrochemical is rare in general English but might be common in your dataset. Generic weights mis-weight exactly the tokens that are characteristic of your domain. Exactly the tokens that matter most.

Deriving weights from your dataset is the right move. Run your corpus through the same transformation chain that runtime matching uses, count token frequencies, compute IDF.

The "same transformation chain" part is non-trivial. If your weight-generation tokenizer drops diacritics and your runtime tokenizer doesn't, the matcher pulls a weight for token X computed against a different splitter, and either misses the token (because the form isn't in the weight list) or scores it wrong. The mismatch is silent; nothing crashes; matching just performs worse than it should.

The fix is operational. Lock the transformation chain. Use the same chain for both purposes. Regenerate weights when the dataset shifts meaningfully.

What changes when this works

False positives drop and recall holds at the same threshold. Operators stop maintaining the exception lists they've built up against the obviously-wrong matches token-set methods generate.

Where this matters most: matching shipping or customs records (often OCR'd from handwritten forms) against clean company registries, where the only shared signal is a company name riddled with extraneous tokens, transliterated variants, or fragments of address mixed into the name field. Address-based matching has nothing to work with there. Token-set methods drown in the noise. Weighted matching is what finally bridges them.

The approach generalizes. Person names: matching Aleksandr Petrov through a customer base full of Smiths and Garcias becomes tractable instead of a tuning exercise. Addresses: "12 Old Mill Lane, Petersfield" finds "12 Old Mill Ln, Petersfield" without drowning in every record that contains Lane or Street or any postal abbreviation. Fraud applicant linkage, customer deduplication across systems, cross-language identifier matching. Anywhere short tokens carry the signal and most are noise, the same problem applies, and is solved the same way. The data carries the domain knowledge.

There's a price. You maintain the weights, tune the threshold, and keep the transformation chain aligned across both stages. Most teams running entity resolution at scale already have this discipline. Weighted matching just gives you a tool that uses it.

If you're hand-tuning thresholds against company name datasets, normalizing legal forms, and writing exception rules for Solutions, Holdings, and Group matches, there's a better way.

Summary

Standard fuzzy matching (Jaccard, Cosine, Jaro, Jaro Winkler, Levenshtein) fails on company name records because all tokens count as equal evidence.

Common business words like Group, Holdings, Solutions aren't candidates for normalization. They carry domain meaning, even when they're noise for matching.

IDF-style token weighting separates rare-and-meaningful tokens from common-and-noise ones. Fixes both precision (false positives) and recall (missed matches) at the same threshold.

At scale, indexing must bucket tokens by weight too: rare tokens indexed individually, mid-weight tokens in pairs, common tokens dropped from the index.

Weights derive automatically from your dataset, run through the same transformation chain as runtime matching. The same technique works on company names, person names, addresses, product titles, anything. Domain knowledge comes from the data; no manual rule lists or per-domain tuning required.

This is the approach the matcher in Tilores takes for its entity-resolution pipeline (technical docs). The transformation-chain alignment described above is implemented as a separate weight-generation step that runs your dataset through the runtime transformation chain to compute weights consistently. The weights, category boundaries, and transformation chain are explicit configuration, not pre-tuned settings you can't see or adjust.

Posts

Explore Similar Articles

The API to unify scattered customer data in real-time.

service@tilores.io