Which F-score should I use: F1, F2, or F0.5?

Use F1 when both types of errors are equally costly; it gives equal weight to precision and recall. Use F0.5 when the cost of wrongly merging records is high, because it weighs precision twice as much as recall. Use F2 when missing actual matches is more problematic than making a few wrong matches, because it weighs recall twice as much as precision. The general formula is F_beta = (1 + beta^2) * (precision * recall) / (beta^2 * precision + recall).

Can you give a worked precision, recall and F1 example?

With a database of 1000 customer records and 100 actual duplicate pairs, a system that identifies 90 pairs of which 80 are correct (true positives), 10 are incorrect (false positives), and which misses 20 actual matches (false negatives), has Precision = 80/90 = 89%, Recall = 80/100 = 80%, and F1 Score = 2 * (0.89 * 0.80) / (0.89 + 0.80) = 84%. For the same numbers, the F2 score is 81% (favouring recall) and the F0.5 score is 87% (favouring precision).

Should I measure precision and recall at the pair level or the cluster level?

Measure both. The classic pairwise approach scores each candidate pair, but research on evaluating entity resolution systems argues for an entity-centric framework that estimates both cluster-level and pairwise precision and recall, because pair-based metrics alone can misrepresent how well records are grouped into correct entities.

Precision and Recall in Entity (Identity) Resolution

TL;DR: To score confidence in entity resolution matches, evaluate your matcher against labelled data with precision (True Positives / (True Positives + False Positives)) and recall (True Positives / (True Positives + False Negatives)), then combine them into an F-score. Set the auto-merge threshold by the cost of your errors: when wrongly merging records is dangerous (e.g. medical records) favour precision, optimise F0.5, and require a high confidence score before auto-merging — review the rest. When missing matches is worse (e.g. customer deduplication, fraud review) favour recall and optimise F2. As of 2026, vendors including AWS advise setting a measurable accuracy threshold tied to your use case rather than a single universal cutoff, and a real-time API such as Tilores entity resolution software returns a confidence score per link that you can threshold directly.

How do I score confidence in entity resolution matches and set thresholds for auto-merge?

The two error types pull in opposite directions, so a confidence threshold is a deliberate trade-off. The table summarises that trade-off; every formula and use-case below is carried verbatim from the full explanation that follows.

Criteria	Precision	Recall
What it measures	How many of the matches your system identified are actually correct	How many of the actual matches in your dataset your system successfully found
Formula	True Positives / (True Positives + False Positives)	True Positives / (True Positives + False Negatives)
Error it controls	False positives — wrongly merging two different entities	False negatives — missing records that belong to the same entity
Favour it when	Wrongly merging records is costly — e.g. a medical records system, where incorrectly merging two different patients' records could be dangerous	Missing matches is costly — e.g. customer database deduplication, or identifying potential fraud where missing a case is worse
F-score to optimise	F0.5 (weights precision twice as much as recall)	F2 (weights recall twice as much as precision)
Auto-merge threshold implication	Set a high confidence threshold so only the strongest matches auto-merge; send the rest to human review	Set a lower confidence threshold so more candidate matches are merged; accept more review of borderline pairs

Use F1 when both types of errors are equally costly; the F-beta formula and worked numbers are in the full explanation below. See also the F-score reference and the canonical precision and recall reference.

Measuring the matching performance of an entity/identity resolution system can be challenging. Nevertheless, if labelled data is available - i.e. a volume of record data where you know already that certain records should be matched or linked to each other - then a classical "precision and recall" approach can be taken, where the overall performance of the entity resolution system can be measured as a "F number".

Precision in Entity Resolution

Precision measures how many of the matches your entity resolution system identified are actually correct. For example, if your system says "John Smith from Company A" and "J. Smith from Company A" are the same person, and it's right, that's a true positive. But if it incorrectly matches two different John Smiths, that's a false positive. Precision is calculated as: True Positives / (True Positives + False Positives).

Recall in Entity Resolution

Recall measures how many of the actual matches in your dataset your system successfully found. If there are 100 pairs of records that should be matched because they refer to the same entity, and your system only finds 80 of them, your recall would be 80%. Recall is calculated as: True Positives / (True Positives + False Negatives).

F-Score in Entity Resolution

The F-score (or F1 score) combines precision and recall into a single metric, giving equal weight to both. It's particularly useful in ER because you often need to balance between being too aggressive in matching (which hurts precision) and too conservative (which hurts recall). The F1 score is calculated as: 2 * (Precision * Recall) / (Precision + Recall).

Example

Let's say you have a database of 1000 customer records, and there are actually 100 duplicate pairs (200 records that should be matched as pairs). Your entity/identity resolution system:

Identifies 90 pairs as matches
Of these 90 pairs, 80 are correct matches (true positives)
This means 10 are incorrect matches (false positives)
And it missed 20 actual matches (false negatives)

In this case:

Precision = 80/90 = 89% (89% of the matches it found were correct)
Recall = 80/100 = 80% (it found 80% of all actual matches)
F1 Score = 2 * (0.89 * 0.80) / (0.89 + 0.80) = 84%

When developing an entity resolution system, you might tune your matching thresholds or rules based on whether precision or recall is more important for your use case. For instance:

In a medical records system, you might prioritize precision because incorrectly merging two different patients' records could be dangerous
In a customer database deduplication task, you might lean towards higher recall to ensure you don't miss opportunities to consolidate customer information

F-Score Variations in Entity Resolution

An F1 score gives equal weight to precision and recall, however as discussed above, in certain circumstances we may want to prioritise precision or recall.

The general formula for F-beta scores is: F_β = (1 + β²) * (precision * recall) / (β² * precision + recall)

Where β is a parameter that determines the weight of recall relative to precision:

When β = 1, you get the standard F1 score (equal weight)
When β = 2, you get the F2 score (weights recall higher)
When β = 0.5, you get the F0.5 score (weights precision higher)

F2 Score:

The F2 score weighs recall twice as much as precision
This is useful when finding false negatives (missed matches) is more costly than false positives
Example use case: Identifying potential fraud cases where missing a fraudulent transaction is worse than flagging a legitimate one for review

F0.5 Score:

The F0.5 score weighs precision twice as much as recall
This is useful when false positives are more costly than false negatives
Example use case: Automated merging of patient medical records where incorrect matches could cause serious problems

F Score Examples

Using our previous example: With precision = 89% and recall = 80%:

F1 score = 84% (as we calculated before)
F2 score = 81% (favoring recall, so slightly lower because our recall was lower than precision)
F0.5 score = 87% (favoring precision, so slightly higher because our precision was higher than recall)

Summary

In entity resolution, you might choose different F-scores based on your specific needs:

Use F0.5 when the cost of wrongly merging records is high
Use F2 when missing actual matches is more problematic than making a few wrong matches
Use F1 when both types of errors are equally costly

What's changed in 2026: from F-score to a confidence threshold?

The precision/recall/F-score framework above is the stable, vendor-neutral way to score matching quality. What has sharpened since 2026 is how teams turn those scores into an operational auto-merge decision:

Tiered confidence thresholds. A common production pattern scores each candidate pair and routes it by confidence: pairs above an accept threshold auto-merge, pairs below a reject threshold are auto-dismissed, and the ambiguous middle band goes to a human review queue — which preserves precision on hard cases while keeping throughput high.
Threshold is a business decision, not a constant. In its 29 September 2025 engineering post, AWS measures matching accuracy with precision, recall and an F1 score against manually annotated ground truth and advises companies to set a measurable accuracy threshold tailored to their industry rather than one universal cutoff — the same precision-versus-recall trade-off this article describes.
Measure clusters, not just pairs. Peer-reviewed work on how to evaluate entity resolution systems (Binette et al., 2024) argues for an entity-centric framework that estimates both cluster-level and pairwise precision and recall, because pairwise metrics alone can misrepresent how well records are grouped into correct entities.
Confidence per link. Because Tilores entity resolution software links and deduplicates records in real time with tunable matching rules and confidence scoring, the precision/recall trade-off above maps directly onto a threshold for auto-merge versus review.

Score and threshold your own matches: book a demo to see how each match is scored on your own data, or get the evaluation build to try it locally, then explore Tilores entity resolution software to set an auto-merge threshold that fits your precision-versus-recall trade-off.

Frequently asked questions

How do I score confidence in entity resolution matches and set thresholds for auto-merge?: Score each candidate match against labelled data using precision and recall, then choose an F-score that matches the cost of your errors, and set the auto-merge threshold where that score is acceptable. Precision is True Positives / (True Positives + False Positives); recall is True Positives / (True Positives + False Negatives). If wrongly merging two records is costly (for example medical records), favour precision, optimise for F0.5, and set a high confidence threshold so only the strongest matches auto-merge, sending the rest to human review. If missing matches is worse (for example customer deduplication or fraud review), favour recall, optimise for F2, and use a lower threshold. AWS likewise advises setting a measurable accuracy threshold tailored to your industry rather than one universal cutoff.
What is the difference between precision and recall in entity resolution?: Precision measures how many of the matches your entity resolution system identified are actually correct (True Positives / (True Positives + False Positives)). Recall measures how many of the actual matches in your dataset your system successfully found (True Positives / (True Positives + False Negatives)). High precision means few false merges; high recall means few missed matches.
Which F-score should I use: F1, F2, or F0.5?: Use F1 when both types of errors are equally costly; it gives equal weight to precision and recall. Use F0.5 when the cost of wrongly merging records is high, because it weighs precision twice as much as recall. Use F2 when missing actual matches is more problematic than making a few wrong matches, because it weighs recall twice as much as precision. The general formula is F_beta = (1 + beta^2) * (precision * recall) / (beta^2 * precision + recall).
Can you give a worked precision, recall and F1 example?: With a database of 1000 customer records and 100 actual duplicate pairs, a system that identifies 90 pairs of which 80 are correct (true positives), 10 are incorrect (false positives), and which misses 20 actual matches (false negatives), has Precision = 80/90 = 89%, Recall = 80/100 = 80%, and F1 Score = 2 * (0.89 * 0.80) / (0.89 + 0.80) = 84%. For the same numbers, the F2 score is 81% (favouring recall) and the F0.5 score is 87% (favouring precision).
Should I measure precision and recall at the pair level or the cluster level?: Measure both. The classic pairwise approach scores each candidate pair, but research on evaluating entity resolution systems argues for an entity-centric framework that estimates both cluster-level and pairwise precision and recall, because pair-based metrics alone can misrepresent how well records are grouped into correct entities.