Does the US have a duplicate voter problem?

We analyzed ~50 Million US voter profiles from 7 US states to look for duplicate registered voters and found nearly 400,000 duplicates, representing 0.8% of the voting population.
We examined this at state and county level and found significant regional variation in the rate of voter duplicates, as well as differences based on political affiliation.
In Michigan, we found voter registration levels of 90% of the population, suggesting deceased voters are not being efficiently removed from the voter data, meanwhile we found registration rates as low as 26% in Arkansas, suggesting voter engagement has some room for improvement.
Lastly, we looked for cases of actual voter fraud in Ohio and Pennsylvania and found at least 61 cases where an individual had voted twice, either in the same county, different counties within the same state, or even both in Ohio and Pennsylvania.

UPDATE: actually there is less of a problem than we thought. Or actually there is a different problem.

When voters move

When a registered US voter changes their address, they are obliged to update the state board of elections by submitting a new voter registration form. However, the system does not work perfectly, resulting in voters that are registered to vote more than once (“duplicates”), whether in the same county, in different counties within the same state, or in two different states.

In states where the voting results are extremely close, such as Georgia, high levels of duplicate voter registrations may erode trust in electoral outcomes.

To investigate how widespread this duplicate voter problem is, we analyzed US voter data in seven US states where the data is freely available¹ - Georgia, Florida, Michigan, North Carolina, Pennsylvania, Ohio and Arkansas: representing over 49.5 million registered voters.

Using Tilores “identity resolution” technology, we deduplicated the voter lists across these seven states. This means that every individual registered voter was compared to all others in the dataset, using the name, address, date of birth, and other fields to look for duplicates.

Identity resolution technology works by using similarity algorithms (aka “fuzzy data matching ”) to look for voter records that have small differences in the name or address field, most likely caused by spelling mistakes or data entry errors in either the original or new voter registration data.

Such minor mistakes in a name would make a voting record look unique, but by allowing for a small degree of variance we can be confident that these records are in fact duplicates of each other. Duplicates may occur, for example, when an individual changes address and registers to vote at their new address, but they remain registered at the previous address.

Close calls in Pennsylvania and Georgia

Overall we found 394,396 voters with duplicate registrations, representing 0.8% of the total voting population in our sample.

Florida was the state with the highest number of both absolute and proportional duplicate voters with 148,516 duplicate voter records found, representing approximately 1.1% of the voting population.

Pennsylvania was the other state in our sample that had a duplicate rate of >1% (1.01%), with 80,142 duplicates. In the 2020 State election, the Democrats won there by 80,555 votes.

In Georgia, the state that had the second closest result in the 2020 Presidential election (after Arizona), we found 51,876 (0.73%) duplicate voters. In 2020, the Georgia State election was won by the Democrats with a margin of 11,779 votes.

Figure 1a) Absolute number of active duplicates per state.

Figure 1b) Active duplicates per state as a % of the total voting population.

County level variation

Significant variation was also seen in the level of duplicates at the county level. For example, although Arkansas had an overall duplicate rate of only 0.92%, one county, Searcy, with a voting population of 5,805, had a duplicate rate of 2.06% (121 individuals).

In Florida, Hardee county, where 13,654 voters are registered, we found 314 duplicates (2.3%), whilst Osceola county, with 214,588 registered voters, has 5376 duplicates (1.94%).

Hover over individual counties in the map below to see detailed data. You can zoom and drag the map.

Figure 2: Duplicate voters at the county level. Dark blue represents a higher duplicate rate, light green represents a lower duplicate rate.

Republicans vs Democrats

In all the sampled states, except Michigan, we were able to look at the split of duplicate voter registrations by registered political party affiliation.

Looking at the voters that identified as either Democratic or Republican voters, we found that Democratic voters had a duplicate rate of 0.96%, whilst Republican voters had a duplicate rate of 0.89%, representing 121,985 and 120,384 profiles respectively.

Figure 3a: Party affiliation of duplicate voter records.

Figure 3b: Party affiliation of duplicate voter registrations in each state as a percentage of overall voters identifying with either political party.

The dead may be voting

Something else we noticed in our analysis of the voting data, was significant differences in the registration rate of voters in certain counties in comparison to the general population of that county. This was most apparent in Michigan, where several counties had voter registration levels of over 90%, which is not feasible when 22% of the US population is under 18 ². This suggests that Michigan is not cleaning up its voter lists with comparison to death notifications as well as other states - another problem that could be solved with the use of identity resolution technology.

At the other end of the spectrum, Arkansas had an unusually low voter registration rate, with some counties, such as Lincoln, having a voter registration rate as low as 26.02%.

What about actual voting fraud?

Duplicate data is one thing, but it does not necessarily mean anything more nefarious than poor processes and bad voter data.

The bigger question is whether we detected any examples of individuals who actually voted twice - i.e. actual voter fraud.

The answer is yes - we did. Fortunately not a lot, but arguably even one case is one too many.

We only have actual voting data for Pennsylvania and Ohio, so we examined the duplicates in that subset of data to look for duplicate voters that actually voted twice and found approximately 1000 potential cases.

Since we wanted to be extra stringent with this analysis, we applied extra data deduplication rules and narrowed the cases down to 61 cases where we were certain voter fraud had occurred and reviewed them manually to be sure.

Of the 61:

31 are affiliated with Democrats
21 are affiliated with Republicans
6 are affiliated with both parties
3 are without affiliation

Two of the cases were Pennsylvania/Ohio cross state cases, with both cases involving absentee votes.

To give an example of a certain duplicate voter, an individual in Pennsylvania registered to vote in West Chester in October 2020, just shortly before the presidential election. Just nine days later, the same individual registered to vote in Philadelphia. There was a minor difference in his name, but otherwise the same date of birth and phone number. The individual voted in both locations.

Another individual living in Normalville, Pennsylvania registered twice at the same address 16 years apart almost to the day. The only difference in their registered voter data was a minor difference in their date of birth. The individual voted under both profiles. This was an example of the most common case - where a small change (a few days or months) was made in the date of birth, while the name and address remained the same.

If we were to analyse the voting data for the other states in our sample, we would of course expect to find more cross-state duplicate voting examples.

How can the US improve its voter data?

The US voter data could be significantly improved through the use of identity resolution technology, such as Tilores. If every state uploaded their new and updated voter registration data on a regular basis, any duplicate voter registrations would be identified in milliseconds. In the case of cross-county and cross-state duplicates, clerks in both electoral offices, as well as the voter themself, could be notified to fix their registration data. This could be done in a privacy-preserving manner, such that electoral officials from each state could only see the data from their own electoral data, but still receive an alert if a duplicate was generated from out of state.

Similarly, to remove the “dead” voters from the register, the death registries could be uploaded to an identity resolution system to identify matches between the death notices and voters on the electoral list that should be removed.

Can identity resolution technology preserve democracy in the US?

Identity resolution technology can be used not only to increase confidence in the quality of the voter data, but also to make sure more potential voters are able to vote.

In 2018, the State of Georgia came under fire for attempting to implement an “exact-match” regulation, meaning that an individual's voter registration had to exactly match their driver’s license or social security data³. Any small variance would have meant that their voting rights were placed in a “pending” status, until corrected.

There were 51,000 pending voter profiles in Georgia in 2018, and the majority of these were African Americans, Latinos, and Asian Americans, suggesting that minorities were being kicked off the voting list simply for having uncommon names that are often misspelled.

Several other states, including Wisconsin and Virginia, had already canceled or scaled back their “exact-match” requirements after realizing that they discriminated against minorities.

If a state electoral office wants to cross-validate electoral registers against other databases, such as social security or driving licenses, then an identity resolution system, such as Tilores, should be used that can accommodate variance in names caused by misspellings and data entry errors.

Why did we do this?

Tilores is a data infrastructure company, whose “identity resolution” technology deduplicates and links record data about individuals at scale, in real-time. The typical use cases for our technology are in fraud prevention, compliance and KYC (know your customer).

Our mission is to create the definitive “source of truth” about any given dataset, in real-time, so that organizations can make trusted decisions with their data. Arguably there is no more important data in society than voting data, so it is imperative that this data is clean, to maintain trust in democracy.

We seek out interesting datasets that we can deduplicate using our technology, especially when we think that the deduplication use case will prove useful to society. With the upcoming US presidential elections in 2024, we thought it important to show the problems with current voter registration data and demonstrate how easily these can be solved.

In a previous showcase, we deduplicated the United Kingdom’s official registry of company directors to show that of 10.1 Million registered company directors in the UK, over 600,000 of them were duplicate profiles.

How does Tilores deduplication work?

Our identity resolution technology deduplicates record data (e.g. voter profiles) using a rules-based system based on text similarity, geographic proximity and temporal ranges.

Text similarity algorithms detect words that have minor character differences - such as Philip and Phillip - or those that are different but phonetically the same - such as Steven and Stephen. Geographic proximity matching uses the geographic coordinates of an address to say that if two addresses are within a certain proximity of each other, they might be considered a match. Temporal range matching looks for events, such as date of birth, that happen within a defined time period to consider them a match.

It is the combination of these matching techniques across available data fields such as name, address and date of birth that creates deduplication rules in Tilores. Multiple different rules are generated to cover different potential duplication scenarios, and if one of these rules is triggered when comparing two records, then the records are considered duplicates of each other..

Once two duplicate records are detected in Tilores, they are connected together as one identity, together with an explanation of which rule was triggered, and are thus deduplicated. These so-called “identity graphs” contain all the possible voter records about an individual person.

Footnotes

¹ Voter data from Michigan cost $23, from Pennsylvania cost $20 and from Georgia cost $250. A full list of US voter data accessibility can be seen here: https://www.ncsl.org/elections-and-campaigns/access-to-and-use-of-voter-registration-lists. Voter data was sourced between September and November 2023.

² Source: https://www.census.gov/library/stories/2021/08/united-states-adult-population-grew-faster-than-nations-total-population-from-2010-to-2020.html

³ https://www.bloomberg.com/news/articles/2018-10-15/how-georgia-s-exact-match-program-was-made-possible

The API to unify scattered customer data in real-time.