Specifying a Damerau-Levenshtein Distance

You are here: Fuzzy Matching > Specifying a Damerau-Levenshtein Distance

Specifying a Damerau-Levenshtein Distance

When specifying a Damerau-Levenshtein distance for either “near” or “similar” tests using the Duplicates command or building a filter using the NEAR() or SIMILAR() functions, it is important to understand the implications of smaller vs. larger distances.

The smaller the distance specified, the more likely the match. Additionally, specifying a smaller distance, results in fewer false positives but also increases the chances of missing some likely matches.

The larger the distance specified, the less likely the match. Additionally, specifying a larger distance, results in more false positives but also reduces the chances of missing some likely matches.

For example, using a distance of 6 would result in both of the following NEAR() comparisons being deemed true:

NEAR("ANDREW J. SMITH", "A.J. SMITH",6) = T

NEAR("K.J. JONES","A.J. SMITH",6) = T

The first example is a likely match, while the second is clearly a false positive.

The challenge then is to determine an optimal distance based on the data being compared. Choosing a larger distance will allow for subsequent filtering to incrementally drill down and exclude the most unlikely matches. Choosing a smaller distance will prevent most false positives but may, in some circumstances, cause likely matches to be missed, as in the example below:

NEAR("ANDREW J. SMITH", "A.J. SMITH",5) = F

Using SIMILAR() on the same examples, a smaller distance can be used to weed out some false positives;

SIMILAR("ANDREW J. SMITH", "A.J. SMITH",5) = T

SIMILAR("K.J. JONES","A.J. SMITH",5) = F

In this case, because SIMILAR() performs some harmonization of the data prior to assessing the difference, a lower distance can be used which assists in reducing false positives.

In general, as a best practice, consider using the NORMALIZE() function to harmonize name fields and the SORTNORMALIZE() to harmonize address fields prior to performing “near” and “similar” comparisons, as harmonizing the data will reduce the distance required to find likely fuzzy matches.