Fuzzy Duplicates Within The Same Table

You are here: Fuzzy Matching > Fuzzy Duplicates Within Same Table

Fuzzy Duplicates Within The Same Table

The most common use for fuzzy matching is to test for duplicates between records within the same table.

Use the Duplicates command to test for duplicate payments within a transaction table or to test for duplicate records within a master table.

There are three key steps to checking for fuzzy duplicates within the same table:

Create a computed field to isolate or harmonize the necessary data in the duplicate key field that is to be tested for similarity or nearness, rather than exactness.

•

Most commonly, use NORMALIZE() for harmonizing names, SORTNORMALIZE() for harmonizing addresses

•

Use INCLUDE() or EXCLUDE() to retain or eliminate desired characters

•

Use FORMAT() when looking for items containing the same formatting (e.g., phone numbers, zip codes)

Optional - for larger files, Extract the necessary fields. This will cause the harmonized computed field to be processed and resolved in order to create a raw data field which will improve subsequent processing of the Duplicates command.

Choose the Duplicates command and select the option that best achieves your needs:

•

Same-Same-Different - where all but the last duplicate key selected are the same and last key is simply different

•

Same-Same-Near - where all but the last duplicate key are the same and the last key is “near” but not necessarily the same

•

“Near” for character fields means the difference is within a user specified Damerau-Levenshtein distance. Use NORMALIZE() to standardize name fields and SORTNORMALIZE() to standardize address fields prior to testing for nearness

•

“Near” for numeric fields means the difference is within a user-specified numeric value (for example, invoice amounts)

•

“Near” for date fields means the difference is within a user specified number of days (for example, invoice dates)

•

Same-Same-Similar - where all but the last duplicate key are the same and the last key is “similar” but not necessarily the same.

Similar is different from Near, as the key is always treated as a character string (even if it is a numeric or date value), with the string upper cased, with any foreign characters converted to English equivalents, and with standard character to number transpositions made (for example, 3 for an E).

The Similar test is exceptionally well suited for keys that are primarily numeric, like invoice numbers and p.o. numbers, and for which user-entered transpositions of letters for numbers is likely.

For example, using the similar option, the value 723.45 when compared to 123.45 yields a Damerau-Levenshtein difference of 1, whereas using the near option, the value 723.45 when compared to 123.45 yields a numeric difference of 600.

•

Similar always means the difference is within a user-specified Damerau-Levenshtein distance

Note: By default, the Duplicates command performs an exact match unless one of the above options is selected.