Fuzzy Duplicates Between Different Tables

You are here: Fuzzy Matching > Fuzzy Duplicates Between Different Tables

Fuzzy Duplicates Between Different Tables

Testing for duplicates between tables is most commonly done when testing master tables between systems, or when testing transactional tables to a master list or table.

In either case, there are three standard steps to efficiently and effectively achieve a meaningful fuzzy duplicates result:

Harmonize the keys. For name fields, use NORMALIZE(); for address fields, use SORTNORMALIZE(). Additionally, use the optional substitution file parameter to more closely harmonize names or addresses by standardizing variations in abbreviations and their long forms into one standard abbreviation (like STREET, ST., ST to ST or DOCTOR, DOC, DR. TO DR).

Note: By closely harmonizing the keys, a higher degree of success can be achieved in finding fuzzy duplicates.

Extract the necessary fields from both tables. This will cause any computed fields using NORMALIZE() or SORTNORMALIZE() to be processed and resolved in order to create raw data fields which will improve subsequent processing of the Join command.

Note: This is important, as these functions are computationally heavy, and, if processed as part of the Join, may dramatically slow the resulting Join.

Perform the appropriate type of Join command to bring the relevant data from the two extracted tables into one table for final comparison:

•

Where common keys exist in both tables, use one of the standard Join options (most typically Join Matched or Join All Primary) to bring the tables together.

•

Where no common keys exist in both tables, use one of the Many-to-Many Joins to bring the tables together:

•

Matched Many-To-Many - when there is no direct common key but there is a reasonable indirect key (like zip code or state), use this option to create a more manageable sized result table based on the indirect key as the Join key for each table

•

Many-to-Many - when there is no direct common key and no reasonable indirect key, use this option to create a result table. Be aware that this Join type creates the largest possible result (for example, if both tables contain 1,000 records, then the result table will contain 1,000 times 1,000 records or 1,000,000 records. For larger files, the resulting table size will be exponentially larger)

•

When using either of the Many-to-Many Join options, strongly consider applying a filter to the Join using either the NEAR() or SIMILAR() functions on the harmonized keys from both tables. This will enable creation of a much smaller result set that only contains the most likely matched records, rather than all matched records.

Note: Analyzer is unique among common data analysis packages as it allows a filter to be built that references fields from both the primary and secondary tables for either of the Many-to-Many Join options. This is significant as it will dramatically increase performance while potentially dramatically reducing the resulting table size.