You are here: Fuzzy Matching > Fuzzy Matching Tools

Fuzzy Matching Tools

Analyzer provides a number of useful commands and functions that can be used in isolation or in combination to support a variety of fuzzy matching scenarios. Whether you are looking for fuzzy matches within the same table or between tables, Analyzer enables users to choose and combine these tools in order to customize the fuzzy matching to meet their needs.

The following commands can be used to facilitate fuzzy matching between tables or within the same table:

Command

Type of Comparison

Objective

Join

Matched Many-to-Many

To combine records from two tables lacking a common key but where there is a reasonable indirect key like zip code or state

Join

Many-to-Many

To combine records from two tables lacking a common key or reasonable indirect key

Duplicates

Same-Same-Different

To test for duplicates within the same table where the value of the last key field is different

Duplicates

Same-Same-Near

To test for duplicates within the same table where the value of the last key field (numeric, date or character) is near, but not necessarily exactly the same, based on a specified level of difference appropriate for the field type

Duplicates

Same-Same-Similar

To test for duplicates within the same table where the value of the last key field when harmonized (numeric, date or character) is similar, but not necessarily exactly the same, based on a specified Damerau-Levenshtein distance

The following functions can be used to facilitate fuzzy matching:

Function

Behavior

ARRANGE()

Re-arranges characters, or optionally whole words, within a string in descending order based on the host machine’s character set

DIFFERENCE()

Compares two strings and returns a positive whole number. A zero indicates that two strings are identical, a small number indicates that two strings are very similar. A value equal to the length of the longer string indicates that two strings are different

NEAR()

Tests whether the difference between two fields falls within a specified difference (used on pairs of numeric, date or character fields)

NORMALIZE()

Replaces all contiguous non-alphanumeric characters (except blanks) with a blank, replaces all international characters with English equivalents, trims leading blanks, compacts all contiguous blank spaces into a single blank space, and upper cases the entire string; also allows option to retain specified characters and to make text substitutions

SIMILAR()

Tests whether the difference between two fields falls within the specified Damerau-Levenshtein distance (used on pairs of numeric, date or character fields)

Prior to testing the difference, this function takes a pair of character strings (or converts a specified pair of numeric or date values to character strings), upper cases the strings, removes any punctuation, and matches similar looking characters (e.g., 1 for I, 0 for O)

SORTNORMALIZE()

Replaces all contiguous non-alphanumeric characters (except blanks) with a blank, replaces all international characters with English equivalents, trims leading blanks, compacts all contiguous blank spaces into a single blank space, and upper cases the entire string; also allows option to retain specified characters and to make text substitutions. Finally, re-arranges remaining result in descending order of whole words based on using any remaining blanks as word separator

SOUNDEX()

Returns a four-character value for a string, which can be compared with the value of another string to see if both strings are phonetically similar using the Soundex algorithm

SOUNDSLIKE()

Indicates whether two strings sound phonetically alike using the Soundex algorithm

FIND()

Searches for a string inside one or more fields or the entire record. Also supports Google-type searches for a literal

LISTFIND()

Searches for a list of strings inside one or more fields or the entire record

INCLUDE() or EXCLUDE()

Includes or excludes specified characters (only digits, only alphabetic characters, etc.)

LEFT() or RIGHT()

Isolates specific number of leading or trailing characters

SUBSTRING()

Isolates specific number of characters with specified starting point in string

MAP() and FORMAT()

Measures a string’s format against a desired format or determines the format of a character string (e.g., zip code or phone number)