NEAR() versus SIMILAR() Functions

You are here: Fuzzy Matching > NEAR() versus SIMILAR() Functions

NEAR() versus SIMILAR() Functions

When performing fuzzy matching, it is important to understand the differences between the NEAR() and SIMILAR() functions in order to determine which function is best suited for the specific fuzzy comparison being made.

On character data, both NEAR() and SIMILAR() measure the difference between two character values using the Damerau-Levenshtein distance.

Where NEAR() and SIMILAR() differ are with their comparison of numeric and date fields. SIMILAR() converts all numeric and date fields to character values, whereas NEAR() function works on the original numeric and date field data as is. As a result, when NEAR() is used on numbers and dates, it compares the difference of the actual field "values" (the numeric difference for numeric values and the number of days for date values), while SIMILAR() compares their "representations" using a Damerau-Levenshtein distance.

Additionally SIMILAR() prepares the field data before testing. The preparation falls into three categories:

•

Conversion - involves turning numeric and data values into character values

•

Normalization - involves removing all blanks, leading zeros and punctuation, upper casing and de-internationalizing data

•

Standardization - involves standardizing similar "looking" characters, like 0 and O or 1 and I

This preparation of the data is a key aspect of SIMILAR() because it means that, in all cases, SIMILAR() compares the "representation" of the values, rather than the "values" themselves. For example, SIMILAR() will compare two numbers and test whether they "look similar, even though they are not close in value", like 123,456,789.01 and 123,546,789.01.

Because SIMILAR() compares "cleansed" representations, performing a SIMILAR() test with a difference of 0 is meaningful, whereas performing a NEAR() test with a difference of 0 is exactly the same as performing an "equal" test. This is because SIMILAR() is testing two "cleansed" values which may have originally been different, but only in an inconsequential manner (punctuation, case, foreign characters or character typos/transpositions).

As a result, there are several distinct situations where one would choose SIMILAR() over NEAR():

Accounting References

When comparing key values like invoice, G/L, customer, employee or part numbers then SIMILAR() is almost always a better choice than NEAR() for two reasons:

•

First, key fields would normally not be subjected to normalization, so inconsequential data entry variations (like a dash, comma or period) would result in measurable differences, even though there no difference in the intended representation. For example, a manual data entry clerk may or may not enter a dash from a source document, which can easily result in a comparison error. Since SIMILAR() normalizes this aspect of the data prior to comparison, inconsequential data entry is eliminated thereby improving the quality and accuracy of the comparison test.

•

Second, depending how the data to be compared is put into electronic form, it can easily be subject to a variety of misreading errors. Both people and automated OCR have a tendency to mistake 0 and O or 1 and I for example. Since SIMILAR() standardizes this aspect of the data prior to comparison, inconsequential data entry is eliminated thereby improving the quality and accuracy of the comparison test.

Eliminating inconsequential differences such as those described above should not measurably increase false positives, but should improve the quality and accuracy of the SIMILAR(　) comparison.

Equality

In this situation, we are talking about applying SIMILAR() with a difference of 0. This means that the "cleansed" values of each field (regardless of field type) are equal. For example, in an address field, the values "123 Main St" and "123 Main St." are deemed to have a difference of 1 using NEAR(). However, since SIMILAR() turns both of these values into "123 MAIN ST" prior to comparison, SIMILAR() deems these values equal (in other words having a difference of 0). As a result, using a SIMILAR() test with a 0 difference will broaden the comparison from a literal comparison into a fuzzy comparison, without appreciably increasing the chance of false positives. Clearly, a NEAR() comparison would be far less suitable for this kind of test, due to the false positives it may generate depending on the difference specified

Note: This is also true when performing a Duplicates command using the SAME-SAME-NEAR vs. SAME-SAME-SIMILAR options.

Alternative to NEAR() function

Assuming character data has already been "normalized", then SIMILAR() test of 1 and NEAR() test of 1 will generally produce about the same result, since they both measure a Damerau-Levenshtein distance. However, if the character data being compared is subject to the types of errors noted under "Accounting References" section, then SIMILAR() may produce a demonstrably better result because "inconsequential" differences would directly impact the quality of a NEAR() test (for example one significant difference combined with one "inconsequential" difference would result in an item being excluded from a NEAR() test of 1). Additionally, even if the data is normalized and most of these types of "inconsequential" errors have been removed, there is still the issue of the mis-reading and typo errors (like the 0 for O or 1 for I transposition). Once again, SIMILAR() test may produce a better result as these types of errors are also dealt with prior to comparison.

Non-Character Data

For non-character based data, SIMILAR() and NEAR() performing entirely different comparisons and test for different things. With NEAR() you are testing the "value" of items to see if the value is close, whereas with SIMILAR() you are testing the "representation" of items to see if the items "look" reasonably the same, essentially testing for data entry or scanning errors. Both tests are valid, depending on what you are testing for. In some instances, both tests are complimentary.

Tip: SIMILAR() inherently comes with a computational cost, so it should not be applied frivolously, only when there is any reasonable possibility of the kinds of differences/errors described above. In these cases, it may be that SIMILAR(　) test of 0 is preferable to an equal test, and SIMILAR(　) test of 1 may be preferable to a NEAR(　) test of 1.