You are here: Commands > Duplicates

Duplicates

Menu: Found in the ANALYZE menu

The Duplicates command is used to detect whether key fields in the current file contain duplicates.

The duplicates command offers four option for detecting duplicates:

•

Exact - the default, all key fields must have the same value

•

Same-Same-Different - all key fields must have the same value except the last key field chosen which must be different

•

Same-Same-Similar - all key fields must have the same value except the last key field chosen which must be within a specified difference calculated using the Damerau-Levenshtein distance. The default difference is 0. For more detail, see DIFFERENCE().

This option automatically converts a numeric or date key to a character key and then uses the SIMILAR() function to harmonize the value of the last key to ensure a meaningful Damerau-Levenshtein distance can be calculated. For more detail, see SIMILAR().

•

Same-Same-Near - all key fields must have the same value except the last key field chosen which must be within a specified difference. The default difference is 1. The difference calculated is:

•

The Damerau-Levenshtein distance for character fields (using the DIFFERENCE() function). For more detail, see DIFFERENCE().

•

A zero or positive numeric value for numeric fields

•

A zero or positive number of days for date fields

Tip: For the Same-Same-Near or Similar duplicates test, exact matches can be suppressed.

To better understand using NEAR and SIMILAR options see NEAR() versus SIMILAR() Functions.

By default, the result of the Duplicates command is to screen in tabular format. Each key with duplicate values is listed along with a count specifying the number of keys with the same value. Clicking on the count value for any key will drill down and display a View of all records with this key value.

Understanding Duplicate Results using Near and Similar

When performing a Duplicates command using the Same-Same-Near (or the Same-Same-Similar), the Duplicates command performs an exhaustive many-to-many style comparison within each "same" group.

As a result, care should be taken when your data has many records that are "near" or "similar" within any grouping. In this case you may end up with a larger number of outputted records than you expected, because each record in a group is compared to each of the others. There are potentially more duplicate record pair combinations than there are actual records.

For example, if testing 4 records with duplicate same-same key values, the records are compared as follows:

•

Record 1 to Record 2

•

Record 1 to Record 3

•

Record 1 to Record 4

•

Record 2 to Record 3

•

Record 2 to Record 4

•

Record 3 to Record 4

So for 4 records tested, there are 6 possible comparisons. For 5 records there would be 10 possible comparisons. For 10 records there would be 45 possible comparisons. And so on. As the number of records in the input table increases, the number of potential comparisons grows substantially larger and the number of potential duplicate record pair sets written to the output table also grows.

Note: The actual formula is n(n-1)/2 so for 1000 records, the maximum number of output records would be 499,500 (essentially 500 times the output records to input records based on all records being compared to each other all records being within the specified Near or Similar difference).

If the value of the last duplicate key in any of the compared records is within the specified Near or Similar difference, then data from the compared records is outputted to a single record. This would necessarily include any compared records where the last duplicate key was an exact match.

This only presents an issue if there are a large number of duplicate records where the value of the last duplicate key occurs within the specified Near or Similar difference, or has exactly the same value, because the resulting output table of duplicates would then contain substantially more records than in the original table.

It is important to understand the relationship of the records within the source table, and to assess the type of duplicates being tested for prior to embarking on this type of Duplicates test. Performing Duplicates tests using the Same-Same-Near and Same-Same-Similar options is best done when there is expected to be a small number of actual duplicates that are Near or Similar to each other.

Parameters

The Duplicates command has the following common command parameters: Append, First, If, Open, Next, Noformat, Presort, Screen, To and While. For a description, see Command Parameters.

On

Specifies the key field(s) to assess for duplicates. In the command dialog, select fields from the “Field(s) to test for Duplicates” list box or click [Choose] to display the Selected Fields dialog. For more information on the Selected Fields dialog, see Selection/Edit Dialog Boxes.

Other

Allows you to specify a list of fields or expressions you want to include in the output for a Duplicates test (same or different). In the command dialog, click the [More] button and select fields from the “List Fields” list box or click [Choose] to display the Selected Fields dialog.

Note: Listing other fields is only available when outputting to a data file. This option is not available for screen output, however you can drill-down on the duplicates count in the tabular output to see all duplicate records in the View with all fields.

Different

Used to perform a “same-same-different” duplicates test. By default, specified field(s) are only outputted to the new data file for duplicate records where the last duplicate field (numeric, character or date) is different (i.e. not the same).

In the command dialog, click the [More] button and select the “Different” radio button in the “Last Duplicates Field Is” group box.

Note: This option is only available when outputting results to a data file.

Note: The last duplicate field cannot be a logical field.

Similar

Used to perform a “same-same-similar” duplicates test. This parameter allows you to identify only duplicate records on key value(s) where the value of the last duplicate field is similar but not necessarily exactly the same. In the command dialog, click the [More] button and select the “Similar” radio button in the “Last Duplicates Field Is” group box.

Note: This option is only available when outputting results to a data file.

Once the “Similar” radio button is selected, a specific Damerau-Levenshtein distance must be specified. The difference is specified as either zero or a positive whole number representing the Damerau-Levenshtein distance. For more information on the Damerau-Levenshtein distance, see DIFFERENCE().

Any records where the last duplicate field (character, numeric or date) is within this specified difference are deemed to be “similar”. Numeric and date fields are internally converted to strings for difference comparison.

Note: The keys selected for the same-same-similar fuzzy duplicates test are assumed to be unique. If they are not unique and the number of fuzzy duplicates identified exceeds 10% of the number of records in the file, Analyzer will stop the Duplicates command and warn the user that "duplicate keys are not sufficiently unique".

Similarity is based on use of the SIMILAR() function. For details on how this function works, see SIMILAR().

By default, specified field(s) are only outputted to the new data file for duplicate records where the last duplicate field (numeric, character or date) is within the specified near value.

For the duplicates SIMILAR test, duplicate records are outputted in pair sets, one pair set per output record.

Note: This option is only available when outputting results to a data file.

Note: The last duplicate field cannot be a logical field.

Near

Used to perform a “same-same-near” duplicates test. This parameter allows you to identify only duplicate records on key value(s) where the value of the last duplicate field is near but not necessarily exactly the same. In the command dialog, click the [More] button and select the “Near” radio button in the “Last Duplicates Field Is” group box.

Note: This option is only available when outputting results to a data file.

Once the “Near” radio button is selected, a specific difference must be specified. The determination of the difference specified will be dependant on the data type of the last field being compared (the field cannot be a logical field).

•

for numeric fields, the difference is specified as either zero or a positive number

•

for datetime fields, the difference is specified as either zero or a positive number of days. For measuring differences in hours minutes or seconds, difference must be a expressed as a fraction of a day, most typically using an elapsed time literal.

•

for character fields, the difference is specified as either zero or a positive whole number representing the Damerau-Levenshtein distance. For more information on the Damerau-Levenshtein distance, see DIFFERENCE().

Any records where the last duplicate field is within this specified difference are deemed to be “near”.

Note: The keys selected for the same-same-near fuzzy duplicates test are assumed to be unique. If they are not unique and the number of fuzzy duplicates identified exceeds 10% of the number of records in the file, Analyzer will stop the Duplicates command and warn the user that "duplicate keys are not sufficiently unique".

By default, specified field(s) are only outputted to the new data file for duplicate records where the last duplicate field (numeric, character or date) is within the specified near value.

For the duplicates NEAR test, duplicate records are outputted in pair sets, one pair set per output record.

Note: This option is only available when outputting results to a data file.

Note: The last duplicate field cannot be a logical field.

Skip

Excludes duplicate records from the output and only displays the number of duplicates identified. In the More tab of the command dialog, check the “Only show total duplicate count” checkbox.

Suppress

Excludes exact duplicates when performing the Same-Same-Near or Similar duplicates test. In the More tab of the command dialog, check the “Suppress exact duplicates” checkbox.

Command Mode Syntax

DUPLICATES <SKIP> <ON> key-field(s)...

<DIFFERENT|NEAR|SIMILAR>

<SUPPRESS> <NOFORMAT> <ERRORLIMIT n> <APPEND> <OPEN>¿

Note: You cannot use the presort keyword within a Group command in a procedure.

Note: The errorlimit keyword is only available on the command line and is only used in very rare circumstances to specify how many fuzzy duplicates (Near or Similar options) are allowed as a percentage of the number of records in the file before the user is warned that "duplicate keys are not sufficiently unique" and the Duplicates command is cancelled. By default, the errorlimit is set to 10 (i.e. 10 percent of the record count of the file).

Tip: When the Duplicates result is to screen, in the Command Log you can click the count value for any duplicate key to View all duplicates for that key.