Testing for duplicates

Concept Information

DUPLICATES command

Duplicate values in one or more fields, or duplicate records, can be the result of data entry errors or fraudulent activity such as splitting credit card transactions to avoid scrutiny.

Requirement for unique values

Fields that should never contain duplicates are ones where the values uniquely identify records. For example, an employee table should never have duplicate employee numbers because each number should identify a unique employee.

Valid duplicates

Duplicate values may also be valid. For example, a transaction table could have duplicate customer numbers because of multiple transactions by the same customers.

Different types of duplicates testing

You can use Analytics to test for duplicates in the following ways:

Test scope Use this test when:
One field

All values in a particular field should be unique, such as employee numbers, or check numbers.

Two or more fields in combination

Uniqueness is not a requirement of any field in isolation, but it is a requirement of certain fields in combination.

Example

In a payroll file covering a year, the employee number field and the pay date field are going to contain numerous duplicates. Employees get paid every two weeks, and multiple employees are paid on the same date.

However, an individual employee should appear only once for a particular date. If a duplicate exists across the employee number and pay date fields in combination, an employee may have been paid twice for the same pay period.

All fields in a record

Checking for entire duplicate records, in which every field in a record is duplicated. Entire duplicate records could be the result of data entry error, or other transactional irregularities.

Sorting and duplicates

Generally, you should only test for duplicates using a sorted key field or fields. Duplicate values in a key field are only found if they are immediately adjacent.

If you test for duplicates using an unsorted key field, non-adjacent duplicate values are not reported as duplicates. If two or more clusters of the same duplicate value exist, they are reported as duplicates, but in separate groups.

Depending on your analysis goal, it may make sense to test for duplicates using an unsorted key field. For example, you may want to find only those duplicate values that are immediately adjacent in the source table, and ignore duplicate values that are non-adjacent.

Including the Group Number field in the output table

You have the option of including the Group Number field in the duplicates output table. The field assigns a sequentially incremented number to each unique group of duplicates. The ability to reference groups of duplicates by number can be useful when you analyze data in the output table.

Filter duplicates output table by group number

You use several key fields in combination to test an accounts payable table for duplicate records:

  • vendor number
  • invoice number
  • invoice date
  • invoice amount

You want to filter the resulting duplicates output table so that only some of the groups of duplicates are subject to additional processing.

To create a filter using the combination of key fields would be laborious. For example:

SET FILTER TO ((Vendor_No = "11475") AND (Invoice_No = "8752512") AND (Invoice_Date = `20191021`) AND (Invoice_Amount = 7125.80)) OR ((Vendor_No = "12130") AND (Invoice_No = "589134") AND (Invoice_Date = `20191117`) AND (Invoice_Amount = 10531.71)) OR ((Vendor_No = "13440") AND (Invoice_No = "5518912") AND (Invoice_Date = `20191015`) AND (Invoice_Amount = 11068.20))

Instead, you achieve the same result by creating a filter based on group number:

SET FILTER TO MATCH(GROUP_NUM, 3 , 8, 11)

Steps

You can test one or more fields in the active table to detect whether duplicate values or entire duplicate records exist.

Remove duplicates

You can use the summarize operation to remove duplicate values or records from a data set and save the remaining unique values or records to a new Analytics table.