Testing for duplicates

Concept Information

Duplicate values in one or more fields, or duplicate records, can be the result of data entry errors or fraudulent activity such as splitting credit card transactions to avoid scrutiny.

Requirement for unique values

Fields that should never contain duplicates are ones where the values uniquely identify records. For example, an employee table should never have duplicate employee numbers because each number should identify a unique employee.

Valid duplicates

Duplicate values may also be valid. For example, a transaction table could have duplicate customer numbers because of multiple transactions by the same customers.

Different types of duplicates testing

You can use Analytics to test for duplicates in the following ways:

Test scope	Use this test when:
One field	All values in a particular field should be unique, such as employee numbers, or check numbers.
Two or more fields in combination	Uniqueness is not a requirement of any field in isolation, but it is a requirement of certain fields in combination. Example In a payroll file covering a year, the employee number field and the pay date field are going to contain numerous duplicates. Employees get paid every two weeks, and multiple employees are paid on the same date. However, an individual employee should appear only once for a particular date. If a duplicate exists across the employee number and pay date fields in combination, an employee may have been paid twice for the same pay period.
All fields in a record	Checking for entire duplicate records, in which every field in a record is duplicated. Entire duplicate records could be the result of data entry error, or other transactional irregularities.

Sorting and duplicates

Generally, you should only test for duplicates using a sorted key field or fields. Duplicate values in a key field are only found if they are immediately adjacent.

If you test for duplicates using an unsorted key field, non-adjacent duplicate values are not reported as duplicates. If two or more clusters of the same duplicate value exist, they are reported as duplicates, but in separate groups.

Depending on your analysis goal, it may make sense to test for duplicates using an unsorted key field. For example, you may want to find only those duplicate values that are immediately adjacent in the source table, and ignore duplicate values that are non-adjacent.

Including the Group Number field in the output table

You have the option of including the Group Number field in the duplicates output table. The field assigns a sequentially incremented number to each unique group of duplicates. The ability to reference groups of duplicates by number can be useful when you analyze data in the output table.

Filter duplicates output table by group number

You use several key fields in combination to test an accounts payable table for duplicate records:

vendor number
invoice number
invoice date
invoice amount

You want to filter the resulting duplicates output table so that only some of the groups of duplicates are subject to additional processing.

To create a filter using the combination of key fields would be laborious. For example:

SET FILTER TO ((Vendor_No = "11475") AND (Invoice_No = "8752512") AND (Invoice_Date = `20191021`) AND (Invoice_Amount = 7125.80)) OR ((Vendor_No = "12130") AND (Invoice_No = "589134") AND (Invoice_Date = `20191117`) AND (Invoice_Amount = 10531.71)) OR ((Vendor_No = "13440") AND (Invoice_No = "5518912") AND (Invoice_Date = `20191015`) AND (Invoice_Amount = 11068.20))

Instead, you achieve the same result by creating a filter based on group number:

SET FILTER TO MATCH(GROUP_NUM, 3 , 8, 11)

Steps

You can test one or more fields in the active table to detect whether duplicate values or entire duplicate records exist.

Remove duplicates

You can use the summarize operation to remove duplicate values or records from a data set and save the remaining unique values or records to a new Analytics table.

All	This option is selected by default. Leave it selected to specify that all records in the view are processed.
First	Select this option and enter a number in the text box to start processing at the first record in the view and include only the specified number of records.
Next	Select this option and enter a number in the text box to start processing at the currently selected record in the view and include only the specified number of records. The actual record number in the leftmost column must be selected, not data in the row.
While	Select this option to use a WHILE statement to limit the processing of records in the view based on a particular criterion or set of criteria. You can enter a condition in the While text box, or click While to create a WHILE statement using the Expression Builder. A WHILE statement allows records in the view to be processed only while the specified condition evaluates to true. As soon as the condition evaluates to false, the processing terminates, and no further records are considered. You can use the While option in conjunction with the All, First, or Next options. Record processing stops as soon as one limit is reached.
Note The number of records specified in the First or Next options references either the physical or the indexed order of records in a table, and disregards any filtering or quick sorting applied to the view. However, results of analytical operations respect any filtering. If a view is quick sorted, Next behaves like First.

All	This option is selected by default. Leave it selected to specify that all records in the view are processed.
First	Select this option and enter a number in the text box to start processing at the first record in the view and include only the specified number of records.
Next	Select this option and enter a number in the text box to start processing at the currently selected record in the view and include only the specified number of records. The actual record number in the leftmost column must be selected, not data in the row.
While	Select this option to use a WHILE statement to limit the processing of records in the view based on a particular criterion or set of criteria. You can enter a condition in the While text box, or click While to create a WHILE statement using the Expression Builder. A WHILE statement allows records in the view to be processed only while the specified condition evaluates to true. As soon as the condition evaluates to false, the processing terminates, and no further records are considered. You can use the While option in conjunction with the All, First, or Next options. Record processing stops as soon as one limit is reached.
Note The number of records specified in the First or Next options references either the physical or the indexed order of records in a table, and disregards any filtering or quick sorting applied to the view. However, results of analytical operations respect any filtering. If a view is quick sorted, Next behaves like First.

Testing for duplicates

Requirement for unique values

Valid duplicates

Different types of duplicates testing

Example

Sorting and duplicates

Including the Group Number field in the output table

Filter duplicates output table by group number

Steps

Remove duplicates

Page options

Is this page helpful?

Is this page helpful?

Testing for duplicates

Requirement for unique values

Valid duplicates

Different types of duplicates testing

Example

Sorting and duplicates

Including the Group Number field in the output table

Filter duplicates output table by group number

Steps

Select the fields

Exclude records from processing (optional)

Deselect Presort (optional)

Configure the output

Specify the scope of the operation

Finalize the settings

Remove duplicates

Select the fields

Deselect Presort (optional)

Exclude records from processing (optional)

Configure the output

Specify the scope of the operation

Finalize the settings

Page options

Is this page helpful?

Is this page helpful?