Testing for duplicates
Concept Information
Duplicate values in one or more fields, or duplicate records, can be the result of data entry errors or fraudulent activity such as splitting credit card transactions to avoid scrutiny.
Requirement for unique values
Fields that should never contain duplicates are ones where the values uniquely identify records. For example, an employee table should never have duplicate employee numbers because each number should identify a unique employee.
Valid duplicates
Duplicate values may also be valid. For example, a transaction table could have duplicate customer numbers because of multiple transactions by the same customers.
Different types of duplicates testing
You can use Analytics to test for duplicates in the following ways:
Test scope | Use this test when: |
---|---|
One field |
All values in a particular field should be unique, such as employee numbers, or check numbers. |
Two or more fields in combination |
Uniqueness is not a requirement of any field in isolation, but it is a requirement of certain fields in combination. ExampleIn a payroll file covering a year, the employee number field and the pay date field are going to contain numerous duplicates. Employees get paid every two weeks, and multiple employees are paid on the same date. However, an individual employee should appear only once for a particular date. If a duplicate exists across the employee number and pay date fields in combination, an employee may have been paid twice for the same pay period. |
All fields in a record |
Checking for entire duplicate records, in which every field in a record is duplicated. Entire duplicate records could be the result of data entry error, or other transactional irregularities. |
Sorting and duplicates
Generally, you should only test for duplicates using a sorted key field or fields. Duplicate values in a key field are only found if they are immediately adjacent.
If you test for duplicates using an unsorted key field, non-adjacent duplicate values are not reported as duplicates. If two or more clusters of the same duplicate value exist, they are reported as duplicates, but in separate groups.
Depending on your analysis goal, it may make sense to test for duplicates using an unsorted key field. For example, you may want to find only those duplicate values that are immediately adjacent in the source table, and ignore duplicate values that are non-adjacent.
Including the Group Number field in the output table
You have the option of including the Group Number field in the duplicates output table. The field assigns a sequentially incremented number to each unique group of duplicates. The ability to reference groups of duplicates by number can be useful when you analyze data in the output table.
Filter duplicates output table by group number
You use several key fields in combination to test an accounts payable table for duplicate records:
- vendor number
- invoice number
- invoice date
- invoice amount
You want to filter the resulting duplicates output table so that only some of the groups of duplicates are subject to additional processing.
To create a filter using the combination of key fields would be laborious. For example:
SET FILTER TO ((Vendor_No = "11475") AND (Invoice_No = "8752512") AND (Invoice_Date = `20191021`) AND (Invoice_Amount = 7125.80)) OR ((Vendor_No = "12130") AND (Invoice_No = "589134") AND (Invoice_Date = `20191117`) AND (Invoice_Amount = 10531.71)) OR ((Vendor_No = "13440") AND (Invoice_No = "5518912") AND (Invoice_Date = `20191015`) AND (Invoice_Amount = 11068.20))
Instead, you achieve the same result by creating a filter based on group number:
SET FILTER TO MATCH(GROUP_NUM, 3 , 8, 11)
Steps
You can test one or more fields in the active table to detect whether duplicate values or entire duplicate records exist.
You can test character, numeric, and datetime fields for duplicates. If letters and numbers appear together in a character field, all alphanumeric characters are tested.
Note
For results to be valid, the field(s) being tested must be in sequential order prior to testing. You can sort the field(s) in advance, or use the Presort option during the duplicates test.
Select the fields
- Open the table that you want to test for duplicates.
- Select .
- To detect duplicates in one or more fields:
- In the Main tab, select the field(s) to test from the Duplicates
On list, or click Duplicates On to select
the field(s), or to create an expression.
The order in which you select the fields is the order in which the columns appear in the results. If you are appending results to an existing Analytics table, the column selection and order must be identical to the column selection and order in the existing table.
- If you clicked Duplicates On, you can optionally specify a descending sort order in the output results for one or more selected fields by clicking the sort arrow (the default is ascending).
- Select one or more List Fields to
include any additional field(s) in the output results, or click List
Fields to select the field(s), to Add All fields,
or to create an expression.
Additional fields can provide useful context for the results. Fields selected for duplicates testing are displayed automatically at the beginning of any result records and do not need to be specifically selected under List Fields.
- Optional. Select Add Groups if you want to include the Group Number field in the output table.
The Group Number field assigns a sequentially incremented number to each unique group of duplicates.
- In the Main tab, select the field(s) to test from the Duplicates
On list, or click Duplicates On to select
the field(s), or to create an expression.
- To detect entire duplicate records:
- In the Main tab, click Duplicates On.
- Click Add All to add all fields to Selected Fields.
- Optionally specify a descending sort order in the output results for one or more fields by clicking the sort arrow (the default is ascending).
- Click OK.
You do not need to select any fields from the List Fields list because all fields in the table are displayed automatically in the result records.
- Optional. Select Add Groups if you want to include the Group Number field in the output table.
The Group Number field assigns a sequentially incremented number to each unique group of duplicates.
Exclude records from processing (optional)
If there are records in the current view that you want to exclude from processing, enter a condition in the If text box, or click If to create an IF statement using the Expression Builder.
Note
The If condition is evaluated against only the records remaining in a table after any scope options have been applied (First, Next, While).
The IF statement considers all records in the view and filters out those that do not meet the specified condition.
Deselect Presort (optional)
If the field or fields you are testing are already sorted based on a prior operation, you can optionally deselect Presort to save time when testing large tables for duplicates.
Note
If you deselect Presort, the field or fields you select for duplicates testing must match the field or fields that were previously sorted for results to be valid.
The message Warning: File out of sequence accompanies results if there is a mismatch between selected and sorted fields. If you output results to an Analytics table, the warning message appears in the command log.
If data in field(s) is not sorted you must leave Presort selected to ensure that all duplicates are found.
Configure the output
- Click the Output tab.
-
Select the appropriate output option in the To panel:
- Screen – Select this option to display the results in the Analytics display area.
Tip
You can click any linked result value in the display area to drill down to the associated record or records in the source table.
If the output table contains a large number of records, it is faster and more useful to save the results to a file than to display the results on the screen.
- Print – Select this option to send the results to the default printer.
- Graph – Select this option to create a graph of the results and display it in the Analytics display area.
- File – Select this option to save or append the results to a text file. The file is saved outside Analytics.
Note
Output options that do not apply to a particular analytical operation are disabled.
- Screen – Select this option to display the results in the Analytics display area.
-
If you selected File as the output type, specify the following information in the As panel:
- File Type – Select Analytics Table to save the results to a new Analytics table, or append the results to an existing Analytics table. Select ASCII Text File or Unicode Text file (depending on which edition of Analytics you are using) to save or append the results to a text file.
- Name – Enter a file name in the Name text box. Or click Name and enter the file name, or select an existing file in the Save or Save File As dialog box to overwrite or append to the file. If Analytics prefills a file name, you can accept the prefilled name, or change it.
You can also specify an absolute or relative file path, or navigate to a different folder, to save or append the file in a location other than the project location. For example: C:\Results\Output.fil or Results\Output.fil.
- Local – Only enabled when connected to a server table and saving or appending the results to an Analytics table. Select Local to save the file to the same location as the project, or to specify a path or navigate to a different local folder. Leave Local deselected to save the file to the Prefix folder on a server.
Note
For output results produced from analysis or processing of AX Server tables, select Local. You cannot deselect the Local setting to import results tables to AX Server.
-
Depending on the output type, you can optionally specify a Header and/or a Footer in the text box(es).
Headers and footers are centered by default. Type a left angle bracket (<) before the header or footer text to left align the text. Click Header or Footer to enter a header or footer of more than one line. Alternatively, you can enter a semi-colon (;) as a line-break character in the header or footer text box. Left aligning multiple lines requires a left angle bracket at the beginning of each line.
Specify the scope of the operation
- Click the More tab.
-
Select the appropriate option in the Scope panel:
- All
- First
- Next
- While
Show me moreAll This option is selected by default. Leave it selected to specify that all records in the view are processed. First Select this option and enter a number in the text box to start processing at the first record in the view and include only the specified number of records. Next Select this option and enter a number in the text box to start processing at the currently selected record in the view and include only the specified number of records. The actual record number in the leftmost column must be selected, not data in the row. While Select this option to use a WHILE statement to limit the processing of records in the view based on a particular criterion or set of criteria. You can enter a condition in the While text box, or click While to create a WHILE statement using the Expression Builder.
A WHILE statement allows records in the view to be processed only while the specified condition evaluates to true. As soon as the condition evaluates to false, the processing terminates, and no further records are considered. You can use the While option in conjunction with the All, First, or Next options. Record processing stops as soon as one limit is reached.
Note
The number of records specified in the First or Next options references either the physical or the indexed order of records in a table, and disregards any filtering or quick sorting applied to the view. However, results of analytical operations respect any filtering.
If a view is quick sorted, Next behaves like First.
Finalize the settings
- If you selected File as the output
type, and want to append the output results to the end of an existing
file, do one of the following:
Select Append To Existing File if you are appending to a text file, or to an Analytics table that you are certain is identical in structure to the output results.
Leave Append To Existing File deselected if you are appending to an Analytics table and you want Analytics to compare the record lengths of the output results and the existing table. If the record lengths are not identical, the data structure is not identical, and the append will not work correctly.
Note
Leaving Append To Existing File deselected is recommended if you are uncertain whether the output results and the existing table have an identical data structure. For more information about appending and data structure, see Appending output results to an existing table.
-
If you selected File (Analytics Table) as the output type, select Use Output Table if you want the output table to open automatically upon completion of the operation.
- Click OK.
Note
Only the duplicate values or records are displayed, and not the initial occurrence of a value or record, if you do both of the following:
- output the results to screen or a text file
- include only tested fields in the output results and do not select any additional fields
If you output to screen, you can click any value to see the initial occurrence of a value or record along with the duplicates.
- If the overwrite prompt appears, select the appropriate
option.
If you are expecting the Append option to appear and it does not, click No to cancel the operation and see Appending output results to an existing table.
Remove duplicates
You can use the summarize operation to remove duplicate values or records from a data set and save the remaining unique values or records to a new Analytics table.
Select the fields
- Open the table that you want to remove duplicates from.
- Select .
- On the Main tab, do one of the
following:
- Select the field or fields that may contain duplicate values from the Summarize On list.
- Click Summarize On to select the field or fields, or to create an expression.
The order in which you select the fields is the order in which the columns appear in the results.
Note
Select the appropriate fields to achieve your required degree of uniqueness.
For example, if you want to remove duplicate employee records and you select only the last name field, you risk removing all records for employees who have the same last name, but a different first name. Select both the last name and the first name fields to increase the degree of uniqueness.
To remove only perfectly duplicate records, click Summarize On and Add All.
- Do not select any Subtotal Fields.
- Optional. Do one of the following:
- From the Other Fields list, select the other field(s) to include in the output results.
- Click Other Fields to select the field(s), or to create an expression.
Note
Select only fields that contain the same value for all records in each summarized group. For more information, see The Other Fields option.
Deselect Presort (optional)
If the field that may contain duplicate values is already sorted, you can optionally deselect Presort. If the data in the field is not sorted, you must leave Presort selected to ensure valid results.
Exclude records from processing (optional)
If there are records in the current view that you want to exclude from processing, enter a condition in the If text box, or click If to create an IF statement using the Expression Builder.
Note
The If condition is evaluated against only the records remaining in a table after any scope options have been applied (First, Next, While).
The IF statement considers all records in the view and filters out those that do not meet the specified condition.
Configure the output
- Click the Output tab.
- In the To panel, select File.
- Specify the following information in the As panel:
File Type – Analytics Table is the only option. Saves the results to a new Analytics table, or appends the results to an existing Analytics table.
Name – Enter a table name in the Name text box. Or click Name and enter the table name, or select an existing table in the Save or Save File As dialog box to overwrite or append to the table. If Analytics prefills a table name, you can accept the prefilled name, or change it.
You can also specify an absolute or relative file path, or navigate to a different folder, to save or append the table in a location other than the project location. For example: C:\Results\No_duplicates.fil or Results\No_duplicates.fil.
Local – Only enabled when connected to a server table. Select Local to save the output table to the same location as the project, or to specify a path or navigate to a different local folder. Leave Local deselected to save the output table to the Prefix folder on the Analytics server.
Note
For output results produced from analysis or processing of Analytics Exchange server tables, select Local. You cannot use the Local setting to import results tables to AX Server.
Specify the scope of the operation
- Click the More tab.
-
Select the appropriate option in the Scope panel:
- All
- First
- Next
- While
Show me moreAll This option is selected by default. Leave it selected to specify that all records in the view are processed. First Select this option and enter a number in the text box to start processing at the first record in the view and include only the specified number of records. Next Select this option and enter a number in the text box to start processing at the currently selected record in the view and include only the specified number of records. The actual record number in the leftmost column must be selected, not data in the row. While Select this option to use a WHILE statement to limit the processing of records in the view based on a particular criterion or set of criteria. You can enter a condition in the While text box, or click While to create a WHILE statement using the Expression Builder.
A WHILE statement allows records in the view to be processed only while the specified condition evaluates to true. As soon as the condition evaluates to false, the processing terminates, and no further records are considered. You can use the While option in conjunction with the All, First, or Next options. Record processing stops as soon as one limit is reached.
Note
The number of records specified in the First or Next options references either the physical or the indexed order of records in a table, and disregards any filtering or quick sorting applied to the view. However, results of analytical operations respect any filtering.
If a view is quick sorted, Next behaves like First.
Finalize the settings
-
Select Use Output Table if you want the output table to open automatically upon completion of the operation.
- If you want to append the output results to the end of
an existing Analytics table, do one of the following:
Select Append To Existing File if you are certain the output results and the existing table are identical in structure.
Leave Append To Existing File deselected if you want Analytics to compare the record lengths of the output results and the existing table. If the record lengths are not identical, the data structure is not identical, and the append will not work correctly.
Note
Leaving Append To Existing File deselected is recommended if you are uncertain whether the output results and the existing table have an identical data structure. For more information about appending and data structure, see Appending output results to an existing table.
- Click OK.
- If the overwrite prompt appears, select the appropriate
option.
If you are expecting the Append option to appear and it does not, click No to cancel the operation and see Appending output results to an existing table.