Testing for duplicates
Concept Information
Duplicate values in one or more fields, or duplicate records, can be the result of data entry errors or fraudulent activity such as splitting credit card transactions to avoid scrutiny.
Requirement for unique values
Fields that should never contain duplicates are ones where the values uniquely identify records. For example, an employee table should never have duplicate employee numbers because each number should identify a unique employee.
Valid duplicates
Duplicate values may also be valid. For example, a transaction table could have duplicate customer numbers because of multiple transactions by the same customers.
Different types of duplicates testing
You can use Analytics to test for duplicates in the following ways:
Test scope | Use this test when: |
---|---|
One field |
All values in a particular field should be unique, such as employee numbers, or check numbers. |
Two or more fields in combination |
Uniqueness is not a requirement of any field in isolation, but it is a requirement of certain fields in combination. ExampleIn a payroll file covering a year, the employee number field and the pay date field are going to contain numerous duplicates. Employees get paid every two weeks, and multiple employees are paid on the same date. However, an individual employee should appear only once for a particular date. If a duplicate exists across the employee number and pay date fields in combination, an employee may have been paid twice for the same pay period. |
All fields in a record |
Checking for entire duplicate records, in which every field in a record is duplicated. Entire duplicate records could be the result of data entry error, or other transactional irregularities. |
Sorting and duplicates
Generally, you should only test for duplicates using a sorted key field or fields. Duplicate values in a key field are only found if they are immediately adjacent.
If you test for duplicates using an unsorted key field, non-adjacent duplicate values are not reported as duplicates. If two or more clusters of the same duplicate value exist, they are reported as duplicates, but in separate groups.
Depending on your analysis goal, it may make sense to test for duplicates using an unsorted key field. For example, you may want to find only those duplicate values that are immediately adjacent in the source table, and ignore duplicate values that are non-adjacent.
Including the Group Number field in the output table
You have the option of including the Group Number field in the duplicates output table. The field assigns a sequentially incremented number to each unique group of duplicates. The ability to reference groups of duplicates by number can be useful when you analyze data in the output table.
Filter duplicates output table by group number
You use several key fields in combination to test an accounts payable table for duplicate records:
- vendor number
- invoice number
- invoice date
- invoice amount
You want to filter the resulting duplicates output table so that only some of the groups of duplicates are subject to additional processing.
To create a filter using the combination of key fields would be laborious. For example:
SET FILTER TO ((Vendor_No = "11475") AND (Invoice_No = "8752512") AND (Invoice_Date = `20191021`) AND (Invoice_Amount = 7125.80)) OR ((Vendor_No = "12130") AND (Invoice_No = "589134") AND (Invoice_Date = `20191117`) AND (Invoice_Amount = 10531.71)) OR ((Vendor_No = "13440") AND (Invoice_No = "5518912") AND (Invoice_Date = `20191015`) AND (Invoice_Amount = 11068.20))
Instead, you achieve the same result by creating a filter based on group number:
SET FILTER TO MATCH(GROUP_NUM, 3 , 8, 11)
Steps
You can test one or more fields in the active table to detect whether duplicate values or entire duplicate records exist.
You can test character, numeric, and datetime fields for duplicates. If letters and numbers appear together in a character field, all alphanumeric characters are tested.
Note
For results to be valid, the field or fields being tested must be in sequential order prior to testing. You can sort the field or fields in advance, or use the Presort option during the duplicates test.
Select the fields
- In the Navigator, open the table that you want to test for duplicates.
- Select Analyze > Duplicates.
-
To detect duplicates in one or more fields:
-
On the Main tab, do one of the following:
-
From the Duplicates On list, select a field or fields to test.
-
Click Duplicates On to select a field or fields to test, or to create an expression.
The order in which you select multiple fields is the order in which the fields appear in the output results. If you want to append results to an existing Analytics table, the field selection and order must be identical to the field selection and order in the existing table.
-
-
Optional. If you clicked Duplicates On, click the sort arrow beside one or more selected fields to specify a descending sort order in the output results (the default is ascending).
-
Do one of the following:
-
From the List Fields list, select one or more additional fields to include in the output results.
-
Click List Fields to select one or more additional fields to include, or to create an expression.
Additional fields can provide useful context for the results. The fields that you select for duplicates testing are displayed automatically at the beginning of any result records. You do not need to specifically selected them under List Fields.
-
-
Optional. Select Add Groups if you want to include the Group Number field in the output table.
The Group Number field assigns a sequentially incremented number to each unique group of duplicates.
-
-
To detect entire duplicate records:
-
In the Main tab, click Duplicates On.
-
Click Add All to add all fields to Selected Fields.
-
Optional. Click the sort arrow beside one or more selected fields to specify a descending sort order in the output results (the default is ascending).
-
Click OK.
You do not need to select any fields from the List Fields list because all fields in the table are displayed automatically in the result records.
-
Optional. Select Add Groups if you want to include the Group Number field in the output table.
The Group Number field assigns a sequentially incremented number to each unique group of duplicates.
-
Exclude records from processing (optional)
-
If there are records in the current view that you want to exclude from processing, do one of the following:
-
Enter a condition in the If text box
-
Click If to create an IF statement using the Expression Builder
The If condition considers all records in the view and filters out those that do not meet the specified condition.
Note
The If condition is evaluated against only the records remaining in a table after any scope options have been applied (First, Next, While).
-
Deselect Presort (optional)
-
If the test field or fields are already sorted based on a previous operation, deselect Presort to save time when testing large tables for duplicates.
If data in the test field or fields is not sorted, you must leave Presort selected to ensure that all duplicates are found.
Note
If you deselect Presort, the test field or fields must match the field or fields that were previously sorted for results to be valid.
The message Warning: File out of sequence accompanies results if there is a mismatch between selected and sorted fields. If you output results to an Analytics table, the warning message appears in the command log.
Configure the output
- Click the Output tab.
-
In the To panel, select the appropriate output option.
Note
Output options are disabled is they do not apply to a particular analytical operation.
Option Details Screen Displays the results in the Analytics display area.
Tip
You can click any linked result value in the display area to drill down to the associated record or records in the source table.
If the output table contains a large number of records, it is faster and more useful to save the results to a file than to display the results on the screen.
Print Sends the results to the default printer. Graph Creates a graph of the results and display it in the Analytics display area. File Saves or appends the results to an Analytics table or a text file, or overwrites an existing table or file.
- Analytics table the table is added to the open project if it is not already in the project.
- Text file the file is saved outside Analytics.
-
If you selected File as the output type, specify the appropriate information in the As panel.
Option Details File Type -
Analytics Table – saves or appends the results to an Analytics table, or overwrites an existing table.
-
ASCII Text File or Unicode Text file – saves or appends the results to a text file, or overwrites an existing file.
Name -
To save to a new table or file
Enter a file name in the Name text box. If Analytics prefills a file name, you can accept the prefilled name, or change it.
-
To append to an existing table or file, or to overwrite an existing table or file
Click Name and select an existing table or file in the Save or Save File As dialog box.
You can specify an absolute or relative file path, or navigate to a different folder, to save, append, or overwrite in a location other than the project location. For example:
C:\Results\Output.fil or Results\Output.fil.
Note
Analytics table names are limited to 64 alphanumeric characters, not including the .FIL extension. The name can include the underscore character ( _ ), but no other special characters, or any spaces. The name cannot start with a number.
Local Enabled only when connected to a server table and saving, appending, or overwriting to an Analytics table.
-
Select Local saves, appends, or overwrites to the same location as the Analytics project.
-
Leave Local deselected saves, appends, or overwrites to the Prefix folder on a server.
Note
For output results produced from analysis or processing of AX Server tables, select Local. You cannot deselect the Local setting to import results tables to AX Server.
-
-
Optional. Specify a Header or a Footer in the appropriate text box.
Click Header or Footer to enter a header or footer of more than one line. You can also enter a semi-colon (;) as a line-break character in the header or footer text box.
Headers and footers are centered by default. Type a left angle bracket (<) before the header or footer text to left align the text. Left aligning multiple lines requires a left angle bracket at the beginning of each line.
Note
Specifying a header or a footer is not available for some output types.
Specify the scope of the operation
- Click the More tab.
-
In the Scope panel, select the appropriate option:
- All
- First
- Next
- While
Show me moreScope option Details All (Default) Specifies that all records in the view are processed. First Enter a number in the text box. Starts processing at the first record in the view and includes only the specified number of records. Next Enter a number in the text box. Starts processing at the currently selected record in the view and includes only the specified number of records. The actual record number must be selected in the leftmost column in the view, not data in the row. While Use a WHILE statement to limit the processing of records in the view based on a particular criterion or set of criteria.
Enter a condition in the While text box, or click While to create a WHILE statement using the Expression Builder.
A WHILE statement allows records in the view to be processed only while the specified condition evaluates to true. As soon as the condition evaluates to false, the processing terminates, and no further records are considered.
You can use the While option in conjunction with the All, First, or Next options. Record processing stops as soon as one limit is reached.
Note
The First or Next options reference either the physical order or the indexed order of records in a table. First or Next disregard any filtering or quick sorting applied to a table view. However, output results of analytical operations respect any filtering.
If a view is quick sorted, Next behaves like First.
Finalize the settings and output the results
-
Optional. If you selected File as the output type, and you want to append the output results to the end of an existing table or file, do one of the following:
-
Select Append To Existing File if you are certain that the output results and the table or file are identical in structure.
Selecting Append To Existing File forces the append to happen even if the record lengths of the output results and an existing table or file are different. If the record lengths are different, the data structure is not identical, and the append will not work correctly.
-
Leave Append To Existing File deselected if you want Analytics to compare the record lengths of the output results and an existing table. This automated check is available only for Analytics tables, not text files.
Note
Leave Append To Existing File deselected if you are uncertain whether the output results and the existing table have an identical data structure. For more information about appending and data structure, see Appending output results to an existing table.
-
-
If you selected File (Analytics Table) as the output type, select Use Output Table if you want the output table to open automatically upon completion of the operation.
- Click OK.
In the output results, only the duplicate values or records are displayed, and not the initial occurrence of a value or a record, if you do both of the following:
- Output the results to screen or a text file
- Include only tested fields in the output results and do not select any additional fields
If you output to screen, you can click any value to see the initial occurrence of a value or a record along with the duplicates.
- If the overwrite prompt appears, select the appropriate
option.
If you are expecting the Append option to appear and it does not, click No to cancel the operation and see Appending output results to an existing table.
Remove duplicates
You can use the summarize operation to remove duplicate values or records from a data set and save the remaining unique values or records to a new Analytics table.
Select the fields
- In the Navigator, open the table that you want to remove duplicates from.
- Select Analyze > Summarize.
- On the Main tab, do one of the
following:
- Select the field or fields that may contain duplicate values from the Summarize On list.
- Click Summarize On to select the field or fields, or to create an expression.
The order in which you select the fields is the order in which the columns appear in the results.
Note
Select the appropriate fields to achieve your required degree of uniqueness.
For example, if you want to remove duplicate employee records and you select only the last name field, you risk removing all records for employees who have the same last name, but a different first name. Select both the last name and the first name fields to increase the degree of uniqueness.
To remove only perfectly duplicate records, click Summarize On and Add All.
- Do not select any Subtotal Fields.
- Optional. Do one of the following:
- From the Other Fields list, select the other field(s) to include in the output results.
- Click Other Fields to select the field(s), or to create an expression.
Note
Select only fields that contain the same value for all records in each summarized group. For more information, see The Other Fields option.
Deselect Presort (optional)
-
If the field that may contain duplicate values is already sorted, deselect Presort.
If the data in the field is not sorted, you must leave Presort selected to ensure valid results.
Exclude records from processing (optional)
-
If there are records in the current view that you want to exclude from processing, do one of the following:
-
Enter a condition in the If text box
-
Click If to create an IF statement using the Expression Builder
The If condition considers all records in the view and filters out those that do not meet the specified condition.
Note
The If condition is evaluated against only the records remaining in a table after any scope options have been applied (First, Next, While).
-
Configure the output
- Click the Output tab.
- In the To panel, select File.
- Specify the following information in the As panel:
File Type – Analytics Table is the only option. Saves the results to a new Analytics table, or appends the results to an existing Analytics table.
Name – Enter a table name in the Name text box. Or click Name and enter the table name, or select an existing table in the Save or Save File As dialog box to overwrite or append to the table. If Analytics prefills a table name, you can accept the prefilled name, or change it.
You can also specify an absolute or relative file path, or navigate to a different folder, to save or append the table in a location other than the project location. For example:
C:\Results\No_duplicates.fil or Results\No_duplicates.fil.
Local – Only enabled when connected to a server table. Select Local to save the output table to the same location as the project, or to specify a path or navigate to a different local folder. Leave Local deselected to save the output table to the Prefix folder on the Analytics server.
Note
For output results produced from analysis or processing of Analytics Exchange server tables, select Local. You cannot use the Local setting to import results tables to AX Server.
Specify the scope of the operation
- Click the More tab.
-
In the Scope panel, select the appropriate option:
- All
- First
- Next
- While
Show me moreScope option Details All (Default) Specifies that all records in the view are processed. First Enter a number in the text box. Starts processing at the first record in the view and includes only the specified number of records. Next Enter a number in the text box. Starts processing at the currently selected record in the view and includes only the specified number of records. The actual record number must be selected in the leftmost column in the view, not data in the row. While Use a WHILE statement to limit the processing of records in the view based on a particular criterion or set of criteria.
Enter a condition in the While text box, or click While to create a WHILE statement using the Expression Builder.
A WHILE statement allows records in the view to be processed only while the specified condition evaluates to true. As soon as the condition evaluates to false, the processing terminates, and no further records are considered.
You can use the While option in conjunction with the All, First, or Next options. Record processing stops as soon as one limit is reached.
Note
The First or Next options reference either the physical order or the indexed order of records in a table. First or Next disregard any filtering or quick sorting applied to a table view. However, output results of analytical operations respect any filtering.
If a view is quick sorted, Next behaves like First.
Finalize the settings and output the results
-
Select Use Output Table if you want the output table to open automatically upon completion of the operation.
- If you want to append the output results to the end of
an existing Analytics table, do one of the following:
Select Append To Existing File if you are certain the output results and the existing table are identical in structure.
Leave Append To Existing File deselected if you want Analytics to compare the record lengths of the output results and the existing table. If the record lengths are not identical, the data structure is not identical, and the append will not work correctly.
Note
Leaving Append To Existing File deselected is recommended if you are uncertain whether the output results and the existing table have an identical data structure. For more information about appending and data structure, see Appending output results to an existing table.
- Click OK.
- If the overwrite prompt appears, select the appropriate
option.
If you are expecting the Append option to appear and it does not, click No to cancel the operation and see Appending output results to an existing table.