duplicates() method
Detects whether duplicate values or entire duplicate rows exist in a dataframe.
Syntax
dataframe_name.duplicates(on = ["key_column", "...n"], add_groups = True|False)
Parameters
Name | Description |
---|---|
on = ["key_column", "...n"] |
The key column or columns to test for duplicates. If you test by more than one column, rows identified as duplicates require identical values in every specified column. If you test by all the columns in a dataframe, rows identified as duplicates must be entirely identical. Key columns are positioned leftmost in the output dataframe. |
add_groups = True | False optional |
The Group column assigns a sequentially incremented number to each unique group of duplicates. Tip The ability to reference groups of duplicates by number can be useful when you analyze data in the output dataframe. If you omit the parameter, the Group column is not included. |
Returns
HCL dataframe.
Examples
Test for duplicate values in one column
In the invoices dataframe, the following example:
- tests for duplicate values in the Invoice_Number column
- outputs any rows that contain duplicate invoice numbers to the inv_num_duplicates dataframe
inv_num_duplicates = invoices.duplicates(on = ["Invoice_Number"])
The second example does the same thing, and it also assigns a sequentially incremented number to each unique group of duplicates.
inv_num_duplicates_group_id = invoices.duplicates(on = ["Invoice_Number"], add_groups = True)
Test for duplicate values in two or more columns in combination
In the invoices dataframe, the following example:
- tests for duplicate combinations of values in the Invoice_Number and Vendor_Number columns
- outputs any rows that contain the same invoice number and the same vendor number to the invoice_vendor_duplicates dataframe
The difference between this test and the previous test is that an identical invoice number from two different vendors is not reported as a false positive.
invoice_vendor_duplicates = invoices.duplicates(on = ["Invoice_Number", "Vendor_Number"])
Test for duplicate rows
In the inventory dataframe, the following example:
- tests for duplicate values in every column
- outputs any entirely identical rows to the inventory_duplicates dataframe
inventory_duplicates = inventory.duplicates(on = ["ProdNum", "ProdClass", "Location", "ProdDesc", "ProdStatus", "UnitCost", "CostDate", "SalePrice", "PriceDate"])