duplicates() method

Detects whether duplicate values or entire duplicate rows exist in a dataframe.

Syntax

dataframe_name.duplicates(on = ["key_column", "...n"], add_groups = True|False)

Parameters

Name Description
on = ["key_column", "...n"]

The key column or columns to test for duplicates.

If you test by more than one column, rows identified as duplicates require identical values in every specified column.

If you test by all the columns in a dataframe, rows identified as duplicates must be entirely identical.

Key columns are positioned leftmost in the output dataframe.

add_groups = True | False

optional

  • True include the Group column in the output dataframe
  • False do not include the Group column in the output dataframe

The Group column assigns a sequentially incremented number to each unique group of duplicates.

Tip

The ability to reference groups of duplicates by number can be useful when you analyze data in the output dataframe.

If you omit the parameter, the Group column is not included.

Returns

HCL dataframe.

Examples

Test for duplicate values in one column

In the invoices dataframe, the following example:

  • tests for duplicate values in the Invoice_Number column
  • outputs any rows that contain duplicate invoice numbers to the inv_num_duplicates dataframe
inv_num_duplicates = invoices.duplicates(on = ["Invoice_Number"])

The second example does the same thing, and it also assigns a sequentially incremented number to each unique group of duplicates.

inv_num_duplicates_group_id = invoices.duplicates(on = ["Invoice_Number"], add_groups = True)

Test for duplicate values in two or more columns in combination

In the invoices dataframe, the following example:

  • tests for duplicate combinations of values in the Invoice_Number and Vendor_Number columns
  • outputs any rows that contain the same invoice number and the same vendor number to the invoice_vendor_duplicates dataframe

The difference between this test and the previous test is that an identical invoice number from two different vendors is not reported as a false positive.

invoice_vendor_duplicates = invoices.duplicates(on = ["Invoice_Number", "Vendor_Number"])

Test for duplicate rows

In the inventory dataframe, the following example:

  • tests for duplicate values in every column
  • outputs any entirely identical rows to the inventory_duplicates dataframe
inventory_duplicates = inventory.duplicates(on = ["ProdNum", "ProdClass", "Location", "ProdDesc", "ProdStatus", "UnitCost", "CostDate", "SalePrice", "PriceDate"])