summarize() method

Groups rows in a dataframe based on identical values in one or more columns. Counts the number of rows in each group, and also subtotals specified numeric columns for each group.

Syntax

dataframe_name.summarize(on = ["key_column", "...n"], calculate = ["numeric_column", "...n"]|None, statistics = True|False)

Parameters

Name Description
on = ["key_column", "...n"]

The key column or columns to summarize.

If you summarize by more than one column, you created nested groups. The order of nesting follows the order in which you specify the columns.

Key columns are positioned leftmost in the summarized dataframe.

Note

The summarize() method automatically sorts the dataframe on the key columns before summarizing the data into groups. The output dataframe contains a single, unique group for each set of identical values, or identical combination of values, in the key column or columns.

calculate = ["numeric_column", "...n"] | None

optional

  • numeric_column one or more numeric columns to subtotal for each group

    Also calculates each group's subtotal as a percentage of the column total

  • None do not subtotal any numeric columns

If you omit the parameter, no columns are subtotaled.

statistics = True | False

optional

  • True calculate statistics for all numeric subtotal columns

    Note

    Cannot be used unless at least one numeric subtotal column is specified with calculate.

  • False do not calculate statistics

If you specify True, the following statistical values are calculated for each subtotal column, broken down by group:

  • minimum
  • maximum
  • mean (average)
  • median
  • standard deviation
  • mode
  • first quartile
  • third quartile

If you omit statistics the default of False is used.

Returns

HCL dataframe.

Examples

Total transaction amount per customer

You summarize an accounts receivable dataframe on the Customer_Number column, and subtotal the Trans_Amount column. The output is grouped by customer and includes the total transaction amount for each customer:

customer_total = accounts_receivable.summarize(on = ["Customer_Number"], calculate = ["Trans_Amount"], statistics = False)

Total transaction amount per customer per transaction date

You summarize an accounts receivable dataframe on the Customer_Number and Trans_Date columns. You subtotal the Trans_Amount column.

The output is grouped by customer, and within customer by date, and includes the total transaction amount for each customer for each date the customer had transactions.

customer_total_by_date = accounts_receivable.summarize(on = ["Customer_Number", "Trans_Date"], calculate = ["Trans_Amount"], statistics = False)

Total transaction amount, with statistical values, per customer per transaction date

This example is identical to the one above, but you specify True for the statistics parameter.

In addition to the subtotaled transaction amount for each customer for each date the customer had transactions, you also calculate a variety for statistical values for each customer for each date:

customer_stats_by_date = accounts_receivable.summarize(on = ["Customer_Number", "Trans_Date"], calculate = ["Trans_Amount"], statistics = True)

Remarks

How it works

The summarize() method groups rows that have the same value in a column, or the same combination of values across multiple columns. The output dataframe contains a single row for each group, with a count of the number of rows in the source dataframe that belong to the group. The output also calculates the percentage of dataframe rows belonging to each group.

Subtotal and statistics: column names and calculations in the output results

You can use the calculate and statistics parameters to perform statistical calculations on any subtotal column that you specify. The statistical calculations are broken down by group in the output results.

calculate parameter

Column name in output dataframe Calculation performed on subtotaled column
subtotaled column name + _sum Subtotaled values for each group
subtotaled column name + _sum%

Each group's subtotal expressed as a percentage of the column total

statistics parameter

Column title in output dataframe Calculation performed on subtotaled column

subtotaled column name + _min

The minimum value for each group

subtotaled column name + _max

The maximum value for each group

subtotaled column name + _mean

The average value for each group

subtotaled column name + _median

The median value for each group

  • Odd-numbered sets of values: the middle value
  • Even-numbered sets of values: the average of the two values at the middle

subtotaled column name + _std

The standard deviation for each group

subtotaled column name + _mode

The most frequently occurring value for each group

In the event of a tie, displays the lowest value. If no value occurs more than once, displays the minimum value in the column.

subtotaled column name + _q25

The first quartile value for each group (lower quartile value)

subtotaled column name + _q75

The third quartile value for each group (upper quartile value)