summarize() method

Groups rows in a dataframe based on identical values in one or more columns. Counts the number of rows in each group, and also subtotals specified numeric columns for each group.

Syntax

dataframe_name.summarize(on = ["key_column", "...n"], calculate = ["numeric_column", "...n"]|None, statistics = True|False)

Parameters

Name	Description
on = ["key_column", "...n"]	The key column or columns to summarize. If you summarize by more than one column, you created nested groups. The order of nesting follows the order in which you specify the columns. Key columns are positioned leftmost in the summarized dataframe. Note The summarize() method automatically sorts the dataframe on the key columns before summarizing the data into groups. The output dataframe contains a single, unique group for each set of identical values, or identical combination of values, in the key column or columns.
calculate = ["numeric_column", "...n"] \| None optional	numeric_column one or more numeric columns to subtotal for each group Also calculates each group's subtotal as a percentage of the column total None do not subtotal any numeric columns If you omit the parameter, no columns are subtotaled.
statistics = True \| False optional	True calculate statistics for all numeric subtotal columns Note Cannot be used unless at least one numeric subtotal column is specified with calculate. False do not calculate statistics If you specify True, the following statistical values are calculated for each subtotal column, broken down by group: minimum maximum mean (average) median standard deviation mode first quartile third quartile If you omit statistics the default of False is used.

Name

Description

on = ["key_column", "...n"]

The key column or columns to summarize.

If you summarize by more than one column, you created nested groups. The order of nesting follows the order in which you specify the columns.

Key columns are positioned leftmost in the summarized dataframe.

Note

The summarize() method automatically sorts the dataframe on the key columns before summarizing the data into groups. The output dataframe contains a single, unique group for each set of identical values, or identical combination of values, in the key column or columns.

calculate = ["numeric_column", "...n"] | None

optional

numeric_column one or more numeric columns to subtotal for each group
Also calculates each group's subtotal as a percentage of the column total
None do not subtotal any numeric columns

If you omit the parameter, no columns are subtotaled.

statistics = True | False

optional

True calculate statistics for all numeric subtotal columns
Note
Cannot be used unless at least one numeric subtotal column is specified with calculate.
False do not calculate statistics

If you specify True, the following statistical values are calculated for each subtotal column, broken down by group:

minimum
maximum
mean (average)
median
standard deviation
mode
first quartile
third quartile

If you omit statistics the default of False is used.

Returns

HCL dataframe.

Examples

Total transaction amount per customer

You summarize an accounts receivable dataframe on the Customer_Number column, and subtotal the Trans_Amount column. The output is grouped by customer and includes the total transaction amount for each customer:

customer_total = accounts_receivable.summarize(on = ["Customer_Number"], calculate = ["Trans_Amount"], statistics = False)

Total transaction amount per customer per transaction date

You summarize an accounts receivable dataframe on the Customer_Number and Trans_Date columns. You subtotal the Trans_Amount column.

The output is grouped by customer, and within customer by date, and includes the total transaction amount for each customer for each date the customer had transactions.

customer_total_by_date = accounts_receivable.summarize(on = ["Customer_Number", "Trans_Date"], calculate = ["Trans_Amount"], statistics = False)

Total transaction amount, with statistical values, per customer per transaction date

This example is identical to the one above, but you specify True for the statistics parameter.

In addition to the subtotaled transaction amount for each customer for each date the customer had transactions, you also calculate a variety for statistical values for each customer for each date:

customer_stats_by_date = accounts_receivable.summarize(on = ["Customer_Number", "Trans_Date"], calculate = ["Trans_Amount"], statistics = True)

Remarks

How it works

The summarize() method groups rows that have the same value in a column, or the same combination of values across multiple columns. The output dataframe contains a single row for each group, with a count of the number of rows in the source dataframe that belong to the group. The output also calculates the percentage of dataframe rows belonging to each group.

Subtotal and statistics: column names and calculations in the output results

You can use the calculate and statistics parameters to perform statistical calculations on any subtotal column that you specify. The statistical calculations are broken down by group in the output results.

calculate parameter

Column name in output dataframe	Calculation performed on subtotaled column
subtotaled column name + _sum	Subtotaled values for each group
subtotaled column name + _sum%	Each group's subtotal expressed as a percentage of the column total

statistics parameter

Column title in output dataframe	Calculation performed on subtotaled column
subtotaled column name + _min	The minimum value for each group
subtotaled column name + _max	The maximum value for each group
subtotaled column name + _mean	The average value for each group
subtotaled column name + _median	The median value for each group Odd-numbered sets of values: the middle value Even-numbered sets of values: the average of the two values at the middle
subtotaled column name + _std	The standard deviation for each group
subtotaled column name + _mode	The most frequently occurring value for each group In the event of a tie, displays the lowest value. If no value occurs more than once, displays the minimum value in the column.
subtotaled column name + _q25	The first quartile value for each group (lower quartile value)
subtotaled column name + _q75	The third quartile value for each group (upper quartile value)