summarize() method
Groups rows in a dataframe based on identical values in one or more columns. Counts the number of rows in each group, and also subtotals specified numeric columns for each group.
Syntax
dataframe_name.summarize(on = ["key_column", "...n"], calculate = ["numeric_column", "...n"]|None, statistics = True|False)
Parameters
Name | Description |
---|---|
on = ["key_column", "...n"] |
The key column or columns to summarize. If you summarize by more than one column, you created nested groups. The order of nesting follows the order in which you specify the columns. Key columns are positioned leftmost in the summarized dataframe. Note The summarize() method automatically sorts the dataframe on the key columns before summarizing the data into groups. The output dataframe contains a single, unique group for each set of identical values, or identical combination of values, in the key column or columns. |
calculate = ["numeric_column", "...n"] | None optional |
If you omit the parameter, no columns are subtotaled. |
statistics = True | False optional |
If you specify True, the following statistical values are calculated for each subtotal column, broken down by group:
If you omit statistics the default of False is used. |
Returns
HCL dataframe.
Examples
Total transaction amount per customer
You summarize an accounts receivable dataframe on the Customer_Number column, and subtotal the Trans_Amount column. The output is grouped by customer and includes the total transaction amount for each customer:
customer_total = accounts_receivable.summarize(on = ["Customer_Number"], calculate = ["Trans_Amount"], statistics = False)
Total transaction amount per customer per transaction date
You summarize an accounts receivable dataframe on the Customer_Number and Trans_Date columns. You subtotal the Trans_Amount column.
The output is grouped by customer, and within customer by date, and includes the total transaction amount for each customer for each date the customer had transactions.
customer_total_by_date = accounts_receivable.summarize(on = ["Customer_Number", "Trans_Date"], calculate = ["Trans_Amount"], statistics = False)
Total transaction amount, with statistical values, per customer per transaction date
This example is identical to the one above, but you specify True for the statistics parameter.
In addition to the subtotaled transaction amount for each customer for each date the customer had transactions, you also calculate a variety for statistical values for each customer for each date:
customer_stats_by_date = accounts_receivable.summarize(on = ["Customer_Number", "Trans_Date"], calculate = ["Trans_Amount"], statistics = True)
Remarks
How it works
The summarize() method groups rows that have the same value in a column, or the same combination of values across multiple columns. The output dataframe contains a single row for each group, with a count of the number of rows in the source dataframe that belong to the group. The output also calculates the percentage of dataframe rows belonging to each group.
Subtotal and statistics: column names and calculations in the output results
You can use the calculate and statistics parameters to perform statistical calculations on any subtotal column that you specify. The statistical calculations are broken down by group in the output results.
calculate parameter
Column name in output dataframe | Calculation performed on subtotaled column |
---|---|
subtotaled column name + _sum | Subtotaled values for each group |
subtotaled column name + _sum% |
Each group's subtotal expressed as a percentage of the column total |
statistics parameter
Column title in output dataframe | Calculation performed on subtotaled column |
---|---|
subtotaled column name + _min |
The minimum value for each group |
subtotaled column name + _max |
The maximum value for each group |
subtotaled column name + _mean |
The average value for each group |
subtotaled column name + _median |
The median value for each group
|
subtotaled column name + _std |
The standard deviation for each group |
subtotaled column name + _mode |
The most frequently occurring value for each group In the event of a tie, displays the lowest value. If no value occurs more than once, displays the minimum value in the column. |
subtotaled column name + _q25 |
The first quartile value for each group (lower quartile value) |
subtotaled column name + _q75 |
The third quartile value for each group (upper quartile value) |