outliers() method

Identifies statistical outliers in a numeric column. Outliers can be identified for the column as a whole, or for separate groups based on identical values in one or more key columns.

Syntax

dataframe_name.outliers(keys = ["key_column", "...n"]|None, on = "numeric_column", distance = number_of_std_devs, method = mean|median)

Parameters

Name Description
keys = ["key_column", "...n"] | None
  • key_column the key column or columns to use for grouping the data in the dataframe

    For each group, a standard deviation is calculated for the group's numeric values in numeric_column. The group standard deviation is used as the basis for identifying group outliers.

    If you group by more than one column, you created nested groups. The order of nesting follows the order in which you specify the columns.

    Key columns are positioned leftmost in the output dataframe.

    Note

    The outliers() method automatically sorts the dataframe on the key columns before identifying outliers.

  • None the data is not grouped, and outliers are identified for numeric_column as a whole

    A standard deviation is calculated for numeric_column as a whole. The column standard deviation is used as the basis for identifying column outliers.

on = "numeric_column"

The numeric column to examine for outliers. You can examine only one column at a time.

Outliers are values that fall outside the upper and lower boundaries established by the group or column standard deviation, or by a specified multiple of the standard deviation.

distance = number_of_std_devs

In numeric_column, the number of standard deviations from the mean or the median to the upper and lower outlier boundaries. You can specify any positive integer or decimal numeral (0.5, 1, 1.5, 2 . . . )

The formula for creating outlier boundaries is:

mean/median ± (number_of_std_devs * standard deviation)

Note

Standard deviation is a measure of the dispersion of a data set – that is, how spread out the values are. The outliers calculation uses population standard deviation.

Example of outlier boundaries

distance = 2

establishes, for numeric_column as a whole, or for each key column group:

  • an upper outlier boundary 2 standard deviations greater than the mean or the median

    mean/median + (2 * SD)

  • a lower outlier boundary 2 standard deviations less than the mean or the median

    mean/median – (2 * SD)

Any value greater than the upper boundary, or less than the lower boundary, is included as an outlier in the output results.

Note

For the same set of data, as you increase the value in number_of_std_devs you potentially decrease the number of outliers returned.

method = mean | median

optional

The method for calculating the center point of the values in numeric_column (the outlier column).

  • mean calculate the mean (average) of the values
  • median calculate the median of the values

The center point is calculated for either:

  • the numeric values for each key column group
  • the numeric column as a whole

The center point is subsequently used in calculating the standard deviation of each group, or of the numeric column as a whole.

If you omit method, the default value of mean is used.

Tip

If the data you are examining for outliers is significantly skewed, using the median might produce results that are more representative of the bulk of the data.

Returns

HCL dataframe.

Examples

Identify transaction amounts that are out of the ordinary

You want to identify transaction amounts that are out of the ordinary across an entire accounts receivable dataframe.

You decide to set the outlier boundaries at 3 times the standard deviation of the Amount column. The test returns 16 outliers in the dataframe of 772 rows.

outliers_ar = accounts_receivable.outliers(keys = None, on = "Amount", distance = 3, method = "mean")

You repeat the test, but increase the standard deviation multiple to 3.5. The test now returns only 6 outliers because the outlier boundaries are farther away from the center point of the values in the Amount column.

outliers_ar = accounts_receivable.outliers(keys = None, on = "Amount", distance = 3.5, method = "mean")

Identify transaction amounts that are out of the ordinary for each customer

For each customer in an accounts receivable dataframe you want to identify any transaction amounts that are out of the ordinary.

You decide to set the outlier boundaries at 3 times the standard deviation of each customer's group of transactions.

outliers_customer_ar = accounts_receivable.outliers(keys = ["CustNum"], on = "Amount", distance = 3, method = "mean")

The test returns 7 outliers. The mean and the standard deviation are reported for each customer's group of transactions:

group no CustNum Amount mean stdev distance
0 65003 4,954.64 833.83 1015.58 3
1 262001 3,567.34 438.81 772.44 3
1 262001 (2,044.82) 438.81 772.44 3
2 376005 (931.55) 484.57 411.18 3
3 501657 5,549.19 441.14 1332.80 3
4 811002 3,409.82 672.10 634.20 3
5 925007 3,393.87 906.16 736.48 3

How outliers are identified for customer 262001

Customer 262001 has 101 transactions in the accounts receivable dataframe, and two of them are reported as outliers because they exceed the outlier boundaries for that customer:

Outlier Lower boundary Upper boundary Outlier
(2,044.82) (1,878.51) 2,756.13 3,567.34

How outlier boundaries are calculated for customer 262001

The outlier boundaries are the mean of all customer 262001 transactions, plus or minus the specified multiple of the standard deviation of the transactions:

mean of all customer 262001 transactions 438.81
Specified multiple of the standard deviation 3
Standard deviation of the transactions 772.44

438.81 ± (3 * 772.44)

= 438.81 ± 2,317.32

= (1,878.51) (lower boundary)

= 2,756.13 (upper boundary)

Use median to identify transaction amounts that are out of the ordinary for each customer

You use median, instead of mean, to perform the same outlier test that you performed in the example above.

outliers_customer_ar_median = accounts_receivable.outliers(keys = ["CustNum"], on = "Amount", distance = 3, method = "median")

The test returns 10 outliers instead of the 7 that are returned in the previous test. Depending on the nature of the data, median and mean can return somewhat different results:

group no CustNum Amount median stdev distance
0 65003 4,954.64 663.68 1015.58 3
1 262001 3,567.34 450.67 772.44 3
1 262001 (2,044.82) 450.67 772.44 3
2 376005 (931.55) 517.16 411.18 3
3 501657 4,426.14 146.80 1332.80 3
3 501657 5,549.19 146.80 1332.80 3
4 811002 3,409.82 624.53 634.20 3
5 925007 2,972.78 717.88 736.48 3
5 925007 3,030.71 717.88 736.48 3
5 925007 3,393.87 717.88 736.48 3

How outlier boundaries are calculated for each customer

The outlier boundaries are the median value of each customer's transactions, plus or minus the specified multiple of the standard deviation of the transactions.

For example, for customer 262001: 450.67 ± (3 * 772.44)

Remarks

All columns in the input dataframe are automatically included in the output dataframe. Columns not directly involved in the outlier calculation are positioned rightmost in the output dataframe.