outliers() method
Identifies statistical outliers in a numeric column. Outliers can be identified for the column as a whole, or for separate groups based on identical values in one or more key columns.
Syntax
dataframe_name.outliers(keys = ["key_column", "...n"]|None, on = "numeric_column", distance = number_of_std_devs, method = mean|median)
Parameters
Name | Description |
---|---|
keys = ["key_column", "...n"] | None |
|
on = "numeric_column" |
The numeric column to examine for outliers. You can examine only one column at a time. Outliers are values that fall outside the upper and lower boundaries established by the group or column standard deviation, or by a specified multiple of the standard deviation. |
distance = number_of_std_devs |
In numeric_column, the number of standard deviations from the mean or the median to the upper and lower outlier boundaries. You can specify any positive integer or decimal numeral (0.5, 1, 1.5, 2 . . . ) The formula for creating outlier boundaries is: mean/median ± (number_of_std_devs * standard deviation) Note Standard deviation is a measure of the dispersion of a data set – that is, how spread out the values are. The outliers calculation uses population standard deviation. Example of outlier boundariesdistance = 2 establishes, for numeric_column as a whole, or for each key column group:
Any value greater than the upper boundary, or less than the lower boundary, is included as an outlier in the output results. Note For the same set of data, as you increase the value in number_of_std_devs you potentially decrease the number of outliers returned. |
method = mean | median optional |
The method for calculating the center point of the values in numeric_column (the outlier column).
The center point is calculated for either:
The center point is subsequently used in calculating the standard deviation of each group, or of the numeric column as a whole. If you omit method, the default value of mean is used. Tip If the data you are examining for outliers is significantly skewed, using the median might produce results that are more representative of the bulk of the data. |
Returns
HCL dataframe.
Examples
Identify transaction amounts that are out of the ordinary
You want to identify transaction amounts that are out of the ordinary across an entire accounts receivable dataframe.
You decide to set the outlier boundaries at 3 times the standard deviation of the Amount column. The test returns 16 outliers in the dataframe of 772 rows.
outliers_ar = accounts_receivable.outliers(keys = None, on = "Amount", distance = 3, method = "mean")
You repeat the test, but increase the standard deviation multiple to 3.5. The test now returns only 6 outliers because the outlier boundaries are farther away from the center point of the values in the Amount column.
outliers_ar = accounts_receivable.outliers(keys = None, on = "Amount", distance = 3.5, method = "mean")
Identify transaction amounts that are out of the ordinary for each customer
For each customer in an accounts receivable dataframe you want to identify any transaction amounts that are out of the ordinary.
You decide to set the outlier boundaries at 3 times the standard deviation of each customer's group of transactions.
outliers_customer_ar = accounts_receivable.outliers(keys = ["CustNum"], on = "Amount", distance = 3, method = "mean")
The test returns 7 outliers. The mean and the standard deviation are reported for each customer's group of transactions:
group no | CustNum | Amount | mean | stdev | distance |
---|---|---|---|---|---|
0 | 65003 | 4,954.64 | 833.83 | 1015.58 | 3 |
1 | 262001 | 3,567.34 | 438.81 | 772.44 | 3 |
1 | 262001 | (2,044.82) | 438.81 | 772.44 | 3 |
2 | 376005 | (931.55) | 484.57 | 411.18 | 3 |
3 | 501657 | 5,549.19 | 441.14 | 1332.80 | 3 |
4 | 811002 | 3,409.82 | 672.10 | 634.20 | 3 |
5 | 925007 | 3,393.87 | 906.16 | 736.48 | 3 |
How outliers are identified for customer 262001
Customer 262001 has 101 transactions in the accounts receivable dataframe, and two of them are reported as outliers because they exceed the outlier boundaries for that customer:
Outlier | Lower boundary | Upper boundary | Outlier |
---|---|---|---|
(2,044.82) | (1,878.51) | 2,756.13 | 3,567.34 |
How outlier boundaries are calculated for customer 262001
The outlier boundaries are the mean of all customer 262001 transactions, plus or minus the specified multiple of the standard deviation of the transactions:
mean of all customer 262001 transactions | 438.81 |
Specified multiple of the standard deviation | 3 |
Standard deviation of the transactions | 772.44 |
438.81 ± (3 * 772.44) = 438.81 ± 2,317.32 = (1,878.51) (lower boundary) = 2,756.13 (upper boundary) |
Use median to identify transaction amounts that are out of the ordinary for each customer
You use median, instead of mean, to perform the same outlier test that you performed in the example above.
outliers_customer_ar_median = accounts_receivable.outliers(keys = ["CustNum"], on = "Amount", distance = 3, method = "median")
The test returns 10 outliers instead of the 7 that are returned in the previous test. Depending on the nature of the data, median and mean can return somewhat different results:
group no | CustNum | Amount | median | stdev | distance |
---|---|---|---|---|---|
0 | 65003 | 4,954.64 | 663.68 | 1015.58 | 3 |
1 | 262001 | 3,567.34 | 450.67 | 772.44 | 3 |
1 | 262001 | (2,044.82) | 450.67 | 772.44 | 3 |
2 | 376005 | (931.55) | 517.16 | 411.18 | 3 |
3 | 501657 | 4,426.14 | 146.80 | 1332.80 | 3 |
3 | 501657 | 5,549.19 | 146.80 | 1332.80 | 3 |
4 | 811002 | 3,409.82 | 624.53 | 634.20 | 3 |
5 | 925007 | 2,972.78 | 717.88 | 736.48 | 3 |
5 | 925007 | 3,030.71 | 717.88 | 736.48 | 3 |
5 | 925007 | 3,393.87 | 717.88 | 736.48 | 3 |
How outlier boundaries are calculated for each customer
The outlier boundaries are the median value of each customer's transactions, plus or minus the specified multiple of the standard deviation of the transactions.
For example, for customer 262001: 450.67 ± (3 * 772.44)
Remarks
All columns in the input dataframe are automatically included in the output dataframe. Columns not directly involved in the outlier calculation are positioned rightmost in the output dataframe.