OUTLIERS command
Concept Information
Identifies statistical outliers in a numeric field. Outliers can be identified for the field as a whole, or for separate groups based on identical values in one or more character, numeric, or datetime key fields.
Syntax
OUTLIERS {AVERAGE|MEDIAN} {PKEY key_field <...n>|PKEY ALL <EXCLUDE field_name <...n>>|NOKEY} ON numeric_field <OTHER field <...n>|OTHER ALL <EXCLUDE field_name <...n>>> NUMSTDEV number_of_std_devs <IF test> <TO {SCREEN|table_name}> <PRESORT> <WHILE test> <FIRST range|NEXT range> <OPEN>
Note
You cannot run the OUTLIERS command locally against a server table.
You must specify the OUTLIERS command name in full. You cannot abbreviate it.
Parameters
Name | Description | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
AVERAGE | MEDIAN |
The method for calculating the center point of the values in numeric_field (the outlier field).
The center point is calculated for either:
The center point is subsequently used in calculating the standard deviation of the numeric field, or of each group. Note If you specify MEDIAN, numeric_field must be sorted. Use PRESORT if numeric_field is not already sorted. Tip If the data you are examining for outliers is significantly skewed, MEDIAN might produce results that are more representative of the bulk of the data. |
||||||||||
PKEY key_field <...n> | PKEY ALL | NOKEY |
One or more character, numeric, or datetime fields to use for grouping the data in the table. If you specify NOKEY, the data is not grouped, and outliers are identified at the field level. Note Key fields must be sorted. Use PRESORT if one or more fields are not already sorted.
|
||||||||||
EXCLUDE field_name optional |
Only valid when grouping table data using PKEY ALL. The field or fields to exclude from the command. EXCLUDE allows you to fine-tune PKEY ALL, by excluding the specified fields. EXCLUDE must immediately follow PKEY ALL. For example: PKEY ALL EXCLUDE field_1 field_2 |
||||||||||
ON numeric_field |
The numeric field to examine for outliers. You can examine only one field at a time. Outliers are values that fall outside the upper and lower boundaries established by the field or group standard deviation, or by a specified multiple of the standard deviation. |
||||||||||
OTHER field <...n> | OTHER ALL optional |
One or more additional fields to include in the output.
Note Key fields and the outlier field are automatically included in the output table, and do not need to be specified using OTHER. |
||||||||||
EXCLUDE field_name optional |
Only valid when using OTHER ALL. The field or fields to exclude from the command. EXCLUDE allows you to fine-tune OTHER ALL, by excluding the specified fields. EXCLUDE must immediately follow OTHER ALL. For example: OTHER ALL EXCLUDE field_1 field_2 |
||||||||||
NUMSTDEV number_of_std_devs |
In numeric_field, the number of standard deviations from the mean or the median to the upper and lower outlier boundaries. You can specify any positive integer or decimal numeral (0.5, 1, 1.5, 2 . . . ) The formula for creating outlier boundaries is: mean/median ± (number_of_std_devs * standard deviation) Note Standard deviation is a measure of the dispersion of a data set – that is, how spread out the values are. The outliers calculation uses population standard deviation. Example of outlier boundariesNUMSTDEV 2 establishes, for numeric_field as a whole, or for each key field group:
Any value greater than the upper boundary, or less than the lower boundary, is included as an outlier in the output results. Note For the same set of data, as you increase the value in number_of_std_devs you potentially decrease the number of outliers returned. |
||||||||||
IF test optional |
A conditional expression that must be true in order to process each record. The command is executed on only those records that satisfy the condition. Note The IF parameter is evaluated against only the records remaining in a table after any scope parameters have been applied (WHILE, FIRST, NEXT). |
||||||||||
TO SCREEN | table_name optional |
The location to send the results of the command to:
|
||||||||||
PRESORT optional |
Performs a sorting operation before executing the command. Tip If the appropriate field or fields in the input table are already sorted, you can save processing time by not specifying PRESORT.
|
||||||||||
WHILE test optional |
A conditional expression that must be true in order to process each record. The command is executed until the condition evaluates as false, or the end of the table is reached. Note If you use WHILE in conjunction with FIRST or NEXT, record processing stops as soon as one limit is reached. |
||||||||||
FIRST range | NEXT range optional |
The number of records to process:
Use range to specify the number of records to process. If you omit FIRST and NEXT, all records are processed by default. |
||||||||||
OPEN optional |
Opens the table created by the command after the command executes. Only valid if the command creates an output table. |
Examples
Identifying transaction amounts that are out of the ordinary
You want to identify transaction amounts that are out of the ordinary across the entire Ar table in Sample Project.acl.
You decide to set the outlier boundaries at 3 times the standard deviation of the Amount field. The test returns 16 outliers in the table of 772 records.
OPEN Ar
OUTLIERS AVERAGE NOKEY ON Amount NUMSTDEV 3 PRESORT TO "Outliers_AR.fil" OPEN
You repeat the test, but increase the standard deviation multiple to 3.5. The test now returns only 6 outliers because the outlier boundaries are farther away from the center point of the values in the Amount field.
OPEN Ar
OUTLIERS AVERAGE NOKEY ON Amount NUMSTDEV 3.5 PRESORT TO "Outliers_AR.fil" OPEN
Identifying transaction amounts that are out of the ordinary for each customer
For each customer in the Ar table in Sample Project.acl, you want to identify any transaction amounts that are out of the ordinary.
You decide to set the outlier boundaries at 3 times the standard deviation of each customer's group of transactions.
OPEN Ar
OUTLIERS AVERAGE PKEY No ON Amount NUMSTDEV 3 PRESORT TO "Outliers_Customer_AR.fil" OPEN
The test returns 7 outliers. The standard deviation and the average are reported for each customer's group of transactions:
Cust Number (No) | Trans Amount | STDEV | AVERAGE | Group Number | |
---|---|---|---|---|---|
1 | 065003 | 4,954.64 | 1015.58 | 833.83 | 1 |
2 | 262001 | 3,567.34 | 772.44 | 438.81 | 2 |
3 | 262001 | (2,044.82) | 772.44 | 438.81 | 2 |
4 | 376005 | (931.55) | 411.18 | 484.57 | 3 |
5 | 501657 | 5,549.19 | 1332.80 | 441.14 | 4 |
6 | 811002 | 3,409.82 | 634.20 | 672.10 | 5 |
7 | 925007 | 3,393.87 | 736.48 | 906.16 | 6 |
How outliers are identified for customer 262001
Customer 262001 has 101 transactions in the Ar table, and two of them are reported as outliers because they exceed the outlier boundaries for that customer:
Outlier | Lower boundary | Upper boundary | Outlier |
---|---|---|---|
(2,044.82) | (1,878.51) | 2,756.13 | 3,567.34 |
How outlier boundaries are calculated for customer 262001
The outlier boundaries are the average of all customer 262001 transactions, plus or minus the specified multiple of the standard deviation of the transactions:
Average of all customer 262001 transactions | 438.81 |
Specified multiple of the standard deviation | 3 |
Standard deviation of the transactions | 772.44 |
438.81 ± (3 * 772.44) = 438.81 ± 2,317.32 = (1,878.51) (lower boundary) = 2,756.13 (upper boundary) |
Using MEDIAN to identify transaction amounts that are out of the ordinary for each customer
You use MEDIAN, instead of AVERAGE, to perform the same outlier test that you performed in the example above.
OPEN Ar
OUTLIERS MEDIAN PKEY No ON Amount NUMSTDEV 3 PRESORT TO "Outliers_Customer_AR_Median.fil" OPEN
The test returns 10 outliers instead of the 7 that are returned in the previous test. Depending on the nature of the data, MEDIAN and AVERAGE can return somewhat different results:
Cust Number (No) | Trans Amount | STDEV | MEDIAN | Group Number | |
---|---|---|---|---|---|
1 | 065003 | 4,954.64 | 1015.58 | 663.68 | 1 |
2 | 262001 | (2,044.82) | 772.44 | 450.67 | 2 |
3 | 262001 | 3,567.34 | 772.44 | 450.67 | 2 |
4 | 376005 | (931.55) | 411.18 | 517.16 | 3 |
5 | 501657 | 4,426.14 | 1332.80 | 146.80 | 4 |
6 | 501657 | 5,549.19 | 1332.80 | 146.80 | 4 |
7 | 811002 | 3,409.82 | 634.20 | 624.53 | 5 |
8 | 925007 | 2,972.78 | 736.48 | 717.88 | 6 |
9 | 925007 | 3,030.71 | 736.48 | 717.88 | 6 |
10 | 925007 | 3,393.87 | 736.48 | 717.88 | 6 |
How outlier boundaries are calculated for each customer
The outlier boundaries are the median value of each customer's transactions, plus or minus the specified multiple of the standard deviation of the transactions.
For example, for customer 262001: 450.67 ± (3 * 772.44)
Remarks
For more information about how this command works, see Identifying outliers.
Add outlier boundary fields to the results table
Analytics automatically adds the STDEV and AVERAGE or MEDIAN calculated fields to the outliers results table. You may find it useful to also add two computed fields that show the outliers boundaries used to identify the outliers in the results table.
- Open the outliers results table.
- Paste this expression into the Analytics command line, edit it as required, and press Enter:
DEFINE FIELD Lower_Boundary COMPUTED AVERAGE - (number_of_std_devs * STDEV)
- For number_of_std_devs, substitute the actual standard deviation multiple you used.
- If you used median as a center point rather than average, substitute MEDIAN for AVERAGE.
- Paste this expression into the Analytics command line, edit it as required, and press Enter:
DEFINE FIELD Upper_Boundary COMPUTED AVERAGE + (number_of_std_devs * STDEV)
- For number_of_std_devs, substitute the actual standard deviation multiple you used.
- If you used median as a center point rather than average, substitute MEDIAN for AVERAGE.
- Right-click the view and select Add Columns.
- From the Available Fields list, double-click Lower_Boundary and Upper_Boundary to add them to the Selected Fields list.
- Click OK.
- Optional. Reposition the added fields by dragging the column headers.