OUTLIERS command

Concept Information

Identifying outliers

Identifies statistical outliers in a numeric field. Outliers can be identified for the field as a whole, or for separate groups based on identical values in one or more character, numeric, or datetime key fields.

Syntax

OUTLIERS {AVERAGE|MEDIAN} {PKEY key_field <...n>|PKEY ALL <EXCLUDE field_name <...n>>|NOKEY} ON numeric_field <OTHER field <...n>|OTHER ALL <EXCLUDE field_name <...n>>> NUMSTDEV number_of_std_devs <IF test> <TO {SCREEN|table_name}> <PRESORT> <WHILE test> <FIRST range|NEXT range> <OPEN>

Note

You cannot run the OUTLIERS command locally against a server table.

You must specify the OUTLIERS command name in full. You cannot abbreviate it.

Parameters

Name Description
AVERAGE | MEDIAN

The method for calculating the center point of the values in numeric_field (the outlier field).

  • AVERAGE calculate the average (mean) of the values
  • MEDIAN calculate the median of the values

The center point is calculated for either:

  • the numeric field as a whole
  • the numeric values for each key field group

The center point is subsequently used in calculating the standard deviation of the numeric field, or of each group.

Note

If you specify MEDIAN, numeric_field must be sorted. Use PRESORT if numeric_field is not already sorted.

Tip

If the data you are examining for outliers is significantly skewed, MEDIAN might produce results that are more representative of the bulk of the data.

PKEY key_field <...n> | PKEY ALL | NOKEY

One or more character, numeric, or datetime fields to use for grouping the data in the table.

If you specify NOKEY, the data is not grouped, and outliers are identified at the field level.

Note

Key fields must be sorted. Use PRESORT if one or more fields are not already sorted.

  • PKEY key_field use the specified field or fields to group the data in the table

    Multiple fields must be separated by spaces, and can be different data types.

    If you specify more than one field, you created nested groups in the output table. The nesting follows the order in which you specify the fields.

    For each group, a standard deviation is calculated for the group's numeric values in numeric_field. The group standard deviation is used as the basis for identifying group outliers.

  • PKEY ALL use all fields in the table to group the data in the table

    If you specify all fields, you create nested groups in the output table. The nesting follows the order in which the fields appear in the table layout.

    For each group, a standard deviation is calculated for the group's numeric values in numeric_field. The group standard deviation is used as the basis for identifying group outliers.

    Note

    Grouping by all fields includes numeric_field, which may not make sense. You can use EXCLUDE to exclude numeric_field from grouping.

  • NOKEY do not group the data in the table

    A standard deviation is calculated for numeric_field as a whole. The field standard deviation is used as the basis for identifying field outliers.

EXCLUDE field_name

optional

Only valid when grouping table data using PKEY ALL.

The field or fields to exclude from the command. EXCLUDE allows you to fine-tune PKEY ALL, by excluding the specified fields.

EXCLUDE must immediately follow PKEY ALL. For example:

PKEY ALL EXCLUDE field_1 field_2
ON numeric_field

The numeric field to examine for outliers. You can examine only one field at a time.

Outliers are values that fall outside the upper and lower boundaries established by the field or group standard deviation, or by a specified multiple of the standard deviation.

OTHER field <...n> | OTHER ALL

optional

One or more additional fields to include in the output.

  • OTHER field <...n> include the specified field or fields

    Fields are included in the order that you list them.

  • OTHER ALL include all fields in the table that are not specified as key fields or the outlier field

    Fields are included in the order that they appear in the table layout.

Note

Key fields and the outlier field are automatically included in the output table, and do not need to be specified using OTHER.

EXCLUDE field_name

optional

Only valid when using OTHER ALL.

The field or fields to exclude from the command. EXCLUDE allows you to fine-tune OTHER ALL, by excluding the specified fields.

EXCLUDE must immediately follow OTHER ALL. For example:

OTHER ALL EXCLUDE field_1 field_2
NUMSTDEV number_of_std_devs

In numeric_field, the number of standard deviations from the mean or the median to the upper and lower outlier boundaries. You can specify any positive integer or decimal numeral (0.5, 1, 1.5, 2 . . . )

The formula for creating outlier boundaries is:

mean/median ± (number_of_std_devs * standard deviation)

Note

Standard deviation is a measure of the dispersion of a data set – that is, how spread out the values are. The outliers calculation uses population standard deviation.

Example of outlier boundaries

NUMSTDEV 2

establishes, for numeric_field as a whole, or for each key field group:

  • an upper outlier boundary 2 standard deviations greater than the mean or the median

    mean/median + (2 * SD)

  • a lower outlier boundary 2 standard deviations less than the mean or the median

    mean/median – (2 * SD)

Any value greater than the upper boundary, or less than the lower boundary, is included as an outlier in the output results.

Note

For the same set of data, as you increase the value in number_of_std_devs you potentially decrease the number of outliers returned.

IF test

optional

A conditional expression that must be true in order to process each record. The command is executed on only those records that satisfy the condition.

Note

The IF parameter is evaluated against only the records remaining in a table after any scope parameters have been applied (WHILE, FIRST, NEXT).

TO SCREEN | table_name

optional

The location to send the results of the command to:

  • SCREEN displays the results in the Analytics display area

    Tip

    You can click any linked result value in the display area to drill down to the associated record or records in the source table.

  • table_name saves the results to an Analytics table

    Specify table_name as a quoted string with a .FIL file extension. For example: TO "Output.FIL"

    By default, the table data file (.FIL) is saved to the folder containing the Analytics project.

    Use either an absolute or relative file path to save the data file to a different, existing folder:

    • TO "C:\Output.FIL"
    • TO "Results\Output.FIL"

    Note

    Table names are limited to 64 alphanumeric characters, not including the .FIL extension. The name can include the underscore character ( _ ), but no other special characters, or any spaces. The name cannot start with a number.

PRESORT

optional

Performs a sorting operation before executing the command.

Tip

If the appropriate field or fields in the input table are already sorted, you can save processing time by not specifying PRESORT.

If you specify PRESORT and: Sorts by:
PKEY, AVERAGE
  • key field or fields
  • key field or fields, then by numeric_field (if numeric_field is computed)

    Note

    Sorting a computed numeric_field is an internal, technical requirement of Analytics.

PKEY, MEDIAN

key field or fields, then by numeric_field

NOKEY, AVERAGE

no sorting

NOKEY, MEDIAN numeric_field
WHILE test

optional

A conditional expression that must be true in order to process each record. The command is executed until the condition evaluates as false, or the end of the table is reached.

Note

If you use WHILE in conjunction with FIRST or NEXT, record processing stops as soon as one limit is reached.

FIRST range | NEXT range

optional

The number of records to process:

  • FIRST start processing from the first record until the specified number of records is reached
  • NEXT start processing from the currently selected record until the specified number of records is reached

Use range to specify the number of records to process.

If you omit FIRST and NEXT, all records are processed by default.

OPEN

optional

Opens the table created by the command after the command executes. Only valid if the command creates an output table.

Examples

Identifying transaction amounts that are out of the ordinary

You want to identify transaction amounts that are out of the ordinary across the entire Ar table in Sample Project.acl.

You decide to set the outlier boundaries at 3 times the standard deviation of the Amount field. The test returns 16 outliers in the table of 772 records.

OPEN Ar
OUTLIERS AVERAGE NOKEY ON Amount NUMSTDEV 3 PRESORT TO "Outliers_AR.fil" OPEN

You repeat the test, but increase the standard deviation multiple to 3.5. The test now returns only 6 outliers because the outlier boundaries are farther away from the center point of the values in the Amount field.

OPEN Ar
OUTLIERS AVERAGE NOKEY ON Amount NUMSTDEV 3.5 PRESORT TO "Outliers_AR.fil" OPEN

Identifying transaction amounts that are out of the ordinary for each customer

For each customer in the Ar table in Sample Project.acl, you want to identify any transaction amounts that are out of the ordinary.

You decide to set the outlier boundaries at 3 times the standard deviation of each customer's group of transactions.

OPEN Ar
OUTLIERS AVERAGE PKEY No ON Amount NUMSTDEV 3 PRESORT TO "Outliers_Customer_AR.fil" OPEN

The test returns 7 outliers. The standard deviation and the average are reported for each customer's group of transactions:

  Cust Number (No) Trans Amount STDEV AVERAGE Group Number
1 065003 4,954.64 1015.58 833.83 1
2 262001 3,567.34 772.44 438.81 2
3 262001 (2,044.82) 772.44 438.81 2
4 376005 (931.55) 411.18 484.57 3
5 501657 5,549.19 1332.80 441.14 4
6 811002 3,409.82 634.20 672.10 5
7 925007 3,393.87 736.48 906.16 6

How outliers are identified for customer 262001

Customer 262001 has 101 transactions in the Ar table, and two of them are reported as outliers because they exceed the outlier boundaries for that customer:

Outlier Lower boundary Upper boundary Outlier
(2,044.82) (1,878.51) 2,756.13 3,567.34

How outlier boundaries are calculated for customer 262001

The outlier boundaries are the average of all customer 262001 transactions, plus or minus the specified multiple of the standard deviation of the transactions:

Average of all customer 262001 transactions 438.81
Specified multiple of the standard deviation 3
Standard deviation of the transactions 772.44

438.81 ± (3 * 772.44)

= 438.81 ± 2,317.32

= (1,878.51) (lower boundary)

= 2,756.13 (upper boundary)

Using MEDIAN to identify transaction amounts that are out of the ordinary for each customer

You use MEDIAN, instead of AVERAGE, to perform the same outlier test that you performed in the example above.

OPEN Ar
OUTLIERS MEDIAN PKEY No ON Amount NUMSTDEV 3 PRESORT TO "Outliers_Customer_AR_Median.fil" OPEN

The test returns 10 outliers instead of the 7 that are returned in the previous test. Depending on the nature of the data, MEDIAN and AVERAGE can return somewhat different results:

  Cust Number (No) Trans Amount STDEV MEDIAN Group Number
1 065003 4,954.64 1015.58 663.68 1
2 262001 (2,044.82) 772.44 450.67 2
3 262001 3,567.34 772.44 450.67 2
4 376005 (931.55) 411.18 517.16 3
5 501657 4,426.14 1332.80 146.80 4
6 501657 5,549.19 1332.80 146.80 4
7 811002 3,409.82 634.20 624.53 5
8 925007 2,972.78 736.48 717.88 6
9 925007 3,030.71 736.48 717.88 6
10 925007 3,393.87 736.48 717.88 6

How outlier boundaries are calculated for each customer

The outlier boundaries are the median value of each customer's transactions, plus or minus the specified multiple of the standard deviation of the transactions.

For example, for customer 262001: 450.67 ± (3 * 772.44)

Remarks

For more information about how this command works, see Identifying outliers.

Add outlier boundary fields to the results table

Analytics automatically adds the STDEV and AVERAGE or MEDIAN calculated fields to the outliers results table. You may find it useful to also add two computed fields that show the outliers boundaries used to identify the outliers in the results table.

  1. Open the outliers results table.
  2. Paste this expression into the Analytics command line, edit it as required, and press Enter:
    DEFINE FIELD Lower_Boundary COMPUTED AVERAGE - (number_of_std_devs * STDEV)
    • For number_of_std_devs, substitute the actual standard deviation multiple you used.
    • If you used median as a center point rather than average, substitute MEDIAN for AVERAGE.
  3. Paste this expression into the Analytics command line, edit it as required, and press Enter:
    DEFINE FIELD Upper_Boundary COMPUTED AVERAGE + (number_of_std_devs * STDEV)
    • For number_of_std_devs, substitute the actual standard deviation multiple you used.
    • If you used median as a center point rather than average, substitute MEDIAN for AVERAGE.
  4. Right-click the view and select Add Columns.
  5. From the Available Fields list, double-click Lower_Boundary and Upper_Boundary to add them to the Selected Fields list.
  6. Click OK.
  7. Optional. Reposition the added fields by dragging the column headers.