Clustering data

Concept Information

CLUSTER command

Clustering groups the records in a table based on similar values in one or more numeric key fields. Similar values are values that are nearby or close to one another in the context of the entire data set. These similar values represent clusters that, once identified, reveal patterns in the data.

How clustering differs from other Analytics grouping commands

Clustering differs from other Analytics grouping commands:

  • Clustering does not require grouping on exact values, or predefined strata with hard numeric boundaries. Instead, clustering groups data based on similar numeric values – that is, values that are close or nearby to one another.
  • Clustering does not require pre-existing data categories.
  • Clustering based on more than one field outputs results that are not nested (non-hierarchical).

Choosing the fields to cluster on

Clustering allows you to discover organic groupings in data that you may not know exist. You are free to create clusters based on multiple numeric fields. In this sense, clustering is exploratory, and an example of unsupervised machine learning.

However, in order to make sense of the output clusters, you need to understand the relation between the fields you select for clustering.

Cluster on a single field

Clustering on a single numeric field is relatively straightforward. You have a single set of values, and clustering groups the values based on closeness between values, or proximity. For example, you can cluster an amount field to find out where the amounts are concentrated over the range of values.

The benefit of clustering over a traditional approach like stratifying is that you do not have to make any assumptions, in advance, about where the concentrations may exist, or create arbitrary numeric boundaries. Clustering discovers where the boundaries lie for any given number of clusters.

Example of clustering on a single numeric field

You cluster the Ap_Trans table on the Invoice Amount field to find out where amounts are concentrated over the range of values. Your expectation is that most of the amounts will be clustered at the lower end of the range. You decide to group the Invoice Amount field into five clusters, and then summarize the clusters to discover how many records are in each cluster.

The output results

In the output results shown below, the first five records are system-generated and equate to the desired number of clusters that you specified. In the Invoice Amount field, the five records show the centroid, or center point, that the clustering algorithm calculates for each of the five clusters of invoice amounts. For example, the centroid for cluster 3 is 2,969.04. For more information, see How the clustering algorithm works.

Beneath the system-generated fields are the source data fields grouped into clusters, starting with cluster 0. The value in the Distance field is the distance from the actual invoice amount to the calculated centroid value for the cluster.

Summarizing the clusters

If you summarize the Cluster field, and sort the summarized output by count, you get the following results, which confirm that the distribution of values is what you expected. Overall, invoice amounts are heavily skewed to lower values. (Centroid values added to the table for ease of comparison.)

The single large value in a cluster by itself appears to be an outlier and should probably be investigated.

Cluster Count Centroid value
0 73 553.36
3 16 2,969.04
4 8 8,061.46
2 4 18,010.28
1 1 56,767.20

Cluster on multiple fields

When you cluster on two or more fields, you need to ask yourself how the fields might relate. You could use clustering to test a hypothesis. For example, a company might be concerned about the rate of employee turnover, which management thinks is concentrated among younger, lower-paid employees.

You could use clustering to discover if there is a strong relation between:

  • length of employee retention and employee age (two-dimensional clustering)
  • length of employee retention, employee age, and salary (three-dimensional clustering)

Note

For this analysis, you need to avoid including any fields that do not clearly relate to the hypothesis, such as number of sick days taken.

Assessing the output clusters

The clustering algorithm will always output a table with the specified number of clusters. Every record in the output table will be in a cluster.

At this point, you need to assess whether any of the clusters have analytical significance or meaning. Just because the algorithm groups records in a cluster does not necessarily mean the grouping is significant.

Two characteristics you can assess are cluster coherence and cluster size.

Tip

Graphing the cluster output table as a scatter plot in a reporting tool, with each cluster assigned a different color, is the easiest way to quickly assess the nature of the output clusters.

How the clustering algorithm works

Clustering in Analytics uses the K-means clustering algorithm, which is a popular machine learning algorithm. You can find detailed descriptions of K-means clustering on the Internet.

A summary of the algorithm appears below.

Choosing the number of clusters (K value)

Determining the optimal number of clusters to use when clustering data can require some testing and experimentation. For any given data set, there is not an exact answer.

Can I cluster on character or datetime fields?

Generally, you cannot cluster on character or datetime fields. The clustering algorithm accepts only numbers, and it performs calculations with the numbers (Euclidean distance, mean).

Steps

Note

If the machine learning menu options are disabled, the Python engine is probably not installed. For more information, see Install ACL for Windows.

Specify settings for the clustering algorithm

  1. Open the table with the data that you want to cluster.
  2. From the Analytics main menu, select Machine Learning > Cluster.
  3. In Number of clusters (K Value), specify the number of clusters to use for grouping the data.
  4. In Maximum number of iterations, specify an upper limit for the number of iterations performed by the clustering algorithm.
  5. In Number of initializations, specify the number of times to generate an initial set of random centroids.
  6. Optional. Select Seed, and enter a number.

Specify a data preprocessing method

In the Preprocessing dropdown list, select the method for preprocessing the data before clustering it:

Standardize Center key field values around zero (0), and scale the values to unit variance when calculating the clusters
Scale to unit variance Scale key field values to unit variance when calculating the clusters, but do not center the values around zero (0)
None Use the raw key field values, unscaled, when calculating the clusters

Select fields

  1. From the Cluster On list, select one or more key fields to use for clustering the records in the table.

    Key fields must be numeric.

  2. Optional. From the Other Fields list, select one or more additional fields to include in the output table.

Tip

You can Ctrl+click to select multiple non-adjacent fields, and Shift+click to select multiple adjacent fields.

Finalize command inputs

  1. If there are records in the current view that you want to exclude from processing, enter a condition in the If text box, or click If to create an IF statement using the Expression Builder.

    Note

    The If condition is evaluated against only the records remaining in a table after any scope options have been applied (First, Next, While).

    The IF statement considers all records in the view and filters out those that do not meet the specified condition.

  2. In the To text box, specify the name of the output table.
  3. Optional. On the More tab:
    1. To specify that only a subset of records are processed, select one of the options in the Scope panel.
    2. Select Use Output Table if you want the output table to open automatically.
  4. Click OK.