Clustering data

Clustering groups the records in a table based on similar values in one or more numeric key fields. Similar values are values that are nearby or close to one another in the context of the entire data set. These similar values represent clusters that, once identified, reveal patterns in the data.

How clustering differs from other Analytics grouping commands

Clustering differs from other Analytics grouping commands:

Clustering does not require grouping on exact values, or predefined strata with hard numeric boundaries. Instead, clustering groups data based on similar numeric values – that is, values that are close or nearby to one another.
Clustering does not require pre-existing data categories.
Clustering based on more than one field outputs results that are not nested (non-hierarchical).

How the clustering algorithm works

Clustering in Analytics uses the K-means clustering algorithm, which is a popular machine learning algorithm. You can find detailed descriptions of K-means clustering on the Internet.

A summary of the algorithm appears below.

Show me more

The K-means clustering algorithm uses an iterative process to optimize clusters:

1	Specify the number of clusters	Decide how many clusters, or groups, to use for grouping a data set. "K" represents the number of clusters you specify. The data points in the data set can be values in a single numeric field, or composite values that the algorithm computes based on multiple numeric fields.
2	Initialize cluster centroids	Generate a set of random data points to use as the initial centroids, or center points, in the cluster calculation. The number of centroids generated is equivalent to the number of clusters you specified.
3	Assign each data point to the nearest centroid	Find the shortest distance from each data point to a centroid. Distance comparisons use squared Euclidean distance. Assign each data point to the nearest centroid. All the data points assigned to a particular centroid become a cluster.
4	Recalculate the centroids	Calculate the average, or mean, of all the data points in a cluster. The mean becomes the new centroid for that cluster.
5	Iterate	Repeat steps 3 and 4: Recalculate the shortest distance from each data point to a centroid. Assign each data point to the nearest centroid, which results in some data points being reassigned to different clusters. Recalculate the centroids. Continue iterating until no data points are reassigned, or until a specified maximum number of iterations is reached. With each iteration, the makeup of the clusters becomes more coherent. That is, the data points in a cluster are closer together.

Choosing the number of clusters (K value)

Determining the optimal number of clusters to use when clustering data can require some testing and experimentation. For any given data set, there is not an exact answer.

Choosing the fields to cluster on

Clustering allows you to discover organic groupings in data that you may not know exist. You are free to create clusters based on multiple numeric fields. In this sense, clustering is exploratory, and an example of unsupervised machine learning.

However, in order to make sense of the output clusters, you need to understand the relation between the fields you select for clustering.

Can I cluster on character or datetime fields?

Generally, you cannot cluster on character or datetime fields. The clustering algorithm accepts only numbers, and it performs calculations with the numbers (Euclidean distance, mean).

Show me more

Categorical character data

You might have categorical character data, such as location IDs, in the form of numbers. Or you could use a computed field to map character categories to a set of numeric codes that you create. You could convert this data to the numeric data type and use it for clustering. However, the resulting clusters would not be valid because you would be performing mathematical calculations on numbers that are representative of something non-numeric.

For example, calculating a centroid position based on the average of a list of location IDs results in a meaningless number. The calculation is based on the invalid assumption that the mathematical distance between location numbers equates to some real-world, measurable distance.

If we consider physical distance, to say that the distance between location 1 and location 9 is twice as far as the distance between location 1 and location 5 makes no sense. Locations 1 and 9 might be beside each other, and location 5 could be miles away.

For a cluster analysis involving location and physical distance, the valid data to use would be geographic coordinates.

Categorical data that represents a scale

You could cluster on categorical data that represents a scale – for example, a rating scale from Poor to Excellent, with corresponding numeric codes from 1 to 5. In this case, an average of the numeric codes has meaning.

Datetime data

You can use Analytics functions to convert datetime data to numeric data. However, the resulting numeric data is not continuous, which presents problems for cluster analysis, which assumes continuous sets of numbers.

For example, the following three numbers, as dates, are all one day apart. However, as numbers, there is a considerable gap, or distance, between the first and second numbers.

20181130
20181201
20181202

You could use serial date values in cluster analysis. Serial dates are a continuous set of integers representing the number of days that have elapsed since 01 January 1900.

Assessing the output clusters

The clustering algorithm will always output a table with the specified number of clusters. Every record in the output table will be in a cluster.

At this point, you need to assess whether any of the clusters have analytical significance or meaning. Just because the algorithm groups records in a cluster does not necessarily mean the grouping is significant.

Two characteristics you can assess are cluster coherence and cluster size.

Tip

Graphing the cluster output table as a scatter plot in a reporting tool, with each cluster assigned a different color, is the easiest way to quickly assess the nature of the output clusters.

Steps

Specify settings for the clustering algorithm

Open the table with the data that you want to cluster.
From the Analytics main menu, select Machine Learning > Cluster.
In Number of clusters (K Value), specify the number of clusters to use for grouping the data.
In Maximum number of iterations, specify an upper limit for the number of iterations performed by the clustering algorithm.
In Number of initializations, specify the number of times to generate an initial set of random centroids.
Optional. Select Seed, and enter a number.

Specify a data preprocessing method

In the Preprocessing dropdown list, select the method for preprocessing the data before clustering it:

Standardize	Center key field values around zero (0), and scale the values to unit variance when calculating the clusters
Scale to unit variance	Scale key field values to unit variance when calculating the clusters, but do not center the values around zero (0)
None	Use the raw key field values, unscaled, when calculating the clusters

Select fields

From the Cluster On list, select one or more key fields to use for clustering the records in the table.
Key fields must be numeric.
Optional. From the Other Fields list, select one or more additional fields to include in the output table.

Tip

You can Ctrl+click to select multiple non-adjacent fields, and Shift+click to select multiple adjacent fields.

Finalize command inputs

If there are records in the current view that you want to exclude from processing, enter a condition in the If text box, or click If to create an IF statement using the Expression Builder.

Note

The If condition is evaluated against only the records remaining in a table after any scope options have been applied (First, Next, While).

The IF statement considers all records in the view and filters out those that do not meet the specified condition.
In the To text box, specify the name of the output table.
Optional. On the More tab:
1. To specify that only a subset of records are processed, select one of the options in the Scope panel.
2. Select Use Output Table if you want the output table to open automatically.
Click OK.

[ Back to top ]

Analytics 14.1 Help