SORTWORDS( ) function

Returns a string with individual words sorted in sequential order.

Syntax

SORTWORDS(string)

Parameters

Name	Type	Description
string	character	The string, or the expression, containing words to be sorted. A "word" is any character or group of characters separated by spaces: an actual word or name a number that uses the character data type groups of letters, numbers, or special characters such as abbreviations or identifying prefixes Multiple spaces between words are automatically converted to a single space. Leading or trailing spaces are automatically trimmed.

Name

Type

Description

string

character

The string, or the expression, containing words to be sorted.

A "word" is any character or group of characters separated by spaces:

an actual word or name
a number that uses the character data type
groups of letters, numbers, or special characters such as abbreviations or identifying prefixes

Multiple spaces between words are automatically converted to a single space. Leading or trailing spaces are automatically trimmed.

Output

Character.

Examples

Basic examples

Literal character input

Returns "1 2 A Z a z" (non-Unicode Analytics):

SORTWORDS("Z a 2 z A 1")

Returns "1 2 a A z Z" (Unicode Analytics):

SORTWORDS("Z a 2 z A 1")

Returns "1 2 A A Z Z":

SORTWORDS(UPPER("Z a 2 z A 1"))

Returns "CA, FL NY, TX,":

SORTWORDS("CA, TX, NY, FL")

Returns "CA FL NY TX":

SORTWORDS(OMIT("CA, TX, NY, FL", ","))

Field input

Returns all the values in the Vendor_Address field with address elements sorted into sequential order:

SORTWORDS(Vendor_Address)

Advanced examples

Sort address elements to improve fuzzy duplicate matching

You can use SORTWORDS( ) as a helper function when performing various kinds of fuzzy matching in Analytics.

Effect of SORTWORDS( ) on Levenshtein distance

First, let's look at the Levenshtein distance between two occurrences of the same address, differently formatted.

Without the SORTWORDS( ) function, the Levenshtein distance returned is 22. A Levenshtein distance that large suggests the two strings are not the same address:

LEVDIST("125 SW 39TH ST, Suite 100", "Suite 100, 125 SW 39TH ST")

Now, let's add the SORTWORDS( ) function. The Levenshtein distance returned is 2 – dramatically lower – which suggests the two strings are the same address.

LEVDIST(SORTWORDS("125 SW 39TH ST, Suite 100"), SORTWORDS("Suite 100, 125 SW 39TH ST"))

Isolating fuzzy duplicates for "125 SW 39TH ST, Suite 100"

You create a filter that isolates all values in the Vendor_Address field that are within a specified Levenshtein distance of "125 SW 39TH ST, Suite 100":

SET FILTER TO LEVDIST(SORTWORDS(Vendor_Address), SORTWORDS("125 SW 39TH ST, Suite 100"), F) < 3

SET FILTER TO ISFUZZYDUP(SORTWORDS(Vendor_Address), SORTWORDS("125 SW 39TH ST, Suite 100"), 3, 99)

Increasing or decreasing the Levenshtein distance in the expressions ( 3 ) allows you to adjust the degree of difference in the filtered values.

For more information about Levenshtein distance, see LEVDIST( ) function.

Remarks

Overview video

For a video providing an overview of the function, see Fuzzy Matching Using SORTWORDS() (English only).

The sort sequence used by SORTWORDS( )

The SORTWORDS( ) function uses whatever sort sequence is specified in the Sort Order option (Tools > Options > Table). The default sort sequences are shown below.

For detailed information, see The Sort Order option and sort sequences .

Analytics Edition	Sort Order default	Associated sort sequence
non-Unicode	System Default (ASCII)	Numbers, then uppercase, then lowercase: 0, 1, 2... A, B, C... a, b, c... For example, “Z” sorts before “a”.
Unicode	Mix Languages (UCA) (Unicode collation algorithm)	Numbers, then lowercase and uppercase intermixed: 0, 1, 2... a, A, b, B, c, C... For example, “a” sorts before “Z”.

Analytics Edition

Sort Order default

Associated sort sequence

non-Unicode

System Default

(ASCII)

Numbers, then uppercase, then lowercase:

0, 1, 2... A, B, C... a, b, c...

For example, “Z” sorts before “a”.

Unicode

Mix Languages (UCA)

(Unicode collation algorithm)

Numbers, then lowercase and uppercase intermixed:

0, 1, 2... a, A, b, B, c, C...

For example, “a” sorts before “Z”.

Case sensitivity

SORTWORDS( ) is case sensitive. Depending on which edition of Analytics you are using (non-Unicode or Unicode), casing in strings may affect sorting.

You can use the UPPER( ) function in conjunction with SORTWORDS( ) if you do not want case to affect sorting:

SORTWORDS(UPPER("string"))

SORTWORDS( ) can improve fuzzy matching

SORTWORDS( ) can improve the effectiveness of fuzzy matching commands, and filters or conditions that use fuzzy matching functions:

Caution

If you use SORTWORDS( ) in conjunction with any of the fuzzy matching commands or functions you must apply SORTWORDS( ) to both strings or both fields being compared. Applying the function to only one of the two strings or fields can seriously degrade the results of fuzzy matching.

Levenshtein distance algorithm

SORTWORDS( ) is especially helpful when used with commands and functions based on the Levenshtein distance algorithm, which returns greater difference values when identical or similar elements in two strings are in different positions. By sorting the elements first, you can significantly reduce the difference values returned by the Levenshtein algorithm. Result sets have fewer false positives and more true positives.

Dice coefficient algorithm

SORTWORDS( ) may be helpful when used with commands and functions based on the Dice coefficient algorithm. However, an improvement in effectiveness is not always the case. By design, the Dice coefficient algorithm minimizes the importance of the position of elements, so sorting elements has less impact than it does with the Levenshtein distance algorithm.

Depending on the nature of the data, SORTWORDS( ) may actually degrade effectiveness by causing the Dice coefficient algorithm to return lower scores. Test a set of sample data before deciding whether to use SORTWORDS( ) in conjunction with the Dice coefficient algorithm in a production setting.

A second consideration is that the benefit of using SORTWORDS( ) is more modest when the Dice coefficient n-gram length is shorter. As you reduce the n-gram length, the Dice coefficient algorithm increasingly minimizes the importance of the position of elements.