Using functions to clean data

You can use an Analytics function to clean data. Different functions allow you to perform different types of data cleansing, depending on your particular need.

Why do I need to clean data?

Frequently, data imported into Analytics is not clean — meaning perfectly formatted and standardized. Analytics commands do not work, or give inaccurate results, if you input badly formatted or non-standard data.

How cleaning data works

When you clean data you are not cleaning or modifying the actual source data. The source data always remains read-only.

Instead, you input the source data to a function that processes it and outputs appropriately formatted and standardized "virtual data". You then input the cleansed virtual data to an Analytics command, rather than the original source data.

Key point

Using one or more functions, you can perform a wide range of data cleansing or data preparation tasks that allow you to work effectively and accurately even if source data is inconsistent. Data preparation is a fundamental preliminary task for much data analysis.

Removing blank spaces

Inconsistent blank spaces in data are a common cause of inaccurate results. You can use the ALLTRIM( ) function to remove leading and trailing blank spaces and ensure accurate results.

Example: Blank spaces

You want to sort a vendor table by city, but leading spaces in some of the city names are causing inaccurate sorting.

Vendor_City sorted
[ ] [ ] [ ] [ ] Salt Lake City
[ ] [ ] Chicago
Ann Arbor
Austin
Englewood
[ ] = blank space

You can use the ALLTRIM( ) function to get rid of the leading spaces and ensure accurate sorting:

ALLTRIM(Vendor_City)

ALLTRIM(Vendor_City) sorted
Ann Arbor
Austin
Chicago
Englewood
Salt Lake City

Note

To apply the ALLTRIM( ) function to the Vendor_City field, you create a computed field that uses ALLTRIM( ). Computed fields are discussed in a subsequent tutorial.

Removing unwanted characters

Inconsistent characters, or non-critical characters, can often impede data analysis. You can use different functions to include only certain characters, or to exclude certain characters or strings of characters, prior to processing data with an Analytics command.

Trying things yourself

You can copy and paste any of the function examples below into the command line to verify the return value, or to experiment with different inputs.

In the command line, you must preface the example with DISPLAY and a space. Experimenting in the command line is explained in previous tutorials.

Example: Unwanted characters

You want to perform a duplicates test on a table, but inconsistent formatting of data is causing inaccurate results.

For example, running the duplicates command on an inconsistently formatted Phone Number field does not report these two phone numbers as duplicates, although they clearly are duplicates:

  • (604) 555-1212
  • Tel. No: 604-555-1212

To ensure all duplicates are found, you can use functions to standardize the data before performing the duplicates operation.

Task Function example

Standardize telephone numbers

INCLUDE("(604) 555-1212", "1234567890")

Returns 6045551212

INCLUDE("Tel. No: 604-555-1212", "1234567890")

Returns 6045551212

The INCLUDE( ) function includes only the specified characters in the output – in this case, only the numbers 0 to 9

Tip

Use INCLUDE( ) if the set of characters you want to include is small, and the set you want to exclude is large.

Standardize addresses

EXCLUDE("#1550-980 Howe St.", "#.")

Returns 1550-980 Howe St

EXCLUDE("1550-980 Howe St", "#.")

Returns 1550-980 Howe St

The EXCLUDE( ) function excludes the specified characters from the output – in this case, the hash sign (#) and the period (.)

Tip

Use EXCLUDE( ) if the set of characters you want to exclude is small, and the set you want to include is large.

Standardize addresses and remove street abbreviations

OMIT("#1550-980 Howe St.", " Street, St.,#")

Returns 1550-980 Howe

OMIT("1550-980 Howe Street", " Street, St.,#")

Returns 1550-980 Howe

The OMIT( ) function excludes the specified characters and strings of characters from the output – in this case, the hash sign (#), and the inconsistently formatted St. and Street

Tip

Use OMIT( ) if you want to exclude specific strings of characters, but not the individual characters that make up the string.

For example, exclude Street when it occurs as a unit, but not S, t, r, e, or t when they occur in other words.

Where to next?

Learn how to use functions to perform multiple tasks simultaneously: Cleaning and filtering data at the same time