Analyzing data

"Analyzing data" is a broad concept that encompasses a large range of different processes and techniques. There may be more than one way to achieve the same data analysis objective. The overall process is often iterative, requiring that you modify your initial approach based upon information you discover along the way.

Effective data analysis

At its most basic, analyzing data is the process of finding answers to questions about data. Analytics provides a number of commands and other tools that you can use to gain general insights about the data you are investigating, and to answer specific questions. However, you should resist the notion that you can click two or three buttons in Analytics and magically have all your data analysis answers.

Effective data analysis requires:

  • understanding the nature of the data
  • formulating specific analysis objectives
  • knowledgeably applying the tools

Analytics can significantly amplify your data analysis abilities but it does not replace them.

Data analysis commands and tools in Analytics

The table below categorizes Analytics commands and tools by data analysis area. The categories are not intended to be absolute. You may find an effective use for a command outside its category. Some commands, such as Sort and Join, serve a main purpose that is not primarily analytical, but in some situations they can provide analytical insight.

Note

Data analysis, beyond the most basic, typically requires using a series of commands to progressively work toward your analysis objective, rather than using a single command in isolation.

Data analysis area Command or tool Description

General characteristics

Verify

Count

Total

Profile

Statistics

Outliers

Sort

Index

Use these commands to discover general characteristics of a data set, including:

  • data validity
  • record counts
  • total amounts
  • minimum, maximum, and average amounts
  • standard deviation, median, mode, and quartile values
  • outliers
  • ranges
  • distribution of negative and positive values
  • patterns
Reliability/Accuracy computed fields Use computed fields to recalculate and test the accuracy of calculated amounts in a data set, such as total amounts including tax
Isolation

filtering

searching

Use filtering to restrict a data set or data processing to a subset of records of interest

Use searching to locate specific values in a data set

Sequential Order Sequence Test whether data is sequentially ordered, and identify out-of-sequence items
Completeness Gaps Verify whether all records in a sequence, such as a sequentially ordered series of checks, are present, and identify the location of any gaps in the sequence
Uniqueness Duplicates Identify duplicate values or items in a field, or entire duplicate records
Inexactness Fuzzy Duplicates Identify nearly identical values that may refer to the same real-world entity

Frequency Distribution

Concentration of Materiality

Stratify

Age

Classify

Summarize

Cross-Tabulate

Histogram

Cluster

Group records and determine how many records and how much value are concentrated by numeric range or cluster, by time period, or by record identifiers such as location codes, vendor/customer numbers, or product identifiers

Also useful for identifying outliers

Compare

Join

Fuzzy Join

Relate

Combine tables to discover whether records are matched or unmatched across tables, such as an invoice table and a PO table
Numeric anomaly Benford Discover anomalous numeric data by testing leading digits for variance from expected Benford distribution