Galvanize Unicode products

The Unicode editions of Galvanize products allow you to view and work with files that contain Unicode data.

Unicode is an industry-standard method of character encoding that supports most of the world's languages.

Should I install the non-Unicode or the Unicode edition of Analytics?

Analytics is available in non-Unicode and Unicode editions. Both editions are contained in the same installation package, and during the installation you specify which edition to install.

You should install the non-Unicode edition, unless you have a requirement to view or analyze Unicode data. Unicode data can only be opened in the Unicode edition of Analytics.

You are more likely to encounter Unicode data if you work in an environment with global information systems, or you analyze data that contains multiple languages.

When the Unicode edition is required

You need to install the Unicode edition to view or analyze data with:

  • Asian characters
  • a combination of non-Unicode, or traditional, character encodings

    For example, some combination of languages from at least two of these character encodings:

    • Latin 1 (English and Western European)
    • Latin 2 (Central European)
    • Cyrillic
    • Greek
    • Arabic

Note

If you want to use the Chinese, Japanese, or Polish Analytics user interface the only option is to install the Unicode edition. This requirement is related to the language of the user interface, not to the language of the data.

Unilingual data

If the data you work with is English-only, or uses only one of the Western European languages, you should most likely install the non-Unicode edition. You should be aware, however, that it is possible for an English-only file to be Unicode.

Note

Contact your IT department if you are uncertain about the character encoding you might encounter when working with organizational data.

Using non-Unicode Analytics with Unicode data

In some situations it is possible, and preferable, to use non-Unicode Analytics with Unicode data.

If all the characters in the Unicode data you work with are supported by one of the traditional character encodings – for example, English-only data – there is no need to use Unicode Analytics. When you import this data into non-Unicode Analytics, text fields are automatically converted from Unicode to ASCII, with no loss or corruption of data.

For the reasons why this approach is preferable, see Drawbacks of the Unicode edition.

Note

Data corruption results if you import Unicode data to non-Unicode Analytics and the data contains characters not supported by the extended ASCII character set.

The language of the data is what matters

The language, or languages, of the data you work with is generally what dictates the edition of Analytics you should install, not the language of the Analytics user interface.

For example, your organization might use the Spanish Analytics interface, but the decision about whether to install the non-Unicode or Unicode edition depends on the language or languages you expect to encounter in the data.

The Chinese, Japanese, and Polish Analytics user interfaces are an exception to the general guideline about choosing an edition of Analytics. All three interfaces are available in the Unicode edition only. For information about localized Analytics interfaces, and Unicode support, see Language support.

Which edition of Analytics am I currently using?

To identify which edition of Analytics you are currently using, select Help > About to open the dialog box containing the product and subscription information. Unicode or non-Unicode appears after the version number.

Robots or Analytics Exchange users

You need to install the edition of Analytics that matches the edition of Robots or Analytics Exchange that your organization uses. Analytics cannot interact with Robots or Analytics Exchange if the editions are mismatched.

Drawbacks of the Unicode edition

The Unicode edition of Analytics has these drawbacks:

  • Larger data file sizes Unicode data requires approximately double the storage space of non-Unicode data because each character is represented using two bytes instead of one.
  • Possible slower performance With large data files, some Analytics commands may take noticeably longer to execute because twice the amount of data is being processed by the Unicode edition.

Because of these drawbacks, you should only install the Unicode edition if you actually need it to work with Unicode data.

Single-byte versus double-byte data in Analytics

Non-Unicode Analytics

When reading and writing data files, the non-Unicode edition of Analytics works with single-byte character sets (SBCS) only. In a single-byte character set, one byte of data is used to represent each character, and a maximum of 256 different characters are supported.

The single-byte character set used by non-Unicode Analytics depends on the language specified by your computer's system locale setting. If the system locale specifies English or one of the Western European languages, the Windows-1252 character set is used. Windows-1252 is also known as "Windows Latin 1". You can set your system locale in the Windows Control Panel.

Other common ways of referring to single-byte character sets are "ANSI", "ANSI character set", or "extended ASCII".

Note

The character set that non-Unicode Analytics uses for processing data is not necessarily the same as the character set used by the text on the Analytics user interface.

Unicode Analytics

Reading data

The Unicode edition of Analytics can read double-byte or single-byte character sets. Double-byte Unicode characters use two bytes of data to represent each character. By using two (or more) bytes of data to encode characters, Unicode has the capacity to represent all the characters of the world's languages in a single character set.

Writing data

For write operations that create output files, Unicode Analytics typically uses double-byte UTF-16 character encoding. For some operations, the output file retains any single-byte character encoding that is present in the source file.

Number of bytes versus number of characters

When working with double-byte Unicode data, keep in mind the distinction between the length of a field in bytes, which appears in the Table Layout dialog box, and the length of a field in characters.

For example, if the length of a Unicode field is 44 bytes in the Table Layout dialog box, the field actually contains 22 characters.

Why bytes and characters matter in ACLScript

When you use functions such as STRING( ) and SUBSTRING( ), which include a field length parameter, you specify the length in characters, not bytes. Conversely, some commands, such as DEFINE FIELD, require that you specify field length in bytes, not characters.

In non-Unicode Analytics, one byte equals one character, so the distinction between bytes and characters does not matter. But in Unicode Analytics, when working with double-byte Unicode data, two bytes equal one character, so the distinction does matter.

For details about which type of unit to use for particular commands and functions, see the ACL Scripting Guide.

Importing text files to Unicode Analytics

The character encoding of a text file affects how it is imported to Unicode Analytics, and the data type used for character fields in the resulting Analytics table.

When importing ASCII and EBCDIC files to Unicode Analytics you have two choices:

  • Convert the character data type to UNICODE and create an Analytics data file

    If you subsequently change the UNICODE data type to ASCII or EBCDIC, the characters in the fields will not display correctly.

  • Retain the ASCII or EBCDIC character encoding, and create an Analytics table layout only without an Analytics data file

    The Analytics table layout continues to be linked to the source text file.

Text file character encoding Data Definition Wizard option Character data type in Analytics table Character length
UTF-16 LE (Unicode)

Unicode Text

UNICODE double-byte character
UTF-8 (Unicode)

Encoded Text + the appropriate character set (code page) for the data file

UNICODE double-byte character

extended ASCII (ANSI character set)

ASCII > Delimited text file

ASCII > Print Image (Report) file

UNICODE double-byte character

ASCII > Other file format

ASCII

single-byte character

EBCDIC

EBCDIC > Print Image (Report) File

UNICODE double-byte character

EBCDIC > Other file format

EBCDIC

single-byte character

Little-endian and big-endian data

“Little-endian” (LE) and “big-endian” (BE) are terms that refer to two different methods of encoding Unicode data. Unicode data that originates from Microsoft Windows computers is typically encoded as little-endian. If you use Analytics on a Windows computer, you cannot analyze big-endian data.

Conversion of non-Unicode Analytics projects to Unicode

You can open a non-Unicode Analytics project in the Unicode edition of Analytics, but you cannot do the reverse: open a Unicode Analytics project in non-Unicode Analytics.

  Open in non-Unicode Analytics Open in Unicode Analytics
non-Unicode project Yes Yes
Unicode project No Yes

Project conversion

When you open a non-Unicode Analytics project in Unicode Analytics you are prompted to automatically convert the project and the associated log file to Unicode. If you proceed with the conversion, copies of the original non-Unicode project and the log file are saved with the file extension .OLD, and are not altered.

Note

Once you convert a non-Unicode Analytics project to Unicode, you can no longer open it in the non-Unicode edition of Analytics, and you cannot convert the project back to non-Unicode. If required, you can recover the non-Unicode version of the project using the .OLD file.

Analytics data files

When you convert a non-Unicode Analytics project to Unicode, any associated Analytics data files (.fil) are not converted to Unicode. They remain as single-byte ASCII (ANSI) data in the Unicode project.

Note

In Unicode Analytics, byte position or byte length of fields in the unconverted single-byte data work the same way as they do in non-Unicode Analytics. One byte equals one character. Keep this difference in mind if you execute any commands against the unconverted data that reference byte position or byte length.

Unicode-specific functions in Analytics

Analytics has six Unicode-specific functions to aid with data analysis and conversion. The functions are summarized in the table below. The functions are only included in the Unicode edition of Analytics.

For detailed information about these functions, see the ACL Scripting Guide.

Function

Purpose

BINTOSTR( )

Returns Unicode character data converted from ZONED or EBCDIC character data. Abbreviation for "Binary to String".

This conversion ensures that values encoded in ZONED or EBCDIC can be displayed correctly.

DBYTE( )

Returns the Unicode character located at the specified byte position in a record.

DHEX( )

Converts a Unicode string to a hexadecimal string.

The inverse of HTOU( ).

HTOU( )

Converts a hexadecimal string to a Unicode string. Abbreviation for "Hexadecimal to Unicode".

The inverse of DHEX( ).

DTOU( )

Converts an Analytics date value to a Unicode string in the specified language and locale format. Abbreviation for "Date to Unicode".

The inverse of UTOD( ).

UTOD( )

Converts a Unicode string containing a formatted date to an Analytics date value. Abbreviation for "Unicode to Date".

The inverse of DTOU( ).

Analytics 14.1 Help