Converting analytic scripts to Unicode

If you are migrating from the non-Unicode edition of Analytics to the Unicode edition, existing regular scripts and analytic scripts are automatically converted to Unicode. However, you must verify that the logic of the scripts remains the same when applied to double-byte Unicode data.

What is Unicode?

Unicode is a standard for encoding text that uses two or more bytes to represent each character, and characters for all languages are contained in a single character set. The Unicode editions of Galvanize products allow you to view and work with files and databases that contain Unicode-encoded data in all modern languages.

Note

Analytics and the AX Engine support little-endian (LE) encoded Unicode data. These products cannot be used to analyze big-endian (BE) encoded data.

Migrating to Unicode Analytics Exchange

encryption of Unicode scripts is not currently supported
Analytics project files and log files are encoded as Unicode data (UTF-16 LE) and cannot be used with the non-Unicode edition of Analytics
when you use Analytics to define print image and delimited files that contain ASCII or EBCDIC-encoded text, the fields in the Analytics table containing this data are assigned the Unicode data type by default

Required analytic scripts changes

Update any parameters that specify a value in bytes

Characters in the non-Unicode edition of Analytics are one byte in length. Characters in the Unicode edition, if they are Unicode data, are two bytes in length. When you specify field length or starting position in bytes in the non-Unicode edition of Analytics, the number of bytes is equal to the number of characters. This is not true for Unicode data in the Unicode edition of Analytics.

To convert analytic scripts for use in Unicode Analytics, you must adjust the numeric value of any parameters that specify field length or starting position in bytes. For example, for an IMPORT DELIMITED command that specifies a WID value of 7 in non-Unicode Analytics, you must double the WID value to 14 to produce the same result in Unicode Analytics.

In addition, for Unicode data, specify an odd-numbered starting byte position for fields, and an even number of bytes for field lengths. Specifying an even-numbered starting position, or an odd-numbered length, can cause characters to display incorrectly.

Recreate all instances of IMPORT PRINT and IMPORT DELIMITED

You need to recreate all instances of the IMPORT PRINT and IMPORT DELIMITED commands by importing the source data file using the Data Definition Wizard in the Unicode version of Analytics and reimporting the projects into AX Server. Using the Data Definition Wizard ensures that all syntax is valid.

Change all instances of the ZONED( ) and EBCDIC( ) functions

You need to change all instances of the ZONED() and EBCDIC() functions as follows so that the ASCII values returned by the functions are correctly converted to Unicode data:

Computed fields wrap the BINTOSTR() function around ZONED() or EBCDIC() instances
Static expressions wrap the BINTOSTR() function around ZONED() instances

BINTOSTR(ZONED(%result%, 5), 'A')

Change all instances of the OPEN FORMAT command

You need to modify all instances of the OPEN FORMAT command. You need to use the SKIP parameter to skip the first two bytes of the Unicode file you are opening. This is required because the first two bytes of UTF-16 encoded files are reserved as byte order marks and are separate from the text in the file.

Non-Unicode

OPEN "ascii_test.txt" FORMAT template_table CRLF 
DEFINE FIELD full_rec ASCII 1 10

Unicode

OPEN "utf-16_test.txt" FORMAT template_table CRLF SKIP 2 
DEFINE FIELD full_rec UNICODE 1 20

Verifying converted analytic scripts

Verify that the Unicode versions of the analytic scripts produce results that are identical to the results produced by the non-Unicode analytic scripts. The best way to do this is to use a Diff tool to compare the log files produced in the analysis. The Diff tool identifies any differences between the files.

What if the results are not the same?

If you cannot produce the same results with the Unicode version of an analytic script as the non-Unicode version, you may be able to isolate the problem by comparing the log outputs of the scripts at each step of the analysis.