Converting analytic scripts to Unicode
If you are migrating from the non-Unicode edition of Analytics to the Unicode edition, existing regular scripts and analytic scripts are automatically converted to Unicode. However, you must verify that the logic of the scripts remains the same when applied to double-byte Unicode data.
What is Unicode?
Unicode is a standard for encoding text that uses two or more bytes to represent each character, and characters for all languages are contained in a single character set. The Unicode editions of Galvanize products allow you to view and work with files and databases that contain Unicode-encoded data in all modern languages.
Note
Analytics and the AX Engine support little-endian (LE) encoded Unicode data. These products cannot be used to analyze big-endian (BE) encoded data.
Migrating to Unicode Analytics Exchange
- encryption of Unicode scripts is not currently supported
- Analytics project files and log files are encoded as Unicode data (UTF-16 LE) and cannot be used with the non-Unicode edition of Analytics
- when you use Analytics to define print image and delimited files that contain ASCII or EBCDIC-encoded text, the fields in the Analytics table containing this data are assigned the Unicode data type by default
Required analytic scripts changes
Update any parameters that specify a value in bytes
Characters in the non-Unicode edition of Analytics are one byte in length. Characters in the Unicode edition, if they are Unicode data, are two bytes in length. When you specify field length or starting position in bytes in the non-Unicode edition of Analytics, the number of bytes is equal to the number of characters. This is not true for Unicode data in the Unicode edition of Analytics.
To convert analytic scripts for use in Unicode Analytics, you must adjust the numeric value of any parameters that specify field length or starting position in bytes. For example, for an IMPORT DELIMITED command that specifies a WID value of 7 in non-Unicode Analytics, you must double the WID value to 14 to produce the same result in Unicode Analytics.
In addition, for Unicode data, specify an odd-numbered starting byte position for fields, and an even number of bytes for field lengths. Specifying an even-numbered starting position, or an odd-numbered length, can cause characters to display incorrectly.
Recreate all instances of IMPORT PRINT and IMPORT DELIMITED
You need to recreate all instances of the IMPORT PRINT and IMPORT DELIMITED commands by importing the source data file using the Data Definition Wizard in the Unicode version of Analytics and reimporting the projects into AX Server. Using the Data Definition Wizard ensures that all syntax is valid.
Change all instances of the ZONED( ) and EBCDIC( ) functions
You need to change all instances of the ZONED() and EBCDIC() functions as follows so that the ASCII values returned by the functions are correctly converted to Unicode data:
- Computed fields wrap the BINTOSTR() function around ZONED() or EBCDIC() instances
- Static expressions wrap the BINTOSTR() function around ZONED() instances
BINTOSTR(ZONED(%result%, 5), 'A')
Change all instances of the OPEN FORMAT command
You need to modify all instances of the OPEN FORMAT command. You need to use the SKIP parameter to skip the first two bytes of the Unicode file you are opening. This is required because the first two bytes of UTF-16 encoded files are reserved as byte order marks and are separate from the text in the file.
Non-Unicode
OPEN "ascii_test.txt" FORMAT template_table CRLF DEFINE FIELD full_rec ASCII 1 10
Unicode
OPEN "utf-16_test.txt" FORMAT template_table CRLF SKIP 2 DEFINE FIELD full_rec UNICODE 1 20
Verifying converted analytic scripts
Verify that the Unicode versions of the analytic scripts produce results that are identical to the results produced by the non-Unicode analytic scripts. The best way to do this is to use a Diff tool to compare the log files produced in the analysis. The Diff tool identifies any differences between the files.
What if the results are not the same?
If you cannot produce the same results with the Unicode version of an analytic script as the non-Unicode version, you may be able to isolate the problem by comparing the log outputs of the scripts at each step of the analysis.