You are here: Field Types > Unicode

Unicode

Unicode Data Types

Analyzer provides three distinct data types for reading variations of Unicode character data:

UTF8
UTF16
UTF16BE

Unicode supports many more characters than the older ASCII or EBCDIC single byte standards (which only support 256 distinct characters). The first 256 character of ASCII and Unicode are the same, but larger character values are (of course) unique to Unicode.

These various Unicode data type standards represent different ways that are used to reduce the size of the raw Unicode data (that might otherwise take 4 bytes per character).

UTF-8 looks very much like ASCII, unless you get a relatively high character value. In these cases a multi-byte value is added that encodes the actual Unicode character.
UTF-16 is what is typically associated with Unicode. Each character normally takes two bytes (although very large character values can take more space). When you look at the raw data you usually see a binary zero in every other byte. Technically, each two byte value is stored as a MICRO value, but the whole field is treated as a character string.
UTF-16 Big Endian is exactly the same as UTF-16, except that each two-byte value is stored with the high-order byte first.

Analyzer reads and usually auto-detects UTF-8 UTF-16 and UTF-16 Big Endian Unicode data. In Analyzer, Unicode data is internally converted into ASCII. When processed through the Print Image Wizard, the text is displayed normally. When processed through the Data Definition Wizard as a standard file, the character data will be displayed in its raw state, as the possibility exists for non-printable data in the file.

Limitations of Unicode Data Types

Analyzer's Unicode data types provide limited support of the following Unicode character sets:

Asian Unicode data
Non-Western (i.e. Eastern) European Unicode data (such as Cyrillic [Russian] or Greek).

This is because Analyzer’s Unicode data types internally convert the Unicode data into the equivalent ASCII value. Where there is no equivalent ASCII value Analyzer substitutes a “?”.

However, Analyzer's Unicode data types fully support purely numeric-based fields in these types of Unicode data (like quantity or amount fields or like ID and invoice number fields).

Additionally, when defined using one of Analyzer’s Unicode data types, all data contained in fields from these types of Unicode data will be correctly exported (for example to Excel) whether Analyzer's Unicode data types can fully process a field or not.

Changing Unicode Data Types

For data that is stored as UNICODE text, Analyzer allows users to change the data type from UTF8 or UTF16 to NUMERIC, PRINT or DATE as appropriate.

Note: Be sure to enter the correct decimal or date descriptions as appropriate.

Note: When changing the data type for UTF16 fields (i.e. double byte character fields) be sure to specify a field length that captures an even number of bytes (i.e. captures both bytes for each stored character).

The ability to change the UTF data type to NUMERIC, PRINT or DATE alleviates the need to convert the character (text) based UTF data types into the desired data type via use of computed fields and other conversion functions.