Unicode

From den4b Wiki
Jump to navigation Jump to search

Unicode is a universal character set which consists of more than 100,000 characters and additional information regarding various character properties. Unicode represents all of the world's living languages, plus other symbols and notations. Old operating systems like Windows 95/98/Me were able to work only with character sets consisting only out of 256 characters (ANSI), meaning that an operating system could work only in 1 or 2 different alphabets/languages. These days, modern operating systems use Unicode in one form or another, hence, are able to work with almost any language and alphabet.

When it comes to applications, many of them are not able to work with Unicode text or process Unicode filenames. This is because they were initially designed for the old operating systems, and from the programmer's point of view it is not an easy task to redesign an application to work with Unicode on new operating systems.

ASCII vs ANSI

ASCII includes definitions for 128 characters, with ordinary values 0-127 ($00-$7F Hex). These represent 33 control characters and the rest are printable characters consisting of letters of English alphabet, some common symbols and punctuations. ANSI, on the other hand, is an extension to ASCII. Since each ASCII character occupies 1 byte (8 bits, with ordinary value range 0-255), there are 128 unused mappings. ANSI extends ASCII to support letters and symbols for alphabets of other languages. Each language can use their own custom extension to ASCII with their own letters included, and each such custom extension will have it's own name (code page), for example: Windows-1252 for Latin, Windows-1251 for Russian, etc.

Encoding

Unicode text can be stored using a range of encoding schemas, including UTF-8, UTF-16, UTF-32, and their relevant byte order variants (little-endian and big-endian).

Each encoding offers a different balance between compactness (storage efficiency) and convenience (ease of manipulation). The choice depends heavily on the application needs.

Number of bytes taken by each encoding to represent a Unicode character:

Encoding Bytes
UTF-8 1-4
UTF-16 2-4
UTF-32 4

External References