Lesson 2 | Character sets |
Objective | Explore the problems posed by multilingual and multiplatform character sets |
Java Character Sets
This course has tried to steer clear of reading and writing text, character-based data like Q or
The quick brown fox jumped over the lazy dog. Reading and writing text is simple as long as you assume that everyone is reading and writing ASCII character data. However, in the modern world, that's rarely true. The Mac uses an extended 8-bit character set called MacRoman that contains many additional symbols like © and letters from non-English Latin alphabets like ç. Windows uses a different 8-bit character set called ISO Latin-1 that has most of these symbols but maps them to different numbers. The character ç is number 141 on the Mac but number 231 on Windows.
Different Alphabets
The problem only gets worse when you attempt to incorporate non-Roman alphabets like Greek, Cyrillic, Hebrew, and Arabic. Character sets used for these languages often do not correspond to ASCII at all, and may not have ASCII character equivalents. When you consider the pictographic languages like Chinese, Japanese, and Korean, there's simply no longer any way to fit all the characters from even one of these languages into eight bits. You have to move to a multibyte character set. There is not one universally accepted standard for how to encode these languages. There are many different ways these characters are commonly
written.
Chinese Character Sets
Chinese alone has about 80,000 pictographs.
Since a two-byte character set like Unicode only has 216 or 65,536 different characters, it can not possibly include all the characters from Chinese. Furthermore, Japanese and Korean can be squeezed into Unicode only by sharing characters with Chinese. To get a full representation of Chinese plus separate representations for Japanese and Korean requires a four-byte character set.
One such set, UCS (the Universal Character System), is in development but is not yet in common use and is not yet supported by Java.
Character streams automatically adapts to the local character set and have support for internationalization.
Java comes with classes called
- InputStreamReader and
- OutputStreamWriter
that translate Unicode back and forth from local encodings. Two of the supported encodings are GB2312 and Big5.
Java 2 allows the programmer to directly access the fonts on the machine.
Previous to the introduction of Swing, Java could not display Chinese except on Chinese operating systems.
With Swing, you can display Chinese in any component, providing you have the fonts that support Chinese on your system.
So the latest versions of Java can display
- Chinese,
- Japanese, and
- Korean
text directly if corresponding fonts are installed.
Java uses UCS-2 internally however every Java Virtual Machine (JVM) has a default charset.
The default charset of the JVM is determined during
virtual-machine startup and typically depends upon the locale and charset being
used by the underlying operating system.
In addition, Java calls UTF-8 as UTF8.
Unicode attempts to provide encodings for all the most common characters in most of the world's current languages in two bytes (16 bits). Java uses Unicode internally, and it uses a variant of Unicode called UTF-8 in .class files for storing string literals. However, there are many Chinese, Japanese, Arabic, Hebrew, and other character sets in common use that are not Unicode. A means is needed for converting between the Unicode text Java supports and these other character sets.
Text Encoding (Unicode character set)
Java is a language for the Internet and since the users of the Internet speak and write in many different human languages, Java must be able to handle a large number of languages. One of the ways in which Java supports internationalization is through the Unicode character set. Unicode is a worldwide standard that supports the scripts of most languages. The latest version of Java bases its character and string data on the Unicode 6.0 standard, which uses at least two bytes to represent each symbol internally.
Java source code can be written using Unicode and stored in any number of character encodings, ranging from a full binary form to ASCII-encoded Unicode character values. This makes Java a friendly language for non-English-speaking programmers who can use their native language for class, method, and variable names just as they can for the
text displayed by the application.
The Java char type and String class natively support Unicode values. Internally, the text is stored as multibyte characters using the UTF-16 encoding. However, the Java language and APIs make this transparent to you and you will not generally have to think about it. Unicode is also very ASCII-friendly, where ASCII is the most common character encoding for English. The first 256 characters are defined to be identical to the first 256 characters in the ISO 8859-1 (Latin-1) character set, so Unicode is effectively backward-compatible with the most common English character sets. Furthermore, one of the most common file encodings for Unicode, called UTF-8, preserves ASCII values in their single byte form. This encoding is used by default in compiled Java class files, so storage remains compact for English text.