This example shows a document type declaration statement, indicated by <!DOCTYPE>. Following DOCTYPE is the root element or the containing element for the XML document, which is greeting. The element greeting is also the only element within the document, and it contains character data. The next lesson discusses how to create a DTD from an existing set of tags.
Encoding with Unicode
Encoding is the process of turning characters into their equivalent binary representation. Some encodings use only a single byte, or eight bits; others use more. The disadvantage of using only one byte is that you are limited to how many characters can be encoded without recourse; this can go to such means as having a special sequence of bits to indicate that the next two bytes refer to one character or other similar workarounds. When an XML processor reads a document, it has to know which encoding was used; but, it is a chicken-and-egg situation, if it does not know the encoding how can it read what you have put in the declaration? The simple answer to this lies in the fact that the first few bytes of a file can contain a byte order mark, or BOM. This helps the parser enough to be able to read the encoding specified in the declaration. Once it knows this it can decode the rest of the document. If, for some reason, the encoding specified is not the actual encoding used you will most likely get an error, or mistakes will be made interpreting the content.
Unicode is a text encoding specification designed from scratch with internationalization in mind.
It tries to define every possible character by giving it a name and a code point, which is a number that can be used to represent it. It also assigns various categories to each character such as whether it is a letter, a numeral, or a punctuation mark.
Two main encoding systems use Unicode:
UTF-8 and
UTF-16.
UTF stands for UCS Transformation Format, and UCS itself means Universal Character Set.
The number refers to how many bits are used to represent a simple character, either 8 or 16 (one or two bytes, respectively).
The reason UTF-8 manages with only one byte whereas UTF-16 needs two is because UTF-8 uses a single byte to represent the more commonly used characters and two or three bytes for the less common ones. UTF-16 uses two bytes for the majority of characters and three bytes for the rest.
It is similar to your keyboard in that the lowercase letters and digits require only one key press but by using the Shift key you have access to the uppercase letters and other symbols.
The advantage of UTF-16 is that it is easier to decode because of its fixed size of two bytes per character. The disadvantage is that file sizes are typically larger than UTF-8 if you are only using the Latin alphabet plus the standard numerals and punctuation marks.
All XML processors are mandated to understand UTF-8 and UTF-16 even if those are the only encodings they can read.
UTF-8 is the default for documents without encoding information. Despite the advantages of Unicode, many documents use other encodings such as ISO-8859-1, Windows-1252, or EBCDIC (an encoding found on many mainframes).
You will also come across files written using ASCII, a basic set of characters that at one time was used for almost all files created. ASCII is a subset of Unicode though so it can be read by any application that understands Unicode.