Lesson 6	Adding information to XML documents
Objective	Comments, CDATA sections, and encoding to add clarity

Adding Comments to XML documents

Use comments, CDATA sections, and encoding to add clarity and information to XML documents. To help make your documents as understandable to humans as they will be to machines, you should consider adding comment tags within your XML document.
Use the same syntax you would for an HTML comment:

<!-- comment text here -->

Here is an example of a comment in an XML document:

<?xml version="1.0"?>
<inventory-items>
<!--Begin definition of inventory items-->
<item>
 <item-name>Computer Monitor</item-name>
 <item-serial-number>981</item-serial-number>
 <units-on-hand>50</units-on-hand>
</item>
</inventory-items>

Rules for Comments
There are some rules that you need to follow when creating comments.
The following table lists the rules and shows examples of correct and incorrect comment usage in an XML document.

Rule	Correct	Incorrect
Comments cannot come before the XML declaration.	<?xml version="1.0"?> <!-- XML document to describe inventory --> <inventory-items> <item> <item-name>Computer Monitor</item-name> </item> </inventory-items>	<!-- XML document to describe inventory --> <?xml version="1.0"?> <inventory-items> <item> <item-name>Computer Monitor</item-name> </item> </inventory-items>
Comments cannot be placed inside markup.	<item> <!-- comment text here --> <item-name>Computer Monitor</item-name> </item>	<item> <item-name>Computer Monitor<!-- comment text here --></item-name> </item>
Comments cannot use "--" inside the comment. The XML parser or browser will look for "--" inside a comment to indicate the end of the comment.	<item> <!-- comment text here --> <item-name>Computer Monitor</item-name> </item>	<item> <!-- comment -- text here --> <item-name>Computer Monitor</item-name> </item>

CDATA Sections

In large XML documents, you may need to use special characters, such as < and &. Because these characters are used as part of the markup used in XML documents, XML processors will look for these characters to read the XML document content. If you want these characters not to be treated as markup in a smaller document, you may escape these characters by using < for < and & for &. This works, but the text will become awkward and hard to read if you use many such characters. As an alternative, you can use a CDATA section. CDATA sections instruct the XML processor that the content included in a section is not markup and should not be parsed. As a result, you may include any kind of text in a CDATA section including the special characters < and &.

CDATA syntax
A CDATA section begins with
```
	
<![CDATA[
```
and ends with a
```
]]>
```
.
When the XML processor encounters the markup
```
<![CDATA
```
,
it will search for ]]> to find the end of the section. As a result, you cannot include the markup
```
]]>
```
anywhere else in a CDATA section. Also, CDATA sections may not be nested.
Here is an example of a CDATA section in an XML document:

<?xml version="1.0"?>
<inventory-items>
 <item>
  <item-name>Computer Monitor</item-name>
  <item-serial-number>981</item-serial-number>
 <![CDATA[ 
  all data included in this section is preserved
  as text. You may include the special characters
  < or > or elements such as <TEST>value</TEST>.
 ]]>
  <units-on-hand>50</units-on-hand>
 </item>
</inventory-items>

To embed script (ECMAScript or any script) within an XML document, presently you must enclose the script in a CDATA section as follows:
< SCRIPT > <![CDATA[Script statements here]]> < /SCRIPT >

Encoding and Web Balkanization

Bearing in mind that the Internet is a global technology, it behooves you to include information in the xml declaration about what encoding scheme (in other words, the standard character set for a language) you used to create an XML file.
Balkanization of the Web?
In a now-famous article for Feed Magazine, Mark Pesce sounded a warning bell about XML.
As you have seen, there are many ways to decide how to tag content.
If every user does his or her own thing, we will lose the potential for interoperability between documents.
Pesce referred to this situation as the "balkanization" of the Web, and said that we would be in a sort of tag gumbo, with no way to discern valuable content. Others have argued that we are already in tag gumbo, and that without a new standard for defining content, the Web will continue to present billions of words but little intelligent data. The freedom to create tags is also the freedom to create chaos.
Fortunately, standardized industry vocabularies are being created, one of which is the Resource Description Framework, which will define the sets of tags we may ultimately use. Rather than continuing to reinvent the wheel, you are encouraged to seek out and pursue emerging industry languages created from XML.
The growing balkanization of the web is the worst enemy of the internet. As worldwide literacy grows exponentially on the web, such expansion results in increasing pressure from corporate interests and regulatory agencies. The net has become a symbol of borderless communication between individuals and of unlimited access to knowledge. The internet is about to become a heavily controlled environment, serving two classes of citizens: a dominant class that sets the rules (technological, legal and commercial) and the subclass of citizens and consumers.

The following xml declaration includes encoding information:

XML Version — XML version = 1.0

You can include encoding information within the XML declaration by adding the ENCODING attribute along with the appropriate value.

When creating an XML document, it is useful for people in other countries or using other encoding schemes to know that our data is in the standard common English character set, which is a subset of UTF-8. UTF stands for UCS (Universal Character Set) Transformation Format. UTF-8 represents a 7-bit character set or the characters 0 through 127.

Adding and Encoding information for XML Documents

he way that characters are represented by the underlying data stream is referred to as the encoding of a file. The specific encoding used is often present as the first few bytes in the file. An application checks these bytes upon opening the file and then knows how to display and manipulate the data. There is also a default encoding if these first few bytes are not present. XML also has other ways of specifying how a file was encoded. Unicode is a superset of every other significant computerized character set used in computer science today. UTF-8 is the proper binary encoding of the Unicode character set. All XML documents should be generated exclusively in UTF-8 which will result in a more robust, more interoperable universe of documents.

XML Document Clarity 1 — XML version = 1.0

You can include encoding information within the XML declaration by adding the ENCODING attribute along with the appropriate value.

When creating an XML document, it is useful for people in other countries or using other encoding schemes to know that our data is in the standard common English character set. — When creating an XML document, it is useful for people in other countries or using other encoding schemes to know that our data is in the standard common English character set, which is a subset of UTF-8.
UTF stands for UCS (Universal Character Set) Transformation Format. UTF-8 represents a 7-bit character set or the characters 0 through 127.

The next lesson concludes this module. The following exercise checks your understanding of creating a well-formed document.
Adding XML Documents - Exercise