A DTD is a simple text file that contains the instructions for the elements contained within the corresponding XML document.
DTDs are written in Extended Backus-Naur Form (EBNF).
Instructions in a DTD include what elements are used, whether these elements contains parsed character data ( #PCDATA ),
or other elements, or both. Instructions also include which attributes can be used with those elements, and how those elements relate within the document's tree structure.
Document Root Element
The basic unit in a DTD is an element. XML documents are required to have a root element.
Every other element must appear between the beginning and ending tags for the root element. The Slideshow shows the process for determining the root element.
Consider the most basic XML document. In this simple example, only one element is present: greeting
This code, for example, would be invalid. If you wanted, to use this syntax, you would need to create an additional, all encompassing element to contain the others.
Here is an example of an element which contains the others GREETINGS is now the root element, and greeting is simply another element that appears in this document.
Let us return for a moment to the simplest of the previous XML examples. A DTD defines all tags used. Because only one tag is used here, the DTD for this well-formed XML document will consist of one piece of information.
The DTD for the previous example would be written like this. This code represents the element type declaration for the element named greeting.
When the vocabulary and structure of potential XML documents for a given purpose are considered together, you can talk about the type of the documents: the 1) elements and 2) attributes in these documents, and how they interrelate are designed to cover a particular subject of interest. Generally speaking, this is not any more than using a specific XML language, for example Mathematical Markup Language (MathML) or X3D (for 3D graphics). But for validation purposes, the nature of an XML language can be much more specific, and Document Type Defi nitions (DTDs) are a way to describe fairly precisely the shape of the language. This idea has parallels in human language. For example, if you want to read or write in English or German, you must have some understanding of the grammar of the language in question.
In a similar fashion, it is useful to make sure the structure and vocabulary of XML documents are valid against the grammatical rules of the appropriate XML language. Fortunately, XML languages are considerably simpler than human languages.
As you would expect, the grammars of XML languages are expressed with computer processing in mind.
XML Parsing
By convention, XML is serialized as a text document. A parser is a piece of software which reads the document and handles the intricacies of the XML format for the programmer. By using an existing parser, the programmer only has to concern himself with the data represented by the XML document. Hence, a parser can automatically handle difficult issues such as white spaces or entities. Some parsers can also validate the document against a schema.
In addition, an XML document is a data structure consisting of a series of characters making up a text file.
The XML parser, like most other parsers, converts the series of characters into a programmatic data structure, i.e. a tree.
It is challenging to query data contained in XML files and a XML parser extracts the data from structured XML files.
In the context of data analytics the extracted data is typically loaded into a schema or relational database.
A good XML parser should automate the whole process and have the following features:
Analyze the (XSD) XML schema or a significant XML sample
Create an optimized relational target schema
Automatically processes the XML files
Can handle arbitrarily complex XML files and supports the whole XSD specification
Provides data lineage in the form of a source to target map and scale for very large volumes
Human Parsing versus Machine Parsing
The breaking down of a human-language sentence into its grammatical components is known as parsing.
The same applies with XML and a machine parser, which has been written to perform the parsing. Parsers are the software subsystems that read the information contained in XML documents into our programs.
The XML specification separates parsers into two categories:
validating and
nonvalidating.
Validating parsers must implement validity checking using DTDs.
With a validating parser, a lot of content-checking code you might otherwise need in your application is unnecessary and
you can depend on the parser to verify the content of the XML document against the DTD.
The next lesson describes the process for creating a DTD.