Describe the Concepts of well-formedness and Validity
Describe Well-formedness and Validation Concepts
Well-formedness in XML refers to the adherence to the basic syntax rules that make an XML document structurally correct. For an XML document to be well-formed, it must follow several key principles: all tags must be properly nested, every opening tag must have a corresponding closing tag, elements must be closed in the reverse order of their opening, and attribute values must be enclosed in quotes. Additionally, XML documents must have a single root element that encapsulates all other elements. The importance of well-formedness lies in ensuring that the document can be parsed by an XML parser without encountering errors that would prevent processing. This foundational aspect of XML ensures that data can be consistently interpreted and utilized across different systems and applications.
Validity, on the other hand, goes beyond well-formedness by ensuring that an XML document not only adheres to the basic syntax rules but also conforms to a specific structure defined by a Document Type Definition (DTD) or an XML Schema. A valid XML document must be well-formed, but it also must comply with the constraints and rules laid out in the schema, such as the sequence of elements, the types of data elements can contain, the number of occurrences of elements, and which elements are required or optional. Validity checks whether the document meets all the specifications set forth by the schema, ensuring that the document's content is meaningful within a particular context or application. This step is crucial for applications that depend on specific data structures to function correctly, like data exchange standards or database systems.
In practical terms, while well-formedness is mandatory for any XML document to be processed, validity is optional but often necessary for specific uses. An XML document can be well-formed but not valid if it does not adhere to the rules of its associated schema. However, all valid XML documents are necessarily well-formed. The process of validation involves comparing the structure and content of an XML document against the rules defined in a DTD or Schema. This validation can be performed by various tools or built into applications to ensure data integrity and consistency. For developers and users alike, understanding the difference between well-formedness and validity is crucial for effective XML usage, as it impacts how data is handled, whether it's for data storage, transmission, or processing in software applications.
Creating XML Documents and Programming
The XML specification defines a set of rules that XML documents must follow in order to be well-formed.
If an XML document is not well-formed according to the XML specification, browsers, for example, should not make any attempt at correcting the errors as they did for HTML documents. The inevitable result of browsers correcting HTML documents was the creation of many versions of HTML.
Well-formedness constraint
The following text demonstrates a complete, well-formed XML document:
<greeting>Hello, World!</greeting>
This XML document is not large, but it is error free.
For reasons of forward compatibility, one additional element is requested, but not required, at the start of XML documents: the xml declaration that specifies the version of XML to which the document conforms. The following example begins with an xml declaration providing version information.
XML Meta Language:
XML is a meta language that allows you to create and format your own document markups. With HTML, existing markup is static: <HEAD> and <BODY>, for example, are tightly integrated into the HTML standard and cannot be changed or extended. XML, on the other hand, allows you to create your own markup tags and configure each to your liking.
For example, <Heading>, <Sidebar>, <Quote>. Each of these elements can be defined through your own 1) document type definitions and 2) stylesheets and applied to one or more XML documents. XML schemas provide another way to define elements.
XML Stylesheet
Browsers use an XML stylesheet or transformation to display XML files. An XML stylesheet is a text-based file with an XML format that can transform one format into another. They are most commonly used to convert from a particular XML format to another or from XML to HTML, but they can also be used to process plain text. In this case the original XML is transformed into HTML, which permits the styling of elements to give the different colors as well as the ability to expand and contract sections using script.
XML Transformations
Transformations allow you to work with XSLT and convert XML documents to other formats such as HTML.
Before we move further, we need to standardize some terminology. An XML document consists of one or more elements. An element is marked with the following form:
<body>
This is text formatted according to the body tag.
</body>.
This element consists of two tags: an opening tag, which places the name of the element between a less-than sign (<) and a greater-than sign (>), and a closing tag, which is identical except for the forward slash (/) that appears before the element name. Like HTML, the text between the opening and closing tags is considered part of the element and is processed according to the rules of the elements.
Overview of Viewing XML Data
The basic idea of markup languages is the separation of the content of a document from its form (i.e., how it is presented to a user).
Therefore, extra processing is needed before viewing marked-up information on a Web front end. For us, two mechanisms are of particular interest: formatting and transformation. From the latter perspective, XML may have to be transformed into another format, an HTML document, for instance. Transformation aspects will be covered later. The formatting of XML documents may be achieved by using style sheets. They may be regarded as collections of rules to transform abstract XML information into formatted information to be passed on toward an output device. The transformation is done by a style sheet processor that will read the XML input along with the style sheet given. From that the processor generates the output accordingly, as long as the style sheet follows a notation the processor understands. This process may be performed on the server as well as on the client side of an Internet-based information system. The basic idea is illustrated in Figure 3.2 from the general perspective of SGML. The decision on whether to process an XML document on the server or to pass it to the client depends on the application. As long as we only wish to view XML-based information, we may perform the processing on the server. If an application requires XML data for decentralized processing on the client side, we have to send the XML document along with the style sheet. This may be the case in ecommerce applications, for example, when exchanging product information or ordering data.
XML, the lazy Developer's Nightmare
HTML rendering programs have been very forgiving with developers. If you forget to close a tag, the closing tag is usually inferred. If you use the wrong tag, the browser disregards it and renders the document, no matter how ill-constructed it is. For that reason, HTML is great for the lazy developer. Half an effort can still produce attractive pages. XML, on the other hand, is far stricter. Any single error will prevent your page from rendering. XML is not for those who dislike detail or want to construct pages haphazardly. XML requires attention and discipline.
The syntax rules of XML are simple, logical, and easy to learn. Question: Are XML tags case sensitive? Answer: Yes, XML tags are case sensitive.
Opening and closing tags must be written with the same case. In XML, all elements must be properly nested within each other.
In the example above, "Properly nested" simply means that since the <i> element is opened inside the <b> element, it must be closed inside the <b> element.
What exactly is an XML document? Consider the definition from the W3C's Recommendation for XML 1.0.
What makes an SML Document?
Consider what exactly constitutes an XML document, according to the W3C's Recommendation for XML 1.0:
A data object is an XML document if it is well-formed, as defined in this specification. A well-formed XML document may in addition be valid if it meets certain further constraints.
This definition indicates that the minimum requirement for a document to be considered an XML document is that it be well-formed.
Standard Markup Language (SML) is a general-purpose, modular, functional programming language with compile-time type checking and
type inference. It is popular among compiler writers and programming language researchers, as well as in the development of theorem provers.
SML is a modern descendant of the ML programming language used in the LCF theorem-proving project. It is uncommon among widely used languages in that it has a formal specification, given as typing rules and operational semantics in The Definition of Standard ML (1990, revised and simplified as The Definition of Standard ML (Revised) in 1997).
Origins of Standard Generalized Markup Language
Some documents needed the ability to mark text as bold or italic whereas others were more concerned with who the original document author was, when was it created, and who had subsequently modified it. To cope with this problem a definition called Standard Generalized Markup Language was released, commonly shortened to SGML. SGML is a step removed from defining an actual markup language, such as the Hyper Text Markup Language, or HTML. Instead it relays how markup languages are to be defined. SGML allows you to create your own markup language and then define it using a standard syntax such that any SGML- aware application can consume documents written in that language and handle them accordingly. As previously noted, the most ubiquitous example of this is HTML. HTML uses angular brackets (< and >) to separate metadata from basic text and also defines a list of what can go into these brackets, such as em for emphasizing text, tr for table, and td for representing tabular data.
Birth of XML
SGML, although well thought-out and capable of defining many different types of markup, suffered from one major failing: it was very complicated. All the flexibility came at a cost, and there were still relatively few applications that could read the
SGML definition of a markup language and use it to correctly process documents. The concept was correct, but it needed to be simpler.
With this goal in mind, a small working group and a larger number of interested parties began working in the mid-1990s on a subset of SGML known as Extensible Markup Language (XML). The first working draft was published in 1996 and two years later the W3C published a revised version as a recommendation on February 10, 1998. XML therefore derived as a subset of SGML, whereas HTML is an application of SGML. XML does not dictate the overall format of a file or what metadata can be added, it just specifies a few rules.
That means it retains a lot of the flexibility of SGML without most of the complexity.
Validity constraint
An XML document is valid if it complies with a Document Type Definition (DTD). A DTD is a text-based file that specifies, among other things, which elements are used in an XML document. An XML parser will validate the XML document against the specified DTD and will generate errors if there are any inconsistencies.
The next lesson defines the rules for constructing well-formed XML documents.