XML stands for eXtensible Markup Language. XML was designed to store and transport data. XML plays an important role in many different IT systems and is often used for distributing data over the Internet. XML stores data in plain text format. This provides a software- and hardware-independent way of storing, transporting, and sharing data. All types of software developers need at least a basic understanding of XML. XML was designed to be self-descriptive and is a W3C Recommendation since 1998.
A great place to start learning XML is at w3schools.com. As they say at w3schools.com “The XML language has no predefined tags. The tags in the example above (like <to> and <from>) are not defined in any XML standard. These tags are “invented” by the author of the XML document. HTML works with predefined tags like <p>, <h1>, <table>, etc. With XML, the author must define both the tags and the document structure”.
As Wikipedia says: “In computing, Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable…The design goals of XML emphasize simplicity, generality, and usability across the Internet.” Wikipedia goes on to describe XML as follows in the next paragraphs.
An XML document is a string of characters. Almost every legal Unicode character may appear in an XML document. The characters making up an XML document are divided into markup and content, which may be distinguished by the application of simple syntactic rules. A tag is a markup construct that begins with < and ends with >. Tags come in three flavors:
- start-tag, such as <section>
- end-tag, such as </section>
- empty-element tag, such as <line-break />
An element is a logical document component that either begins with a start-tag and ends with a matching end-tag or consists only of an empty-element tag. The characters between the start-tag and end-tag, if any, are the element’s content, and may contain markup, including other elements, which are called child elements.
An attribute is a markup construct consisting of a name–value pair that exists within a start-tag or empty-element tag. An example is <img src=”tulips.jpg” alt=”Tulips” />, where the names of the attributes are “src” and “alt”, and their values are “tulips.jpg” and “Tulips” respectively.
An XML attribute can only have a single value and each attribute can appear at most once on each element.
Below is an example of an XML file from MSDN by Microsoft.
<?xml version='1.0'?> <!-- This is a comment. They are ignored by the XML parser. They can cross more than one line. --> <BookInfo> <Book> <ISBN>989-0-487-04641-2</ISBN> <!-- Comments can exist here also --> <Title>My World</Title> <Author>Nancy Davolio</Author> <Quantity>121</Quantity> </Book> <Book> <ISBN>981-0-776-05541-0</ISBN> <Title>Get Connected</Title> <Author>Janet Leverling</Author> <Quantity>435</Quantity> </Book> <Book> <ISBN>999-1-543-02345-2</ISBN> <Title>Honesty</Title> <Author>Robert Fuller</Author> <Quantity>315</Quantity> </Book> </BookInfo>
A well formed document will follow the following list of qualifications.
- it must begin with the XML declaration
- it must have one unique root element
- start-tags must have matching end-tags
- elements are case sensitive
- all elements must be closed
- all elements must be properly nested
- all attribute values must be quoted with either double or single quotes
- entities must be used for special characters
- tag names can only have letters, numbers, hyphens, underscores or periods
You can use MS Excel to open XML files. When you open the above XML file with three books in it, you get the following.
Every XML document must begin with an XML declaration. The first thing is the version number which is 1.0. The next thing is the character encoding which is by default UTF-8. The third thing is standalone. This can be either yes or no. If yes, then there are no other documents required to process/parse this XML file. The default is no. No means that there can be other documents required to parse this XML file.
<?xml version='1.0' encoding="UTF-8" standalone="no"?>
An empty element does not have any text between the tags. It may have attributes. If it has attributes, it is still considered to be empty. The following two elements are empty.
<book title="My XML Guide"></book> <book title="My XML Guide" />
We cannot directly use the five characters represented by entity types within our xml file. They are: < > & ‘ “. You can use the following entities: < > & &apos ". You must follow each of these displayed entities with a semicolon.
CDATA blocks are ignored by the XML parser. Like comments, we can type whatever we want inside the CDATA block. They can appear anywhere in our document, but we cannot nest them. They are usually used to show examples of elements and describe they are used, perhaps to show how numbers or dates are formatted.
<Book> <ISBN>999-1-543-02345-2</ISBN> <![CDATA[ <Quantity>315</Quantity> The quantity must be a positive integer ]]>