XML (eXtensible Markup Language) is a versatile markup language that defines rules for encoding documents in a format that is both human-readable and machine-readable. Unlike HTML, which was designed specifically for web pages, XML is designed to store and transport data.
In this tutorial, we’ll explore the fundamentals of XML using a publishing database example, covering key concepts and demonstrating different ways to structure data in XML.
41.2 XML Basics
41.2.1 What is XML?
XML stands for eXtensible Markup Language. It’s a markup language designed to store and transport data that is both human-readable and machine-readable. Key characteristics of XML include:
It’s platform-independent
It supports Unicode, allowing almost any information in any language to be communicated
It’s self-descriptive through customizable tags
It separates data from presentation
It’s hierarchical, allowing complex data structures to be represented
41.2.2 XML Syntax Rules
XML documents must follow certain rules:
XML documents must have a root element
XML elements must have a closing tag or be self-closing
XML tags are case sensitive
XML elements must be properly nested
XML attribute values must be quoted
Let’s examine a sample XML document that will serve as our reference throughout this tutorial:
Every XML document typically begins with an XML declaration that specifies the XML version and character encoding:
<?xml version="1.0" encoding="UTF-8"?>
This tells processing applications that this is an XML document that conforms to XML 1.0 specifications and uses UTF-8 character encoding.
41.3.2 Elements
Elements are the building blocks of XML. Elements consist of “tags” (e.g. ) and content. For example <book>War & Peace</book> is all considered one “element”.
Tags come in a few different forms. The word between the is known as the “name” of the tag. For example:
a start tag:
an end tag (notice the / before the name):
(there mustNOTbe a space between the name of the tag and the “/”)
an empty tag (notice the ” /” after the name):
(It’s preferable that there be a space between the name of the tag and the “/”to conform with older rules for XML. Newer rules for XML don’t requirethis space)
We will explain more below.
Rules for naming tags
Names of tags may ONLY contain
Letters
Numbers
Underscore (_)
Hyphen (-)
Period (.)
Names MAY NOT contain spaces, or other “special characters” like like !, @, #, $, %, ^, &, *, (, ), <, >, [, ], /, |, :, ;, “, ’, `, ~, =, commas
Type 1 - Elements with start and end tags, e.g. New York
Most elements have both a start tag and an end tag, with content between them. The start tag and end tag must have the same “name” (i.e. the word between the tag). Remember that the name of a tag may not contain spaces.
For example:
<phone>718-496-7223</phone>
Note that all tags are surrounded by (ie. less than and greater than signs). In our example, most elements follow this pattern, including <publisher>, <phone>, and <author>. Element’s may consist of a start tag, content, and an end tag.
The “content” may be any of the following:
plain text, e.g.
<street>75 West 205 St</street>
other elements, e.g.
<address> <street>75 West 205 St</street> <city>Bronx</city> <state>NY</state> <zip>10468</zip></address>
The elements inside are known as nested elements, inner elements or child elements.
The element outside is known as the outer element or parent element
In this example, the address element contains four child elements: street, city, state, and zip.
A combination of text and other elements. For example the author element shown below contains plain text (Sarah Buchman) as well as other elements. Portions of plain text may appear anywhere in the outer element. in the
There must be one “root” element for the document that contains everything else. In the example above the <books> ... </books> element is the root element. It contains everything else in the XML document.
41.3.3 Attributes
Attributes provide additional information about elements and are always specified in the start tag. Each attribute is written in the form name=“value”. The “quotes” must surround the value of the attribute. Both ‘single quotes’ and “double quotes” are valid.
In this example, the book element has several attributes: id, title, type, pages, sales, and pubdate.
41.3.4 Empty Elements
Empty elements are elements that don’t contain any content. There are two ways to write empty elements:
Empty Elements with Attributes
Some empty elements contain attributes but no content:
<format type="audiobook" price="2.99" />
In this example, the format element has attributes but no content. Note the / before the closing bracket, which indicates that this is a self-closing tag. There is NO end tag for this element.
Empty Elements without Attributes
Some empty elements have neither content nor attributes:
<pulitzerPrizeWinner />
These self-closing tags indicate the presence of an element without any additional information.
41.4 XML Design Patterns
There are different ways to structure data in XML. Let’s explore some common design patterns with examples from our publishing database.
41.4.1 Attribute-Centric vs. Element-Centric Design
Data in XML can be represented either as attributes or as child elements. Let’s compare these approaches:
Attribute-Centric
The book element uses an attribute-centric approach for its basic information:
This example shows a more complex structure: - Author-book relationship details as attributes - Author name as text content - Contact information as nested elements
This example shows a collection of empty elements with attributes, representing the available formats for a book.
41.6 Best Practices for XML Design
Based on our examples, here are some best practices for designing XML documents:
Choose appropriate element names: Use clear, descriptive names that reflect the data’s meaning.
Be consistent in your design approach: If you use attributes for certain types of information, do so consistently.
Use attributes for metadata: Attributes are good for information that describes the element but isn’t part of its primary content.
Use elements for structured data: When data has its own structure or might need to be extended in the future, use elements.
Keep it simple: Only make your XML as complex as it needs to be. Overly complex structures can be difficult to maintain and process.
Consider your use case: The optimal XML design depends on how the data will be used. Consider factors like readability, processing efficiency, and future extensibility.
41.7 Working with XML in R
You can read XML data into R and process it. The next section goes into detail about how to do this. The following is just a simple example.
library(xml2)library(magrittr) # for the %>% pipe operator used below# Read XML filebooks_xml <-read_xml("books-v025.xml")# Extract book titlesbook_titles <-xml_find_all(books_xml, "//book") %>%xml_attr("title")book_titles
[1] "1977!" "200 Years of German Humor"
[3] "Ask Your System Administrator" "But I Did It Unconsciously"
[5] "Perhaps It's a Glandular Problem"
41.8 Conclusion
XML is a powerful and flexible way to structure data. By understanding the basic elements, attributes, and design patterns, you can create XML documents that effectively represent complex data hierarchies.
Our publishing database example demonstrates various XML features and design approaches, from attribute-heavy elements to deeply nested structures. The best approach for your XML depends on your specific use case, considering factors like readability, processing efficiency, and future extensibility.
As you continue working with XML, remember that good design balances technical correctness with practical usability, creating documents that are both machine-processable and human-understandable.