41 39. Intro to XML

41.1 Introduction to XML

XML (eXtensible Markup Language) is a versatile markup language that defines rules for encoding documents in a format that is both human-readable and machine-readable. Unlike HTML, which was designed specifically for web pages, XML is designed to store and transport data.

In this tutorial, we’ll explore the fundamentals of XML using a publishing database example, covering key concepts and demonstrating different ways to structure data in XML.

41.2 XML Basics

41.2.1 What is XML?

XML stands for eXtensible Markup Language. It’s a markup language designed to store and transport data that is both human-readable and machine-readable. Key characteristics of XML include:

It’s platform-independent
It supports Unicode, allowing almost any information in any language to be communicated
It’s self-descriptive through customizable tags
It separates data from presentation
It’s hierarchical, allowing complex data structures to be represented

41.2.2 XML Syntax Rules

XML documents must follow certain rules:

XML documents must have a root element
XML elements must have a closing tag or be self-closing
XML tags are case sensitive
XML elements must be properly nested
XML attribute values must be quoted

Let’s examine a sample XML document that will serve as our reference throughout this tutorial:

<?xml version="1.0" encoding="UTF-8"?>
<books>
  <book id="B001" title="1977!" type="history" pages="107" sales="566" pubdate="2000-08-01">
    <publisher city="New York" state="NY" country="USA">Abatis Publishers</publisher>
    <formats>
      <format type="hardcover" price="49.99" />
      <format type="paperback" price="21.99" />
    </formats>
    <pulitzerPrizeWinner />
    <royalty advance="10000" rate="0.05" />
    <authors>
      <author order="1" royalty_shares="1" primary="true">Sarah Buchman
        <phone>718-496-7223</phone>
        <address>
          <street>75 West 205 St</street>
          <city>Bronx</city>
          <state>NY</state>
          <zip>10468</zip>
        </address>
      </author>
    </authors>
  </book>
  
  <book id="B002" title="200 Years of German Humor" type="history" pages="14" 
        sales="9566" pubdate="1998-04-01">
    <formats>
      <format type="hardcover" price="19.95" />
      <format type="audiobook" price="2.99" />
    </formats>
    <publisher city="Hamburg" state="" country="Germany">Schandenfreude Press</publisher>
    <royalty advance="1000" rate="0.06" />
    <authors>
      <author order="1" royalty_shares="1" primary="true">Sarah Buchman
        <phone>718-496-7223</phone>
        <address>
          <street>75 West 205 St</street>
          <city>Bronx</city>
          <state>NY</state>
          <zip>10468</zip>
        </address>
      </author>
    </authors>
  </book>
  
  <book id="B003" title="Ask Your System Administrator" type="computer" pages="1226" 
        sales="25667" pubdate="2000-09-01">
    <formats>
      <format type="hardcover" price="39.99" />
      <format type="paperback" price="19.99" />
    </formats>
    <publisher city="San Francisco" state="CA" country="USA">Core Dump Books</publisher>
    <pulitzerPrizeWinner />
    <royalty advance="15000" rate="0.07" />
    <authors>
      <author order="1" royalty_shares="1" primary="true">Christian Kells
        <phone>212-771-4680</phone>
        <address>
          <street>114 Horatio St</street>
          <city>New York</city>
          <state>NY</state>
          <zip>10014</zip>
        </address>
      </author>
    </authors>
  </book>
  
  <book id="B004" title="But I Did It Unconsciously" type="psychology" pages="510"
        sales="13001" pubdate="1999-05-31">
    <formats>
      <format type="paperback" price="12.99" />
      <format type="ebook" price="4.99" />
    </formats>

    <publisher city="New York" state="NY" country="USA">Abatis Publishers</publisher>
    <royalty advance="20000" rate="0.08" />
    <authors>
      <author order="1" royalty_shares="0.6" primary="true">Hallie Hull
        <phone>415-549-4278</phone>
        <address>
          <street>3800 Waldo Ave, #14F</street>
          <city>San Francisco</city>
          <state>CA</state>
          <zip>94123</zip>
        </address>
      </author>
      <author order="2" royalty_shares="0.4" primary="false">Klee Hull
        <phone>415-549-4278</phone>
        <address>
          <street>3800 Waldo Ave, #14F</street>
          <city>San Francisco</city>
          <state>CA</state>
          <zip>94123</zip>
        </address>
      </author>
    </authors>
  </book>

  <!-- Additional books omitted for brevity -->
  
</books>

41.3 XML Document Structure

41.3.1 XML Declaration

Every XML document typically begins with an XML declaration that specifies the XML version and character encoding:

<?xml version="1.0" encoding="UTF-8"?>

This tells processing applications that this is an XML document that conforms to XML 1.0 specifications and uses UTF-8 character encoding.

41.3.2 Elements

Elements are the building blocks of XML. Elements consist of “tags” (e.g. ) and content. For example <book>War & Peace</book> is all considered one “element”.

Tags come in a few different forms. The word between the is known as the “name” of the tag. For example:

a start tag:
an end tag (notice the / before the name):

(there must NOT be a space between the name of the tag and the “/”)
an empty tag (notice the ” /” after the name):

(It’s preferable that there be a space between the name of the tag and the “/” to conform with older rules for XML. Newer rules for XML don’t require this space)

We will explain more below.

Rules for naming tags

Names of tags may ONLY contain

Letters
Numbers
Underscore (_)
Hyphen (-)
Period (.)

Names MAY NOT contain spaces, or other “special characters” like like !, @, #, $, %, ^, &, *, (, ), <, >, [, ], /, |, :, ;, “, ’, `, ~, =, commas

Type 1 - Elements with start and end tags, e.g. New York

Most elements have both a start tag and an end tag, with content between them. The start tag and end tag must have the same “name” (i.e. the word between the tag). Remember that the name of a tag may not contain spaces.

For example:

<phone>718-496-7223</phone>

Note that all tags are surrounded by (ie. less than and greater than signs). In our example, most elements follow this pattern, including <publisher>, <phone>, and <author>. Element’s may consist of a start tag, content, and an end tag.

The “content” may be any of the following:

plain text, e.g.
```
<street>75 West 205 St</street>
```
other elements, e.g.
```
<address>
  <street>75 West 205 St</street>
  <city>Bronx</city>
  <state>NY</state>
  <zip>10468</zip>
</address>
```
The elements inside are known as nested elements, inner elements or child elements.

The element outside is known as the outer element or parent element

In this example, the address element contains four child elements: street, city, state, and zip.

A combination of text and other elements. For example the author element shown below contains plain text (Sarah Buchman) as well as other elements. Portions of plain text may appear anywhere in the outer element. in the

    <author order="1" royalty_shares="1" primary="true">Sarah Buchman
      <phone>718-496-7223</phone>
      <address>
        <street>75 West 205 St</street>
        <city>Bronx</city>
        <state>NY</state>
        <zip>10468</zip>
      </address>
    </author>

The “root” element

There must be one “root” element for the document that contains everything else. In the example above the <books> ... </books> element is the root element. It contains everything else in the XML document.

41.3.3 Attributes

Attributes provide additional information about elements and are always specified in the start tag. Each attribute is written in the form name=“value”. The “quotes” must surround the value of the attribute. Both ‘single quotes’ and “double quotes” are valid.

<book id="B001" title="1977!" type="history" pages="107" sales="566" pubdate="2000-08-01">

In this example, the book element has several attributes: id, title, type, pages, sales, and pubdate.

41.3.4 Empty Elements

Empty elements are elements that don’t contain any content. There are two ways to write empty elements:

Empty Elements with Attributes

Some empty elements contain attributes but no content:

<format type="audiobook" price="2.99" />

In this example, the format element has attributes but no content. Note the / before the closing bracket, which indicates that this is a self-closing tag. There is NO end tag for this element.

Empty Elements without Attributes

Some empty elements have neither content nor attributes:

<pulitzerPrizeWinner />

These self-closing tags indicate the presence of an element without any additional information.

41.4 XML Design Patterns

There are different ways to structure data in XML. Let’s explore some common design patterns with examples from our publishing database.

41.4.1 Attribute-Centric vs. Element-Centric Design

Data in XML can be represented either as attributes or as child elements. Let’s compare these approaches:

Attribute-Centric

The book element uses an attribute-centric approach for its basic information:

<book id="B001" title="1977!" type="history" pages="107" sales="566" pubdate="2000-08-01">

Advantages: - More compact - Often easier to read for simple values - More efficient for parsing

Element-Centric

The author’s address uses an element-centric approach:

<address>
  <street>75 West 205 St</street>
  <city>Bronx</city>
  <state>NY</state>
  <zip>10468</zip>
</address>

Advantages: - Better for complex or structured data - Can contain mixed content (text and other elements) - More extensible for future changes

41.4.2 Mixed Content

XML can contain mixed content, where an element contains both text and child elements:

<author order="1" royalty_shares="1" primary="true">Sarah Buchman
  <phone>718-496-7223</phone>
  <address>
    <!-- Address details -->
  </address>
</author>

In this example, the author element contains both the author’s name, Sarah Buchman, as text content and child elements for contact information.

41.5 XML Data Modeling Examples

Let’s explore some specific modeling examples from our publishing database:

41.5.1 Example 1: Book Information

<book id="B001" title="1977!" type="history" pages="107" sales="566" pubdate="2000-08-01">
  <!-- Book contents -->
</book>

This example shows how basic book metadata is represented as attributes of the book element.

41.5.2 Example 2: Publisher Information

<publisher city="New York" state="NY" country="USA">Abatis Publishers</publisher>

This example demonstrates a mixed approach: - Publisher location details as attributes - Publisher name as text content

41.5.3 Example 3: Author Information

<author order="1" royalty_shares="1" primary="true">Sarah Buchman
  <phone>718-496-7223</phone>
  <address>
    <street>75 West 205 St</street>
    <city>Bronx</city>
    <state>NY</state>
    <zip>10468</zip>
  </address>
</author>

This example shows a more complex structure: - Author-book relationship details as attributes - Author name as text content - Contact information as nested elements

41.5.4 Example 4: Format Information

<formats>
  <format type="hardcover" price="29.99" />
  <format type="paperback" price="7.99" />
  <format type="ebook" price="4.99" />
  <format type="audiobook" price="2.99" />
</formats>

This example shows a collection of empty elements with attributes, representing the available formats for a book.

41.6 Best Practices for XML Design

Based on our examples, here are some best practices for designing XML documents:

Choose appropriate element names: Use clear, descriptive names that reflect the data’s meaning.
Be consistent in your design approach: If you use attributes for certain types of information, do so consistently.
Use attributes for metadata: Attributes are good for information that describes the element but isn’t part of its primary content.
Use elements for structured data: When data has its own structure or might need to be extended in the future, use elements.
Keep it simple: Only make your XML as complex as it needs to be. Overly complex structures can be difficult to maintain and process.
Consider your use case: The optimal XML design depends on how the data will be used. Consider factors like readability, processing efficiency, and future extensibility.

41.7 Working with XML in R

You can read XML data into R and process it. The next section goes into detail about how to do this. The following is just a simple example.

library(xml2)
library(magrittr)  # for the %>% pipe operator used below

# Read XML file
books_xml <- read_xml("books-v025.xml")

# Extract book titles
book_titles <- xml_find_all(books_xml, "//book") %>%
  xml_attr("title")

book_titles

[1] "1977!"                            "200 Years of German Humor"       
[3] "Ask Your System Administrator"    "But I Did It Unconsciously"      
[5] "Perhaps It's a Glandular Problem"

41.8 Conclusion

XML is a powerful and flexible way to structure data. By understanding the basic elements, attributes, and design patterns, you can create XML documents that effectively represent complex data hierarchies.

Our publishing database example demonstrates various XML features and design approaches, from attribute-heavy elements to deeply nested structures. The best approach for your XML depends on your specific use case, considering factors like readability, processing efficiency, and future extensibility.

As you continue working with XML, remember that good design balances technical correctness with practical usability, creating documents that are both machine-processable and human-understandable.