41  40. Intro to XML

41.1 Introduction to XML

XML (eXtensible Markup Language) is a versatile markup language that defines rules for encoding documents in a format that is both human-readable and machine-readable. Unlike HTML, which was designed specifically for web pages, XML is designed to store and transport data.

In this tutorial, we’ll explore the fundamentals of XML using a publishing database example, covering key concepts and demonstrating different ways to structure data in XML.

41.2 XML Basics

41.2.1 What is XML?

XML stands for eXtensible Markup Language. It’s a markup language designed to store and transport data that is both human-readable and machine-readable. Key characteristics of XML include:

  • It’s platform-independent
  • It supports Unicode, allowing almost any information in any language to be communicated
  • It’s self-descriptive through customizable tags
  • It separates data from presentation
  • It’s hierarchical, allowing complex data structures to be represented

41.2.2 XML Syntax Rules

XML documents must follow certain rules:

  1. XML documents must have a root element
  2. XML elements must have a closing tag or be self-closing
  3. XML tags are case sensitive
  4. XML elements must be properly nested
  5. XML attribute values must be quoted

Let’s examine a sample XML document that will serve as our reference throughout this tutorial:

<?xml version="1.0" encoding="UTF-8"?>
<books>
  <book id="B001" title="1977!" type="history" pages="107" sales="566" pubdate="2000-08-01">
    <publisher city="New York" state="NY" country="USA">Abatis Publishers</publisher>
    <formats>
      <format type="hardcover" price="49.99" />
      <format type="paperback" price="21.99" />
    </formats>
    <pulitzerPrizeWinner />
    <royalty advance="10000" rate="0.05" />
    <authors>
      <author order="1" royalty_shares="1" primary="true">Sarah Buchman
        <phone>718-496-7223</phone>
        <address>
          <street>75 West 205 St</street>
          <city>Bronx</city>
          <state>NY</state>
          <zip>10468</zip>
        </address>
      </author>
    </authors>
  </book>
  
  <book id="B002" title="200 Years of German Humor" type="history" pages="14" 
        sales="9566" pubdate="1998-04-01">
    <formats>
      <format type="hardcover" price="19.95" />
      <format type="audiobook" price="2.99" />
    </formats>
    <publisher city="Hamburg" state="" country="Germany">Schandenfreude Press</publisher>
    <royalty advance="1000" rate="0.06" />
    <authors>
      <author order="1" royalty_shares="1" primary="true">Sarah Buchman
        <phone>718-496-7223</phone>
        <address>
          <street>75 West 205 St</street>
          <city>Bronx</city>
          <state>NY</state>
          <zip>10468</zip>
        </address>
      </author>
    </authors>
  </book>
  
  <book id="B003" title="Ask Your System Administrator" type="computer" pages="1226" 
        sales="25667" pubdate="2000-09-01">
    <formats>
      <format type="hardcover" price="39.99" />
      <format type="paperback" price="19.99" />
    </formats>
    <publisher city="San Francisco" state="CA" country="USA">Core Dump Books</publisher>
    <pulitzerPrizeWinner />
    <royalty advance="15000" rate="0.07" />
    <authors>
      <author order="1" royalty_shares="1" primary="true">Christian Kells
        <phone>212-771-4680</phone>
        <address>
          <street>114 Horatio St</street>
          <city>New York</city>
          <state>NY</state>
          <zip>10014</zip>
        </address>
      </author>
    </authors>
  </book>
  
  <book id="B004" title="But I Did It Unconsciously" type="psychology" pages="510"
        sales="13001" pubdate="1999-05-31">
    <formats>
      <format type="paperback" price="12.99" />
      <format type="ebook" price="4.99" />
    </formats>

    <publisher city="New York" state="NY" country="USA">Abatis Publishers</publisher>
    <royalty advance="20000" rate="0.08" />
    <authors>
      <author order="1" royalty_shares="0.6" primary="true">Hallie Hull
        <phone>415-549-4278</phone>
        <address>
          <street>3800 Waldo Ave, #14F</street>
          <city>San Francisco</city>
          <state>CA</state>
          <zip>94123</zip>
        </address>
      </author>
      <author order="2" royalty_shares="0.4" primary="false">Klee Hull
        <phone>415-549-4278</phone>
        <address>
          <street>3800 Waldo Ave, #14F</street>
          <city>San Francisco</city>
          <state>CA</state>
          <zip>94123</zip>
        </address>
      </author>
    </authors>
  </book>

  <!-- Additional books omitted for brevity -->
  
</books>

41.3 XML Document Structure

41.3.1 XML Declaration

Every XML document typically begins with an XML declaration that specifies the XML version and character encoding:

<?xml version="1.0" encoding="UTF-8"?>

This tells processing applications that this is an XML document that conforms to XML 1.0 specifications and uses UTF-8 character encoding.

41.3.2 Elements

Elements are the building blocks of XML. An element consists of a start tag, content, and an end tag.

Elements with Start and End Tags

Most elements have both a start tag and an end tag, with content between them:

<phone>718-496-7223</phone>

In our example, most elements follow this pattern, including <publisher>, <phone>, and <author>.

Nested Elements

XML elements can be nested within other elements, creating a hierarchical structure:

<address>
  <street>75 West 205 St</street>
  <city>Bronx</city>
  <state>NY</state>
  <zip>10468</zip>
</address>

In this example, the address element contains four child elements: street, city, state, and zip.

41.3.3 Attributes

Attributes provide additional information about elements and are always specified in the start tag:

<book id="B001" title="1977!" type="history" pages="107" sales="566" pubdate="2000-08-01">

In this example, the book element has several attributes: id, title, type, pages, sales, and pubdate.

Attributes must always be quoted, using either single (') or double (") quotes.

41.3.4 Empty Elements

Empty elements are elements that don’t contain any content. There are two ways to write empty elements:

Empty Elements with Attributes

Some empty elements contain attributes but no content:

<format type="audiobook" price="2.99" />

In this example, the format element has attributes but no content. Note the / before the closing bracket, which indicates that this is a self-closing tag. There is NO end tag for this element.

Empty Elements without Attributes

Some empty elements have neither content nor attributes:

<pulitzerPrizeWinner />

These self-closing tags indicate the presence of an element without any additional information.

41.4 XML Design Patterns

There are different ways to structure data in XML. Let’s explore some common design patterns with examples from our publishing database.

41.4.1 Attribute-Centric vs. Element-Centric Design

Data in XML can be represented either as attributes or as child elements. Let’s compare these approaches:

Attribute-Centric

The book element uses an attribute-centric approach for its basic information:

<book id="B001" title="1977!" type="history" pages="107" sales="566" pubdate="2000-08-01">

Advantages: - More compact - Often easier to read for simple values - More efficient for parsing

Element-Centric

The author’s address uses an element-centric approach:

<address>
  <street>75 West 205 St</street>
  <city>Bronx</city>
  <state>NY</state>
  <zip>10468</zip>
</address>

Advantages: - Better for complex or structured data - Can contain mixed content (text and other elements) - More extensible for future changes

41.4.2 Mixed Content

XML can contain mixed content, where an element contains both text and child elements:

<author order="1" royalty_shares="1" primary="true">Sarah Buchman
  <phone>718-496-7223</phone>
  <address>
    <!-- Address details -->
  </address>
</author>

In this example, the author element contains both the author’s name, Sarah Buchman, as text content and child elements for contact information.

41.5 XML Data Modeling Examples

Let’s explore some specific modeling examples from our publishing database:

41.5.1 Example 1: Book Information

<book id="B001" title="1977!" type="history" pages="107" sales="566" pubdate="2000-08-01">
  <!-- Book contents -->
</book>

This example shows how basic book metadata is represented as attributes of the book element.

41.5.2 Example 2: Publisher Information

<publisher city="New York" state="NY" country="USA">Abatis Publishers</publisher>

This example demonstrates a mixed approach: - Publisher location details as attributes - Publisher name as text content

41.5.3 Example 3: Author Information

<author order="1" royalty_shares="1" primary="true">Sarah Buchman
  <phone>718-496-7223</phone>
  <address>
    <street>75 West 205 St</street>
    <city>Bronx</city>
    <state>NY</state>
    <zip>10468</zip>
  </address>
</author>

This example shows a more complex structure: - Author-book relationship details as attributes - Author name as text content - Contact information as nested elements

41.5.4 Example 4: Format Information

<formats>
  <format type="hardcover" price="29.99" />
  <format type="paperback" price="7.99" />
  <format type="ebook" price="4.99" />
  <format type="audiobook" price="2.99" />
</formats>

This example shows a collection of empty elements with attributes, representing the available formats for a book.

41.6 Best Practices for XML Design

Based on our examples, here are some best practices for designing XML documents:

  1. Choose appropriate element names: Use clear, descriptive names that reflect the data’s meaning.

  2. Be consistent in your design approach: If you use attributes for certain types of information, do so consistently.

  3. Use attributes for metadata: Attributes are good for information that describes the element but isn’t part of its primary content.

  4. Use elements for structured data: When data has its own structure or might need to be extended in the future, use elements.

  5. Keep it simple: Only make your XML as complex as it needs to be. Overly complex structures can be difficult to maintain and process.

  6. Consider your use case: The optimal XML design depends on how the data will be used. Consider factors like readability, processing efficiency, and future extensibility.

41.7 Working with XML in R

You can read XML data into R and process it. The next section goes into detail about how to do this. The following is just a simple example.

library(xml2)
library(magrittr)  # for the %>% pipe operator used below

# Read XML file
books_xml <- read_xml("books-v025.xml")

# Extract book titles
book_titles <- xml_find_all(books_xml, "//book") %>%
  xml_attr("title")

book_titles
[1] "1977!"                            "200 Years of German Humor"       
[3] "Ask Your System Administrator"    "But I Did It Unconsciously"      
[5] "Perhaps It's a Glandular Problem"

41.8 Conclusion

XML is a powerful and flexible way to structure data. By understanding the basic elements, attributes, and design patterns, you can create XML documents that effectively represent complex data hierarchies.

Our publishing database example demonstrates various XML features and design approaches, from attribute-heavy elements to deeply nested structures. The best approach for your XML depends on your specific use case, considering factors like readability, processing efficiency, and future extensibility.

As you continue working with XML, remember that good design balances technical correctness with practical usability, creating documents that are both machine-processable and human-understandable.

41.9 Resources for Further Learning