# Install packages if needed
# install.packages(c("xml2", "tidyverse"))
library(xml2) # For XML parsing
library(tidyverse) # For data manipulation
library(dplyr) # For data transformation
42 41. Parsing XML using R
This tutorial explains how to parse XML files in R and transform the data into usable dataframes. We’ll work with a books database XML file and create dataframes that mirror common relational database tables.
42.1 What is XML?
XML (eXtensible Markup Language) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. XML files use tags to define elements and their attributes, creating a hierarchical structure.
Key components of XML:
- Elements: Defined by tags (e.g.,
<book>...</book>
) - Attributes: Properties within tags (e.g.,
id="B001"
) - Nesting: Elements can contain other elements
- Self-closing tags: Elements with no content (e.g.,
<pulitzerPrizeWinner />
)
42.2 Required Packages
For XML processing in R, we’ll use the xml2
package for parsing and the tidyverse
for data manipulation:
42.3 Loading the XML File
First, let’s load our XML file:
# Load the XML file
<- read_xml("books-v025.xml")
books_xml
# Quick look at the structure
xml_structure(books_xml)
<books>
<book [id, title, type, pages, price, sales, pubdate]>
<publisher [city, state, country]>
{text}
<pulitzerPrizeWinner>
<royalty [advance, rate]>
<authors>
<author [order, royalty_shares, primary]>
{text}
<phone>
{text}
{text}
<address>
<street>
{text}
<city>
{text}
<state>
{text}
<zip>
{text}
{text}
<book [id, title, type, pages, price, sales, pubdate]>
<publisher [city, state, country]>
{text}
<royalty [advance, rate]>
<authors>
<author [order, royalty_shares, primary]>
{text}
<phone>
{text}
{text}
<address>
<street>
{text}
<city>
{text}
<state>
{text}
<zip>
{text}
{text}
<book [id, title, type, pages, price, sales, pubdate]>
<publisher [city, state, country]>
{text}
<pulitzerPrizeWinner>
<royalty [advance, rate]>
<authors>
<author [order, royalty_shares, primary]>
{text}
<phone>
{text}
{text}
<address>
<street>
{text}
<city>
{text}
<state>
{text}
<zip>
{text}
{text}
<book [id, title, type, pages, price, sales, pubdate]>
<publisher [city, state, country]>
{text}
<royalty [advance, rate]>
<authors>
<author [order, royalty_shares, primary]>
{text}
<phone>
{text}
{text}
<address>
<street>
{text}
<city>
{text}
<state>
{text}
<zip>
{text}
{text}
<author [order, royalty_shares, primary]>
{text}
<phone>
{text}
{text}
<address>
<street>
{text}
<city>
{text}
<state>
{text}
<zip>
{text}
{text}
<book [id, title, type, pages, price, sales, pubdate]>
<publisher [city, state, country]>
{text}
<royalty [advance, rate]>
<authors>
<author [order, royalty_shares, primary]>
{text}
<phone>
{text}
{text}
<address>
<street>
{text}
<city>
{text}
<state>
{text}
<zip>
{text}
{text}
<author [order, royalty_shares, primary]>
{text}
<phone>
{text}
{text}
<address>
<street>
{text}
<city>
{text}
<state>
{text}
<zip>
{text}
{text}
<author [order, royalty_shares, primary]>
{text}
<phone>
{text}
{text}
<address>
<street>
{text}
<city>
{text}
<state>
{text}
<zip>
{text}
{text}
<bio>
{text}
<formats>
<format [type, available, price]>
<format [type, available, price]>
<format [type, available, price]>
<format [type, available]>
The read_xml()
function from the xml2
package reads the XML file into R, creating an XML document object that we can query and manipulate.
42.4 Understanding XML Node Selection
To extract data from XML, we need to understand how to select nodes. The xml2
package provides functions like:
xml_find_all()
: Find all nodes matching an XPath expressionxml_find_first()
: Find the first node matching an XPath expressionxml_attr()
: Extract an attribute valuexml_text()
: Extract the text content
XPath is a query language for selecting nodes in XML documents. Basic XPath patterns include:
/books/book
: All book elements that are children of the books element//author
: All author elements anywhere in the document/books/book[@id="B001"]
: The book element with attribute id=“B001”//book/authors/author[@primary="true"]
: All primary authors
42.5 Creating the Dataframes
Now let’s create dataframes that mirror the CSV files mentioned. We’ll extract the relevant data from the XML and ensure we don’t create duplicate entries.
42.5.1 1. Publishers Dataframe
Let’s start by creating a dataframe for publishers:
# Extract all publisher nodes
<- xml_find_all(books_xml, "//publisher")
publisher_nodes
# Extract book IDs to use as identifiers
<- xml_find_all(books_xml, "//book") %>%
book_ids xml_attr("id")
# Create publishers dataframe
<- tibble(
publishers_df pub_id = character(),
pub_name = character(),
city = character(),
state = character(),
country = character()
)
# Populate publishers dataframe
for (i in seq_along(publisher_nodes)) {
<- xml_text(publisher_nodes[i])
pub_name <- xml_attr(publisher_nodes[i], "city")
city <- xml_attr(publisher_nodes[i], "state")
state <- xml_attr(publisher_nodes[i], "country")
country
# Check if this publisher is already in the dataframe
if (!pub_name %in% publishers_df$pub_name) {
# Create a publisher ID (using P prefix instead of B)
<- paste0("P", sprintf("%03d", nrow(publishers_df) + 1))
pub_id
# Add to dataframe
<- publishers_df %>%
publishers_df add_row(
pub_id = pub_id,
pub_name = pub_name,
city = city,
state = state,
country = country
)
}
}
# Display the publishers dataframe
publishers_df
# A tibble: 4 × 5
pub_id pub_name city state country
<chr> <chr> <chr> <chr> <chr>
1 P001 Abatis Publishers New York "NY" USA
2 P002 Schandenfreude Press Hamburg "" Germany
3 P003 Core Dump Books San Francisco "CA" USA
4 P004 AAA Publishing Berkeley "CA" USA
42.5.3 3. Titles Dataframe
Next, let’s create the titles dataframe:
# Extract all book nodes
<- xml_find_all(books_xml, "//book")
book_nodes
# Create titles dataframe
<- tibble(
titles_df title_id = character(),
title_name = character(),
type = character(),
pub_id = character(),
pages = numeric(),
price = numeric(),
sales = numeric(),
pubdate = character()
)
# Process each book
for (i in seq_along(book_nodes)) {
<- xml_attr(book_nodes[i], "id")
title_id <- xml_attr(book_nodes[i], "title")
title_name <- xml_attr(book_nodes[i], "type")
book_type <- xml_attr(book_nodes[i], "pages") %>% as.numeric()
pages <- xml_attr(book_nodes[i], "price") %>% as.numeric()
price <- xml_attr(book_nodes[i], "sales") %>% as.numeric()
sales <- xml_attr(book_nodes[i], "pubdate")
pubdate
# Find the publisher and get its ID from our publishers dataframe
<- xml_find_first(book_nodes[i], "./publisher") %>% xml_text()
publisher_name <- publishers_df %>%
pub_id filter(pub_name == publisher_name) %>%
pull(pub_id)
# Add to dataframe
<- titles_df %>%
titles_df add_row(
title_id = title_id,
title_name = title_name,
type = book_type,
pub_id = pub_id,
pages = pages,
price = price,
sales = sales,
pubdate = pubdate
)
}
# Display the titles dataframe
titles_df
# A tibble: 5 × 8
title_id title_name type pub_id pages price sales pubdate
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 B001 1977! hist… P001 107 22.0 566 2000-0…
2 B002 200 Years of German Humor hist… P002 14 20.0 9566 1998-0…
3 B003 Ask Your System Administrator comp… P003 1226 40.0 25667 2000-0…
4 B004 But I Did It Unconsciously psyc… P001 510 13.0 13001 1999-0…
5 B005 Perhaps It's a Glandular Prob… psyc… P004 826 7.99 94123 2000-1…
42.5.4 4. Royalties Dataframe
Now, let’s extract the royalty information:
# Create royalties dataframe
<- tibble(
royalties_df title_id = character(),
advance = numeric(),
royalty_rate = numeric()
)
# Process each book for royalty information
for (i in seq_along(book_nodes)) {
<- xml_attr(book_nodes[i], "id")
title_id
# Extract royalty information
<- xml_find_first(book_nodes[i], "./royalty")
royalty_node <- xml_attr(royalty_node, "advance") %>% as.numeric()
advance <- xml_attr(royalty_node, "rate") %>% as.numeric()
royalty_rate
# Add to dataframe
<- royalties_df %>%
royalties_df add_row(
title_id = title_id,
advance = advance,
royalty_rate = royalty_rate
)
}
# Display the royalties dataframe
royalties_df
# A tibble: 5 × 3
title_id advance royalty_rate
<chr> <dbl> <dbl>
1 B001 10000 0.05
2 B002 1000 0.06
3 B003 15000 0.07
4 B004 20000 0.08
5 B005 100000 0.07
42.6 Complete Dataset
Now we have created five dataframes that mirror the structure of the CSV files:
publishers_df
: Information about publishersauthors_df
: Information about authorstitles_df
: Information about book titlesroyalties_df
: Information about royalty agreementstitle_authors_df
: Linking table between books and authors
42.7 Exporting the Dataframes
If you want to save these dataframes as CSV files:
# Export dataframes to CSV
write_csv(publishers_df, "publishers_extracted.csv")
write_csv(authors_df, "authors_extracted.csv")
write_csv(titles_df, "titles_extracted.csv")
write_csv(royalties_df, "royalties_extracted.csv")
write_csv(title_authors_df, "title_authors_extracted.csv")
42.8 Common Challenges When Parsing XML
When working with XML files, you may encounter these common challenges:
- Namespace issues: Some XML files use namespaces that can complicate node selection
- Encoding problems: Character encoding can cause issues with special characters
- Missing data: Elements or attributes that might be optional
- Complex hierarchies: Deeply nested structures that require careful navigation
42.9 Handling Missing Data
Let’s modify our approach to handle potential missing data:
# Function to safely extract node text
<- function(node, xpath = NULL, default = NA) {
safe_xml_text if (is.null(xpath)) {
<- tryCatch(
result xml_text(node),
error = function(e) default
)else {
} <- xml_find_first(node, xpath)
found_node <- if (length(found_node) == 0) default else xml_text(found_node)
result
}return(result)
}
# Function to safely extract attribute
<- function(node, attr, default = NA) {
safe_xml_attr <- tryCatch(
result xml_attr(node, attr),
error = function(e) default
)if (is.na(result) || result == "") result <- default
return(result)
}
42.10 Advanced XML Operations
For more complex XML operations, you might want to explore:
- XPath expressions: More complex queries to navigate the XML structure
- XML validation: Validating XML against a schema (DTD or XSD)
- XML transformation: Using XSLT to transform XML into other formats
42.11 Conclusion
This tutorial has demonstrated how to parse an XML file in R and transform its data into structured dataframes. We’ve created five separate dataframes that mirror the structure of typical relational database tables, making it easy to work with the data using familiar R tools.
XML parsing is a valuable skill for working with various data sources, including web APIs, configuration files, and legacy data systems.