Parsing XML Documents with Python

1 hour
  • 2 Learning Objectives

About this Hands-on Lab

XML is the return format for many APIs. It is likely you will work with it in your career. Python is an excellent language for parsing and writing XML to interact with these APIs and other circumstances you may encounter.

In this lab we will use Python’s `defusedxml` package to write an XML file, allow the intended user to make use of the XML and make changes, and then parse the XML for the changes.

You will need basic Python programming and SQL skills for this lab:
– [Certified Associate in Python Programming Certification](https://linuxacademy.com/cp/modules/view/id/470)

Learning Objectives

Successfully complete this lab by achieving the following learning objectives:

Write an XML File

create_catalog.py contains a skeleton of what we need for this objective. Run it to see that it returns an error:

python create_catalog.py

The example code showing how to make this work is shown below:

import sqlite3
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element, ElementTree

DB_NAME = "author_contracts.db"

def get_db_data():
    """
    write a code to execute the sql script and return the results
    """

    sql_query = """SELECT author, title, genre FROM authors"""

    con = sqlite3.connect(DB_NAME)
    cur = con.cursor()

    cur.execute(sql_query)
    results = cur.fetchall()

    cur.close()
    con.close()

    return results

def create_book_entry(author, title, genre):
    """
    create the book entry as defined and return book
    """

    book = Element('book')

    author_tag = ET.SubElement(book, 'author')
    author_tag.text = author

    title_tag = ET.SubElement(book, 'title')
    title_tag.text = title

    genre_tag = ET.SubElement(book, 'genre')
    genre_tag.text = genre

    ET.SubElement(book, 'isbn')

    return book

# using the information from get_db_data()
# write code to create a root and then
# add each book to it, finally write
# data to "catalog.xml"

root = Element("catalog")

book_info = get_db_data()

for author, title, genre in book_info:
    book = create_book_entry(author, title, genre)
    root.append(book)

tree = ElementTree(element=root)
tree.write("catalog.xml", encoding="UTF-8", xml_declaration=True)

# test code
expected_catalog = b"<?xml version='1.0' encoding='UTF-8'?>n<catalog><book><author>Thompson, Keith</author><title>Oh Python! My Python!</title><genre>biography</genre><isbn /></book><book><author>Fritts, Larry</author><title>Fun with Django</title><genre>satire</genre><isbn /></book><book><author>Applegate, John</author><title>When Bees Attack! The Horror!</title><genre>horror</genre><isbn /></book><book><author>Brown, James</author><title>Martin Buber's Philosophies</title><genre>guide</genre><isbn /></book><book><author>Smith, Jackson</author><title>The Sun Also Orbits</title><genre>mystery</genre><isbn /></book></catalog>"

try:
    with open("catalog.xml", "rb") as f:
        catalog = f.read()
except FileNotFoundError:
    catalog = ""

assert catalog == expected_catalog

Run python create_catalog.py.

Congratulations! The data is in a format the ISBN company can use.

Parse XML Changes

First install defusedxml:

pip3 install defusedxml

Then we can work on parse_catalog.py. It contains a skeleton of what we need for this next objective. Run python parse_catalog.py before editing, to see that it returns an error.

The example code showing how to accomplish this second objective is shown below:

import sqlite3
import sys
import defusedxml.ElementTree as ET

DB_NAME = "author_contracts.db"

def update_db(isbn_data_list):
    """ 
    add code to execute each sql_stmt in the order given
    results from sql_query_3 should be assigned to results
    """

    sql_query_1 = ''' ALTER TABLE authors ADD COLUMN isbn CHAR(20); '''

    sql_query_2 = "UPDATE authors SET isbn = ? WHERE title = ?;"

    sql_query_3 = ''' SELECT isbn FROM authors;'''

    con = sqlite3.connect("DB_NAME")
    cur = con.cursor()

    con.execute(sql_query_1)
    con.commit()

    con.executemany(sql_query_2, isbn_data_list)
    con.commit()

    cur.execute(sql_query_3)
    results = cur.fetchall()

    cur.close()
    con.close()  

    # test code
    expected_results = [('000-1-000000-00-1',), ('000-2-000000-00-2',), ('000-3-000000-00-3',), ('000-4-000000-00-4',), ('000-5-000000-00-5',)]

    assert results == expected_results

# using 'isbn.xml'
# loop through "book" in file and append isbn and title as a list object to isbn_data_list
# send isbn_data_list to function update_db

file_name = "isbn.xml"

try:
    tree = ET.parse(file_name)
except:
    print("File not found")
    sys.exit(1)

isbn_data_list = []
for book in tree.findall('book'):
    title = book.findtext('title')
    isbn = book.findtext('isbn')
    isbn_data_list.append([isbn, title])

update_db(isbn_data_list)

Now run python parse_catalog.py.

Awesome! You have mastered XML file parsing.

Additional Resources

Atlantic Publishing's Legal department is ready to get an ISBN for each book that will be published this year. You are the IT department. The Legal department tells you that the service they use to get the ISBN requests the information as XML. They have an example XML file for you to review:

<catalog>
  <book>
    <title>book title</title>
    <author>book author</author>
    <genre>book genre</genre>
  </book>
</catalog>

Legal asks you to prepare this file. It will come back with an ISBN added, and then they would like you to add the ISBN to the database.

Logging In

There are a couple of ways to get in and work with the code. One is to use the credentials provided in the hands-on lab page, log in with SSH, and use a text editor in the terminal.

The other is using VS Code in a web browser. If you'd like to go this route, then you will need to navigate to the public IP address of the workstation server (provided in the hands-on lab page) on port 8080 (example: http://PUBLIC_IP:8080). Your password will be the same password that you'd use to connect over SSH.

What are Hands-on Labs

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!

Get Started
Who’s going to be learning?