A guide to operating XML files in Python

We often need to parse data written in different languages. Python provides many third-party libraries to parse or split data written in other languages. Today we will learn the related functions of the Python XML parser.

Let’s take a look below~

What is XML?

XML is Extensible Markup Language, which is similar in appearance to HTML, but XML is used for data representation, while HTML is used to define the data being used. XML is specifically designed for sending and receiving data back and forth between clients and servers. Take a look at the following example:

<?xml version="1.0" encoding="UTF-8"?>
<metadata>
<food>
    <item name="breakfast">Idly</item>
    <price>$2.5</price>
    <description>
   Two idly's with chutney
   </description>
    <calories>553</calories>
</food>
<food>
    <item name="breakfast">Paper Dosa</item>
    <price>$2.7</price>
    <description>
    Plain Paper Dosa with chutney
    </description>
    <calories>700</calories>
</food>
<food>
    <item name="breakfast">Upma</item>
    <price>$3.65</price>
    <description>
    Rava upma with bajji
    </description>
    <calories>600</calories>
</food>
<food>
    <item name="breakfast">Bisi Bele Bath</item>
    <price>$4.50</price>
    <description>
   Bisi Bele Bath with sev
    </description>
    <calories>400</calories>
</food>
<food>
    <item name="breakfast">Kesari Bath</item>
    <price>$1.95</price>
    <description>
    Sweet rava with saffron
    </description>
    <calories>950</calories>
</food>
</metadata>

The above example shows the contents of a file named “Sample.xml”. The following code examples will be based on this XML example.

Python XML parsing module

Python allows parsing these XML documents using two modules, the xml.etree.ElementTree module and Minidom (a minimal DOM implementation). Parsing means reading information from a file and splitting it into fragments by identifying the parts of a specific XML file. Let’s take a closer look at how to use these modules to parse XML data.

xml.etree.ElementTree module:

This module helps us format XML data into a tree structure, which is the most natural representation of hierarchical data. Element types allow hierarchical data structures to be stored in memory and have the following properties:

Property Description
Tag A string representing the type of data being stored
Attributes Consists of many attributes stored as dictionaries
Text String Text string containing the information to be displayed
Tail String There can also be tail strings if necessary
Child Elements Consists of many child elements stored as sequences

ElementTree is a class that encapsulates the structure of elements and allows conversion to and from XML. Now let us try to parse the above XML file using the python module.

There are two ways to parse files using the ElementTree module.

The first one is using parse() function and the second one is fromstring() function. The parse() function parses an XML document provided as a file, while fromstring parses XML when provided as a string, i.e. within triple quotes.

Use the parse() function:

As mentioned before, this function takes the file format XML for parsing, look at the example below:

import xml.etree.ElementTree as ET
mytree = ET.parse('sample.xml')
myroot = mytree.getroot()

The first thing we need to do is import the xml.etree.ElementTree module and then parse the “Sample.xml” file using the parse() method and the getroot() method returns the root element of “Sample.xml”.

When the above code is executed, we will not see the output returned, but as long as there are no errors, it means that the code executed successfully. To check the root element, you can simply use a print statement like this:

import xml.etree.ElementTree as ET
mytree = ET.parse('sample.xml')
myroot = mytree.getroot()
print(myroot)

Output:

<Element metadata’ at 0x033589F0>

The above output shows that the root element in our XML document is “metadata”.

Use the fromstring() function

We can also use the fromstring() function to parse string data, we need to pass the XML as a string within triple quotes like this:

import xml.etree.ElementTree as ET
data='''<?xml version="1.0" encoding="UTF-8"?>
<metadata>
<food>
    <item name="breakfast">Idly</item>
    <price>$2.5</price>
    <description>
   Two idly's with chutney
   </description>
    <calories>553</calories>
</food>
</metadata>
'''
myroot = ET.fromstring(data)
#print(myroot)
print(myroot.tag)

The above code will return the same output as the previous one, the XML document used as a string is just a part of “Sample.xml”, this has been used for visibility, the complete XML document can also be used.

The root tag can also be retrieved using a “tag” object as follows:

print(myroot.tag)

Output:

metadata

You can also slice tagged string output by specifying only the portion of the string you want to see in the output.

print(myroot.tag[0:4])

Output:

meta

As mentioned before, tags can also have dictionary attributes. To check if the root tag has any attributes, you can use the “attrib” object like this:

print(myroot.attrib)

Output:

{}

As you can see, the output is an empty dictionary because our root tag has no attributes.

Find elements of interest

The root is also made up of subtags. To retrieve the subtags of the root tag, you can use the following command:

print(myroot[0].tag)

Output:

food

Now, if you want to retrieve all the first child tags of the root, you can iterate over them using a for loop like this:

for x in myroot[0]:
     print(x.tag, x.attrib)

Output:

item {name’: breakfast’}
price {}
description {}
calories {}

All items returned are subproperties and tags of food.

To separate text from XML using ElementTree, you can use the text property. For example, if you want to retrieve all information about the first food, you should use the following code:

for x in myroot[0]:
        print(x.text)

Output:

Idly
$2.5
Two idly’s with chutney
553

As can be seen, the text information of the first item has been returned as output. Now if you want to display all items at a specific price, you can use the get() method, which accesses the attributes of the element.

for x in myroot.findall('food'):
    item =x.find('item').text
    price = x.find('price').text
    print(item, price)

Output:

Idly $2.5
Paper Dosa $2.7
Upma $3.65
Bisi Bele Bath $4.50
Kesari Bath $1.95

The above output shows all the required items along with the price of each item, using ElementTree, the XML file can also be modified.

Modify XML file

The elements in our XML file can be manipulated, and for this, we can use the set() function. Let’s first look at how to add something to the XML.

Add to XML:

The following example shows how to add content to the project description.

for description in myroot.iter('description'):
     new_desc = str(description.text) + 'wil be served'
     description.text = str(new_desc)
     description.set('updated', 'yes')
 
mytree.write('new.xml')

The write() function helps to create a new xml file and write the updated output to that file, but the same function can also be used to modify the original file. After executing the above code, you will be able to see that a new file has been created containing the updated results.

To add a new subtag, you can use the SubElement() method. For example, if you want to add a new professional tag to the first item Idly, you can do the following:

ET.SubElement(myroot[0], 'speciality')
for x in myroot.iter('speciality'):
     new_desc = 'South Indian Special'
     x.text = str(new_desc)
 
mytree.write('output5.xml')

As we can see, a new label is added under the first food label. You can add a label anywhere by specifying a subscript within [] brackets.

Let’s see how to delete a project using this module.

Remove from XML:

To remove attributes or child elements using ElementTree, you can use the pop() method, which will remove the desired attributes or elements that are not required by the user.

myroot[0][0].attrib.pop('name', None)
 
# create a new XML file with the results
mytree.write('output5.xml')

Output:

524bb33574959ca21518991ae65d0aca.png

The image above shows that the name attribute has been removed from the item tag. To remove the complete tag, you can use the same pop() method as follows:

myroot[0].remove(myroot[0][0])
mytree.write('output6.xml')

Output:

4474fe91bfdcd1268e562789081c57fd.png

The output shows that the first child element of the food label has been removed. If you want to remove all tags, you can use the clear() function as follows:

myroot[0].clear()
mytree.write('output7.xml')

When the above code is executed, the first subtag of the food tag will be completely deleted, including all subtags.

So far, we have been using the xml.etree.ElementTree module from the Python XML parser. Now let’s see how to parse XML using Minidom.

xml.dom.minidom Module

This module is basically used by people who are proficient in DOM (Document Object Module) and DOM applications usually parse XML into DOM first. In xml.dom.minidom this can be achieved by

Use the parse() function:

The first method is to use the parse() function by providing the XML file to be parsed as a parameter. For example:

from xml.dom import minidom
p1 = minidom.parse("sample.xml")

After doing this, you will be able to split the XML file and get the required data. You can also use this function to parse open files.

dat=open('sample.xml')
p2=minidom.parse(dat)

In this case, the variable storing the open file is provided as an argument to the parse function.

Use parseString() method:

This method is used when we want to provide XML to be parsed as a string.

p3 = minidom.parseString('<myxml>Using<empty/> parseString</myxml>')

XML can be parsed using any of the above methods, now let us try to get the data using this module

Find elements of interest

After my file is parsed, if we try to print it, the output returned displays a message that the variable storing the parsed data is an object of the DOM.

dat=minidom.parse('sample.xml')
print(dat)

Output:

<xml.dom.minidom.Document object at 0x03B5A308>

Access elements using GetElementsByTagName

tagname= dat.getElementsByTagName('item')[0]
print(tagname)

If we try to get the first element using the GetElementByTagName method, I see the following output:

<DOM Element: item at 0xc6bd00>

Note that only one output is returned because the [0] subscript is used here for convenience, which will be removed in further examples.

To access the value of a property, we will have to use the value attribute as shown below:

dat = minidom.parse('sample.xml')
tagname= dat.getElementsByTagName('item')
print(tagname[0].attributes['name'].value)

Output:

breakfast

To retrieve the data present in these tags, you can use the data attribute as follows:

print(tagname[1].firstChild.data)

Output:

Paper Dosa

You can also split and retrieve the value of a property using the value attribute.

print(items[1].attributes['name'].value)

Output:

breakfast

To print out all the items available in our menu, we can iterate over the items and return them all.

for x in items:
    print(x.firstChild.data)

Output:

Idly
Paper Dosa
Upma
Bisi Bele Bath
Kesari Bath

To count the number of items on our menu, we can use the len() function as follows:

print(len(items))

Output:

5

The output specifies that our menu contains 5 items.

Recommended reading:
Getting Started: The most comprehensive problem of learning Python from scratch | Learned Python for 8 months from scratch | Practical projects | This is the shortcut to learn Python

Essential information: Crawling Douban short reviews, the movie “The Next Us” | Analysis of the best NBA players in 38 years | From highly anticipated to word of mouth! Detective Tang 3 is disappointing | Watch the new Legend of Heaven and Dragon Sword with laughter | The king of lantern riddle answers | Use Python to make a massive sketch of young ladies | Mission: Impossible is so popular, I use machine learning to make a mini recommendation system for movies

Fun: Pinball game | Nine-square grid | Beautiful flowers | Two hundred lines of Python “Tiantian Cool Run” game!

AI: A robot that can write poetry | Colorize pictures | Predict income | Mission: Impossible is so popular, I use machine learning to make a mini movie recommendation system

Gadget: Convert Pdf to Word, easily convert tables and watermarks! | Save html web pages to pdf with one click! | Goodbye PDF extraction charges! | Use 90 lines of code to create the most powerful PDF converter, one-click conversion of word, PPT, excel, markdown, and html | Create a DingTalk low-price ticket reminder! |60 lines of code made a voice wallpaper switcher that I can watch every day! |

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Java Skill TreeHomepageOverview 139213 people are learning the system