Nelson's Weblog: tech / python / xpath

XPath, XML, Python

I hate working with XML. It's easy to extract data from simple text files or CSV files, but XML is all nested, and has entities, and lots of pointy brackets. Regexp just doesn't cut it, you really need an XML parser. And for some reason Python is not so great at XML.

Python has too many XML choices. There's the stock Python install, which barely does anything. Then there's what you probably should use, PyXML, which has an ugly hack to confusingly install on top of the default Python libraries. But if you follow the advice of Python's most visible XML expert, Uche Ogbuji, you may think there's something wrong with PyXML and install 4Suite instead, which is the same as PyXML only different. Or should you use Amara instead? Then there's ElementTree which is brilliantly fast and simple to use, but limited, or xmltramp, which is even more hacky. On the other extreme there's libxml2, which is fast and powerful but has an awful API.

Mind you, this is all for the basic stuff, like parsing XML. There's lots more Python XML options too. But what's missing is a clear single simple library to use. PyXML seems the most standard, but it seems very slow and it tries to be more DOM-like than Python-like. I hate DOM.

All of this is a long-winded preamble to my attempt to do something simple with XPath in Python. I ended up with three tiny sample programs that extract the 'xmlUrl' attributes from an OPML file. Here they are.

PyXML

from xml.dom.ext.reader import Sax2
from xml import xpath
doc = Sax2.FromXmlFile('foo.opml').documentElement
for url in xpath.Evaluate('//@xmlUrl', doc):
print url.value

This is pretty simple, if a bit wordy. While my goal was simplicity, it's worth noting this is really slow, like 2.5 seconds on a 14k file. Other versions are 0.5 seconds.

libxml2

import libxml2
doc = libxml2.parseFile('foo.opml')
for url in doc.xpathEval('//@xmlUrl'):
print url.content

Looks simple enough. But this example hides the awfulness of the libxml2 API. For instance when you're looking at a tag and you want a list of all its attributes, you can't just get a list. You call get_properties(), which returns only the first attribute, which you then have to call .next on to get the second one. This is Python, guys, not C. We have list as a datatype. The good thing about libxml2 is it's powerful and fast.

ElementTree

from elementtree import ElementTree
tree = ElementTree.parse("foo.opml")
for outline in tree.findall("//outline"):
print outline.get('xmlUrl')

This is my favourite example, because it feels the simplest and most Pythonic. But ElementTree's XPath support is woefully incomplete. About all you can do is select nodes. You can't select attributes or do anything fancier.

Bottom line? Python is all about "batteries included". But the XML batteries are weak. There are some more powerful options but they've all got drawbacks. Either the APIs are awful, the libraries are slow, or else they lack features. Someone needs to put a clean new XML system into Python. Implement standard SAX and DOM because you have to, but then build a really nice Pythonic API as well and promote that.


Mastodon @nelson@tech.lgbt Linkblog Tue 2025-04-29 Alarm Grid Matt Haughey on Slate truck Sat 2025-04-26 2024 US election data Fri 2025-04-25 County judge arrested Thu 2025-04-24 DHS doxxes victim Wed 2025-04-23 Doll Takeover Tue 2025-04-22 AI sycophancy Mon 2025-04-21 ConsolePi Python t-strings Sun 2025-04-20 Ostentatio Genitalium Sat 2025-04-19 Gay men subreddits Fri 2025-04-18 applejohn Futel update Juan Carlos Gomez-Lopez Thu 2025-04-17 SillyTavern Wed 2025-04-16 ICE resistance Musk's personal eugenics program Tue 2025-04-15 Hertz data breach BG3: The Final Patch 4chan pwned Search Archives 2024 12 11 10 09 08 07 06 05 04 03 02 01 2023 12 11 10 09 08 07 06 05 04 03 02 01 2022 12 11 10 09 08 07 06 05 04 03 02 01 2021 12 11 10 09 08 07 06 05 04 03 02 01 2020 12 11 10 09 08 07 06 05 04 03 02 01 2019 12 11 10 09 08 07 06 05 04 03 02 01 2018 12 11 10 09 08 07 06 05 04 03 02 01 2017 12 11 10 09 08 07 06 05 04 03 02 01 2016 12 11 10 09 08 07 06 05 04 03 02 01 2015 12 11 10 09 08 07 06 05 04 03 02 01 2014 12 11 10 09 08 07 06 05 04 03 02 01 2013 12 11 10 09 08 07 06 05 04 03 02 01 2012 12 11 10 09 08 07 06 05 04 03 02 01 2011 12 11 10 09 08 07 06 05 04 03 02 01 2010 12 11 10 09 08 07 06 05 04 03 02 01 2009 12 11 10 09 08 07 06 05 04 03 02 01 2008 12 11 10 09 08 07 06 05 04 03 02 01 2007 12 11 10 09 08 07 06 05 04 03 02 01 2006 12 11 10 09 08 07 06 05 04 03 02 01 2005 12 11 10 09 08 07 06 05 04 03 02 01 2004 12 11 10 09 08 07 06 05 04 03 02 01 2003 12 11 10 09 08 07 06 05 04 03 02 01 2002 12 11 10 09 08 07 06 05 04 03 02 01 2001 12 11 10 09 08 07 One good site MDN Nelson Minar nelson@monkey.org Blog licensed under a Creative Commons License		XPath, XML, Python I hate working with XML. It's easy to extract data from simple text files or CSV files, but XML is all nested, and has entities, and lots of pointy brackets. Regexp just doesn't cut it, you really need an XML parser. And for some reason Python is not so great at XML. Python has too many XML choices. There's the stock Python install, which barely does anything. Then there's what you probably should use, PyXML, which has an ugly hack to confusingly install on top of the default Python libraries. But if you follow the advice of Python's most visible XML expert, Uche Ogbuji, you may think there's something wrong with PyXML and install 4Suite instead, which is the same as PyXML only different. Or should you use Amara instead? Then there's ElementTree which is brilliantly fast and simple to use, but limited, or xmltramp, which is even more hacky. On the other extreme there's libxml2, which is fast and powerful but has an awful API. Mind you, this is all for the basic stuff, like parsing XML. There's lots more Python XML options too. But what's missing is a clear single simple library to use. PyXML seems the most standard, but it seems very slow and it tries to be more DOM-like than Python-like. I hate DOM. All of this is a long-winded preamble to my attempt to do something simple with XPath in Python. I ended up with three tiny sample programs that extract the 'xmlUrl' attributes from an OPML file. Here they are. PyXML from xml.dom.ext.reader import Sax2 from xml import xpath doc = Sax2.FromXmlFile('foo.opml').documentElement for url in xpath.Evaluate('//@xmlUrl', doc): print url.value This is pretty simple, if a bit wordy. While my goal was simplicity, it's worth noting this is really slow, like 2.5 seconds on a 14k file. Other versions are 0.5 seconds. libxml2 import libxml2 doc = libxml2.parseFile('foo.opml') for url in doc.xpathEval('//@xmlUrl'): print url.content Looks simple enough. But this example hides the awfulness of the libxml2 API. For instance when you're looking at a tag and you want a list of all its attributes, you can't just get a list. You call `get_properties()`, which returns only the first attribute, which you then have to call `.next` on to get the second one. This is Python, guys, not C. We have list as a datatype. The good thing about libxml2 is it's powerful and fast. ElementTree from elementtree import ElementTree tree = ElementTree.parse("foo.opml") for outline in tree.findall("//outline"): print outline.get('xmlUrl') This is my favourite example, because it feels the simplest and most Pythonic. But ElementTree's XPath support is woefully incomplete. About all you can do is select nodes. You can't select attributes or do anything fancier. Bottom line? Python is all about "batteries included". But the XML batteries are weak. There are some more powerful options but they've all got drawbacks. Either the APIs are awful, the libraries are slow, or else they lack features. Someone needs to put a clean new XML system into Python. Implement standard SAX and DOM because you have to, but then build a really nice Pythonic API as well and promote that. See also this followup. tech • python 2005-01-14 15:32 Z Nelson's Weblog • tech • python