I hate working with XML. It's easy to extract data from simple text files or CSV files, but XML is all nested, and has entities, and lots of pointy brackets. Regexp just doesn't cut it, you really need an XML parser. And for some reason Python is not so great at XML.

Python has too many XML choices. There's the stock Python install, which barely does anything. Then there's what you probably should use, PyXML, which has an ugly hack to confusingly install on top of the default Python libraries. But if you follow the advice of Python's most visible XML expert, Uche Ogbuji, you may think there's something wrong with PyXML and install 4Suite instead, which is the same as PyXML only different. Or should you use Amara instead? Then there's ElementTree which is brilliantly fast and simple to use, but limited, or xmltramp, which is even more hacky. On the other extreme there's libxml2, which is fast and powerful but has an awful API.

Mind you, this is all for the basic stuff, like parsing XML. There's lots more Python XML options too. But what's missing is a clear single simple library to use. PyXML seems the most standard, but it seems very slow and it tries to be more DOM-like than Python-like. I hate DOM.

All of this is a long-winded preamble to my attempt to do something simple with XPath in Python. I ended up with three tiny sample programs that extract the 'xmlUrl' attributes from an OPML file. Here they are.

PyXML

from xml.dom.ext.reader import Sax2
from xml import xpath
doc = Sax2.FromXmlFile('foo.opml').documentElement
for url in xpath.Evaluate('//@xmlUrl', doc):
  print url.value
This is pretty simple, if a bit wordy. While my goal was simplicity, it's worth noting this is really slow, like 2.5 seconds on a 14k file. Other versions are 0.5 seconds.

libxml2

import libxml2
doc = libxml2.parseFile('foo.opml')
for url in doc.xpathEval('//@xmlUrl'):
  print url.content
Looks simple enough. But this example hides the awfulness of the libxml2 API. For instance when you're looking at a tag and you want a list of all its attributes, you can't just get a list. You call get_properties(), which returns only the first attribute, which you then have to call .next on to get the second one. This is Python, guys, not C. We have list as a datatype. The good thing about libxml2 is it's powerful and fast.

ElementTree

from elementtree import ElementTree
tree = ElementTree.parse("foo.opml")
for outline in tree.findall("//outline"):
  print outline.get('xmlUrl')
This is my favourite example, because it feels the simplest and most Pythonic. But ElementTree's XPath support is woefully incomplete. About all you can do is select nodes. You can't select attributes or do anything fancier.

Bottom line? Python is all about "batteries included". But the XML batteries are weak. There are some more powerful options but they've all got drawbacks. Either the APIs are awful, the libraries are slow, or else they lack features. Someone needs to put a clean new XML system into Python. Implement standard SAX and DOM because you have to, but then build a really nice Pythonic API as well and promote that.

See also this followup.
techpython
  2005-01-14 15:32 Z