Python ElementTree support for parsing unknown XML entities? - python

I have a set of super simple XML files to parse... but... they use custom defined entities. I don't need to map these to characters, but I do wish to parse and act on each one. For example:
<Style name="admin-5678">
<Rule>
<Filter>[admin_level]='5'</Filter>
&maxscale_zoom11;
</Rule>
</Style>
There is a tantalizing hint at http://effbot.org/elementtree/elementtree-xmlparser.htm that XMLParser has limited entity support, but I can't find the methods mentioned, everything gives errors:
#!/usr/bin/python
##
## Where's the entity support as documented at:
## http://effbot.org/elementtree/elementtree-xmlparser.htm
## In Python 2.7.1+ ?
##
from pprint import pprint
from xml.etree import ElementTree
from cStringIO import StringIO
parser = ElementTree.ElementTree()
#parser.entity["maxscale_zoom11"] = unichr(160)
testf = StringIO('<foo>&maxscale_zoom11;</foo>')
tree = parser.parse(testf)
#tree = parser.parse(testf,"XMLParser")
for node in tree.iter('foo'):
print node.text
Which depending on how you adjust the comments gives:
xml.etree.ElementTree.ParseError: undefined entity: line 1, column 5
or
AttributeError: 'ElementTree' object has no attribute 'entity'
or
AttributeError: 'str' object has no attribute 'feed'
For those curious the XML is from the OpenStreetMap's mapnik project.

As #cnelson already pointed out in a comment, the chosen solution here won't work in Python 3.
I finally got it working. Quoted from this Q&A.
Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.
This works for both Python 2.6, 2.7, 3.3, 3.4.
import xml.etree.ElementTree as ET
html = '''<html>
<div>Some reasonably well-formed HTML content.</div>
<form action="login">
<input name="foo" value="bar"/>
<input name="username"/><input name="password"/>
<div>It is not unusual to see in an HTML page.</div>
</form></html>'''
magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY nbsp ' '>
]>''' # You can define more entities here, if needed
et = ET.fromstring(magic + html)

I'm not sure if this is a bug in ElementTree or what, but you need to call UseForeignDTD(True) on the expat parser to behave the way it did in the past.
It's a bit hacky, but you can do this by creating your own instance of ElementTree.Parser, calling the method on it's instance of xml.parsers.expat, and then passing it to ElementTree.parse():
from xml.etree import ElementTree
from cStringIO import StringIO
testf = StringIO('<foo>&moo_1;</foo>')
parser = ElementTree.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity['moo_1'] = 'MOOOOO'
etree = ElementTree.ElementTree()
tree = etree.parse(testf, parser=parser)
for node in tree.iter('foo'):
print node.text
This outputs "MOOOOO"
Or using a mapping interface:
from xml.etree import ElementTree
from cStringIO import StringIO
class AllEntities:
def __getitem__(self, key):
#key is your entity, you can do whatever you want with it here
return key
testf = StringIO('<foo>&moo_1;</foo>')
parser = ElementTree.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = AllEntities()
etree = ElementTree.ElementTree()
tree = etree.parse(testf, parser=parser)
for node in tree.iter('foo'):
print node.text
This outputs "moo_1"
A more complex fix would be to subclass ElementTree.XMLParser and fix it there.

Related

Restore CDATA during lxml serialization

I know that I can preserve CDATA sections during XML parsing, using the following:
from lxml import etree
parser = etree.XMLParser(strip_cdata=False)
root = etree.XML('<root><![CDATA[test]]></root>', parser)
See APIs specific to lxml.etree
But, is there a simple way to "restore" CDATA section during serialization?
For example, by specifying a list of tag names…
For instance, I want to turn:
<CONFIG>
<BODY>This is a <message>.</BODY>
</CONFIG>
to:
<CONFIG>
<BODY><![CDATA[This is a <message>.]]></BODY>
</CONFIG>
Just by telling that BODY should contains CDATA…
Something like this?
from lxml import etree
parser = etree.XMLParser(strip_cdata=True)
root = etree.XML('<root><x><![CDATA[<test>]]></x></root>', parser)
print etree.tostring(root)
for elem in root.findall('x'):
elem.text = etree.CDATA(elem.text)
print etree.tostring(root)
Produces:
<root><x><test></x></root>
<root><x><![CDATA[<test>]]></x></root>

Python LXML parse error on Evernote XML

I'm trying to parse Evernote Markup Language (ENML) with lxml in Python 2.7. ENML is a superset of XHTML.
from StringIO import StringIO
import lxml.etree as etree
if __name__ == '__main__':
xml_str = StringIO('<?xml version="1.0" encoding="UTF-8"?>\r\n<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">\r\n\r\n<en-note style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">\nA really simple example. Another sentence.\n</en-note>')
tree = etree.parse(xml_str)
The code above errors out with:
XMLSyntaxError: Entity 'nbsp' not defined, line 5, column 32
How do I successfully parse ENML?
is understood by the HTML parser, not the XML parser:
from StringIO import StringIO
import lxml.html as LH
if __name__ == '__main__':
xml_str = StringIO('<?xml version="1.0" encoding="UTF-8"?>\r\n<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">\r\n\r\n<en-note style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">\nA really simple example. Another sentence.\n</en-note>')
tree = LH.parse(xml_str)
print(LH.tostring(tree))
You can try replacing the entity names by their numerical values.
http://www.w3schools.com/tags/ref_entities.asp

How to print Element as correct xml with xml tag?

So I have this function in my view:
from django.http import HttpResponse
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
def helloworld(request):
root_element = Element("root_element")
comment = Comment("Hello World!!!")
root_element.append(comment)
foo_element = Element("foo")
foo_element.text = "bar"
bar_element = Element("bar")
bar_element.text = "foo"
root_element.append(foo_element)
root_element.append(bar_element)
return HttpResponse(tostring(root_element), "application/xml")
What it does it prints something like this:
<root_element><!--Hello World!!!--><foo>bar</foo><bar>foo</bar></root_element>
As you can see, it is missing the xml tag at the beginning. How to output proper XML beginning with xml declaration?
If you can add a dependency in your project, I suggest you to use lxml which is more complete and optimized than the basic xml module that come with Python.
For doing this, you just have to change your import statement to :
from lxml.etree import Element, SubElement, Comment, tostring
And then, you'll have a tostring() with a 'xml_declaration' option :
>>> tostring(root, xml_declaration=False)
'<root_element><!--Hello World!!!--><foo>bar</foo><bar>foo</bar></root_element>'
>>> tostring(root, xml_declaration=True)
"<?xml version='1.0' encoding='ASCII'?>\n<root_element><!--Hello World!!!--><foo>bar</foo><bar>foo</bar></root_element>"
In the standard lib, only the write() method of ElementTree have a xml_declaration option. An other solution would be to create a wrapper which use ElementTree.write() to write into a StringIO and then, to return the content of the StringIO.

What's the best way to handle -like entities in XML documents with lxml?

Consider the following:
from lxml import etree
from StringIO import StringIO
x = """<?xml version="1.0" encoding="utf-8"?>\n<aa> â</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)
This would fail with:
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 2, column 11
This is because resolve_entities=False doesn't ignore them, it just doesn't resolve them.
If I use etree.HTMLParser instead, it creates html and body tags, plus a lot of other special handling it tries to do for HTML.
What's the best way to get a â text child under the aa tag with lxml?
You can't ignore entities as they are part of the XML definition. Your document is not well-formed if it doesn't have a DTD or standalone="yes" or if it includes entities without an entity definition in the DTD. Lie and claim your document is HTML.
https://mailman-mail5.webfaction.com/pipermail/lxml/2008-February/003398.html
You can try lying and putting an XHTML DTD on your document. e.g.
from lxml import etree
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
x = """<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >\n<aa> â</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)
etree.tostring(r) # '<aa> â</aa>'
#Alex is right: your document is not well-formed XML, and so XML parsers will not parse it. One option is to pre-process the text of the document to replace bogus entities with their utf-8 characters:
entities = [
(' ', u'\u00a0'),
('â', u'\u00e2'),
...
]
for before, after in entities:
x = x.replace(before, after.encode('utf8'))
Of course, this can be broken by sufficiently weird "xml" also.
Your best bet is to fix your input XML documents to be well-formed XML.
When I was trying to do something similar, I just used x.replace('&', '&') before parsing the string.

_ElementInterface instance has no attribute 'tostring'

The code below generates this error. I can't figure out why. If ElementTree has parse, why doesn't it have tostring? http://docs.python.org/library/xml.etree.elementtree.html#xml.etree.ElementTree.ElementTree
from xml.etree.ElementTree import ElementTree
...
tree = ElementTree()
node = ElementTree()
node = tree.parse(open("my_xml.xml"))
text = node.tostring()
tostring is a method of the xml.etree.ElementTree module, not the confusingly similarly-named xml.etree.ElementTree.ElementTree class.
from xml.etree.ElementTree import ElementTree
from xml.etree.ElementTree import tostring
tree = ElementTree()
node = tree.parse(open("my_xml.xml"))
text = tostring(node)
tostring() is actually a function of the ElementTree module not a method of the ElementTree wrapper class.
>>> import xml.etree.ElementTree as ET
>>> x = ET.fromstring('<xml><one>one</one></xml>')
>>> x
<Element xml at 7f749572f710>
>>> ET.tostring(x)
'<xml><one>one</one></xml>'
The docs you've linked to do not support the existence of a ElementTree.tostring() method.
Also, your call to tree.parse() rebinds node.

Categories

Resources