Python xml.dom.minidom.parse() function ignores DTDs - python

I have the following Python code:
import xml.dom.minidom
import xml.parsers.expat
try:
domTree = ml.dom.minidom.parse(myXMLFileName)
except xml.parsers.expat.ExpatError, e:
return e.args[0]
which I am using to parse an XML file. Although it quite happily spots simple XML errors like mismatched tags, it completely ignores the DTD specified at the top of the XML file:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<!DOCTYPE ServerConfig SYSTEM "ServerConfig.dtd">
so it doesn't notice when mandatory elements are missing, for example. How can I switch on DTD checking?

See this question - the accepted answer is to use lxml validation.

Just by way of explanation: Python xml.dom.minidom and xml.sax use the expat parser by default, which is a non-validating parser. It may read the DTD in order to do entity replacement, but it won't validate against the DTD.
gimel and Tim recommend lxml, which is a nicely pythonic binding for the libxml2 and libxslt libraries. It supports validation against a DTD. I've been using lxml, and I like it a lot.

Just for the record, this is what my code looks like now:
from lxml import etree
try:
parser = etree.XMLParser(dtd_validation=True)
domTree = etree.parse(myXMLFileName, parser=parser)
except etree.XMLSyntaxError, e:
return e.args[0]

I recommend lxml over xmlproc because the PyXML package (containing xmlproc) is not being developed any more; the latest Python version that PyXML can be used with is 2.4.

I believe you need to switch from expat to xmlproc.
See:
http://code.activestate.com/recipes/220472/

Related

Fetching child tags of a tag with a xmlns namespace [duplicate]

I want to use the method of findall to locate some elements of the source xml file in the ElementTree module.
However, the source xml file (test.xml) has namespaces. I truncate part of xml file as sample:
<?xml version="1.0" encoding="iso-8859-1"?>
<XML_HEADER xmlns="http://www.test.com">
<TYPE>Updates</TYPE>
<DATE>9/26/2012 10:30:34 AM</DATE>
<COPYRIGHT_NOTICE>All Rights Reserved.</COPYRIGHT_NOTICE>
<LICENSE>newlicense.htm</LICENSE>
<DEAL_LEVEL>
<PAID_OFF>N</PAID_OFF>
</DEAL_LEVEL>
</XML_HEADER>
The sample python code is below:
from xml.etree import ElementTree as ET
tree = ET.parse(r"test.xml")
el1 = tree.findall("DEAL_LEVEL/PAID_OFF") # Return None
el2 = tree.findall("{http://www.test.com}DEAL_LEVEL/{http://www.test.com}PAID_OFF") # Return <Element '{http://www.test.com}DEAL_LEVEL/PAID_OFF' at 0xb78b90>
Though using "{http://www.test.com}" works, it's very inconvenient to add a namespace in front of each tag.
How can I ignore the namespace when using functions like find, findall, ...?
Instead of modifying the XML document itself, it's best to parse it and then modify the tags in the result. This way you can handle multiple namespaces and namespace aliases:
from io import StringIO # for Python 2 import from StringIO instead
import xml.etree.ElementTree as ET
# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))
for _, el in it:
_, _, el.tag = el.tag.rpartition('}') # strip ns
root = it.root
This is based on the discussion here.
If you remove the xmlns attribute from the xml before parsing it then there won't be a namespace prepended to each tag in the tree.
import re
xmlstring = re.sub(' xmlns="[^"]+"', '', xmlstring, count=1)
The answers so far explicitely put the namespace value in the script. For a more generic solution, I would rather extract the namespace from the xml:
import re
def get_namespace(element):
m = re.match('\{.*\}', element.tag)
return m.group(0) if m else ''
And use it in find method:
namespace = get_namespace(tree.getroot())
print tree.find('./{0}parent/{0}version'.format(namespace)).text
Here's an extension to #nonagon answer (which removes namespace from tags) to also remove namespace from attributes:
import io
import xml.etree.ElementTree as ET
# instead of ET.fromstring(xml)
it = ET.iterparse(io.StringIO(xml))
for _, el in it:
if '}' in el.tag:
el.tag = el.tag.split('}', 1)[1] # strip all namespaces
for at in list(el.attrib.keys()): # strip namespaces of attributes too
if '}' in at:
newat = at.split('}', 1)[1]
el.attrib[newat] = el.attrib[at]
del el.attrib[at]
root = it.root
Obviously this is a permanent defacing of the XML but if that's acceptable because there are no non-unique tag names and because you won't be writing the file needing the original namespaces then this can make accessing it a lot easier
Improving on the answer by ericspod:
Instead of changing the parse mode globally we can wrap this in an object supporting the with construct.
from xml.parsers import expat
class DisableXmlNamespaces:
def __enter__(self):
self.old_parser_create = expat.ParserCreate
expat.ParserCreate = lambda encoding, sep: self.old_parser_create(encoding, None)
def __exit__(self, type, value, traceback):
expat.ParserCreate = self.oldcreate
This can then be used as follows
import xml.etree.ElementTree as ET
with DisableXmlNamespaces():
tree = ET.parse("test.xml")
The beauty of this way is that it does not change any behaviour for unrelated code outside the with block. I ended up creating this after getting errors in unrelated libraries after using the version by ericspod which also happened to use expat.
You can use the elegant string formatting construct as well:
ns='http://www.test.com'
el2 = tree.findall("{%s}DEAL_LEVEL/{%s}PAID_OFF" %(ns,ns))
or, if you're sure that PAID_OFF only appears in one level in tree:
el2 = tree.findall(".//{%s}PAID_OFF" % ns)
In python 3.5 , you can pass the namespace as an argument in find().
For example ,
ns= {'xml_test':'http://www.test.com'}
tree = ET.parse(r"test.xml")
el1 = tree.findall("xml_test:DEAL_LEVEL/xml_test:PAID_OFF",ns)
Documentation link :- https://docs.python.org/3.5/library/xml.etree.elementtree.html#parsing-xml-with-namespaces
I might be late for this but I dont think re.sub is a good solution.
However the rewrite xml.parsers.expat does not work for Python 3.x versions,
The main culprit is the xml/etree/ElementTree.py see bottom of the source code
# Import the C accelerators
try:
# Element is going to be shadowed by the C implementation. We need to keep
# the Python version of it accessible for some "creative" by external code
# (see tests)
_Element_Py = Element
# Element, SubElement, ParseError, TreeBuilder, XMLParser
from _elementtree import *
except ImportError:
pass
Which is kinda sad.
The solution is to get rid of it first.
import _elementtree
try:
del _elementtree.XMLParser
except AttributeError:
# in case deleted twice
pass
else:
from xml.parsers import expat # NOQA: F811
oldcreate = expat.ParserCreate
expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)
Tested on Python 3.6.
Try try statement is useful in case somewhere in your code you reload or import a module twice you get some strange errors like
maximum recursion depth exceeded
AttributeError: XMLParser
btw damn the etree source code looks really messy.
If you're using ElementTree and not cElementTree you can force Expat to ignore namespace processing by replacing ParserCreate():
from xml.parsers import expat
oldcreate = expat.ParserCreate
expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)
ElementTree tries to use Expat by calling ParserCreate() but provides no option to not provide a namespace separator string, the above code will cause it to be ignore but be warned this could break other things.
Let's combine nonagon's answer with mzjn's answer to a related question:
def parse_xml(xml_path: Path) -> Tuple[ET.Element, Dict[str, str]]:
xml_iter = ET.iterparse(xml_path, events=["start-ns"])
xml_namespaces = dict(prefix_namespace_pair for _, prefix_namespace_pair in xml_iter)
return xml_iter.root, xml_namespaces
Using this function we:
Create an iterator to get both namespaces and a parsed tree object.
Iterate over the created iterator to get the namespaces dict that we can
later pass in each find() or findall() call as sugested by
iMom0.
Return the parsed tree's root element object and namespaces.
I think this is the best approach all around as there's no manipulation either of a source XML or resulting parsed xml.etree.ElementTree output whatsoever involved.
I'd like also to credit balmy's answer with providing an essential piece of this puzzle (that you can get the parsed root from the iterator). Until that I actually traversed XML tree twice in my application (once to get namespaces, second for a root).
Just by chance dropped into the answer here: XSD conditional type assignment default type confusion?. This is not the exact answer for the topic question but may be applicable if the namespace is not critical.
<?xml version="1.0" encoding="UTF-8"?>
<persons xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="test.xsd">
<person version="1">
<firstname>toto</firstname>
<lastname>tutu</lastname>
</person>
</persons>
Also see: https://www.w3.org/TR/xmlschema-1/#xsi_schemaLocation
Works for me. I call an XML validation procedure in my application. But also I want to quickly see the validation highliting and autocompletion in PyCharm when editing the XML. This noNamespaceSchemaLocation attribute does what I need.
RECHECKED
from xml.etree import ElementTree as ET
tree = ET.parse("test.xml")
el1 = tree.findall("person/firstname")
print(el1[0].text)
el2 = tree.find("person/lastname")
print(el2.text)
Returnrs
>python test.py
toto
tutu

how to parse xml without dtd validation and using lxml?

I've tried using the following code which has invalid dtd/xml
<city>
<address>
<zipcode>4455</zipcode>
</address>
I'm trying to parse with with lxml
like this,
from lxml import etree as ET
parser = ET.XMLParser(dtd_validation=False)
tree = ET.fromstring(xml_data,parser)
print(tree.xpath('//zipcode'))
Unfortunately, This code still gives xml errors,
Any idea how i can get a non-validating parse of the above xml?
Assuming that by 'invalid dtd' you meant that the <city> tag is not closed in above XML sample, then your document is actually invalid XML or frankly it isn't XML at all because it doesn't follow XML rules.
You need to fix the document somehow to be able to treat it as an XML document. For this simple unclosed tag case, setting recover=True will do the job :
from lxml import etree as ET
parser = ET.XMLParser(recover=True)
tree = ET.fromstring(xml_data,parser)
print(tree.xpath('//zipcode'))

ElementTree's iterparse() XML parsing error

I need to parse a 1.2GB XML file that has an encoding of "ISO-8859-1", and after reading a few articles on the NET, it seems that Python's ElementTree's iterparse() is preferred as to SAX parsing.
I've written a extremely short piece of code just to test it out, but it's prompting out an error that I've no idea how to solve.
My Code (Python 2.7):
from xml.etree.ElementTree import iterparse
for (event, node) in iterparse('dblp.xml', events=['start']):
print node.tag
node.clear()
Edit: Ahh, as the file was really big and laggy, I typed out the XML line, and made a mistake. It's "& uuml;" without the space. I apologize for this.
This code works fine until it hits a line in the XML file that looks like this:
<Journal>Technical Report 248, ETH Zürich, Dept of Computer Science</Journal>
which I guess means Zurich, but the parser does not seem to know this.
Running the code above gave me an error:
xml.etree.ElementTree.ParseError: undefined entity ü
Is there anyway I could solve this issue? I've googled quite a few solutions, but none seem to deal with this problem directly.
Try following:
from xml.etree.ElementTree import iterparse, XMLParser
import htmlentitydefs
class CustomEntity:
def __getitem__(self, key):
if key == 'umml':
key = 'uuml' # Fix invalid entity
return unichr(htmlentitydefs.name2codepoint[key])
parser = XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = CustomEntity()
for (event, node) in iterparse('dblp.xml', events=['start'], parser=parser):
print node.tag
node.clear()
OR
from xml.etree.ElementTree import iterparse, XMLParser
import htmlentitydefs
parser = XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = {'umml': unichr(htmlentitydefs.name2codepoint['uuml'])}
for (event, node) in iterparse('dblp.xml', events=['start'], parser=parser):
print node.tag
node.clear()
Related question: Python ElementTree support for parsing unknown XML entities?

How do I skip validating the URI in lxml?

I am using lxml to parse some xml files. I don't create them, I'm just parsing them. Some of the files contain invalid uri's for the namespaces. For instance:
'D:\Path\To\some\local\file.xsl'
I get an error when I try to process it:
lxml.etree.XMLSyntaxError: xmlns:xsi: 'D:\Path\To\some\local\file.xsl' is not a valid URI
Is there an easy way to replace any invalid uri's with something (anything, such as 'http://www.googlefsdfsd.com/')? I thought of writing a regex but was hoping for an easier way.
What the parser doesn't like are the backslashes in the namespace uri.
To parse the xml despite the invalid uris, you can instantiate an lxml.etree.XMLParser with the recover argument set to True and then use that to parse the file:
from lxml import etree
recovering_parser = etree.XMLParser(recover=True)
xml = etree.parse("xmlfile.xml", parser=recovering_parser)
...
If you are sure that those specific errors are not significant to your use case you could just catch it as an exeption:
try:
# process your tree here
SomeFn()
except lxml.etree.XMLSyntaxError, e:
print "Ignoring", e
pass

XML parsing - ElementTree vs SAX and DOM

Python has several ways to parse XML...
I understand the very basics of parsing with SAX. It functions as a stream parser, with an event-driven API.
I understand the DOM parser also. It reads the XML into memory and converts it to objects that can be accessed with Python.
Generally speaking, it was easy to choose between the two depending on what you needed to do, memory constraints, performance, etc.
(Hopefully I'm correct so far.)
Since Python 2.5, we also have ElementTree. How does this compare to DOM and SAX? Which is it more similar to? Why is it better than the previous parsers?
ElementTree is much easier to use, because it represents an XML tree (basically) as a structure of lists, and attributes are represented as dictionaries.
ElementTree needs much less memory for XML trees than DOM (and thus is faster), and the parsing overhead via iterparse is comparable to SAX. Additionally, iterparse returns partial structures, and you can keep memory usage constant during parsing by discarding the structures as soon as you process them.
ElementTree, as in Python 2.5, has only a small feature set compared to full-blown XML libraries, but it's enough for many applications. If you need a validating parser or complete XPath support, lxml is the way to go. For a long time, it used to be quite unstable, but I haven't had any problems with it since 2.1.
ElementTree deviates from DOM, where nodes have access to their parent and siblings. Handling actual documents rather than data stores is also a bit cumbersome, because text nodes aren't treated as actual nodes. In the XML snippet
<a>This is <b>a</b> test</a>
The string test will be the so-called tail of element b.
In general, I recommend ElementTree as the default for all XML processing with Python, and DOM or SAX as the solutions for specific problems.
Minimal DOM implementation:
Link.
Python supplies a full, W3C-standard implementation of XML DOM (xml.dom) and a minimal one, xml.dom.minidom. This latter one is simpler and smaller than the full implementation. However, from a "parsing perspective", it has all the pros and cons of the standard DOM - i.e. it loads everything in memory.
Considering a basic XML file:
<?xml version="1.0"?>
<catalog>
<book isdn="xxx-1">
<author>A1</author>
<title>T1</title>
</book>
<book isdn="xxx-2">
<author>A2</author>
<title>T2</title>
</book>
</catalog>
A possible Python parser using minidom is:
import os
from xml.dom import minidom
from xml.parsers.expat import ExpatError
#-------- Select the XML file: --------#
#Current file name and directory:
curpath = os.path.dirname( os.path.realpath(__file__) )
filename = os.path.join(curpath, "sample.xml")
#print "Filename: %s" % (filename)
#-------- Parse the XML file: --------#
try:
#Parse the given XML file:
xmldoc = minidom.parse(filepath)
except ExpatError as e:
print "[XML] Error (line %d): %d" % (e.lineno, e.code)
print "[XML] Offset: %d" % (e.offset)
raise e
except IOError as e:
print "[IO] I/O Error %d: %s" % (e.errno, e.strerror)
raise e
else:
catalog = xmldoc.documentElement
books = catalog.getElementsByTagName("book")
for book in books:
print book.getAttribute('isdn')
print book.getElementsByTagName('author')[0].firstChild.data
print book.getElementsByTagName('title')[0].firstChild.data
Note that xml.parsers.expat is a Python interface to the Expat non-validating XML parser (docs.python.org/2/library/pyexpat.html).
The xml.dom package supplies also the exception class DOMException, but it is not supperted in minidom!
The ElementTree XML API:
Link.
ElementTree is much easier to use and it requires less memory than XML DOM. Furthermore, a C implementation is available (xml.etree.cElementTree).
A possible Python parser using ElementTree is:
import os
from xml.etree import cElementTree # C implementation of xml.etree.ElementTree
from xml.parsers.expat import ExpatError # XML formatting errors
#-------- Select the XML file: --------#
#Current file name and directory:
curpath = os.path.dirname( os.path.realpath(__file__) )
filename = os.path.join(curpath, "sample.xml")
#print "Filename: %s" % (filename)
#-------- Parse the XML file: --------#
try:
#Parse the given XML file:
tree = cElementTree.parse(filename)
except ExpatError as e:
print "[XML] Error (line %d): %d" % (e.lineno, e.code)
print "[XML] Offset: %d" % (e.offset)
raise e
except IOError as e:
print "[XML] I/O Error %d: %s" % (e.errno, e.strerror)
raise e
else:
catalogue = tree.getroot()
for book in catalogue:
print book.attrib.get("isdn")
print book.find('author').text
print book.find('title').text
ElementTree has more pythonic API. It also is in the standard library now so using it reduces dependencies.
I actually prefer lxml as it has API like ElementTree, but has also nice additional features and performs well.
ElementTree's parse() is like DOM, whereas iterparse() is like SAX. In my opinion, ElementTree is better than DOM and SAX in that it provides API easier to work with.

Categories

Resources