How do I skip validating the URI in lxml?

How do I skip validating the URI in lxml? - python

I am using lxml to parse some xml files. I don't create them, I'm just parsing them. Some of the files contain invalid uri's for the namespaces. For instance:
'D:\Path\To\some\local\file.xsl'
I get an error when I try to process it:
lxml.etree.XMLSyntaxError: xmlns:xsi: 'D:\Path\To\some\local\file.xsl' is not a valid URI
Is there an easy way to replace any invalid uri's with something (anything, such as 'http://www.googlefsdfsd.com/')? I thought of writing a regex but was hoping for an easier way.

What the parser doesn't like are the backslashes in the namespace uri.
To parse the xml despite the invalid uris, you can instantiate an lxml.etree.XMLParser with the recover argument set to True and then use that to parse the file:
from lxml import etree
recovering_parser = etree.XMLParser(recover=True)
xml = etree.parse("xmlfile.xml", parser=recovering_parser)
...

If you are sure that those specific errors are not significant to your use case you could just catch it as an exeption:
try:
# process your tree here
SomeFn()
except lxml.etree.XMLSyntaxError, e:
print "Ignoring", e
pass

Related

Cannot parse XML files in Python - xml.etree.ElementTree.ParseError

I am trying to parse information from XML file using Python's xml module. Problem is that when I specify list of files and start parsing strategy, after first file being (supposedly) successfully parsed, I am getting following error:
Parsing 20586908.xml ..
Parsing 20586934.xml ..
Traceback (most recent call last):
File "<ipython-input-72-0efdae22e237>", line 11, in parse
xmlTree = ET.parse(xmlFilePath, parser = self.parser)
File "C:\Users\StefanCepa995\miniconda3\envs\dl4cv\lib\xml\etree\ElementTree.py", line 1202, in parse
tree.parse(source, parser)
File "C:\Users\StefanCepa995\miniconda3\envs\dl4cv\lib\xml\etree\ElementTree.py", line 601, in parse
parser.feed(data)
xml.etree.ElementTree.ParseError: parsing finished: line 1755, column 0
Here is the code I am using to parse XML files:
class INBreastXMLParser:
def __init__(self, xmlRootDir):
self.parser = ET.XMLParser(encoding="utf-8")
self.xmlAnnotations = [os.path.join(root, f)
for root, dirs, files in os.walk(xmlRootDir)
for f in files if f.endswith('.xml')]
def parse(self):
for xmlFilePath in self.xmlAnnotations:
logger.info(f"Parsing {os.path.basename(xmlFilePath)} ..")
try:
xmlTree = ET.parse(xmlFilePath, parser = self.parser)
root = xmlTree.getroot()
except Exception as err:
logging.error(f"Could not parse {xmlFilePath}. Reason - {err}")
traceback.print_exc()
And here is the screenshot of the part of the file where parsing fails:

The problem is that the ET.XMLParser instance is reused. The underlying XML library (Expat) that is used by ElementTree does not support this:
Due to limitations in the Expat library used by pyexpat, the xmlparser instance returned can only be used to parse a single XML document. Call ParserCreate for each document to provide unique parser instances.
You need to create a new parser for each XML file. Move
self.parser = ET.XMLParser(encoding="utf-8")
from the __init__ method to the parse method.

Parse errors can and do happen. They have exactly one reason: The parser errors. And even it's only one reason, the causes can be plenty. Three common ones:
The input is invalid (e.g. invalid XML in your example)
The parser is incompatible (e.g. the XML input is valid, but (encoded) in a form or variant the parser can not handle)
The parser has errors itself (e.g. Software Bugs)
As the parser you have in use is written in software and there is normally a bug in each ~173 lines of code, this could be worth a quick look.
But only if you can look fast. It might not be worth because more often the problem is with the input. So maybe worth to look into that first.
In any case you're lucky. It seems like you want to process XML and tooling exists! Check the validation of the file on disk, your program gives you a hint already that it might be invalid with the parse error.
Also move it out of that directory and start your script again. It might not be the only file that is invalid and you might want to find out how many of the remaining files cause an issue with your script as fast as possible, too.

Python ElementTree generate not well formed XML file with special character '\x0b'

I used ElementTree to generate xml with special character of '\x0b', then use minidom to parse it. It will throw not well-formed error.
import xml.etree.ElementTree as ET
from xml.dom import minidom
root = ET.Element('root')
root.text='\x0b'
xml = ET.tostring(root, 'UTF-8')
print(xml)
pretty_tree = minidom.parseString(xml)
Generated XML: <root>\x0b</root>
Error:
Traceback (most recent call last):
File "testXml.py", line 7, in <module>
pretty_tree = minidom.parseString(xml)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/minidom.py", line 1968, in parseString
return expatbuilder.parseString(string)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/expatbuilder.py", line 925, in parseString
return builder.parseString(string)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 6

This behaviour has been raised as a bug in the past and resolved as "won't fix".
The author of the ElementTree module commented
For ET, [this behaviour is] very much on purpose. Validating data provided by every
single application would kill performance for all of them, even if only a
small minority would ever try to serialize data that cannot be represented
in XML.
The closing comment (by the maintainer of lxml, who is also a Python core dev) includes these observations:
This is a tricky decision. lxml, for example, validates user input, but that's because it has to process it anyway and does it along the way directly on input (and very efficiently in C code). ET, on the other hand, is rather lenient about what it allows users to do and doesn't apply much processing to user input. It even allows invalid trees during processing and only expects the tree to be serialisable when requested to serialise it.
I think that's a fair behaviour, because most user input will be ok and shouldn't need to suffer the performance penalty of validating all input. Null-characters are a very rare thing to find in text, for example, and I think it's reasonable to let users handle the few cases by themselves where they can occur.
...
In the end, users who really care about correct output should run some kind of schema validation over it after serialisation, as that would detect not only data issues but also structural and logical issues (such as a missing or empty attribute), specifically for their target data format. In some cases, it might even detect random data corruption due to old non-ECC RAM in the server machine. :)
...
So in summary, ET.tostring will generate xml which is not well-formed, and this is by design. If necessary, the output can be parsed to check that it is well-formed, using ET.fromstring or another parser. Alternatively, lxml can be used instead of ElementTree.

\x0b is an XML restricted character. There is a good description of valid and restricted characters in the answers to this question.

As a workaround for myself, I wrote a helper method to clean the restricted chars before saving to XML model:
def clean(str):
return re.sub(r'[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFF]+', '', str)

Fetching child tags of a tag with a xmlns namespace [duplicate]

I want to use the method of findall to locate some elements of the source xml file in the ElementTree module.
However, the source xml file (test.xml) has namespaces. I truncate part of xml file as sample:
<?xml version="1.0" encoding="iso-8859-1"?>
<XML_HEADER xmlns="http://www.test.com">
<TYPE>Updates</TYPE>
<DATE>9/26/2012 10:30:34 AM</DATE>
<COPYRIGHT_NOTICE>All Rights Reserved.</COPYRIGHT_NOTICE>
<LICENSE>newlicense.htm</LICENSE>
<DEAL_LEVEL>
<PAID_OFF>N</PAID_OFF>
</DEAL_LEVEL>
</XML_HEADER>
The sample python code is below:
from xml.etree import ElementTree as ET
tree = ET.parse(r"test.xml")
el1 = tree.findall("DEAL_LEVEL/PAID_OFF") # Return None
el2 = tree.findall("{http://www.test.com}DEAL_LEVEL/{http://www.test.com}PAID_OFF") # Return <Element '{http://www.test.com}DEAL_LEVEL/PAID_OFF' at 0xb78b90>
Though using "{http://www.test.com}" works, it's very inconvenient to add a namespace in front of each tag.
How can I ignore the namespace when using functions like find, findall, ...?

Instead of modifying the XML document itself, it's best to parse it and then modify the tags in the result. This way you can handle multiple namespaces and namespace aliases:
from io import StringIO # for Python 2 import from StringIO instead
import xml.etree.ElementTree as ET
# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))
for _, el in it:
_, _, el.tag = el.tag.rpartition('}') # strip ns
root = it.root
This is based on the discussion here.

If you remove the xmlns attribute from the xml before parsing it then there won't be a namespace prepended to each tag in the tree.
import re
xmlstring = re.sub(' xmlns="[^"]+"', '', xmlstring, count=1)

The answers so far explicitely put the namespace value in the script. For a more generic solution, I would rather extract the namespace from the xml:
import re
def get_namespace(element):
m = re.match('\{.*\}', element.tag)
return m.group(0) if m else ''
And use it in find method:
namespace = get_namespace(tree.getroot())
print tree.find('./{0}parent/{0}version'.format(namespace)).text

Here's an extension to #nonagon answer (which removes namespace from tags) to also remove namespace from attributes:
import io
import xml.etree.ElementTree as ET
# instead of ET.fromstring(xml)
it = ET.iterparse(io.StringIO(xml))
for _, el in it:
if '}' in el.tag:
el.tag = el.tag.split('}', 1)[1] # strip all namespaces
for at in list(el.attrib.keys()): # strip namespaces of attributes too
if '}' in at:
newat = at.split('}', 1)[1]
el.attrib[newat] = el.attrib[at]
del el.attrib[at]
root = it.root
Obviously this is a permanent defacing of the XML but if that's acceptable because there are no non-unique tag names and because you won't be writing the file needing the original namespaces then this can make accessing it a lot easier

Improving on the answer by ericspod:
Instead of changing the parse mode globally we can wrap this in an object supporting the with construct.
from xml.parsers import expat
class DisableXmlNamespaces:
def __enter__(self):
self.old_parser_create = expat.ParserCreate
expat.ParserCreate = lambda encoding, sep: self.old_parser_create(encoding, None)
def __exit__(self, type, value, traceback):
expat.ParserCreate = self.oldcreate
This can then be used as follows
import xml.etree.ElementTree as ET
with DisableXmlNamespaces():
tree = ET.parse("test.xml")
The beauty of this way is that it does not change any behaviour for unrelated code outside the with block. I ended up creating this after getting errors in unrelated libraries after using the version by ericspod which also happened to use expat.

You can use the elegant string formatting construct as well:
ns='http://www.test.com'
el2 = tree.findall("{%s}DEAL_LEVEL/{%s}PAID_OFF" %(ns,ns))
or, if you're sure that PAID_OFF only appears in one level in tree:
el2 = tree.findall(".//{%s}PAID_OFF" % ns)

In python 3.5 , you can pass the namespace as an argument in find().
For example ,
ns= {'xml_test':'http://www.test.com'}
tree = ET.parse(r"test.xml")
el1 = tree.findall("xml_test:DEAL_LEVEL/xml_test:PAID_OFF",ns)
Documentation link :- https://docs.python.org/3.5/library/xml.etree.elementtree.html#parsing-xml-with-namespaces

I might be late for this but I dont think re.sub is a good solution.
However the rewrite xml.parsers.expat does not work for Python 3.x versions,
The main culprit is the xml/etree/ElementTree.py see bottom of the source code
# Import the C accelerators
try:
# Element is going to be shadowed by the C implementation. We need to keep
# the Python version of it accessible for some "creative" by external code
# (see tests)
_Element_Py = Element
# Element, SubElement, ParseError, TreeBuilder, XMLParser
from _elementtree import *
except ImportError:
pass
Which is kinda sad.
The solution is to get rid of it first.
import _elementtree
try:
del _elementtree.XMLParser
except AttributeError:
# in case deleted twice
pass
else:
from xml.parsers import expat # NOQA: F811
oldcreate = expat.ParserCreate
expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)
Tested on Python 3.6.
Try try statement is useful in case somewhere in your code you reload or import a module twice you get some strange errors like
maximum recursion depth exceeded
AttributeError: XMLParser
btw damn the etree source code looks really messy.

If you're using ElementTree and not cElementTree you can force Expat to ignore namespace processing by replacing ParserCreate():
from xml.parsers import expat
oldcreate = expat.ParserCreate
expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)
ElementTree tries to use Expat by calling ParserCreate() but provides no option to not provide a namespace separator string, the above code will cause it to be ignore but be warned this could break other things.

Let's combine nonagon's answer with mzjn's answer to a related question:
def parse_xml(xml_path: Path) -> Tuple[ET.Element, Dict[str, str]]:
xml_iter = ET.iterparse(xml_path, events=["start-ns"])
xml_namespaces = dict(prefix_namespace_pair for _, prefix_namespace_pair in xml_iter)
return xml_iter.root, xml_namespaces
Using this function we:
Create an iterator to get both namespaces and a parsed tree object.
Iterate over the created iterator to get the namespaces dict that we can
later pass in each find() or findall() call as sugested by
iMom0.
Return the parsed tree's root element object and namespaces.
I think this is the best approach all around as there's no manipulation either of a source XML or resulting parsed xml.etree.ElementTree output whatsoever involved.
I'd like also to credit balmy's answer with providing an essential piece of this puzzle (that you can get the parsed root from the iterator). Until that I actually traversed XML tree twice in my application (once to get namespaces, second for a root).

Just by chance dropped into the answer here: XSD conditional type assignment default type confusion?. This is not the exact answer for the topic question but may be applicable if the namespace is not critical.
<?xml version="1.0" encoding="UTF-8"?>
<persons xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="test.xsd">
<person version="1">
<firstname>toto</firstname>
<lastname>tutu</lastname>
</person>
</persons>
Also see: https://www.w3.org/TR/xmlschema-1/#xsi_schemaLocation
Works for me. I call an XML validation procedure in my application. But also I want to quickly see the validation highliting and autocompletion in PyCharm when editing the XML. This noNamespaceSchemaLocation attribute does what I need.
RECHECKED
from xml.etree import ElementTree as ET
tree = ET.parse("test.xml")
el1 = tree.findall("person/firstname")
print(el1[0].text)
el2 = tree.find("person/lastname")
print(el2.text)
Returnrs
>python test.py
toto
tutu

ElementTree's iterparse() XML parsing error

I need to parse a 1.2GB XML file that has an encoding of "ISO-8859-1", and after reading a few articles on the NET, it seems that Python's ElementTree's iterparse() is preferred as to SAX parsing.
I've written a extremely short piece of code just to test it out, but it's prompting out an error that I've no idea how to solve.
My Code (Python 2.7):
from xml.etree.ElementTree import iterparse
for (event, node) in iterparse('dblp.xml', events=['start']):
print node.tag
node.clear()
Edit: Ahh, as the file was really big and laggy, I typed out the XML line, and made a mistake. It's "& uuml;" without the space. I apologize for this.
This code works fine until it hits a line in the XML file that looks like this:
<Journal>Technical Report 248, ETH Zürich, Dept of Computer Science</Journal>
which I guess means Zurich, but the parser does not seem to know this.
Running the code above gave me an error:
xml.etree.ElementTree.ParseError: undefined entity ü
Is there anyway I could solve this issue? I've googled quite a few solutions, but none seem to deal with this problem directly.

Try following:
from xml.etree.ElementTree import iterparse, XMLParser
import htmlentitydefs
class CustomEntity:
def __getitem__(self, key):
if key == 'umml':
key = 'uuml' # Fix invalid entity
return unichr(htmlentitydefs.name2codepoint[key])
parser = XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = CustomEntity()
for (event, node) in iterparse('dblp.xml', events=['start'], parser=parser):
print node.tag
node.clear()
OR
from xml.etree.ElementTree import iterparse, XMLParser
import htmlentitydefs
parser = XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = {'umml': unichr(htmlentitydefs.name2codepoint['uuml'])}
for (event, node) in iterparse('dblp.xml', events=['start'], parser=parser):
print node.tag
node.clear()
Related question: Python ElementTree support for parsing unknown XML entities?

Close all opened xml tags

I have a file, which change it content in a short time. But I'd like to read it before it is ready. The problem is, that it is an xml-file (log). So when you read it, it could be, that not all tags are closed.
I would like to know if there is a possibility to close all opened tags correctly, that there are no problems to show it in the browser (with xslt stylsheet). This should be made by using included features of python.

Some XML parsers allow incremental parsing of XML documents that is the parser can start working on the document without needing it to be fully loaded. The XMLTreeBuilder from the xml.etree.ElementTree module in the Python standard library is one such parser: Element Tree
As you can see in the example below you can feed data to the parser bit by bit as you read it from your input source. The appropriate hook methods in your handler class will get called when various XML "events" happen (tag started, tag data read, tag ended) allowing you to process the data as the XML document is loaded:
from xml.etree.ElementTree import XMLTreeBuilder
class MyHandler(object):
def start(self, tag, attrib):
# Called for each opening tag.
print tag + " started"
def end(self, tag):
# Called for each closing tag.
print tag + " ended"
def data(self, data):
# Called when data is read from a tag
print data + " data read"
def close(self):
# Called when all data has been parsed.
print "All data read"
handler = MyHandler()
parser = XMLTreeBuilder(target=handler)
parser.feed(<sometag>)
parser.feed(<sometag-child-tag>text)
parser.feed(</sometag-child-tag>)
parser.feed(</sometag>)
parser.close()
In this example the handler would receive five events and print:
sometag started
sometag-child started
"text" data read
sometag-child ended
sometag ended
All data read

If I am understanding your question correctly, you have a log file that is always being appended to so you get something like:
<root>
<entry> ... </entry>
<entry> ... </entry>
...
<entry> ... </entry
<!-- no closing root -->
In this case you DON'T want to use a DOM parser because it tries to read a complete document and would choke on the missing tag. Instead, a SAX or Pull parser would work because it reads the document like a stream of data rather than a complete tree. As Denis replied above, you could either close the missing tag at the end or ignore any incomplete tags before writing it out.
XML parsing on Wikipedia

You can use any SAX parser by feeding data available so far to it. Use SAX handler that just reconstructs source XML, keep the stack of tags opened and close them in reverse order at the end.

You could use BeautifulStoneSoup (XML part of BeautifulSoup).
www.crummy.com/software/BeautifulSoup
It's not ideal, but it would circumvent the problem if you cannot fix the file's output...
It's basically a previously implemented version of what Denis said.
You can just join whatever you need into the soup and it will do its best to fix it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I skip validating the URI in lxml? - python

If you are sure that those specific errors are not significant to your use case you could just catch it as an exeption: try: # process your tree here SomeFn() except lxml.etree.XMLSyntaxError, e: print "Ignoring", e pass

Related

Cannot parse XML files in Python - xml.etree.ElementTree.ParseError

Python ElementTree generate not well formed XML file with special character '\x0b'

Fetching child tags of a tag with a xmlns namespace [duplicate]

ElementTree's iterparse() XML parsing error

Close all opened xml tags

Categories

Resources