Cannot parse XML files in Python - xml.etree.ElementTree.ParseError - python

I am trying to parse information from XML file using Python's xml module. Problem is that when I specify list of files and start parsing strategy, after first file being (supposedly) successfully parsed, I am getting following error:
Parsing 20586908.xml ..
Parsing 20586934.xml ..
Traceback (most recent call last):
File "<ipython-input-72-0efdae22e237>", line 11, in parse
xmlTree = ET.parse(xmlFilePath, parser = self.parser)
File "C:\Users\StefanCepa995\miniconda3\envs\dl4cv\lib\xml\etree\ElementTree.py", line 1202, in parse
tree.parse(source, parser)
File "C:\Users\StefanCepa995\miniconda3\envs\dl4cv\lib\xml\etree\ElementTree.py", line 601, in parse
parser.feed(data)
xml.etree.ElementTree.ParseError: parsing finished: line 1755, column 0
Here is the code I am using to parse XML files:
class INBreastXMLParser:
def __init__(self, xmlRootDir):
self.parser = ET.XMLParser(encoding="utf-8")
self.xmlAnnotations = [os.path.join(root, f)
for root, dirs, files in os.walk(xmlRootDir)
for f in files if f.endswith('.xml')]
def parse(self):
for xmlFilePath in self.xmlAnnotations:
logger.info(f"Parsing {os.path.basename(xmlFilePath)} ..")
try:
xmlTree = ET.parse(xmlFilePath, parser = self.parser)
root = xmlTree.getroot()
except Exception as err:
logging.error(f"Could not parse {xmlFilePath}. Reason - {err}")
traceback.print_exc()
And here is the screenshot of the part of the file where parsing fails:

The problem is that the ET.XMLParser instance is reused. The underlying XML library (Expat) that is used by ElementTree does not support this:
Due to limitations in the Expat library used by pyexpat, the xmlparser instance returned can only be used to parse a single XML document. Call ParserCreate for each document to provide unique parser instances.
You need to create a new parser for each XML file. Move
self.parser = ET.XMLParser(encoding="utf-8")
from the __init__ method to the parse method.

Parse errors can and do happen. They have exactly one reason: The parser errors. And even it's only one reason, the causes can be plenty. Three common ones:
The input is invalid (e.g. invalid XML in your example)
The parser is incompatible (e.g. the XML input is valid, but (encoded) in a form or variant the parser can not handle)
The parser has errors itself (e.g. Software Bugs)
As the parser you have in use is written in software and there is normally a bug in each ~173 lines of code, this could be worth a quick look.
But only if you can look fast. It might not be worth because more often the problem is with the input. So maybe worth to look into that first.
In any case you're lucky. It seems like you want to process XML and tooling exists! Check the validation of the file on disk, your program gives you a hint already that it might be invalid with the parse error.
Also move it out of that directory and start your script again. It might not be the only file that is invalid and you might want to find out how many of the remaining files cause an issue with your script as fast as possible, too.

Related

Python ElementTree generate not well formed XML file with special character '\x0b'

I used ElementTree to generate xml with special character of '\x0b', then use minidom to parse it. It will throw not well-formed error.
import xml.etree.ElementTree as ET
from xml.dom import minidom
root = ET.Element('root')
root.text='\x0b'
xml = ET.tostring(root, 'UTF-8')
print(xml)
pretty_tree = minidom.parseString(xml)
Generated XML: <root>\x0b</root>
Error:
Traceback (most recent call last):
File "testXml.py", line 7, in <module>
pretty_tree = minidom.parseString(xml)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/minidom.py", line 1968, in parseString
return expatbuilder.parseString(string)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/expatbuilder.py", line 925, in parseString
return builder.parseString(string)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 6
This behaviour has been raised as a bug in the past and resolved as "won't fix".
The author of the ElementTree module commented
For ET, [this behaviour is] very much on purpose. Validating data provided by every
single application would kill performance for all of them, even if only a
small minority would ever try to serialize data that cannot be represented
in XML.
The closing comment (by the maintainer of lxml, who is also a Python core dev) includes these observations:
This is a tricky decision. lxml, for example, validates user input, but that's because it has to process it anyway and does it along the way directly on input (and very efficiently in C code). ET, on the other hand, is rather lenient about what it allows users to do and doesn't apply much processing to user input. It even allows invalid trees during processing and only expects the tree to be serialisable when requested to serialise it.
I think that's a fair behaviour, because most user input will be ok and shouldn't need to suffer the performance penalty of validating all input. Null-characters are a very rare thing to find in text, for example, and I think it's reasonable to let users handle the few cases by themselves where they can occur.
...
In the end, users who really care about correct output should run some kind of schema validation over it after serialisation, as that would detect not only data issues but also structural and logical issues (such as a missing or empty attribute), specifically for their target data format. In some cases, it might even detect random data corruption due to old non-ECC RAM in the server machine. :)
...
So in summary, ET.tostring will generate xml which is not well-formed, and this is by design. If necessary, the output can be parsed to check that it is well-formed, using ET.fromstring or another parser. Alternatively, lxml can be used instead of ElementTree.
\x0b is an XML restricted character. There is a good description of valid and restricted characters in the answers to this question.
As a workaround for myself, I wrote a helper method to clean the restricted chars before saving to XML model:
def clean(str):
return re.sub(r'[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFF]+', '', str)

How to check if a file is a Python file in Python language

I'm working on a web application where candidates submit python code and the platform passes them through a unit test which is written by the recruiters.
I'm using Django and I'm having trouble validating the unit-test file.
How to test if a file is a python file and that it's correctly formed (no Syntax or lexical errors)? using Python ?
Thanks
You can use the built-in function compile. Pass the content of the file as a string as the first param, file from which the code was read (If it wasn't read from a file, you can give a name yourself) as the second param and mode as the third.
Try:
with open("full/file/path.py") as f:
try:
compile(f.read(), 'your_code_name', 'exec')
except Exception as ex:
print(ex)
In case the content of the test file is:
print 1
The compile function will throw a SyntaxError and the following message will be printed:
Missing parentheses in call to 'print'. Did you mean print(1)? (your_code_name, line 1)

Errno 22 Invalid argument - Zipfile Is Skipped

I am working on a project in Python in which I am parsing data from a zipped folder containing log files. The code works fine for most zips, but occasionally this exception is thrown:
[Errno 22] Invalid argument
As a result, the entire file is skipped, thus excluding the data in the desired log files from the results. When I try to extract the zipped file using the default Windows utility, I am met with this error:
Zip error
However, when I try to extract the file with 7zip, it does so successfully, save 2 errors:
1 <path> Unexpected End of Data
2 Data error: x.csv
x.csv is totally unrelated to the log I am trying to parse, and as such, I need to write code that is resilient to the point where if an unrelated file is corrupted, it will still be able to parse the other logs that are not.
At the moment, I am using the zipfile module to extract the files into memory. Is there a robust way to do this without the entire file being skipped?
Update 1: I believe the error I am running into is that the zipfile is missing a footer. I realized this when looking at it in a hex editor. I do not really have any idea how to safely edit the actual file using Python.
Here is the code that I am using to extract zips into memory:
for zip in os.listdir(directory):
try:
if zip.lower().endswith('.zip'):
if os.path.isfile(directory + "\\" + zip):
logs = zipfile.ZipFile(directory + "\\" + zip)
for log in logs.namelist():
if log.endswith('log.txt'):
data = logs.read(log)
Edit 2: Traceback for the error:
Traceback (most recent call last):
File "c:/Users/xxx/Desktop/Python Porjects/PE/logParse.py", line 28, in parse
logs = zipfile.ZipFile(directory + "\\" + zip)
File "C:\Users\xxx\AppData\Local\Programs\Python\Python37\lib\zipfile.py", line 1222, in __init__
self._RealGetContents()
File "C:\Users\xxx\AppData\Local\Programs\Python\Python37\lib\zipfile.py", line 1289, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
The stacktrace seems to show that it's not your code which badly manage to read the file but the Python module managing zip that is raising an error.
It looks like that python zip manager is more strict than other program (see this bug where a user report a difference between python behaviour and other program as GNOME Archive Manager).
Maybe, there is a bug report to do.

ElementTree's iterparse() XML parsing error

I need to parse a 1.2GB XML file that has an encoding of "ISO-8859-1", and after reading a few articles on the NET, it seems that Python's ElementTree's iterparse() is preferred as to SAX parsing.
I've written a extremely short piece of code just to test it out, but it's prompting out an error that I've no idea how to solve.
My Code (Python 2.7):
from xml.etree.ElementTree import iterparse
for (event, node) in iterparse('dblp.xml', events=['start']):
print node.tag
node.clear()
Edit: Ahh, as the file was really big and laggy, I typed out the XML line, and made a mistake. It's "& uuml;" without the space. I apologize for this.
This code works fine until it hits a line in the XML file that looks like this:
<Journal>Technical Report 248, ETH Zürich, Dept of Computer Science</Journal>
which I guess means Zurich, but the parser does not seem to know this.
Running the code above gave me an error:
xml.etree.ElementTree.ParseError: undefined entity ü
Is there anyway I could solve this issue? I've googled quite a few solutions, but none seem to deal with this problem directly.
Try following:
from xml.etree.ElementTree import iterparse, XMLParser
import htmlentitydefs
class CustomEntity:
def __getitem__(self, key):
if key == 'umml':
key = 'uuml' # Fix invalid entity
return unichr(htmlentitydefs.name2codepoint[key])
parser = XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = CustomEntity()
for (event, node) in iterparse('dblp.xml', events=['start'], parser=parser):
print node.tag
node.clear()
OR
from xml.etree.ElementTree import iterparse, XMLParser
import htmlentitydefs
parser = XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = {'umml': unichr(htmlentitydefs.name2codepoint['uuml'])}
for (event, node) in iterparse('dblp.xml', events=['start'], parser=parser):
print node.tag
node.clear()
Related question: Python ElementTree support for parsing unknown XML entities?

How do I skip validating the URI in lxml?

I am using lxml to parse some xml files. I don't create them, I'm just parsing them. Some of the files contain invalid uri's for the namespaces. For instance:
'D:\Path\To\some\local\file.xsl'
I get an error when I try to process it:
lxml.etree.XMLSyntaxError: xmlns:xsi: 'D:\Path\To\some\local\file.xsl' is not a valid URI
Is there an easy way to replace any invalid uri's with something (anything, such as 'http://www.googlefsdfsd.com/')? I thought of writing a regex but was hoping for an easier way.
What the parser doesn't like are the backslashes in the namespace uri.
To parse the xml despite the invalid uris, you can instantiate an lxml.etree.XMLParser with the recover argument set to True and then use that to parse the file:
from lxml import etree
recovering_parser = etree.XMLParser(recover=True)
xml = etree.parse("xmlfile.xml", parser=recovering_parser)
...
If you are sure that those specific errors are not significant to your use case you could just catch it as an exeption:
try:
# process your tree here
SomeFn()
except lxml.etree.XMLSyntaxError, e:
print "Ignoring", e
pass

Categories

Resources