This question appears related to this one from 2013, but it didn't help me.
I'm about to parse a large (2GB) XML file, and plan to do it with Python 3.5.2 and ElementTree. I'm new to Python, but it works well until reaching any escape character, such as:
<author>Sanjeev Saxöna</author>
returning:
test.xml
File "<string>", line unknown
ParseError: undefined entity ö: line 5, column 19enter code here
My code looks something like this:
import xml.etree.ElementTree as etree
for event, elem in etree.iterparse('test_esc.xml'):
# do something with the node
What's the best way to deal with this? Parsing the unescaped 'ö' actually works fine:
<author>Sanjeev Saxöna</author>
Is there an easy way to programmatically unescape the whole XML file?
As suggested by the answer linked by Soulaimane Sahmi, I added an inline DTD to the XML file. It is maybe not the best solution out there, but it works for now.
Related
I'm using minidom from xml.dom to parse an xml document. I make some changes to it and then re-export it back to a new xml file. This file is generated by a program as an export and I use the changed document as an import. Upon importing, the program tells me that there are missing CDATA nodes and that it cannot import.
I simplified my code to test the process:
from xml.dom import minidom
filename = 'Test.xml'
dom = minidom.parse(filename)
with open( filename.replace('.xml','_Generated.xml'), mode='w', encoding='utf8' ) as fh:
fh.write(dom.toxml())
Using this for the Test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<body>
<![CDATA[]]>
</body>
This is what the Text_Generated.xml file is:
<?xml version="1.0" ?><body>
</body>
A simple solution is to first open the document and change all the empty CDATA nodes to include some value before parsing then removing the value from the new file after generation but this seems like unnecessary work and time for execution as some of these documents include tens of thousands of lines.
I partially debugged the issue down to the explatbuilder.py and it's parser. The parser is installed with custom callbacks. The callback that handles the data from the CDATA nodes is the character_data_handler_cdata method. The data that is supplied to this method is already missing after parsing.
Anyone know what is going on with this?
Unfortunately the XML specification is not 100% explicit about what counts as significant information in a document and what counts as noise. But there's a fairly wide consensus that CDATA tags serve no purpose other than to delimit text that hasn't been escaped: so % and % and % and <!CDATA[%]]> are different ways of writing the same content, and whichever of these you use in your input, the XML parser will produce the same output. On that assumption, an empty <!CDATA[]]> represents "no content" and a parser will remove it.
If your document design attaches signficance to CDATA tags then it's out of line with usual practice followed by most XML tooling, and it would be a good idea to revise the design to use element tags instead.
Having said that, many XML parsers do have an option to report CDATA tags to the application, so you may be able to find a way around this, but it's still not a good design choice.
I'm having an issue where I'm using beautifulsoup to parse the xml generated from a Tableau workbook and when I write the results to file it doesn't behave as expected. Chose bs4 and it's standard XML parser, because I find it easiest for my brain to comprehend and I don't need the speed of the lxml parser/package.
Background: I have a calculated field in my Tableau workbook that will programmatically change during publish depending on the server and site location that template workbook will go to. I've already gone through and built some functions and scripted out everything I need to get the data to do this, but when my script writes the xml to file it adds some encodings for ampersand. This results in the file being valid and able to be opened in Tableau, but the field is considered invalid, despite looking like it is valid. I'm thinking the XML is some how getting malformed somewhere in my process.
Code so far for where I think the issue is occuring:
import bs4 as bs
twb = open(Script_config['local_file_location'], 'r')
bs_content = bs(twb, 'xml')
# formula_final below comes from another script that handles getting the data I need to programmatically generate the formula I need.
# Here is what I use to generate the bulk of the formula for Tableau
# 'When '[{}]' then {} '.format(rows['Column_Name'], rows['Formatted_ColumnName']))
# Does some other stuff and slaps together the formula I need as a string that can be written into my XML
# Verified that my result is coming over correctly and only changes once I do the replacement here and/or the writing of the file.
for calculation in bs_content.find_all('column', {'caption': 'Group By', 'datatype':'string', 'name':'[Calculation_12345678910]'}):
calculation.find('calculation')['formula'] = formula_final
with open('test.twb', 'w') as file:
file.write(str(bs_content))
Sample XML:
<?xml version="1.0" encoding="utf-8"?>
<workbook source-build="2021.1.4 (20211.21.0712.0907)" source-platform="win" version="18.1" xml:base="https://localhost" xmlns:user="http://www.tableausoftware.com/xml/user">
...
<column caption="Group By" datatype="string" name="[Calculation_12345678910]" role="dimension" type="nominal">
<calculation class="tableau" formula="Case [Parameters].[Location External ID Parameter] When '[Territory]' then [Territory] End"/>
</column>
Problem:
In the sample XML, Tableau is expecting the XML to be formatted without the & in front of the apos;. It should just be reading as '.
What I've tried:
Thinking that I could just escape the & character I put the necessary slashes in place to escape it before the apos; portion, but to no avail I can't figure out how to get my XML to be formed so that it doesn't always put the ampersand code as part of the other special characters in my XML.
Any help would be much appreciated!
Good problem description.
Your problem is known as 'double escaping'. Your program is reading data which has already been serialized by an XML processor. That's why it contains '[{}]' and not '[{}]'
I think your program reads that XML value from a file as a simple string and assigns it to the value of a tag. But when BeautifulSoup's XML processor encounters the & in the tag value it must replace it with &. So you end up with '' instead of ' in the XML output.
The quick and dirty solution is to write some code to replace all XML entities with the equivalent text. A better solution would be to read the XML data using an XML parser - that way, your program will receive the intended string value automatically.
I used ElementTree to generate xml with special character of '\x0b', then use minidom to parse it. It will throw not well-formed error.
import xml.etree.ElementTree as ET
from xml.dom import minidom
root = ET.Element('root')
root.text='\x0b'
xml = ET.tostring(root, 'UTF-8')
print(xml)
pretty_tree = minidom.parseString(xml)
Generated XML: <root>\x0b</root>
Error:
Traceback (most recent call last):
File "testXml.py", line 7, in <module>
pretty_tree = minidom.parseString(xml)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/minidom.py", line 1968, in parseString
return expatbuilder.parseString(string)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/expatbuilder.py", line 925, in parseString
return builder.parseString(string)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 6
This behaviour has been raised as a bug in the past and resolved as "won't fix".
The author of the ElementTree module commented
For ET, [this behaviour is] very much on purpose. Validating data provided by every
single application would kill performance for all of them, even if only a
small minority would ever try to serialize data that cannot be represented
in XML.
The closing comment (by the maintainer of lxml, who is also a Python core dev) includes these observations:
This is a tricky decision. lxml, for example, validates user input, but that's because it has to process it anyway and does it along the way directly on input (and very efficiently in C code). ET, on the other hand, is rather lenient about what it allows users to do and doesn't apply much processing to user input. It even allows invalid trees during processing and only expects the tree to be serialisable when requested to serialise it.
I think that's a fair behaviour, because most user input will be ok and shouldn't need to suffer the performance penalty of validating all input. Null-characters are a very rare thing to find in text, for example, and I think it's reasonable to let users handle the few cases by themselves where they can occur.
...
In the end, users who really care about correct output should run some kind of schema validation over it after serialisation, as that would detect not only data issues but also structural and logical issues (such as a missing or empty attribute), specifically for their target data format. In some cases, it might even detect random data corruption due to old non-ECC RAM in the server machine. :)
...
So in summary, ET.tostring will generate xml which is not well-formed, and this is by design. If necessary, the output can be parsed to check that it is well-formed, using ET.fromstring or another parser. Alternatively, lxml can be used instead of ElementTree.
\x0b is an XML restricted character. There is a good description of valid and restricted characters in the answers to this question.
As a workaround for myself, I wrote a helper method to clean the restricted chars before saving to XML model:
def clean(str):
return re.sub(r'[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFF]+', '', str)
I need to parse a 1.2GB XML file that has an encoding of "ISO-8859-1", and after reading a few articles on the NET, it seems that Python's ElementTree's iterparse() is preferred as to SAX parsing.
I've written a extremely short piece of code just to test it out, but it's prompting out an error that I've no idea how to solve.
My Code (Python 2.7):
from xml.etree.ElementTree import iterparse
for (event, node) in iterparse('dblp.xml', events=['start']):
print node.tag
node.clear()
Edit: Ahh, as the file was really big and laggy, I typed out the XML line, and made a mistake. It's "& uuml;" without the space. I apologize for this.
This code works fine until it hits a line in the XML file that looks like this:
<Journal>Technical Report 248, ETH Zürich, Dept of Computer Science</Journal>
which I guess means Zurich, but the parser does not seem to know this.
Running the code above gave me an error:
xml.etree.ElementTree.ParseError: undefined entity ü
Is there anyway I could solve this issue? I've googled quite a few solutions, but none seem to deal with this problem directly.
Try following:
from xml.etree.ElementTree import iterparse, XMLParser
import htmlentitydefs
class CustomEntity:
def __getitem__(self, key):
if key == 'umml':
key = 'uuml' # Fix invalid entity
return unichr(htmlentitydefs.name2codepoint[key])
parser = XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = CustomEntity()
for (event, node) in iterparse('dblp.xml', events=['start'], parser=parser):
print node.tag
node.clear()
OR
from xml.etree.ElementTree import iterparse, XMLParser
import htmlentitydefs
parser = XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = {'umml': unichr(htmlentitydefs.name2codepoint['uuml'])}
for (event, node) in iterparse('dblp.xml', events=['start'], parser=parser):
print node.tag
node.clear()
Related question: Python ElementTree support for parsing unknown XML entities?
I'm trying to query a database, then convert the file-like object it returns to an XML document. Here's what I've been doing:
>>> import urllib, xml.dom.minidom
>>> query = "http://sbol.bhi.washington.edu/openrdf-sesame/repositories/sbol_test?query=select%20distinct%20%3Fname%20%3Ffeaturename%20where%20%7B%3Fpart%20%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23annotation%3E%20%3Fannotation%3B%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23status%3E%20'Available'%3B%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23name%3E%20%3Fname.%3Fannotation%20%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23feature%3E%20%3Ffeature.%3Ffeature%20%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23type%3E%20%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23binding%3E%3B%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23name%3E%20%3Ffeaturename%7D"
>>> raw_result = urllib.urlopen(query)
>>> xml_result = xml.dom.minidom.parse(raw_result)
That last command gives me
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 4
Almost the same thing happens if I use xml.etree.ElementTree to do the parsing. I think they both use Expat. The weird part is, if instead of loading the file in python I just paste the query into Firefox, the resulting file can be read in perfectly well using open(path_to_file, "r").
Any ideas what this could be?
UPDATE:
This is the first line of the file:
<?xml version='1.0' encoding='UTF-8'?>
However that may not be what's in raw_result... that's what you get after downloading query-result.srx and changing the extension to .txt. The file extension doesn't matter does it? Also, I'm pretty new to this whole xml thing—why is column 4 the 8th character? – Jeff 0 secs ago edit
Your server is picky about the accept header in deciding what to send back and in which format. The following should work:
In [265]: import urllib2
In [266]: req = urllib2.Request(query, headers={'Accept':'application/xml'})
In [267]: rsp = urllib2.urlopen(req)
In [268]: xml = minidom.parse(rsp)
In [268]: xml.toxml()[:64]
Out[268]: u'<?xml version="1.0" ?><sparql xmlns="http://www.w3.org/2005/spar'
Note the accept header in urllib2.Request.
Any chance you could post the XML snippet? The parser is indicating that the error is happening at the very first line. My guess is the formatting is off or reporting incorrectly, which is causing EXPAT to pitch an exception right off the bat.
My guess is that first line violates something in the "well formed XML" content anwyay. For reference, you might compare against http://en.wikipedia.org/wiki/XML
Looks like something is wrong with your XML file, right about line 1, column 4.
I tried this, and what I got doesn't look like XML to me. Here are the first eight characters, as Alex suggested:
>>> raw_result.read(8)
'BRTR\x00\x00\x00\x03'
It seems that the RDF server is delivering plain text to your urllib.urlopen call.
You should be able, with setting the right header
Accept: application/sparql-results+xml, */*;q=0.5
, to get the xml response. You have to read the RDF protocol specification of openRDF for details - there is for openRDF more than one format.