Python XML parsing removing empty CDATA nodes - python

I'm using minidom from xml.dom to parse an xml document. I make some changes to it and then re-export it back to a new xml file. This file is generated by a program as an export and I use the changed document as an import. Upon importing, the program tells me that there are missing CDATA nodes and that it cannot import.
I simplified my code to test the process:
from xml.dom import minidom
filename = 'Test.xml'
dom = minidom.parse(filename)
with open( filename.replace('.xml','_Generated.xml'), mode='w', encoding='utf8' ) as fh:
fh.write(dom.toxml())
Using this for the Test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<body>
<![CDATA[]]>
</body>
This is what the Text_Generated.xml file is:
<?xml version="1.0" ?><body>
</body>
A simple solution is to first open the document and change all the empty CDATA nodes to include some value before parsing then removing the value from the new file after generation but this seems like unnecessary work and time for execution as some of these documents include tens of thousands of lines.
I partially debugged the issue down to the explatbuilder.py and it's parser. The parser is installed with custom callbacks. The callback that handles the data from the CDATA nodes is the character_data_handler_cdata method. The data that is supplied to this method is already missing after parsing.
Anyone know what is going on with this?

Unfortunately the XML specification is not 100% explicit about what counts as significant information in a document and what counts as noise. But there's a fairly wide consensus that CDATA tags serve no purpose other than to delimit text that hasn't been escaped: so % and % and &#x25 and <!CDATA[%]]> are different ways of writing the same content, and whichever of these you use in your input, the XML parser will produce the same output. On that assumption, an empty <!CDATA[]]> represents "no content" and a parser will remove it.
If your document design attaches signficance to CDATA tags then it's out of line with usual practice followed by most XML tooling, and it would be a good idea to revise the design to use element tags instead.
Having said that, many XML parsers do have an option to report CDATA tags to the application, so you may be able to find a way around this, but it's still not a good design choice.

Related

Beautifulsoup4, parsing Tableau XML file, and writing to file

I'm having an issue where I'm using beautifulsoup to parse the xml generated from a Tableau workbook and when I write the results to file it doesn't behave as expected. Chose bs4 and it's standard XML parser, because I find it easiest for my brain to comprehend and I don't need the speed of the lxml parser/package.
Background: I have a calculated field in my Tableau workbook that will programmatically change during publish depending on the server and site location that template workbook will go to. I've already gone through and built some functions and scripted out everything I need to get the data to do this, but when my script writes the xml to file it adds some encodings for ampersand. This results in the file being valid and able to be opened in Tableau, but the field is considered invalid, despite looking like it is valid. I'm thinking the XML is some how getting malformed somewhere in my process.
Code so far for where I think the issue is occuring:
import bs4 as bs
twb = open(Script_config['local_file_location'], 'r')
bs_content = bs(twb, 'xml')
# formula_final below comes from another script that handles getting the data I need to programmatically generate the formula I need.
# Here is what I use to generate the bulk of the formula for Tableau
# 'When &apos;[{}]&apos; then {} '.format(rows['Column_Name'], rows['Formatted_ColumnName']))
# Does some other stuff and slaps together the formula I need as a string that can be written into my XML
# Verified that my result is coming over correctly and only changes once I do the replacement here and/or the writing of the file.
for calculation in bs_content.find_all('column', {'caption': 'Group By', 'datatype':'string', 'name':'[Calculation_12345678910]'}):
calculation.find('calculation')['formula'] = formula_final
with open('test.twb', 'w') as file:
file.write(str(bs_content))
Sample XML:
<?xml version="1.0" encoding="utf-8"?>
<workbook source-build="2021.1.4 (20211.21.0712.0907)" source-platform="win" version="18.1" xml:base="https://localhost" xmlns:user="http://www.tableausoftware.com/xml/user">
...
<column caption="Group By" datatype="string" name="[Calculation_12345678910]" role="dimension" type="nominal">
<calculation class="tableau" formula="Case [Parameters].[Location External ID Parameter] When &apos;[Territory]&apos; then [Territory] End"/>
</column>
Problem:
In the sample XML, Tableau is expecting the XML to be formatted without the & in front of the apos;. It should just be reading as &apos;.
What I've tried:
Thinking that I could just escape the & character I put the necessary slashes in place to escape it before the apos; portion, but to no avail I can't figure out how to get my XML to be formed so that it doesn't always put the ampersand code as part of the other special characters in my XML.
Any help would be much appreciated!
Good problem description.
Your problem is known as 'double escaping'. Your program is reading data which has already been serialized by an XML processor. That's why it contains &apos;[{}]&apos; and not '[{}]'
I think your program reads that XML value from a file as a simple string and assigns it to the value of a tag. But when BeautifulSoup's XML processor encounters the & in the tag value it must replace it with &. So you end up with &apos;' instead of &apos; in the XML output.
The quick and dirty solution is to write some code to replace all XML entities with the equivalent text. A better solution would be to read the XML data using an XML parser - that way, your program will receive the intended string value automatically.

Python ElementTree generate not well formed XML file with special character '\x0b'

I used ElementTree to generate xml with special character of '\x0b', then use minidom to parse it. It will throw not well-formed error.
import xml.etree.ElementTree as ET
from xml.dom import minidom
root = ET.Element('root')
root.text='\x0b'
xml = ET.tostring(root, 'UTF-8')
print(xml)
pretty_tree = minidom.parseString(xml)
Generated XML: <root>\x0b</root>
Error:
Traceback (most recent call last):
File "testXml.py", line 7, in <module>
pretty_tree = minidom.parseString(xml)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/minidom.py", line 1968, in parseString
return expatbuilder.parseString(string)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/expatbuilder.py", line 925, in parseString
return builder.parseString(string)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 6
This behaviour has been raised as a bug in the past and resolved as "won't fix".
The author of the ElementTree module commented
For ET, [this behaviour is] very much on purpose. Validating data provided by every
single application would kill performance for all of them, even if only a
small minority would ever try to serialize data that cannot be represented
in XML.
The closing comment (by the maintainer of lxml, who is also a Python core dev) includes these observations:
This is a tricky decision. lxml, for example, validates user input, but that's because it has to process it anyway and does it along the way directly on input (and very efficiently in C code). ET, on the other hand, is rather lenient about what it allows users to do and doesn't apply much processing to user input. It even allows invalid trees during processing and only expects the tree to be serialisable when requested to serialise it.
I think that's a fair behaviour, because most user input will be ok and shouldn't need to suffer the performance penalty of validating all input. Null-characters are a very rare thing to find in text, for example, and I think it's reasonable to let users handle the few cases by themselves where they can occur.
...
In the end, users who really care about correct output should run some kind of schema validation over it after serialisation, as that would detect not only data issues but also structural and logical issues (such as a missing or empty attribute), specifically for their target data format. In some cases, it might even detect random data corruption due to old non-ECC RAM in the server machine. :)
...
So in summary, ET.tostring will generate xml which is not well-formed, and this is by design. If necessary, the output can be parsed to check that it is well-formed, using ET.fromstring or another parser. Alternatively, lxml can be used instead of ElementTree.
\x0b is an XML restricted character. There is a good description of valid and restricted characters in the answers to this question.
As a workaround for myself, I wrote a helper method to clean the restricted chars before saving to XML model:
def clean(str):
return re.sub(r'[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFF]+', '', str)

how to parse xml with multiple root element

I need to parse both var & group root elements.
Code
import xml.etree.ElementTree as ET
tree_ownCloud = ET.parse('0020-syslog_rules.xml')
root = tree_ownCloud.getroot()
Error
xml.etree.ElementTree.ParseError: junk after document element: line 17, column 0
Sample XML
<var name="BAD_WORDS">core_dumped|failure|error|attack| bad |illegal |denied|refused|unauthorized|fatal|failed|Segmentation Fault|Corrupted</var>
<group name="syslog,errors,">
<rule id="1001" level="2">
<match>^Couldn't open /etc/securetty</match>
<description>File missing. Root access unrestricted.</description>
<group>pci_dss_10.2.4,gpg13_4.1,</group>
</rule>
<rule id="1002" level="2">
<match>$BAD_WORDS</match>
<options>alert_by_email</options>
<description>Unknown problem somewhere in the system.</description>
<group>gpg13_4.3,</group>
</rule>
</group>
I tried following couple of other questions on stackoverflow here, but none helped.
I know the reason, due to which it is not getting parsed, people have usually tried hacks. IMO it's a very common usecase to have multiple root elements in XML, and something must be there in ET parsing library to get this done.
As mentioned in the comment, an XML file cannot have multiple roots. Simple as that.
If you do receive/store data in this format (and then it's not proper XML). You could consider a hack of surrounding what you have with a fake tag, e.g.
import xml.etree.ElementTree as ET
with open("0020-syslog_rules.xml", "r") as inputFile:
fileContent = inputFile.read()
root = ET.fromstring("<fake>" + fileContent +"</fake>")
print(root)
Actually, the example data is not a well-formed XML document, but it is a well-formed XML entity. Some XML parsers have an option to accept an entity rather than a document, and in XPath 3.1 you can parse this using the parse-xml-fragment() function.
Another way to parse a fragment like this is to create a wrapper document which references it as an external entity:
<!DOCTYPE wrapper [
<!ENTITY e SYSTEM "fragment.xml">
]>
<wrapper>&e;</wrapper>
and then supply this wrapper document as the input to your XML parser.

Python ElementTree ParseError from iterparse when reaching escape character (XML)

This question appears related to this one from 2013, but it didn't help me.
I'm about to parse a large (2GB) XML file, and plan to do it with Python 3.5.2 and ElementTree. I'm new to Python, but it works well until reaching any escape character, such as:
<author>Sanjeev Saxöna</author>
returning:
test.xml
File "<string>", line unknown
ParseError: undefined entity ö: line 5, column 19enter code here
My code looks something like this:
import xml.etree.ElementTree as etree
for event, elem in etree.iterparse('test_esc.xml'):
# do something with the node
What's the best way to deal with this? Parsing the unescaped 'ö' actually works fine:
<author>Sanjeev Saxöna</author>
Is there an easy way to programmatically unescape the whole XML file?
As suggested by the answer linked by Soulaimane Sahmi, I added an inline DTD to the XML file. It is maybe not the best solution out there, but it works for now.

Close all opened xml tags

I have a file, which change it content in a short time. But I'd like to read it before it is ready. The problem is, that it is an xml-file (log). So when you read it, it could be, that not all tags are closed.
I would like to know if there is a possibility to close all opened tags correctly, that there are no problems to show it in the browser (with xslt stylsheet). This should be made by using included features of python.
Some XML parsers allow incremental parsing of XML documents that is the parser can start working on the document without needing it to be fully loaded. The XMLTreeBuilder from the xml.etree.ElementTree module in the Python standard library is one such parser: Element Tree
As you can see in the example below you can feed data to the parser bit by bit as you read it from your input source. The appropriate hook methods in your handler class will get called when various XML "events" happen (tag started, tag data read, tag ended) allowing you to process the data as the XML document is loaded:
from xml.etree.ElementTree import XMLTreeBuilder
class MyHandler(object):
def start(self, tag, attrib):
# Called for each opening tag.
print tag + " started"
def end(self, tag):
# Called for each closing tag.
print tag + " ended"
def data(self, data):
# Called when data is read from a tag
print data + " data read"
def close(self):
# Called when all data has been parsed.
print "All data read"
handler = MyHandler()
parser = XMLTreeBuilder(target=handler)
parser.feed(<sometag>)
parser.feed(<sometag-child-tag>text)
parser.feed(</sometag-child-tag>)
parser.feed(</sometag>)
parser.close()
In this example the handler would receive five events and print:
sometag started
sometag-child started
"text" data read
sometag-child ended
sometag ended
All data read
If I am understanding your question correctly, you have a log file that is always being appended to so you get something like:
<root>
<entry> ... </entry>
<entry> ... </entry>
...
<entry> ... </entry
<!-- no closing root -->
In this case you DON'T want to use a DOM parser because it tries to read a complete document and would choke on the missing tag. Instead, a SAX or Pull parser would work because it reads the document like a stream of data rather than a complete tree. As Denis replied above, you could either close the missing tag at the end or ignore any incomplete tags before writing it out.
XML parsing on Wikipedia
You can use any SAX parser by feeding data available so far to it. Use SAX handler that just reconstructs source XML, keep the stack of tags opened and close them in reverse order at the end.
You could use BeautifulStoneSoup (XML part of BeautifulSoup).
www.crummy.com/software/BeautifulSoup
It's not ideal, but it would circumvent the problem if you cannot fix the file's output...
It's basically a previously implemented version of what Denis said.
You can just join whatever you need into the soup and it will do its best to fix it.

Categories

Resources