Python xml.etree.ElemenTree, getting HTML entities - python

I am trying to analyze xml data, and encountered an issue with regard to HTML entities when I use
import xml.etree.ElementTree as ET
tree = ET.parse(my_xml_file)
root = tree.getroot()
for regex_rule in root.findall('.//regex_rule'):
print(regex_rule.get('input')) #this ".get()" method turns < into <, but I want to get < as written
print(regex_rule.get('input') == "(?<!\S)hello(?!\S)") #prints out false because ElementTree's get method turns < into < , is that right?
And here is the xml file contents:
<rules>
<regex_rule input="(?<!\S)hello(?!\S)" output="world"/>
</rules>
I would appreciate if anybody can direct me to getting the string as is from the xml attribute for the input, without converting
<
into
<

xml.etree.ElementTree is doing exactly the standards-compliant thing, which is to decode XML character entities with the understanding that they do in fact encode the referenced character and should be interpreted as such.
The preferred course of action if you do need to encode the literal < is to change your input file to use &lt; instead (i.e. we XML-encode the &).
If you can't change your input file format then you'll probably need to use a different module, or write your own parser: xml.etree.ElementTree translates entities well before you can do anything meaningful with the output.

Related

Elementtree and Unicode or UTF-8 confusion

Okay, I feel a bit lost right now. I have some problems with unicode (or utf-8 ?)
I am using Python3.3 on linux (But I have the same problem on windows).
I try to create an XML file with Elementtree.
item = ET.Element("item")
item_title = Et.SubElement(item, "title")
That is of course not everything, just an example.
So now I want to have the tag 'title' have a text like this (replace the ##Content## with random content, doesnt matter so much):
# Thats how I create the text for the tag
item.title.text = u'<![CDATA[##CONTENT##]>'
# This is how I want it to look like
<title><![CDATA[##CONTENT##]></title>
# Thats what I get
<title><![CDATA[##CONTENT##]></title>
# These are some of the things I tried for writing it to an xml file
ET.ElementTree(item).write(myOutputFile, encoding="unicode")
myOutputFile.write(ET.tostring(item, encoding='unicode', method='xml')))
myOutputFile.write(str(ET.tostring(item, encoding='utf-8', method='xml')))
myOutputFile.write(str(ET.tostring(item)
# Oh and thats how I open the file for writing
myOutputFile = codecs.open(HereIsMyFile, 'w', encoding='utf-8')
I tried to search and found some similar sounding problems (some of the things I tried are from SO already), but none seems to work. They changed some stuff in the output, but never showed the < or >.
I also noticed, if I use utf-8 I have to use str() when writing to the file. That got me also confused about the difference in unicode and utf-8, I tried to read some stuff about that but that didn't really help me in my actual problem.
At this point I don't really know where to look for my error and I would love a hint where to look.
Is it the way I write to the file? How I open it?
Or is it Elementtree causing the error? (I didn't try something else, like lxml, because well, that would mean rewriting a lot of stuff I guess).
I hope you can help me and if something isn't clear I will try to explain it a bit better!
Edit: Oh and I also tried to open the file without codecs, because I somewhere read it is not needed anymore in Python3.x but I wasn't so sure anymore, so I tried it.
The correct way to write an XML document with ElementTree is:
with codecs.open(HereIsMyFile, 'w', encoding='utf-8'):
root.write(myOutputFile)
If you specify an encoding for write(), you must use what the XML standard defines. unicode isn't an encoding, it's a standard.
ElementTree doesn't support CDATA. The effect you're seeing is that ElementTree notices special characters in the text of the node and it escapes them; there is no way to prevent that.
This answer contains the implementation of a CDATA element: How to output CDATA using ElementTree
There seem to be a couple of layers of confusion here.
Taking the lower level first: encodings such as UTF-8 convert Unicode characters into bytes. Your problem is that the characters in your generated XML aren’t the ones you want, not with how those characters are stored as bytes, so there isn’t anything to fix there.
Secondly, you seem to be expecting the wrong thing from this line:
item.title.text = u'<![CDATA[##CONTENT##]>'
This tells ElementTree that you want that text in the parsed document. Consider this:
item.title.text = u'I <3 ASCII art.'
ElementTree won’t store that directly in the markup: it’ll turn it into
<title>I <3 ASCII art.</title>
Likewise:
item.title.text = u"This </title> isn’t the end of the title"
becomes
<title>This </title> isn’t the end of the title</title>
Hopefully you can see the value of this: no matter what text you put in there, it won’t break the element markup, or indeed affect it in any way.
Note that because of this automatic conversion, you very likely don’t need CDATA sections at all.
If for some reason you do, though, you can do it by stating it explicitly (using lxml.etree):
title = lxml.etree.Element('title')
title.text = lxml.etree.CDATA('###CONTENT###')
print(lxml.etree.tostring(title))
outputs:
<title><![CDATA[###CONTENT###]]></title>

How to iterate through all XML Elements and apply logic to each Element's value with ElementTree for Python

I am currently trying to apply logic to Element values in a XML file. Specifically I am trying to encode all the values to UTF-8 while not touching any of the element names/attributes themselves.
Here is the sample XML:
<?xml version="1.0"?>
<sd_1>
<sd_2>
<sd_3>\311 is a fancy kind of E</sd_3>
</sd_2>
</sd_1>
Currently I have tried 3 methods to achieve this with no success:
First I tried the looping through each element retrieving the values with .text and using .parse:
import xml.etree.ElementTree as ET
et = ET.parse('xml/test.xml')
for child in et.getroot():
for core in child:
core_value = str(core.text)
core.text = core_value.encode('utf-8')
et.write('output.xml')
This results in an XML file that does not have the text \311 altered correctly, it just stays as it is.
Next I tried the .iterparse with cElementTree to no avail:
import xml.etree.cElementTree as etree
xml_file_path = 'xml/test.xml'
with open(xml_file_path) as xml_file:
tree = etree.iterparse(xml_file)
for items in tree:
for item in items:
print item.text
etree.write('output1.xml')
This results in:
"...print item.text\n', "AttributeError: 'str' object has no attribute 'text'..."
Not sure what I am doing wrong there, I have seen multiple examples with the same arrangement, but when I print through the elements without the .text I see the tuple with a string value of 'end' at the start and I think that is causing the issue with this method.
How do I properly iterate through my elements, and without specifying the element names e.g. .findall(), apply logic to the values housed in each Element so that when I write the xml to file it saves the changes made when the program was iterating through element values?
Is this what you are looking for?
import xml.etree.ElementTree as ET
et = ET.parse('xml/test.xml')
for child in et.getroot():
for core in child:
core_value = str(core.text)
core.text = core_value.decode('unicode-escape')
et.write('output.xml')
This is an interesting question. Let's focus on the first method you proposed, as that should be a totally fine way to approach this problem. When I print out the lines one by one, here is what I get:
>>> core_value
'\\311 is a fancy kind of E'
What happened for me is that the character was read as a literal '\', which must be escaped to be printed as such. If we change the escaped character (\\) to a non-escaped character (\), we get the following:
>>> cv = core_value.replace('\\311','\311')
'\xc9 is a fancy kind of E'
>>> print cv
É is a fancy kind of E
The weird piece here is that you don't know when in the original file \311 is "supposed to be" one character or four. If you know for a fact that those will all be one character, you can write some vile code based on this answer:
Python Unicode, have unicode number in normal string, want to print unicode
To transorm all of the things that come after a \ into the correct unicode characters and delete the \.

Is there a way to recover iterparse on invalid Char values?

I'm using lxml's iterparse to parse some big XML files (3-5Gig). Since some of these files have invalid characters a lxml.etree.XMLSyntaxError is thrown.
When using lxml.etree.parse I can provide a parser which recovers on invalid characters:
parser = lxml.etree.XMLParser(recover=True)
root = lxml.etree.parse(open("myMalformed.xml, parser)
Is there a way to get the same functionality for iterparse?
Edit:
Encoding is not an Issue here. There are invalid characters in these XML files which can be sanitized by defining a XMLParser with recover=True. Since I need to use iterparse for this, I can't use a custom parser. So I'm looking for the functionality provided in my snippet above for this here:
context = etree.iterparse(open("myMalformed.xml", events=('end',), tag="Foo") <-- cant recover
When you say invalid characters, do you mean unicode characters? If so you can try
lxml.etree.XMLParser(encoding='UTF-8', recover=True)
If you mean malformed XML then this obviously won't work. If you can post your traceback, we can see the nature of the XMLSyntaxError which will provide more information.

How can I disable 'output escaping' in minidom

I'm trying to build an xml document from scratch using xml.dom.minidom. Everything was going well until I tried to make a text node with a ® (Registered Trademark) symbol in. My objective is for when I finally hit print mydoc.toxml() this particular node will actually contain a ® symbol.
First I tried:
import xml.dom.minidom as mdom
data = '®'
which gives the rather obvious error of:
File "C:\src\python\HTMLGen\test2.py", line 3
SyntaxError: Non-ASCII character '\xae' in file C:\src\python\HTMLGen\test2.py on line 3, but no encoding declared; see http://www.python.or
g/peps/pep-0263.html for details
I have of course also tried changing the encoding of my python script to 'utf-8' using the opening line comment method, but this didn't help.
So I thought
import xml.dom.minidom as mdom
data = '®' #Both accepted xml encodings for registered trademark
data = '®'
text = mdom.Text()
text.data = data
print data
print text.toxml()
But because when I print text.toxml(), the ampersands are being escaped, I get this output:
®
&reg;
My question is, does anybody know of a way that I can force the ampersands not to be escaped in the output, so that I can have my special character reference carry through to the XML document?
Basically, for this node, I want print text.toxml() to produce output of ® or ® in a happy and cooperative way!
EDIT 1:
By the way, if minidom actually doesn't have this capacity, I am perfectly happy using another module that you can recommend which does.
EDIT 2:
As Hugh suggested, I tried using data = u'®' (while also using data # -*- coding: utf-8 -*- Python source tags). This almost helped in the sense that it actually caused the ® symbol itself to be outputted to my xml. This is actually not the result I am looking for. As you may have guessed by now (and perhaps I should have specified earlier) this xml document happens to be an HTML page, which needs to work in a browser. So having ® in the document ends up causing rubbish in the browser (® to be precise!).
I also tried:
data = unichr(174)
text.data = data.encode('ascii','xmlcharrefreplace')
print text.toxml()
But of course this lead to the same origional problem where all that happens is the ampersand gets escaped by .toxml().
My ideal scenario would be some way of escaping the ampersand so that the XML printing function won't "escape" it on my behalf for the document (in other words, achieving my original goal of having ® or ® appear in the document).
Seems like soon I'm going to have to resort to regular expressions!
EDIT 2a:
Or perhaps not. Seems like getting my html meta information correct <META http-equiv="Content-Type" Content="text/html; charset=UTF-8"> could help, but I'm not sure yet how this fits in with the xml structure...
Two options that work, one with the escaping ® and the other without. It's not really obvious why you want escaping ... it's 6 bytes instead of the 2 or 3 bytes for non-CJK characters.
import xml.dom.minidom as mdom
text = mdom.Text()
# Start with unicode
text.data = u'\xae'
f = open('reg1.html', 'w')
f.write("header saying the file is ascii")
uxml = text.toxml()
bxml = uxml.encode('ascii', 'xmlcharrefreplace')
f.write(bxml)
f.close()
f = open('reg2.html', 'w')
f.write("header saying the file is UTF-8")
xml = text.toxml(encoding='UTF-8')
f.write(xml)
f.close()
If I understand correctly, what you really want is to be able to create a text node from a unicode object (e.g. u'®' or u'\u00ae') and then have toxml() output unicode characters encoded as entities (e.g. ®). Looking at the source of minidom.py, however, it seems that minidom doesn't support entity encoding on output except the special cases of &, ", < and >.
You also ask about alternative modules that could help, however. There are several possible candidates, but ElementTree (xml.etree) seems to do the appropriate encoding. For example, if you take the first example from this blog post by Doug Hellmann but replace:
child_with_tail.text = 'This child has regular text.'
... with:
child_with_tail.text = u'This child has regular text \u00ae.'
... and run the script, you should see the output contains:
This child has regular text®.
You could also use the lxml implementation of ElementTree in that example just by replacing the import statement with:
from lxml.etree import Element, SubElement, Comment, tostring
Update: the alternative answer from John Machin takes the nice approach of running .encode('ascii', 'xmlcharrefreplace') on the output from minidom's toxml(), which converts any non-ASCII characters to their equivalent XML numeric character references.
Default unescape:
from xml.sax.saxutils import unescape
unescape("< & >")
The result is,
'< & >'
And, unescape more:
unescape("&apos; "", {"&apos;": "'", """: '"'})
Check details here, https://wiki.python.org/moin/EscapingXml

How to write ampersand in node attribude?

I need to have following attribute value in my XML node:
CommandLine="copy $(TargetPath) ..\..\
echo dummy > dummy.txt"
Actually this is part of a .vcproj file generated in VS2008. 
&#x0A means line break, as there should be 2 separate commands.
I'm using Python 2.5 with minidom to parse XML - but unfortunately I don't know how to store sequences like 
, the best thing i can get is &amp#x0D;.
How can I store exactly 
?
UPD : Exactly speaking i have to store not &, but \r\n sequence in form of 
&#x0A
I'm using Python 2.5 with minidom to parse XML - but unfortunately I don't know how to store sequences like
Well, you can't specify that you want hex escapes specifically, but according to the DOM LS standard, implementations should change \r\n in attribute values to character references automatically.
Unfortunately, minidom doesn't:
>>> from xml.dom import minidom
>>> document= minidom.parseString('<a/>')
>>> document.documentElement.setAttribute('a', 'a\r\nb')
>>> document.toxml()
u'<?xml version="1.0" ?><a a="a\r\nb"/>'
This is a bug in minidom. Try the same in another DOM (eg. pxdom):
>>> import pxdom
>>> document= pxdom.parseString('<a/>')
>>> document.documentElement.setAttribute('a', 'a\r\nb')
>>> document.pxdomContent
u'<?xml version="1.0" ?><a a="a
b"/>'
You should try storing the actual characters (ASCII 13 and ASCII 10) in the attribute value, instead of their already-escaped counterparts.
EDIT: It looks like minidom does not handle newlines in attribute values correctly.
Even though a literal line break in an attribute value is allowed, but it will face normalization upon document parsing, at which point it is converted to a space.
I filed a bug in this regard: http://bugs.python.org/issue5752
An ampersand is a special character in XML and as such most xml parsers require valid xml in order to function. Let minidom escape the ampersand for you (really it should already be escaped) and then when you need to display the escaped value, unescape it.

Categories

Resources