Elementtree and Unicode or UTF-8 confusion - python

Okay, I feel a bit lost right now. I have some problems with unicode (or utf-8 ?)
I am using Python3.3 on linux (But I have the same problem on windows).
I try to create an XML file with Elementtree.
item = ET.Element("item")
item_title = Et.SubElement(item, "title")
That is of course not everything, just an example.
So now I want to have the tag 'title' have a text like this (replace the ##Content## with random content, doesnt matter so much):
# Thats how I create the text for the tag
item.title.text = u'<![CDATA[##CONTENT##]>'
# This is how I want it to look like
<title><![CDATA[##CONTENT##]></title>
# Thats what I get
<title><![CDATA[##CONTENT##]></title>
# These are some of the things I tried for writing it to an xml file
ET.ElementTree(item).write(myOutputFile, encoding="unicode")
myOutputFile.write(ET.tostring(item, encoding='unicode', method='xml')))
myOutputFile.write(str(ET.tostring(item, encoding='utf-8', method='xml')))
myOutputFile.write(str(ET.tostring(item)
# Oh and thats how I open the file for writing
myOutputFile = codecs.open(HereIsMyFile, 'w', encoding='utf-8')
I tried to search and found some similar sounding problems (some of the things I tried are from SO already), but none seems to work. They changed some stuff in the output, but never showed the < or >.
I also noticed, if I use utf-8 I have to use str() when writing to the file. That got me also confused about the difference in unicode and utf-8, I tried to read some stuff about that but that didn't really help me in my actual problem.
At this point I don't really know where to look for my error and I would love a hint where to look.
Is it the way I write to the file? How I open it?
Or is it Elementtree causing the error? (I didn't try something else, like lxml, because well, that would mean rewriting a lot of stuff I guess).
I hope you can help me and if something isn't clear I will try to explain it a bit better!
Edit: Oh and I also tried to open the file without codecs, because I somewhere read it is not needed anymore in Python3.x but I wasn't so sure anymore, so I tried it.

The correct way to write an XML document with ElementTree is:
with codecs.open(HereIsMyFile, 'w', encoding='utf-8'):
root.write(myOutputFile)
If you specify an encoding for write(), you must use what the XML standard defines. unicode isn't an encoding, it's a standard.
ElementTree doesn't support CDATA. The effect you're seeing is that ElementTree notices special characters in the text of the node and it escapes them; there is no way to prevent that.
This answer contains the implementation of a CDATA element: How to output CDATA using ElementTree

There seem to be a couple of layers of confusion here.
Taking the lower level first: encodings such as UTF-8 convert Unicode characters into bytes. Your problem is that the characters in your generated XML aren’t the ones you want, not with how those characters are stored as bytes, so there isn’t anything to fix there.
Secondly, you seem to be expecting the wrong thing from this line:
item.title.text = u'<![CDATA[##CONTENT##]>'
This tells ElementTree that you want that text in the parsed document. Consider this:
item.title.text = u'I <3 ASCII art.'
ElementTree won’t store that directly in the markup: it’ll turn it into
<title>I <3 ASCII art.</title>
Likewise:
item.title.text = u"This </title> isn’t the end of the title"
becomes
<title>This </title> isn’t the end of the title</title>
Hopefully you can see the value of this: no matter what text you put in there, it won’t break the element markup, or indeed affect it in any way.
Note that because of this automatic conversion, you very likely don’t need CDATA sections at all.
If for some reason you do, though, you can do it by stating it explicitly (using lxml.etree):
title = lxml.etree.Element('title')
title.text = lxml.etree.CDATA('###CONTENT###')
print(lxml.etree.tostring(title))
outputs:
<title><![CDATA[###CONTENT###]]></title>

Related

Formatting text that is meant to be replaced

This is a rather generic question, but I have a textfile that I want to edit using a script.
What are some ways to format text, so that it will visually stand out but still be recognized by my script?
It works fine when I use text_to_be_replaced, but it is hard to find when you have a large file.
Tried searching, and it seems that the common ways are:
%text_to_be_replaced%
<text_to_be_replaced>
$(text_to_be_replaced)
But maybe there is a commonly used/widely accepted way to format text for visibility?
The language the script is written in is python, if that matters... but I'm looking for a more-or-less generic soluting which will work 90% of the time.
I'm not aware of any generic standard here, but if it's meant to be replaced, you can use the new string formatting method as follows:
string = 'some text {add_text_here} some more text'
Then to replace it when you need to:
value = 'formatted'
string = string.format(add_text_here=value)
Now print it out:
>>> string
'some text formatted some more text'
In fact, this quite neat at the addition of curly {brackets} around the text that needs to be replaced also may make it stand out a little.
At first I thought that {{curly braces}} would be fine, but than I went with $ALLCAPS.
First of all, caps really stands out, while lowercase may be confused with the rest of the code.
And while it $REALLYSTANDSOUT, it shouldn't cause any problems, since it's just a "bookmark" in a text file, and will be replaced with the appropriate stuff determined by the script.

Writing unicode symbols to files (as opposed to unicode code)

I'm new to python and unicode is starting to give me headaches.
Currently I write to file like this:
my_string = "马/馬"
f = codecs.open(local_filepath, encoding='utf-8', mode='w+')
f.write(my_string)
f.close()
And when I open file with i.e. Gedit, I can see something like this:
\u9a6c/\u99ac\tm\u01ce
While I'd like to see exactly what I've written:
马/馬
I've tried a few different variations, like writing my_string.decode() or my_string.encode('utf-8') instead of just my_string, I know those two methods are the opposites but I was not sure which one I needed. Neither worked anyway.
If I manually write these symbols to text file, then with python read the file, re-write what I've just read back to the same file and save, symbols get turned to the code \u9a6c. Not sure if this is importat, figured I'd just mention it to help identify the problem.
Edit: the strings came from SQL Alchemy objects repr method, which turned out to be where the problem lied. I didn't mention it because it just didn't occur to me it can be related to the problem somehow. Thanks again for your help!
From the comments it is now clear you are using either the repr() function or calling the object.__repr__() method directly.
Don't do that. You are writing debugging information to your file:
>>> my_string = u"马/馬"
>>> print repr(my_string)
u'\u9a6c/\u99ac'
The value produced is meant to be pastable back into a Python session so you can re-produce the exact same value, and as such it is ASCII-safe (so it can be used in Python 2 source code without encoding issues).
From the repr() documentation:
For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval(), otherwise the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object.
Write the Unicode objects to your file directly instead, codecs.open() handles encoding to UTF-8 correctly if you do.

How can I disable 'output escaping' in minidom

I'm trying to build an xml document from scratch using xml.dom.minidom. Everything was going well until I tried to make a text node with a ® (Registered Trademark) symbol in. My objective is for when I finally hit print mydoc.toxml() this particular node will actually contain a ® symbol.
First I tried:
import xml.dom.minidom as mdom
data = '®'
which gives the rather obvious error of:
File "C:\src\python\HTMLGen\test2.py", line 3
SyntaxError: Non-ASCII character '\xae' in file C:\src\python\HTMLGen\test2.py on line 3, but no encoding declared; see http://www.python.or
g/peps/pep-0263.html for details
I have of course also tried changing the encoding of my python script to 'utf-8' using the opening line comment method, but this didn't help.
So I thought
import xml.dom.minidom as mdom
data = '®' #Both accepted xml encodings for registered trademark
data = '®'
text = mdom.Text()
text.data = data
print data
print text.toxml()
But because when I print text.toxml(), the ampersands are being escaped, I get this output:
®
&reg;
My question is, does anybody know of a way that I can force the ampersands not to be escaped in the output, so that I can have my special character reference carry through to the XML document?
Basically, for this node, I want print text.toxml() to produce output of ® or ® in a happy and cooperative way!
EDIT 1:
By the way, if minidom actually doesn't have this capacity, I am perfectly happy using another module that you can recommend which does.
EDIT 2:
As Hugh suggested, I tried using data = u'®' (while also using data # -*- coding: utf-8 -*- Python source tags). This almost helped in the sense that it actually caused the ® symbol itself to be outputted to my xml. This is actually not the result I am looking for. As you may have guessed by now (and perhaps I should have specified earlier) this xml document happens to be an HTML page, which needs to work in a browser. So having ® in the document ends up causing rubbish in the browser (® to be precise!).
I also tried:
data = unichr(174)
text.data = data.encode('ascii','xmlcharrefreplace')
print text.toxml()
But of course this lead to the same origional problem where all that happens is the ampersand gets escaped by .toxml().
My ideal scenario would be some way of escaping the ampersand so that the XML printing function won't "escape" it on my behalf for the document (in other words, achieving my original goal of having ® or ® appear in the document).
Seems like soon I'm going to have to resort to regular expressions!
EDIT 2a:
Or perhaps not. Seems like getting my html meta information correct <META http-equiv="Content-Type" Content="text/html; charset=UTF-8"> could help, but I'm not sure yet how this fits in with the xml structure...
Two options that work, one with the escaping ® and the other without. It's not really obvious why you want escaping ... it's 6 bytes instead of the 2 or 3 bytes for non-CJK characters.
import xml.dom.minidom as mdom
text = mdom.Text()
# Start with unicode
text.data = u'\xae'
f = open('reg1.html', 'w')
f.write("header saying the file is ascii")
uxml = text.toxml()
bxml = uxml.encode('ascii', 'xmlcharrefreplace')
f.write(bxml)
f.close()
f = open('reg2.html', 'w')
f.write("header saying the file is UTF-8")
xml = text.toxml(encoding='UTF-8')
f.write(xml)
f.close()
If I understand correctly, what you really want is to be able to create a text node from a unicode object (e.g. u'®' or u'\u00ae') and then have toxml() output unicode characters encoded as entities (e.g. ®). Looking at the source of minidom.py, however, it seems that minidom doesn't support entity encoding on output except the special cases of &, ", < and >.
You also ask about alternative modules that could help, however. There are several possible candidates, but ElementTree (xml.etree) seems to do the appropriate encoding. For example, if you take the first example from this blog post by Doug Hellmann but replace:
child_with_tail.text = 'This child has regular text.'
... with:
child_with_tail.text = u'This child has regular text \u00ae.'
... and run the script, you should see the output contains:
This child has regular text®.
You could also use the lxml implementation of ElementTree in that example just by replacing the import statement with:
from lxml.etree import Element, SubElement, Comment, tostring
Update: the alternative answer from John Machin takes the nice approach of running .encode('ascii', 'xmlcharrefreplace') on the output from minidom's toxml(), which converts any non-ASCII characters to their equivalent XML numeric character references.
Default unescape:
from xml.sax.saxutils import unescape
unescape("< & >")
The result is,
'< & >'
And, unescape more:
unescape("&apos; "", {"&apos;": "'", """: '"'})
Check details here, https://wiki.python.org/moin/EscapingXml

Cleaning an XML file in Python before parsing

I'm using minidom to parse an xml file and it threw an error indicating that the data is not well formed. I figured out that some of the pages have characters like ไอเฟล &, causing the parser to hiccup. Is there an easy way to clean the file before I start parsing it? Right now I'm using a regular expressing to throw away anything that isn't an alpha numeric character and the </> characters, but it isn't quite working.
Try
xmltext = re.sub(u"[^\x20-\x7f]+",u"",xmltext)
It will get rid of everything except 0x20-0x7F range.
You may start from \x01, if you want want to keep control characters like tab, line breaks.
xmltext = re.sub(u"[^\x01-\x7f]+",u"",xmltext)
Take a look at µTidyLib, a Python wrapper to TidyLib.
If you do need the data with the strange characters you could, in stead of just stripping them, convert them to codes the XML parser can understand.
You could have a look at the unicodedata package, especially the normalize method.
I haven't used it myself, so I can't tell you all that much, but you could ask again here on SO if you decide you're going to convert and keep that data.
>>> import unicodedata
>>> unicodedata.normalize("NFKD" , u"ไภเฟล &")
u'a\u03001\u201ea\u0300 \u0327 a\u03001\u20aca\u0300 \u0327Y\u0308a\u0300 \u0327\xa5 &'
It looks like you're dealing with data which are saved with some kind of encoding "as if" they were ASCII. XML file should normally be UTF8, and SAX (the underlying parser used by minidom) should handle that, so it looks like something's wrong in that part of the processing chain. Instead of focusing on "cleaning up" I'd first try to make sure the encoding is correct and correctly recognized. Maybe a broken XML directive? Can you edit your Q to show the first few lines of the file, especially the <?xml ... directive at the very start?
I'd throw out all non-ASCII characters which can be identified by having the 8th bit (0x80) set (128 .. 255 respectively 0x80 .. 0xff).
You could read in the file into a Python string named old_str
Then perform a filter call in conjunction with a lambda statement:
new_str = filter(lambda x: x in string.ascii_letters, old_str)
Parse new_str
Many ways exist to accomplish stripping non-ASCII characters from a string.
This question might be related: How to check if a string in Python is in ASCII?

Should I strip the XML declaration from suds output before parsing with lxml?

I’m trying to implement a SOAP webservice in Python 2.6 using the suds library. That is working well, but I’ve run into a problem when trying to parse the output with lxml.
Suds returns a suds.sax.text.Text object with the reply from the SOAP service. The suds.sax.text.Text class is a subclass of the Python built-in Unicode class. In essence, it would be comparable with this Python statement:
u'<?xml version="1.0" encoding="utf-8" ?><root><lotsofelements \></root>'
Which is incongrous, since if the XML declaration is correct, the contents are UTF-8 encoded, and thus not a Python Unicode object (because those are stored in some internal encoding like UCS4).
lxml will refuse to parse this, as documented, since there is no clear answer to what encoding it should be interpreted as.
As I see it, there are two ways out of this bind:
Strip the <?xml> declaration, including the encoding.
Convert the output from Suds into a bytestring, using the specified encoding.
Currently, the data I’m receiving from the webservice is within the ASCII-range, so either way will work, but both feels very much like ugly hacks to me, and I’m not quite sure what would happen, if I start to receive data that would need a wider range of Unicode characters.
Any good ideas? I can’t imagine I’m the first one in this position…
You and lxml are correct; a valid XML document must be a stream of bytes encoded as declared in the <?xml ..... header (default: UTF-8).
I'd suggest a third option: leave it in unicode with an XML header that omits the encoding declaration but leaves the version in there (future-safe). That will keep lxml happy and avoid the overhead of you encoding it again.
I'd also suggest some gentle enquiry at the suds site and having a poke around in their source.
Hmm, I'm currently implementing my first Suds-based solution and parsing my responses with lxml without a problem, but I think this could be because I'm doing it in a pretty blunt and dumb way. Here's what my code looks like:
try:
result = self.client.service.ExportOwnersDetails(fAccess=self.access_id, fParams=params)
except URLError:
# TODO: Log timeout here, handle
return
response = str(result.fReturn)
if len(response) == 0 or response.find('<?xml ') == -1:
# TODO: Log import error here, handle
return
response = StringIO(response)
xml = etree.parse(response)
Like I said, not very clever (and obviously I still have some logging to do), but that's my approach. The fAccess, fParams, fReturn nonsense is the naming convention at the third-party provider I'm integrating with.

Categories

Resources