Unicode decode error using codecs.open() - python

I have run into a character encoding problem as follows:
rating = 'Barntillåten'
new_file = codecs.open(os.path.join(folder, "metadata.xml"), 'w', 'utf-8')
new_file.write(
"""<?xml version="1.0" encoding="UTF-8"?>
<ratings>
<rating system="%s">%s</rating>
</ratings>""" % (values['rating_system'], rating))
The error I get is:
File "./assetshare.py", line 314, in write_file
</ratings>""" % (values['rating_system'], rating))
I know that the encoding error is related to Barntillåten, because if I replace that word with test, the function works fine.
Why is this encoding error happening and what do I need to do to fix it?

rating must be a Unicode string in order to contain Unicode codepoints.
rating = u'Barntillåten'
Otherwise, in Python 2, the non-Unicode string 'Barntillåten' contains bytes (encoded with whatever your source encoding was), not codepoints.

In Python 2, codecs.open expects to read and write unicode objects. You're passing it a str.
The fix is to ensure that the data you pass it is unicode:
new_file.write((
"""<?xml version="1.0" encoding="UTF-8"?>
"""<ratings>
<rating system="%s">%s</rating>
</ratings>""" % (values['rating_system'], rating)
).decode('utf-8'))
If you use unicode literals (u"...") then Python will try to ensure that all data is unicode. Here it would be sufficient to have rating = u'Barntillåten':
rating = u'Barntillåten'
new_file = codecs.open(os.path.join(folder, "metadata.xml"), 'w', 'utf-8')
new_file.write(
"""<?xml version="1.0" encoding="UTF-8"?>
"""<ratings>
<rating system="%s">%s</rating>
</ratings>""" % (values['rating_system'], rating))
You can write into a codecs.open file a str object, but only if the str is encoded in the default encoding, which means that for safety that's only safe if the str is plain ASCII. The default encoding is and should be left as ASCII; see Changing default encoding of Python?

You need to use unicode literals.
u'...'
u"..."
u'''......'''
u"""......"""

Related

Remove "encoding" attribute from XML in Python

I am using python to do some conditional changes to an XML document. The incoming document has <?xml version="1.0" ?> at the top.
I'm using xml.etree.ElementTree.
How I'm parsing the changed XMl:
filter_update_body = ET.tostring(root, encoding="utf8", method="xml")
The output has this at the top:
<?xml version='1.0' encoding='utf8'?>
The client wants the "encoding" tag removed but if I remove it then it either doesn't include the line at all or it puts in encoding= 'us-ascii'
Can this be done so the output matches: <?xml version="1.0" ?>?
(I don't know why it matters honestly but that's what I was told needed to happen)
As pointed out in this answer there is no way to make ElementTree omit the encoding attribute. However, as #James suggested in a comment, it can be stripped from the resulting output like this:
filter_update_body = ET.tostring(root, encoding="utf8", method="xml")
filter_update_body = filter_update_body.replace(b"encoding='utf8'", b"", 1)
The b prefixes are required because ET.tostring() will return a bytes object if encoding != "unicode". In turn, we need to call bytes.replace().
With encoding = "unicode" (note that this is the literal string "unicode"), it will return a regular str. In this case, the bs can be omitted. We use good old str.replace().
It's worth noting that the choice between bytes and str also affects how the XML will eventually be written to a file. A bytes object should be written in binary mode, a str in text mode.

ElementTree Unicode Encode Error about korean words in python2.7 [duplicate]

I have this char in an xml file:
<data>
<products>
<color>fumè</color>
</product>
</data>
I try to generate an instance of ElementTree with the following code:
string_data = open('file.xml')
x = ElementTree.fromstring(unicode(string_data.encode('utf-8')))
and I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 185: ordinal not in range(128)
(NOTE: The position is not exact, I sampled the xml from a larger one).
How to solve it? Thanks
Might you have stumbled upon this problem while using Requests (HTTP for Humans), response.text decodes the response by default, you can use response.content to get the undecoded data, so ElementTree can decode it itself. Just remember to use the correct encoding.
More info: http://docs.python-requests.org/en/latest/user/quickstart/#response-content
You need to decode utf-8 strings into a unicode object. So
string_data.encode('utf-8')
should be
string_data.decode('utf-8')
assuming string_data is actually an utf-8 string.
So to summarize: To get an utf-8 string from a unicode object you encode the unicode (using the utf-8 encoding), and to turn a string to a unicode object you decode the string using the respective encoding.
For more details on the concepts I suggest reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (not Python specific).
You do not need to decode XML for ElementTree to work. XML carries it's own encoding information (defaulting to UTF-8) and ElementTree does the work for you, outputting unicode:
>>> data = '''\
... <data>
... <products>
... <color>fumè</color>
... </products>
... </data>
... '''
>>> x = ElementTree.fromstring(data)
>>> x[0][0].text
u'fum\xe8'
If your data is contained in a file(like) object, just pass the filename or file object directly to the ElementTree.parse() function:
x = ElementTree.parse('file.xml')
Have you tried using the parse function, instead of opening the file... (which BTW would require a .read() after it for the .fromstring() to work...)
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
# etc...
The most likely your file is not UTF-8. è character can be from some other encoding, latin-1 for example.
Function open() does not return a string.
Instead use open('file.xml').read().

Remove non Unicode characters from xml database with Python

So I have a 9000 line xml database, saved as a txt, which I want to load in python, so I can do some formatting and remove unnecessary tags (I only need some of the tags, but there is a lot of unnecessary information) to make it readable. However, I am getting a UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 608814: character maps to <undefined>, which I assume means that the program ran into a non-Unicode character. I am quite positive that these characters are not important to the program (the data I am looking for is all plain text, with no special symbols), so how can I remove all of these from the txt file, when I can't read the file without getting the UnicodeDecodeError?
One crude workaround is to decode the bytes from the file yourself and specify the error handling. EG:
for line in somefile:
uline = line.decode('ascii', errors='ignore')
That will turn the line into a Unicode object in which any non-ascii bytes have been dropped. This is not a generally recommended approach - ideally you'd want to process XML with a proper parser, or at least know your file's encoding and open it appropriately (the exact details depend on your Python version). But if you're entirely certain you only care about ascii characters this is a simple fallback.
The error suggests that you're using open() function without specifying an explicit character encoding. locale.getpreferredencoding(False) is used in this case (e.g., cp1252). The error says that it is not an appropriate encoding for the input.
An xml document may contain a declaration at the very begining that specifies the encoding used explicitly. Otherwise the encoding is defined by BOM or it is utf-8. If your copy-pasting and saving the file hasn't messed up the encoding and you don't see a line such as <?xml version="1.0" encoding="iso-8859-1" ?> then open the file using utf-8:
with open('input-xml-like.txt', encoding='utf-8', errors='ignore') as file:
...
If the input is an actual XML then just pass it to an XML parser instead:
import xml.etree.ElementTree as etree
tree = etree.parse('input.xml')

Using unicode character in XML with python : 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

I use django and in my view i need to send a request as XML with some unicode character that received from html page with post method. I tried these (Note that i save that input in fname variable) :
xml = r"""my XML code with unicode {0} """.format(fname)
And
fname = u"%s".encode('utf8') % (fname)
xml = r"""my XML code with unicode {0} """.format(fname)
And
fname = fname.encode('ascii', 'ignore').decode('ascii')
xml = r"""my XML code with unicode {0} """.format(fname)
And every time i got this error:
'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
You could reproduce the error with this code:
>>> "{0}".format(u"\U0001F384"*4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
To fix this particular error, just use Unicode format string:
>>> u"{0}".format(u"\U0001F384"*4)
u'\U0001f384\U0001f384\U0001f384\U0001f384'
You could use xml.etree.ElementTree module to build your xml document instead of string formatting. xml is a complex format; it is easy to get it wrong. ElementTree will also serialize your Unicode string into bytes correctly making sure that the character encoding in the xml declaration is consistent with the actual encoding that is used in the document.
xml = r"""my XML code with unicode {0} """.format(fname)
The .format method always produces the same output string type as the input format string. In the case your format string is a byte string r"""...""" so if fname is a Unicode string Python tries to force it into being a byte string. If frame contains characters that do not exist in the default encoding (ASCII) then bang.
Note that this differs from the old string formatting operator %, which tries to promote to Unicode string when either the format string or any of the arguments used are Unicode, which would work in this case as long as the my XML code was ASCII-compatible. This is a common problem when you convert code that uses % to .format().
This should work fine:
xml = ur"""my XML code with unicode {0} """.format(fname)
However the output will be a Unicode string so whatever you do next needs to cope with that (for example if you are writing it to a byte stream/file, you would probably want to .encode('utf-8') the whole thing). Alternatively encode it in place to get a byte string:
xml = r"""my XML code with unicode {0} """.format(fname.encode('utf-8'))
Note that this above:
fname = u"%s".encode('utf8') % (fname)
does not work because you are encoding the format string to bytes, not the fname argument. This is identical to saying just fname = '%s' % fname, which is effectively fname = fname.
I Solved that with this code:
fname = fname.encode('ascii', 'xmlcharrefreplace')
This smells bad. For input hello ☃, you are now generating hello ☃ instead of the normal output hello ☃.
If both ☃ and ☃ look the same to you in the output then probably you are doing something like this:
xml = '<element>{0}</element>'.format(some_text)
which is broken for XML-special characters like & and <. When you are generating XML you should take care to escape special characters (&<>"', to &, < etc), otherwise at best your output will break for these characters; at worst, when some_text includes user input you have an XML-injection vulnerability which may break the logic of your system in a security-senesitive way.
As J F Sebastian said (+1), it's a good idea to use existing known-good XML serialisation libraries like etree instead of trying to roll your own.
You could do something like:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
but its an ugly hack-ish that only works in python 2.7.
or
fname.encode('GB18030').decode('utf-8')
it will bypass the error but may still look messy. If you're posting to an html file then make the charset utf-8
I Solved that with this code:
fname = fname.encode('ascii', 'xmlcharrefreplace')
xml = r"""my XML code with unicode {0} """.format(fname)
Thank you for your help.
Update :
And you can remove or replace special characters like > & < with this (Thanks to #bobince for notice this) :
fname = fname.replace("<", "")
fname = fname.replace(">", "")
fname = fname.replace("&", "")

Reading UTF-8 XML and writing it to a file with Python

I'm trying to parse UTF-8 XML file and save some parts of it to another file. Problem is, that this is my first Python script ever and I'm totally confused about the character encoding problems I'm finding.
My script fails immediately when it tries to write non-ascii character to a file, but it can print it to command prompt (at least in some level)
Here's the XML (from the parts that matter at least, it's a *.resx file which contains UI strings)
<?xml version="1.0" encoding="utf-8"?>
<root>
<resheader name="foo">
<value>bar</value>
</resheader>
<data name="lorem" xml:space="preserve">
<value>ipsum öä</value>
</data>
</root>
And here's my python script
from xml.dom.minidom import parse
names = []
values = []
def getStrings(path):
dom = parse(path)
data = dom.getElementsByTagName("data")
for i in range(len(data)):
name = data[i].getAttribute("name")
names.append(name)
value = data[i].getElementsByTagName("value")
values.append(value[0].firstChild.nodeValue.encode("utf-8"))
def writeToFile():
with open("uiStrings-fi.py", "w") as f:
for i in range(len(names)):
line = names[i] + '="'+ values[i] + '"' #varName='varValue'
f.write(line)
f.write("\n")
getStrings("ResourceFile.fi-FI.resx")
writeToFile()
And here's the traceback:
Traceback (most recent call last):
File "GenerateLanguageFiles.py", line 24, in
writeToFile()
File "GenerateLanguageFiles.py", line 19, in writeToFile
line = names[i] + '="'+ values[i] + '"' #varName='varValue'
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in ran
ge(128)
How should I fix my script so it would read and write UTF-8 characters properly? The files I'm trying to generate would be used in test automation with Robots Framework.
You'll need to remove the call to encode() - that is, replace nodeValue.encode("utf-8") with nodeValue - and then change the call to open() to
with open("uiStrings-fi.py", "w", "utf-8") as f:
This uses a "Unicode-aware" version of open() which you will need to import from the codecs module, so also add
from codecs import open
to the top of the file.
The issue is that when you were calling nodeValue.encode("utf-8"), you were converting a Unicode string (Python's internal representation that can store all Unicode characters) into a regular string (which can only store single-byte characters 0-255). Later on, when you construct the line to write to the output file, names[i] is still a Unicode string but values[i] is a regular string. Python tries to convert the regular string to Unicode, which is the more general type, but because you don't specify an explicit conversion, it uses the ASCII codec, which is the default, and ASCII can't handle characters with byte values greater than 127. Unfortunately, several of those do occur in the string values[i] because the UTF-8 encoding uses those upper-range bytes frequently. So Python complains that it sees a character it can't handle. The solution, as I said above, is to defer the conversion from Unicode to bytes until the last possible moment, and you do that by using the Unicode-aware version of open (which will handle the encoding for you).
Now that I think about it, instead of what I said above, an alternate solution would be to replace names[i] with names[i].encode("utf-8"). That way, you convert names[i] into a regular string as well, and Python has no reason to try to convert values[i] back to Unicode. Although, one could make the argument that it's good practice to keep your strings as Unicode objects until you write them out to the file... if nothing else, I believe unicode becomes the default in Python 3.
The XML parser decodes the UTF-8 encoding of the input when it reads the file and all the text nodes and attributes of the resulting DOM are then unicode objects. When you select the interesting data from the DOM, you re-encode the values as UTF-8, but you don't encode the names. The resulting values array contains encoded byte strings while the names array still contains unicode objects.
In the line where the encoding error is thrown, Python tries to concatenate such a unicode name and a byte string value. To do so, both values have to be of the same type and Python tries to convert the byte string values[i] to unicode, but it doesn't know that it's UTF-8 encoded and fails when it tries to use the ASCII codec.
The easiest way to work around this would be to keep all the strings as Unicode objects and just encode them to UTF-8 when they are written to the file:
values.append(value[0].firstChild.nodeValue) # encode not yet
...
f.write(line.encode('utf-8')) # but now

Categories

Resources