The confusion on python encoding

The confusion on python encoding - python

I retrieved the data encoded in big5 from database,and I want to send the data as email of html content, the code is like this:
html += """<tr><td>"""
html += unicode(rs[0], 'big5') # rs[0] is data encoded in big5
I run the script, but the error raised: UnicodeDecodeError: 'ascii' codec can't decode byte...... However, I tried the code in interactive python command line, there are no errors raised, could you give me the clue?

If html is not already a unicode object but a normal string, it is converted to unicode when it is concatenated with the converted version of rs[0]. If html now contains special characters you can get a unicode error.
So the other contents of html also need to be correctly decoded to unicode. If the special characters come from string literals, you could use unicode literals (like u"abcä") instead.

Your call to unicode() is working correctly. It is the concatenation, which is adding a unicode object to a byte string, that is causing trouble. If you change the first line to u'''<tr><td>''', (or u'<tr><td>') it should work fine.
Edit: This means your error lies in the data that is already in html by the time python reaches this snippet:
>>> '\x9f<tr><td>' + unicode('\xc3\x60', 'big5')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 0: ordinal not in range(128)
>>> u'\x9f<tr><td>' + unicode('\xc3\x60', 'big5')
u'\x9f<tr><td>\u56a5'
>>>

Related

Python UnicodeEncodeError for u'\u2019' while trying to create a CSV or export

I'm trying to export some data to CSV from out of a database, and I'm struggling to understand the following UnicodeEncodeError:
>>> sample
u'I\u2019m now'
>>> type(sample)
<type 'unicode'>
>>> str(sample)
Traceback (most recent call last):
File "<console>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 1: ordinal not in range(128)
>>> print sample
I’m now
>>> sample.encode('utf-8', 'ignore')
'I\xe2\x80\x99m now'
I'm confused. Is it unicode or not? What does the UnicodeEncodeError actually mean in this context? Why does print work just fine? If I want to be able to save this data to a CSV file, how can I handle the encoding so that it does not generate an error when I try to use csv.writer's writerow?
Thanks for your help.

It is a Python unicode object, you used type(sample) to verify that. Also, it contains Unicode, so you can serialize it to a file that has one of the Unicode encodings.
The encoding error needs to be read carefully: It is the "ascii" codec that can't represent that string. ASCII is just the Unicode subset with codepoints below 127. Your string uses codepoint 0x2019, so it can't be encoded with ASCII.
print works because it is correctly implemented and it doesn't try to encode the string as ASCII. I think you would get similar errors if stdout was set up with e.g. Latin-1 as encoding, but it seems your system can handle a wider range of Unicode than that.
In order to write a CSV file, you could just use UTF-8 as encoding for that file. I haven't used the CSV module though, so I'm not sure exactly how. In any case, if it doesn't work, you should provide the exact code that doesn't as MCVE in a different question.
BTW: Please upgrade to Python 3! It has many improvements over the 2.x series, also concerning string/Unicode handling.

UnicodeEncodeError: 'charmap' codec can't encode character '\u2264'

I'm using python3.6 in windows7 and django 1.9
I got this error when i run my code.
In my code i'm parsing xml data to write a html page.
I got that some character is unable to encode properly thats why its throwing an error.
\u2264 this is the character(less than or equal) which is the root cause of the error.
My question is how to encode this properly in python3
Detailed Error Log:
Traceback (most recent call last):
File "C:\Dev\EXE\TEMP\cookie\crumbs\views.py", line 1520, in parser
html_file.write(html_text)
File "C:\Users\Cookie1\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2264' in position 389078: character maps to <undefined>

The error message indicates that you're trying to encode to the Windows-1252 character encoding. That encoding doesn't have a representation of the less than or equal symbol.
>>> "\u2264".encode("cp1252")
>>> Traceback... [as above]
The answer is to use UTF-8, an unrestricted encoding, instead of Windows-1252, a very restricted encoding.
Your question doesn't include much context, but the line html_file.write(html_text) makes me think you're using Python's file protocol. The documentation for open() shows how to set the encoding, e.g.
html_file = open("file.html", mode="w", encoding="utf8")
Note that "the default encoding is platform dependent (whatever locale.getpreferredencoding() returns)", which is why you're getting Windows-1252 on Windows 7.

python cleaning high or non-ascii out of a file [duplicate]

I'm working on a series of parsers where I get a bunch of tracebacks from my unit tests like:
File "c:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 112: character maps to <undefined>
The files are opened with open() with no extra arguemnts. Can I pass extra arguments to open() or use something in the codec module to open these differently?
This came up with code that was written in Python 2 and converted to 3 with the 2to3 tool.
UPDATE: it turns out this is a result of feeding a zipfile into the parser. The unit test actually expects this to happen. The parser should recognize it as something that can't be parsed. So, I need to change my exception handling. In the process of doing that now.

Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:
>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>
or with an actual file:
>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte
Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:
>>> open('test.txt', encoding='latin-1').read()
'\x81\n'
Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.

You can relax the error handling.
For instance:
f = open(filename, encoding="...", errors="replace")
Or:
f = open(filename, encoding="...", errors="ignore")
See the docs.
EDIT:
But are you certain that the problem is in reading the file? Could it be that the exception happens when something is written to the console? Check http://wiki.python.org/moin/PrintFails

All files are "not Unicode". Unicode is an internal representation which must be encoded. You need to determine for each file what encoding has been used, and specify that where necessary when the file is opened.
As the traceback and error message indicate, the file in question is NOT encoded in cp1252.
If it is encoded in latin1, the "\x81" that it is complaining about is a C1 control character that doesn't even have a name (in Unicode). Consider latin1 extremely unlikely to be valid.
You say "some of the files are parsed with xml.dom.minidom" -- parsed successfully or unsuccessfully?
A valid XML file should declare its encoding (default is UTF-8) in the first line, and you should not need to specify an encoding in your code. Show us the code that you are using to do the xml.dom.minidom parsing.
"others read directly as iterables" -- sample code please.
Suggestion: try opening some of each type of file in your browser. Then click View and click Character Encoding (Firefox) or Encoding (Internet Explorer). What encoding has the browser guessed [usually reliably]?
Other possible encoding clues: What languages are used in the text in the files? Where did you get the files from?
Note: please edit your question with clarifying information; don't answer in the comments.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3

I got this string 'Velcro Back Rest \xa36.99'. Note it does not have u in the front. Its just plain ascii.
How do I convert it to unicode?
I tried this,
>>> unicode('Velcro Back Rest \xa36.99')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 17: ordinal not in range(128)
This answer explain it nicely. But I have same question as the OP of that question. In the answer to that comment Winston says "You should not encoding a string object ..."
But the framework I am working requires that it should be converted unicode string. I use scrapy and I have this line.
loader.add_value('name', product_name)
Here product_name contains that problematic string and it throws the error.

You need to specify an encoding to decode the bytes to Unicode with:
>>> 'Velcro Back Rest \xa36.99'.decode('latin1')
u'Velcro Back Rest \xa36.99'
>>> print 'Velcro Back Rest \xa36.99'.decode('latin1')
Velcro Back Rest £6.99
In this case, I was able to guess the encoding from experience, you need to provide the correct codec used for each encoding you encounter. For web data, that is usually included in the from of the content-type header:
Content-Type: text/html; charset=iso-8859-1
where iso-8859-1 is the official standard name for the Latin 1 encoding, for example. Python recognizes latin1 as an alias for iso-8859-1.
Note that your input data is not plain ASCII. If it was, it'd only use bytes in the range 0 through to 127; \xa3 is 163 decimal, so outside of the ASCII range.

UnicodeDecodeError

I got an UnicodeDecodeError,
'utf8' codec can't decode byte 0xe5 in position 1923: invalid continuation byte
I have use Danish letter "å" in my template. How can I solve the problem, then I can use non-English letter in my Django project and database?

I can get a similar error (mentioning the same byte value) doing this:
>>> 'å'.encode('latin-1')
b'\xe5'
>>> _.decode('utf-8')
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
_.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 0: unexpected end of data
This implies that your data is encoded in latin-1 rather than utf-8. In general, there are two solutions to this: if you have control over your input data, re-save it as UTF-8. Otherwise, when you read the data in Python, set the encoding to latin-1. For a django template, you should be able to use the first - the editor you use should have an 'encoding' option somewhere, change it to utf-8, resave, and everything should work.

this helped me https://stackoverflow.com/a/23278373/2571607
Basically, open C:\Python27\Lib\mimetypes.py
replace
‘default_encoding = sys.getdefaultencoding()’
with
if sys.getdefaultencoding() != 'gbk':
reload(sys)
sys.setdefaultencoding('gbk')
default_encoding = sys.getdefaultencoding()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

The confusion on python encoding - python

Related

Python UnicodeEncodeError for u'\u2019' while trying to create a CSV or export

UnicodeEncodeError: 'charmap' codec can't encode character '\u2264'

python cleaning high or non-ascii out of a file [duplicate]

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3

UnicodeDecodeError

Categories

Resources