UnicodeDecodeError - python

I got an UnicodeDecodeError,
'utf8' codec can't decode byte 0xe5 in position 1923: invalid continuation byte
I have use Danish letter "å" in my template. How can I solve the problem, then I can use non-English letter in my Django project and database?

I can get a similar error (mentioning the same byte value) doing this:
>>> 'å'.encode('latin-1')
b'\xe5'
>>> _.decode('utf-8')
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
_.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 0: unexpected end of data
This implies that your data is encoded in latin-1 rather than utf-8. In general, there are two solutions to this: if you have control over your input data, re-save it as UTF-8. Otherwise, when you read the data in Python, set the encoding to latin-1. For a django template, you should be able to use the first - the editor you use should have an 'encoding' option somewhere, change it to utf-8, resave, and everything should work.

this helped me https://stackoverflow.com/a/23278373/2571607
Basically, open C:\Python27\Lib\mimetypes.py
replace
‘default_encoding = sys.getdefaultencoding()’
with
if sys.getdefaultencoding() != 'gbk':
reload(sys)
sys.setdefaultencoding('gbk')
default_encoding = sys.getdefaultencoding()

Related

UnicodeEncodeError: 'charmap' codec can't encode character '\u2264'

I'm using python3.6 in windows7 and django 1.9
I got this error when i run my code.
In my code i'm parsing xml data to write a html page.
I got that some character is unable to encode properly thats why its throwing an error.
\u2264 this is the character(less than or equal) which is the root cause of the error.
My question is how to encode this properly in python3
Detailed Error Log:
Traceback (most recent call last):
File "C:\Dev\EXE\TEMP\cookie\crumbs\views.py", line 1520, in parser
html_file.write(html_text)
File "C:\Users\Cookie1\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2264' in position 389078: character maps to <undefined>
The error message indicates that you're trying to encode to the Windows-1252 character encoding. That encoding doesn't have a representation of the less than or equal symbol.
>>> "\u2264".encode("cp1252")
>>> Traceback... [as above]
The answer is to use UTF-8, an unrestricted encoding, instead of Windows-1252, a very restricted encoding.
Your question doesn't include much context, but the line html_file.write(html_text) makes me think you're using Python's file protocol. The documentation for open() shows how to set the encoding, e.g.
html_file = open("file.html", mode="w", encoding="utf8")
Note that "the default encoding is platform dependent (whatever locale.getpreferredencoding() returns)", which is why you're getting Windows-1252 on Windows 7.

python cleaning high or non-ascii out of a file [duplicate]

I'm working on a series of parsers where I get a bunch of tracebacks from my unit tests like:
File "c:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 112: character maps to <undefined>
The files are opened with open() with no extra arguemnts. Can I pass extra arguments to open() or use something in the codec module to open these differently?
This came up with code that was written in Python 2 and converted to 3 with the 2to3 tool.
UPDATE: it turns out this is a result of feeding a zipfile into the parser. The unit test actually expects this to happen. The parser should recognize it as something that can't be parsed. So, I need to change my exception handling. In the process of doing that now.
Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:
>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>
or with an actual file:
>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte
Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:
>>> open('test.txt', encoding='latin-1').read()
'\x81\n'
Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.
You can relax the error handling.
For instance:
f = open(filename, encoding="...", errors="replace")
Or:
f = open(filename, encoding="...", errors="ignore")
See the docs.
EDIT:
But are you certain that the problem is in reading the file? Could it be that the exception happens when something is written to the console? Check http://wiki.python.org/moin/PrintFails
All files are "not Unicode". Unicode is an internal representation which must be encoded. You need to determine for each file what encoding has been used, and specify that where necessary when the file is opened.
As the traceback and error message indicate, the file in question is NOT encoded in cp1252.
If it is encoded in latin1, the "\x81" that it is complaining about is a C1 control character that doesn't even have a name (in Unicode). Consider latin1 extremely unlikely to be valid.
You say "some of the files are parsed with xml.dom.minidom" -- parsed successfully or unsuccessfully?
A valid XML file should declare its encoding (default is UTF-8) in the first line, and you should not need to specify an encoding in your code. Show us the code that you are using to do the xml.dom.minidom parsing.
"others read directly as iterables" -- sample code please.
Suggestion: try opening some of each type of file in your browser. Then click View and click Character Encoding (Firefox) or Encoding (Internet Explorer). What encoding has the browser guessed [usually reliably]?
Other possible encoding clues: What languages are used in the text in the files? Where did you get the files from?
Note: please edit your question with clarifying information; don't answer in the comments.

Why does str.encode('utf-8') produce UnicodeDecodeError in my python script?

When running the following code (which just prints out file names):
print filename
It throws the following error:
File "myscript.py", line 78, in __listfilenames
print filename
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13: ordinal not in range(128)
So to fix this, I tried changing print filename to print filename.encode('utf-8') which didn't fix the problem.
The script only fails when trying read a filename such as Coé.jpg.
Any ideas how I can modify filename so the script continues to work when it comes acorss a special character?
NB. I'm a python noob
filename is already encoded. It is already a byte string and doesn't need encoding again.
But since you asked it to be encoded, Python first has to decode it for you, and it can only do that with the default ASCII encoding. That implicit decoding fails:
>>> 'Coé.jpg'
'Co\xc3\xa9.jpg'
>>> 'Coé.jpg'.decode('utf8')
u'Co\xe9.jpg'
>>> 'Coé.jpg'.decode('utf8').encode('utf8')
'Co\xc3\xa9.jpg'
>>> 'Coé.jpg'.encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
If you wanted encoded bytestrings, you don't have to do any encoding at all. Remove the .encode('utf8').
You probably need to read up on Python and Unicode. I recommend:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO
The rule of thumb is: decode as early as you can, encode as late as you can. That means when you receive data, decode to Unicode objects, when you need to pass that information to something else, encode only then. Many APIs can do the decoding and encoding as part of their job; print will encode to the codec used by the terminal, for example.

(python)In my database mix some wrong ascii code, how to converte those string without errors

In my database mix some wrong ascii code, how to make concatenate those string without errors?
my example situation is like(some ascii character is larger than 128):
>>> s=b'\xb0'
>>> addstr='read '+s
>>> print addstr
read ░
>>> addstr.encode('ascii','ignore')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 5: ordinal
not in range(128)
>>> addstr.encode('utf_8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 5: ordinal
not in range(128)
I can do:
>>> addstr.decode("windows-1252").encode('utf-8')
'read \xc2\xb0'
but you can see the windows-1252 coding will change my character.
I would like convert the addstr to unicode? how to do it?
addstrUnicode = addstr.decode("unicode-escape")
You should not be concerned about the character changing, it is just that the utf-8 encoding requires two bytes, not one byte, for characters between 0x80 and 0x7FF, so when you encode as utf-8, an extra byte (0xC2) is added.
This is a useful link to read to help understand different types of encodings.
Additionally, make sure you know the original encoding of the character before you start trying to decode it. While you mentioned that it was "ascii code", the ascii character set only extends up to 127, which means the character cannot be ascii-encoded. I'm assuming here it's just Unicode point \u00B0.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3

I got this string 'Velcro Back Rest \xa36.99'. Note it does not have u in the front. Its just plain ascii.
How do I convert it to unicode?
I tried this,
>>> unicode('Velcro Back Rest \xa36.99')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 17: ordinal not in range(128)
This answer explain it nicely. But I have same question as the OP of that question. In the answer to that comment Winston says "You should not encoding a string object ..."
But the framework I am working requires that it should be converted unicode string. I use scrapy and I have this line.
loader.add_value('name', product_name)
Here product_name contains that problematic string and it throws the error.
You need to specify an encoding to decode the bytes to Unicode with:
>>> 'Velcro Back Rest \xa36.99'.decode('latin1')
u'Velcro Back Rest \xa36.99'
>>> print 'Velcro Back Rest \xa36.99'.decode('latin1')
Velcro Back Rest £6.99
In this case, I was able to guess the encoding from experience, you need to provide the correct codec used for each encoding you encounter. For web data, that is usually included in the from of the content-type header:
Content-Type: text/html; charset=iso-8859-1
where iso-8859-1 is the official standard name for the Latin 1 encoding, for example. Python recognizes latin1 as an alias for iso-8859-1.
Note that your input data is not plain ASCII. If it was, it'd only use bytes in the range 0 through to 127; \xa3 is 163 decimal, so outside of the ASCII range.

Categories

Resources