UnicodeEncodeError: 'charmap' codec can't encode character '\u2264' - python

I'm using python3.6 in windows7 and django 1.9
I got this error when i run my code.
In my code i'm parsing xml data to write a html page.
I got that some character is unable to encode properly thats why its throwing an error.
\u2264 this is the character(less than or equal) which is the root cause of the error.
My question is how to encode this properly in python3
Detailed Error Log:
Traceback (most recent call last):
File "C:\Dev\EXE\TEMP\cookie\crumbs\views.py", line 1520, in parser
html_file.write(html_text)
File "C:\Users\Cookie1\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2264' in position 389078: character maps to <undefined>

The error message indicates that you're trying to encode to the Windows-1252 character encoding. That encoding doesn't have a representation of the less than or equal symbol.
>>> "\u2264".encode("cp1252")
>>> Traceback... [as above]
The answer is to use UTF-8, an unrestricted encoding, instead of Windows-1252, a very restricted encoding.
Your question doesn't include much context, but the line html_file.write(html_text) makes me think you're using Python's file protocol. The documentation for open() shows how to set the encoding, e.g.
html_file = open("file.html", mode="w", encoding="utf8")
Note that "the default encoding is platform dependent (whatever locale.getpreferredencoding() returns)", which is why you're getting Windows-1252 on Windows 7.

Related

python cleaning high or non-ascii out of a file [duplicate]

I'm working on a series of parsers where I get a bunch of tracebacks from my unit tests like:
File "c:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 112: character maps to <undefined>
The files are opened with open() with no extra arguemnts. Can I pass extra arguments to open() or use something in the codec module to open these differently?
This came up with code that was written in Python 2 and converted to 3 with the 2to3 tool.
UPDATE: it turns out this is a result of feeding a zipfile into the parser. The unit test actually expects this to happen. The parser should recognize it as something that can't be parsed. So, I need to change my exception handling. In the process of doing that now.
Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:
>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>
or with an actual file:
>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte
Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:
>>> open('test.txt', encoding='latin-1').read()
'\x81\n'
Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.
You can relax the error handling.
For instance:
f = open(filename, encoding="...", errors="replace")
Or:
f = open(filename, encoding="...", errors="ignore")
See the docs.
EDIT:
But are you certain that the problem is in reading the file? Could it be that the exception happens when something is written to the console? Check http://wiki.python.org/moin/PrintFails
All files are "not Unicode". Unicode is an internal representation which must be encoded. You need to determine for each file what encoding has been used, and specify that where necessary when the file is opened.
As the traceback and error message indicate, the file in question is NOT encoded in cp1252.
If it is encoded in latin1, the "\x81" that it is complaining about is a C1 control character that doesn't even have a name (in Unicode). Consider latin1 extremely unlikely to be valid.
You say "some of the files are parsed with xml.dom.minidom" -- parsed successfully or unsuccessfully?
A valid XML file should declare its encoding (default is UTF-8) in the first line, and you should not need to specify an encoding in your code. Show us the code that you are using to do the xml.dom.minidom parsing.
"others read directly as iterables" -- sample code please.
Suggestion: try opening some of each type of file in your browser. Then click View and click Character Encoding (Firefox) or Encoding (Internet Explorer). What encoding has the browser guessed [usually reliably]?
Other possible encoding clues: What languages are used in the text in the files? Where did you get the files from?
Note: please edit your question with clarifying information; don't answer in the comments.

unicode error printing \u2002 using Python 3

I am getting the error that Python can't decode character \u2002 when trying to print a block of text:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2002' in position 355: character maps to <undefined>
What i don't understand is that, as far as i can tell, this is a unicode character (i.e. the EN SPACE character), so not sure why not printing.
For reference, the content was read in using file_content = open (file_name, encoding="utf8")
Works for me! (on a linux teminal)
>>> print("\u2002")
It's an invisible as it's EN_SPACE
If you are on windows however you are likely using codepage 125X in your terminal and...
>>> "\u2002".encode("cp1250")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/encodings/cp1250.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character '\u2002' in position 0: character maps to <undefined>
There is no problem using that character in Unicode (as a unicode string in Python). But when you write it out ("print it") it needs to be encoded into an encoding. Some encodings don't support some characters. The encoding you are using to print does not support that particular character.
Probably you are using the Windows console which typically uses a codepage like 850 or 437 which doesn't include this character.
There are ways to change the Windows console codepage (chcp) or you could try in Idle or some other IDE

Why does str.encode('utf-8') produce UnicodeDecodeError in my python script?

When running the following code (which just prints out file names):
print filename
It throws the following error:
File "myscript.py", line 78, in __listfilenames
print filename
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13: ordinal not in range(128)
So to fix this, I tried changing print filename to print filename.encode('utf-8') which didn't fix the problem.
The script only fails when trying read a filename such as Coé.jpg.
Any ideas how I can modify filename so the script continues to work when it comes acorss a special character?
NB. I'm a python noob
filename is already encoded. It is already a byte string and doesn't need encoding again.
But since you asked it to be encoded, Python first has to decode it for you, and it can only do that with the default ASCII encoding. That implicit decoding fails:
>>> 'Coé.jpg'
'Co\xc3\xa9.jpg'
>>> 'Coé.jpg'.decode('utf8')
u'Co\xe9.jpg'
>>> 'Coé.jpg'.decode('utf8').encode('utf8')
'Co\xc3\xa9.jpg'
>>> 'Coé.jpg'.encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
If you wanted encoded bytestrings, you don't have to do any encoding at all. Remove the .encode('utf8').
You probably need to read up on Python and Unicode. I recommend:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO
The rule of thumb is: decode as early as you can, encode as late as you can. That means when you receive data, decode to Unicode objects, when you need to pass that information to something else, encode only then. Many APIs can do the decoding and encoding as part of their job; print will encode to the codec used by the terminal, for example.

UnicodeDecodeError

I got an UnicodeDecodeError,
'utf8' codec can't decode byte 0xe5 in position 1923: invalid continuation byte
I have use Danish letter "å" in my template. How can I solve the problem, then I can use non-English letter in my Django project and database?
I can get a similar error (mentioning the same byte value) doing this:
>>> 'å'.encode('latin-1')
b'\xe5'
>>> _.decode('utf-8')
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
_.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 0: unexpected end of data
This implies that your data is encoded in latin-1 rather than utf-8. In general, there are two solutions to this: if you have control over your input data, re-save it as UTF-8. Otherwise, when you read the data in Python, set the encoding to latin-1. For a django template, you should be able to use the first - the editor you use should have an 'encoding' option somewhere, change it to utf-8, resave, and everything should work.
this helped me https://stackoverflow.com/a/23278373/2571607
Basically, open C:\Python27\Lib\mimetypes.py
replace
‘default_encoding = sys.getdefaultencoding()’
with
if sys.getdefaultencoding() != 'gbk':
reload(sys)
sys.setdefaultencoding('gbk')
default_encoding = sys.getdefaultencoding()

printing hebrew in python works in eclipse but not shell

I have some code that converts a Unicode representation of hebrew text file into hebrew for display
for example:
f = open(sys.argv[1])
for line in f:
print eval('u"' + line +'"')
This works fun when I run it in PyDev (eclipse), but when I run it from the command line, I get
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 9-10: ordinal not in range(256)
An example line from the input file is:
\u05d9\u05d5\u05dd
What is the problem? How can I solve this?
Do not use eval(); instead use the unicode_escape codec to interpret that data:
for line in f:
line = line.decode('unicode_escape')
The unicode_escape encoding interprets \uabcd character sequences the same way Python would when parsing a unicode literal in the source code:
>>> '\u05d9\u05d5\u05dd'.decode('unicode_escape')
u'\u05d9\u05d5\u05dd'
The exception you see is not caused by the eval() statement though; I suspect it is being caused by an attempt to print the result instead. Python will try to encode unicode values automatically and will detect what encoding the current terminal uses.
Your Eclipse output window uses a different encoding from your terminal; if the latter is configured to support Latin-1 then you'll see that exact exception, as Python tries to encode Hebrew codepoints to an encoding that doesn't support those:
>>> u'\u05d9\u05d5\u05dd'.encode('latin1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256)
The solution is to reconfigure your terminal (UTF-8 would be a good choice), or to not print unicode values with codepoints that cannot be encoded to Latin-1.
If you are redirecting output from Python to a file, then Python cannot determine the output encoding automatically. In that case you can use the PYTHONIOENCODING environment variable to tell Python what encoding to use for standard I/O:
PYTHONIOENCODING=utf-8 python yourscript.py > outputfile.txt
Thank you, this solved my problem.
line.decode('unicode_escape')
did the trick.
Followup - This now works, but if I try to send the output to a file:
python myScript.py > textfile.txt
The file itself has the error:
'ascii' codec can't encode characters in position 42-44: ordinal not in range(128)

Categories

Resources