unicode error printing \u2002 using Python 3

unicode error printing \u2002 using Python 3 - python

I am getting the error that Python can't decode character \u2002 when trying to print a block of text:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2002' in position 355: character maps to <undefined>
What i don't understand is that, as far as i can tell, this is a unicode character (i.e. the EN SPACE character), so not sure why not printing.
For reference, the content was read in using file_content = open (file_name, encoding="utf8")

Works for me! (on a linux teminal)
>>> print("\u2002")
It's an invisible as it's EN_SPACE
If you are on windows however you are likely using codepage 125X in your terminal and...
>>> "\u2002".encode("cp1250")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/encodings/cp1250.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character '\u2002' in position 0: character maps to <undefined>

There is no problem using that character in Unicode (as a unicode string in Python). But when you write it out ("print it") it needs to be encoded into an encoding. Some encodings don't support some characters. The encoding you are using to print does not support that particular character.
Probably you are using the Windows console which typically uses a codepage like 850 or 437 which doesn't include this character.
There are ways to change the Windows console codepage (chcp) or you could try in Idle or some other IDE

Related

UnicodeEncodeError: 'charmap' codec can't encode character '\u2264'

I'm using python3.6 in windows7 and django 1.9
I got this error when i run my code.
In my code i'm parsing xml data to write a html page.
I got that some character is unable to encode properly thats why its throwing an error.
\u2264 this is the character(less than or equal) which is the root cause of the error.
My question is how to encode this properly in python3
Detailed Error Log:
Traceback (most recent call last):
File "C:\Dev\EXE\TEMP\cookie\crumbs\views.py", line 1520, in parser
html_file.write(html_text)
File "C:\Users\Cookie1\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2264' in position 389078: character maps to <undefined>

The error message indicates that you're trying to encode to the Windows-1252 character encoding. That encoding doesn't have a representation of the less than or equal symbol.
>>> "\u2264".encode("cp1252")
>>> Traceback... [as above]
The answer is to use UTF-8, an unrestricted encoding, instead of Windows-1252, a very restricted encoding.
Your question doesn't include much context, but the line html_file.write(html_text) makes me think you're using Python's file protocol. The documentation for open() shows how to set the encoding, e.g.
html_file = open("file.html", mode="w", encoding="utf8")
Note that "the default encoding is platform dependent (whatever locale.getpreferredencoding() returns)", which is why you're getting Windows-1252 on Windows 7.

Error while importing Wikipedia using pip

I'm using Python 2.7 and trying to work with the following code
import wikipedia
input = raw_input("Question: ")
print wikipedia.summary(input)
I see this error when the code is run:
Traceback (most recent call last): File "wik.py", line 5, in
print wikipedia.summary(input) File "C:\Anaconda2\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013'
in position 38: character maps to undefined
How can I fix this? Thanks in advance.

Python 2 defaults to ASCII, which only maps characters between \u0000 and \u007F1. You need to use a different encoding in order to properly get this character (\u2013 is a long dash) and many others outside of ASCII.
Using UTF-8 should work for you, and I believe this print statement will properly output text:
print wikipedia.summary(input).encode("utf8")
For more information on this, check this similar question: UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 3 2: ordinal not in range(128).

Why does str.encode('utf-8') produce UnicodeDecodeError in my python script?

When running the following code (which just prints out file names):
print filename
It throws the following error:
File "myscript.py", line 78, in __listfilenames
print filename
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13: ordinal not in range(128)
So to fix this, I tried changing print filename to print filename.encode('utf-8') which didn't fix the problem.
The script only fails when trying read a filename such as Coé.jpg.
Any ideas how I can modify filename so the script continues to work when it comes acorss a special character?
NB. I'm a python noob

filename is already encoded. It is already a byte string and doesn't need encoding again.
But since you asked it to be encoded, Python first has to decode it for you, and it can only do that with the default ASCII encoding. That implicit decoding fails:
>>> 'Coé.jpg'
'Co\xc3\xa9.jpg'
>>> 'Coé.jpg'.decode('utf8')
u'Co\xe9.jpg'
>>> 'Coé.jpg'.decode('utf8').encode('utf8')
'Co\xc3\xa9.jpg'
>>> 'Coé.jpg'.encode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
If you wanted encoded bytestrings, you don't have to do any encoding at all. Remove the .encode('utf8').
You probably need to read up on Python and Unicode. I recommend:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO
The rule of thumb is: decode as early as you can, encode as late as you can. That means when you receive data, decode to Unicode objects, when you need to pass that information to something else, encode only then. Many APIs can do the decoding and encoding as part of their job; print will encode to the codec used by the terminal, for example.

UnicodeDecodeError

I got an UnicodeDecodeError,
'utf8' codec can't decode byte 0xe5 in position 1923: invalid continuation byte
I have use Danish letter "å" in my template. How can I solve the problem, then I can use non-English letter in my Django project and database?

I can get a similar error (mentioning the same byte value) doing this:
>>> 'å'.encode('latin-1')
b'\xe5'
>>> _.decode('utf-8')
Traceback (most recent call last):
File "<pyshell#18>", line 1, in <module>
_.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 0: unexpected end of data
This implies that your data is encoded in latin-1 rather than utf-8. In general, there are two solutions to this: if you have control over your input data, re-save it as UTF-8. Otherwise, when you read the data in Python, set the encoding to latin-1. For a django template, you should be able to use the first - the editor you use should have an 'encoding' option somewhere, change it to utf-8, resave, and everything should work.

this helped me https://stackoverflow.com/a/23278373/2571607
Basically, open C:\Python27\Lib\mimetypes.py
replace
‘default_encoding = sys.getdefaultencoding()’
with
if sys.getdefaultencoding() != 'gbk':
reload(sys)
sys.setdefaultencoding('gbk')
default_encoding = sys.getdefaultencoding()

printing hebrew in python works in eclipse but not shell

I have some code that converts a Unicode representation of hebrew text file into hebrew for display
for example:
f = open(sys.argv[1])
for line in f:
print eval('u"' + line +'"')
This works fun when I run it in PyDev (eclipse), but when I run it from the command line, I get
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 9-10: ordinal not in range(256)
An example line from the input file is:
\u05d9\u05d5\u05dd
What is the problem? How can I solve this?

Do not use eval(); instead use the unicode_escape codec to interpret that data:
for line in f:
line = line.decode('unicode_escape')
The unicode_escape encoding interprets \uabcd character sequences the same way Python would when parsing a unicode literal in the source code:
>>> '\u05d9\u05d5\u05dd'.decode('unicode_escape')
u'\u05d9\u05d5\u05dd'
The exception you see is not caused by the eval() statement though; I suspect it is being caused by an attempt to print the result instead. Python will try to encode unicode values automatically and will detect what encoding the current terminal uses.
Your Eclipse output window uses a different encoding from your terminal; if the latter is configured to support Latin-1 then you'll see that exact exception, as Python tries to encode Hebrew codepoints to an encoding that doesn't support those:
>>> u'\u05d9\u05d5\u05dd'.encode('latin1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: ordinal not in range(256)
The solution is to reconfigure your terminal (UTF-8 would be a good choice), or to not print unicode values with codepoints that cannot be encoded to Latin-1.
If you are redirecting output from Python to a file, then Python cannot determine the output encoding automatically. In that case you can use the PYTHONIOENCODING environment variable to tell Python what encoding to use for standard I/O:
PYTHONIOENCODING=utf-8 python yourscript.py > outputfile.txt

Thank you, this solved my problem.
line.decode('unicode_escape')
did the trick.
Followup - This now works, but if I try to send the output to a file:
python myScript.py > textfile.txt
The file itself has the error:
'ascii' codec can't encode characters in position 42-44: ordinal not in range(128)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

unicode error printing \u2002 using Python 3 - python

Related

UnicodeEncodeError: 'charmap' codec can't encode character '\u2264'

Error while importing Wikipedia using pip

Why does str.encode('utf-8') produce UnicodeDecodeError in my python script?

UnicodeDecodeError

printing hebrew in python works in eclipse but not shell

Categories

Resources