Error Encoding non-BMP characters - python

I've developed a little program in python 3.4, but when I try to run it, at the end says:
File "C:\Python34\lib\idlelib\PyShell.py", line 1352, in write
return self.shell.write(s, self.tags)
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 39559-39559: Non-BMP character not supported in Tk
I've tried all, but I found nothing. Help, please!

I presume you did the equivalent of the following.
>>> print('\U00011111')
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
print('\U00011111')
File "C:\Programs\Python34\lib\idlelib\PyShell.py", line 1347, in write
return self.shell.write(s, self.tags)
UnicodeEncodeError: 'UCS-2' codec can't encode character '\U00011111' in position 0: Non-BMP character not supported in Tk
The problem is as stated: Idle uses the tkinter interface to tcl/tk and tk cannot display non-BMP supplementary chars (ord(char) > 0xFFFF).
Saving a string with non-BMP chars to a file will work fine as long as you encode with utf-8 (or -16, or -32).
On Windows, the console interpreter gives the same error with 'UCS-2' replaced by 'charmap'. The console interpreter is actually worse in that it raises an error even for some BMP chars, depending on the code page being used. I do not know what the situation is on other systems.
EDIT
I forget the best alternative, at least on Windows. Either of the following will print any string on any ascii terminal.
>>> repr('\U00011111')
"'\U00011111'"
>>> ascii('\U00011111')
"'\\U00011111'"
repr() does not double backslashes when echoed, ascii() does. These escape more chars than needed for Idle, but will not raise an exception at the >>> prompt. However, for reasons I do not understand, print(repr('\U00011111')) fails, so print(ascii(s)) is needed within a program to print s.

Related

Which encoding should Python open function use?

I'm getting an exception when reading a file that contains a RIGHT DOUBLE QUOTATION MARK Unicode symbol. It is encoded in UTF-8 (0xE2 0x80 0x9D). The minimal example:
import sys
print(sys.getdefaultencoding())
f = open("input.txt", "r")
r.readline()
This script fails reading the first line even if the right quotation mark is not on the first line. The exception looks like that:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 102: char
acter maps to <undefined>
The input file is in utf-8 encoding, I've tried both with and without BOM. The default encoding returned by sys.getdefaultencoding() is utf-8.
This script fails on the machine with Python 3.6.5 but works well on another with Python 3.6.0. Both machines are Windows.
My questions are mostly theoretical, as this exception is thrown from external software that I cannot change, and it reads file that I don't wish to change. What should be the difference in these machines except the Python patch version? Why does vanilla open use cp1252 if the system default is utf-8?
As clearly stated in Python's open documentation:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
Windows defaults to a localized encoding (cp1252 on US and Western European versions). Linux typically defaults to utf-8.
Because it is platform-dependent, use the encoding parameter and specify the encoding of the file explicitly.

Python 2.7: Printing out an decoded string

I have an file that is called: Abrázame.txt
I want to decode this so that python understands what this 'á' char is so that it will print me Abrázame.txt
This is the following code i have on an Scratch file:
import os
s = os.path.join(r'C:\Test\AutoTest', os.listdir(r'C:\\Test\\AutoTest')[0])
print(unicode(s.decode(encoding='utf-16', errors='strict')))
The error i get from above is:
Traceback (most recent call last):
File "C:/Users/naythan_onfri/.PyCharmCE2017.2/config/scratches/scratch_3.py", line 12, in <module>
print(unicode(s.decode(encoding='utf-16', errors='strict')))
File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x74 in position 28: truncated data
I have looked up the utf-16 character set and it does indeed have 'á' character in it. So why is it that this string cannot be decoded with Utf-16.
Also i know that 'latin-1' will work and produce the string im looking for however since this is for an automation project and i wanting to ensure that any filename with any registered character can be decoded and used for other things within the project for example:
"Opening up file explorer at the directory of the file with the file already selected."
Is looping through each of the codecs(Mind you i believe there is 93 codecs) to find whichever one can decode the string, the best way of getting the result I'm looking for? I figure there something far better than that solution.
You want to decode at the edge when you first read a string so that you don't have surprises later in your code. At the edge, you have some reasonable chance of guessing what that encoding is. For this code, the edge is
os.listdir(r'C:\\Test\\AutoTest')[0]
and you can get the current file system directory encoding. So,
import sys
fs_encoding = sys.getfilesystemencoding()
s = os.path.join(r'C:\Test\AutoTest',
os.listdir(r'C:\\Test\\AutoTest')[0].decode(encoding=fs_encodig, errors='strict')
print(s)
Note that once you decode you have a unicode string and you don't need to build a new unicode() object from it.
latin-1 works if that's your current code page. Its an interesting curiosity that even though Windows has supported "wide" characters with "W" versions of their API for many years, python 2 is single-byte character based and doesn't use them.
Long live python 3.

unicode error printing \u2002 using Python 3

I am getting the error that Python can't decode character \u2002 when trying to print a block of text:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2002' in position 355: character maps to <undefined>
What i don't understand is that, as far as i can tell, this is a unicode character (i.e. the EN SPACE character), so not sure why not printing.
For reference, the content was read in using file_content = open (file_name, encoding="utf8")
Works for me! (on a linux teminal)
>>> print("\u2002")
It's an invisible as it's EN_SPACE
If you are on windows however you are likely using codepage 125X in your terminal and...
>>> "\u2002".encode("cp1250")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/encodings/cp1250.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character '\u2002' in position 0: character maps to <undefined>
There is no problem using that character in Unicode (as a unicode string in Python). But when you write it out ("print it") it needs to be encoded into an encoding. Some encodings don't support some characters. The encoding you are using to print does not support that particular character.
Probably you are using the Windows console which typically uses a codepage like 850 or 437 which doesn't include this character.
There are ways to change the Windows console codepage (chcp) or you could try in Idle or some other IDE

Can't handle strings in windows

I have written a python 2.7 code in linux and it worked fine.
It uses
os.listdir(os.getcwd())
to read folder names as variables and uses them later in some parts.
In linux I used simple conversion trick to manually convert the non asci characters into asci ones.
str(str(tfile)[0:-4]).replace('\xc4\xb0', 'I').replace("\xc4\x9e", 'G').replace("\xc3\x9c", 'U').replace("\xc3\x87", 'C').replace("\xc3\x96", 'O').replace("\xc5\x9e", 'S') ,str(line.split(";")[0]).replace(" ", "").rjust(13, "0"),a))
This approach failed in windows. I tried
udata = str(str(str(tfile)[0:-4])).decode("UTF-8")
asci = udata.encode("ascii","ignore")
Which also failed with following
DEM¦-RTEPE # at this string error occured
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Python27\lib\lib-tk\Tkinter.py", line 1532, in __call__
return self.func(*args)
File "C:\Users\benhur.satir\workspace\Soykan\tkinter.py", line 178, in SparisDerle
udata = str(str(str(tfile)[0:-4])).decode("utf=8")
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa6 in position 3: invalid start byte
How can I handle such characters in windows?
NOTE:Leaving them UTF causes xlswriter module to fail, so I need to convert them to asci. Missing characters are not desirable yet acceptable.
Windows does not like UTF8. You probably get the folder names in the default system encoding, generally win1252 (a variant of ISO-8859-1).
That's the reason why you could not find UTF8 characters in the file names. By the way the exception says you found a character of code 0xa6, which in win1252 encoding would be |.
It does not say exactly what is the encoding on your windows system as it may depends on the localization, but it proves the data is not UTF8 encoded.
How about this?
You can use this for optional .replace()
In the module of string, there is a set of characters that can be used..
>>> import string
>>> string.digits+string.punctuation
'0123456789!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
>>>

A UnicodeDecodeError that occurs with json in python on Windows, but not Mac

On windows, I have the following problem:
>>> string = "Don´t Forget To Breathe"
>>> import json,os,codecs
>>> f = codecs.open("C:\\temp.txt","w","UTF-8")
>>> json.dump(string,f)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python26\lib\json\__init__.py", line 180, in dump
for chunk in iterable:
File "C:\Python26\lib\json\encoder.py", line 294, in _iterencode
yield encoder(o)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 3-5: invalid data
(Notice the non-ascii apostrophe in the string.)
However, my friend, on his mac (also using python2.6), can run through this like a breeze:
> string = "Don´t Forget To Breathe"
> import json,os,codecs
> f = codecs.open("/tmp/temp.txt","w","UTF-8")
> json.dump(string,f)
> f.close(); open('/tmp/temp.txt').read()
'"Don\\u00b4t Forget To Breathe"'
Why is this? I've also tried using UTF-16 and UTF-32 with json and codecs, but to no avail.
What does repr(string) show on each machine? On my Mac the apostrophe shows as \xc2\xb4 (utf8 coding, 2 bytes) so of course the utf8 codec can deal with it; on your Windows it clearly isn't doing that since it talks about three bytes being a problem - so on Windows you must have some other, non-utf8 encoding set for your console.
Your general problem is that, in Python pre-3, you should not enter a byte string ("...." as you used, rather than u"....") with non-ascii content (unless specifically as escape strings): this may (depending on how the session is set) fail directly or produce bytes, according to some codec set as the default one, which are not the exact bytes you expect (because you're not aware of the exact default codec in use). Use explicit Unicode literals
string = u"Don´t Forget To Breathe"
and you should be OK (or if you have any problem it will emerge right at the time of this assignment, at which point we may go into the issue of "how to I set a default encoding for my interactive sessions" if that's what you require).

Categories

Resources