I have a large project where at various places problematic implicit Unicode conversions (coersions) were used in the form of e.g.:
someDynamicStr = "bar" # could come from various sources
# works
u"foo" + someDynamicStr
u"foo{}".format(someDynamicStr)
someDynamicStr = "\xff" # uh-oh
# raises UnicodeDecodeError
u"foo" + someDynamicStr
u"foo{}".format(someDynamicStr)
(Possibly other forms as well.)
Now I would like to track down those usages, especially those in actively used code.
It would be great if I could easily replace the unicode constructor with a wrapper which checks whether the input is of type str and the encoding/errors parameters are set to the default values and then notifies me (prints traceback or such).
/edit:
While not directly related to what I am looking for I came across this gloriously horrible hack for how to make the decode exception go away altogether (the decode one only, i.e. str to unicode, but not the other way around, see https://mail.python.org/pipermail/python-list/2012-July/627506.html).
I don't plan on using it but it might be interesting for those battling problems with invalid Unicode input and looking for a quick fix (but please think about the side effects):
import codecs
codecs.register_error("strict", codecs.ignore_errors)
codecs.register_error("strict", lambda x: (u"", x.end)) # alternatively
(An internet search for codecs.register_error("strict" revealed that apparently it's used in some real projects.)
/edit #2:
For explicit conversions I made a snippet with the help of a SO post on monkeypatching:
class PatchedUnicode(unicode):
def __init__(self, obj=None, encoding=None, *args, **kwargs):
if encoding in (None, "ascii", "646", "us-ascii"):
print("Problematic unicode() usage detected!")
super(PatchedUnicode, self).__init__(obj, encoding, *args, **kwargs)
import __builtin__
__builtin__.unicode = PatchedUnicode
This only affects explicit conversions using the unicode() constructor directly so it's not something I need.
/edit #3:
The thread "Extension method for python built-in types!" makes me think that it might actually not be easily possible (in CPython at least).
/edit #4:
It's nice to see many good answers here, too bad I can only give out the bounty once.
In the meantime I came across a somewhat similar question, at least in the sense of what the person tried to achieve: Can I turn off implicit Python unicode conversions to find my mixed-strings bugs?
Please note though that throwing an exception would not have been OK in my case. Here I was looking for something which might point me to the different locations of problematic code (e.g. by printing smth.) but not something which might exit the program or change its behavior (because this way I can prioritize what to fix).
On another note, the people working on the Mypy project (which include Guido van Rossum) might also come up with something similar helpful in the future, see the discussions at https://github.com/python/mypy/issues/1141 and more recently https://github.com/python/typing/issues/208.
/edit #5
I also came across the following but didn't have yet the time to test it: https://pypi.python.org/pypi/unicode-nazi
You can register a custom encoding which prints a message whenever it's used:
Code in ourencoding.py:
import sys
import codecs
import traceback
# Define a function to print out a stack frame and a message:
def printWarning(s):
sys.stderr.write(s)
sys.stderr.write("\n")
l = traceback.extract_stack()
# cut off the frames pointing to printWarning and our_encode
l = traceback.format_list(l[:-2])
sys.stderr.write("".join(l))
# Define our encoding:
originalencoding = sys.getdefaultencoding()
def our_encode(s, errors='strict'):
printWarning("Default encoding used");
return (codecs.encode(s, originalencoding, errors), len(s))
def our_decode(s, errors='strict'):
printWarning("Default encoding used");
return (codecs.decode(s, originalencoding, errors), len(s))
def our_search(name):
if name == 'our_encoding':
return codecs.CodecInfo(
name='our_encoding',
encode=our_encode,
decode=our_decode);
return None
# register our search and set the default encoding:
codecs.register(our_search)
reload(sys)
sys.setdefaultencoding('our_encoding')
If you import this file at the start of our script, then you'll see warnings for implicit conversions:
#!python2
# coding: utf-8
import ourencoding
print("test 1")
a = "hello " + u"world"
print("test 2")
a = "hello ☺ " + u"world"
print("test 3")
b = u" ".join(["hello", u"☺"])
print("test 4")
c = unicode("hello ☺")
output:
test 1
test 2
Default encoding used
File "test.py", line 10, in <module>
a = "hello ☺ " + u"world"
test 3
Default encoding used
File "test.py", line 13, in <module>
b = u" ".join(["hello", u"☺"])
test 4
Default encoding used
File "test.py", line 16, in <module>
c = unicode("hello ☺")
It's not perfect as test 1 shows, if the converted string only contain ASCII characters, sometimes you won't see a warning.
What you can do is the following:
First create a custom encoding. I will call it "lascii" for "logging ASCII":
import codecs
import traceback
def lascii_encode(input,errors='strict'):
print("ENCODED:")
traceback.print_stack()
return codecs.ascii_encode(input)
def lascii_decode(input,errors='strict'):
print("DECODED:")
traceback.print_stack()
return codecs.ascii_decode(input)
class Codec(codecs.Codec):
def encode(self, input,errors='strict'):
return lascii_encode(input,errors)
def decode(self, input,errors='strict'):
return lascii_decode(input,errors)
class IncrementalEncoder(codecs.IncrementalEncoder):
def encode(self, input, final=False):
print("Incremental ENCODED:")
traceback.print_stack()
return codecs.ascii_encode(input)
class IncrementalDecoder(codecs.IncrementalDecoder):
def decode(self, input, final=False):
print("Incremental DECODED:")
traceback.print_stack()
return codecs.ascii_decode(input)
class StreamWriter(Codec,codecs.StreamWriter):
pass
class StreamReader(Codec,codecs.StreamReader):
pass
def getregentry():
return codecs.CodecInfo(
name='lascii',
encode=lascii_encode,
decode=lascii_decode,
incrementalencoder=IncrementalEncoder,
incrementaldecoder=IncrementalDecoder,
streamwriter=StreamWriter,
streamreader=StreamReader,
)
What this does is basically the same as the ASCII-codec, just that it prints a message and the current stack trace every time it encodes or decodes from unicode to lascii.
Now you need to make it available to the codecs module so that it can be found by the name "lascii". For this you need to create a search function that returns the lascii-codec when it's fed with the string "lascii". This is then registered to the codecs module:
def searchFunc(name):
if name=="lascii":
return getregentry()
else:
return None
codecs.register(searchFunc)
The last thing that is now left to do is to tell the sys module to use 'lascii' as default encoding:
import sys
reload(sys) # necessary, because sys.setdefaultencoding is deleted on start of Python
sys.setdefaultencoding('lascii')
Warning:
This uses some deprecated or otherwise unrecommended features. It might not be efficient or bug-free. Do not use in production, only for testing and/or debugging.
Just add:
from __future__ import unicode_literals
at the beginning of your source code files - it has to be the first import and it has to be in all source code files affected and the headache of using unicode in Python-2.7 goes away. If you didn't do anything super weird with strings then it should get rid of the problem out of the box.
Check out the following Copy&Paste from my console - I tried with the sample from your question:
user#linux2:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> someDynamicStr = "bar" # could come from various sources
>>>
>>> # works
... u"foo" + someDynamicStr
u'foobar'
>>> u"foo{}".format(someDynamicStr)
u'foobar'
>>>
>>> someDynamicStr = "\xff" # uh-oh
>>>
>>> # raises UnicodeDecodeError
... u"foo" + someDynamicStr
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
uUnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
">>> u"foo{}".format(someDynamicStr)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
>>>
And now with __future__ magic:
user#linux2:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from __future__ import unicode_literals
>>> someDynamicStr = "bar" # could come from various sources
>>>
>>> # works
... u"foo" + someDynamicStr
u'foobar'
>>> u"foo{}".format(someDynamicStr)
u'foobar'
>>>
>>> someDynamicStr = "\xff" # uh-oh
>>>
>>> # raises UnicodeDecodeError
... u"foo" + someDynamicStr
u'foo\xff'
>>> u"foo{}".format(someDynamicStr)
u'foo\xff'
>>>
I see you have a lot of edits relating to solutions you may have encountered. I'm just going to address your original post which I believe to be: "I want to create a wrapper around the unicode constructor that checks input".
The unicode method is part of Python's standard library. You will decorate the unicode method to add checks to the method.
def add_checks(fxn):
def resulting_fxn(*args, **kargs):
# this is where whether the input is of type str
if type(args[0]) is str:
# do something
# this is where the encoding/errors parameters are set to the default values
encoding = 'utf-8'
# Set default error behavior
error = 'ignore'
# Print any information (i.e. traceback)
# print 'blah'
# TODO: for traceback, you'll want to use the pdb module
return fxn(args[0], encoding, error)
return resulting_fxn
Using this will look like this:
unicode = add_checks(unicode)
We overwrite the existing function name so that you don't have to change all the calls in the large project. You want to do this very early on in the runtime so that subsequent calls have the new behavior.
Related
I migrated my python code from Win10 host to WS2012R2. Surprisingly it stops operating correctly and now shows warning message: "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to "
I've tried to execute a command:
set PYTHONLEGACYWINDOWSSTDIO=yes
My code:
import logging
import sys
def get_console_handler():
console_handler = logging.StreamHandler(sys.stdout)
return console_handler
def get_logger():
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logger.addHandler(get_console_handler())
return logger
my_logger = get_logger()
my_logger.debug("Это отладочное сообщение".encode("cp1252"))
What should I do to get rid of this warning?
Update
Colleagues, I am sorry for misleading you! I am obviously was tired after long hours of bug tracking )
The problem doesn't connect with "*.encode()" calling as such, it is connected with default python encoding while IO console operation (I suppose so)! The original code makes some requests from DB in cp1251 charset but the problem appears when python is trying to convert it to cp1252.
Here is another example of how to summon the error.
Create a plain text file, i.e. test.txt with text "Это отладочное сообщение" and save it cp1252.
Run python console and enter:
f = open("test.txt")
f.read()
Output:
f = open("test.txt")
f.read()
Traceback (most recent call last): File "<stdin>", line 1, in <module>
File "c:\project\venv\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 29: character maps to <undefined>
Use encode("utf-8"). Here is a list of python encodings: https://docs.python.org/2.4/lib/standard-encodings.html
my_logger.debug("Это отладочное сообщение".encode("utf-8"))
then use .decode("utf-8") to see the printable value of your string
The problem is how logging.StreamHandler performs console output, namely due to the fact that you couldn't change default encoding in contrast with FileHandler.
If the default system encoding doesn't match the needed one, you could face an issue.
For my example. I wanted to output cp1251 lines, while system default encoding was:
import locale
locale.getpreferredencoding()
'cp1252'
This question was solved by changing system locale (see https://stackoverflow.com/a/11234956/9851754). Choose "Change system locale..." for non-Unicode programs. No code changes needed.
import locale
locale.getpreferredencoding()
'cp1251'
I have tested your code with Python 3.6.8 and it worked for me (I didn't change anything).
Python 3.6.8:
>>> python3 -V
Python 3.6.8
>>> python3 test.py
Это отладочное сообщение
But when I have tested it with Python 2.7.15+, I got a similar error than you.
Python 2.7.15+ with your implementation:
>>> python2 -V
Python 2.7.15+
>>> python2 test.py
File "test.py", line 17
SyntaxError: Non-ASCII character '\xd0' in file test.py on line 17, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Then I have put the following line into the first line it worked for me.
Begging of code:
# -*- coding: utf-8 -*-
import logging
import sys
...
Output with Python 2.7.15+ and with modified code:
>>> python2 -V
Python 2.7.15+
>>> python2 test.py
Это отладочное сообщение
I have a colleague whose computer won't run a Python script that uses the dateutil.tz module; there is a timezone name '\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4' that shows up and in dateutil.tz there is this code:
def tzname_in_python2(namefunc):
"""Change unicode output into bytestrings in Python 2
tzname() API changed in Python 3. It used to return bytes, but was changed
to unicode strings
"""
def adjust_encoding(*args, **kwargs):
name = namefunc(*args, **kwargs)
if name is not None and not PY3:
name = name.encode()
return name
return adjust_encoding
which breaks because the string in question is not ASCII. What is this string? It doesn't look like valid Unicode:
>>> a = '\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4'
>>> a.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\app\python\anaconda\2\envs\emblaze\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: invalid continuation byte
My python script contains
timezone = dateutil.tz.tzlocal()
and the resulting object fails to run timezone.tzname(some_timestamp) because of the non-ASCII nature of the timezone name.
If this happens again, there is a python module for this:
>>> import chardet
>>> z = b'\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4'
>>> chardet.detect(z)
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
Aha, I figured it out after a bunch of searching on the net. It's not UTF8 or UTF16; it seems to be GB2312 (or GBK) encoding, which can be decoded in Python (on MS Windows, at least) with the gbk codec:
>>> '\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xb1\xea\xd7\xbc\xca\xb1\xbc\xe4'.decode('gbk')
u'\u7f8e\u56fd\u5c71\u5730\u6807\u51c6\u65f6\u95f4'
>>> '\xc3\xc0\xb9\xfa\xc9\xbd\xb5\xd8\xcf\xc4\xc1\xee\xca\xb1'.decode('gbk')
u'\u7f8e\u56fd\u5c71\u5730\u590f\u4ee4\u65f6'
which print out (in IPython Notebook) as
美国山地标准时间
美国山地夏令时
which Google Translate tells me represents "American Mountain Standard Time" and "American Mountain Summertime", respectively.
I wrote a small example of the issue for everybody to see what's going on using Python 2.7 and Django 1.10.8
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, unicode_literals, print_function
import time
from django import setup
setup()
from django.contrib.auth.models import Group
group = Group(name='schön')
print(type(repr(group)))
print(type(str(group)))
print(type(unicode(group)))
print(group)
print(repr(group))
print(str(group))
print(unicode(group))
time.sleep(1.0)
print('%s' % group)
print('%r' % group) # fails
print('%s' % [group]) # fails
print('%r' % [group]) # fails
Exits with the following output + traceback
$ python .PyCharmCE2017.2/config/scratches/scratch.py
<type 'str'>
<type 'str'>
<type 'unicode'>
schön
<Group: schön>
schön
schön
schön
Traceback (most recent call last):
File "/home/srkunze/.PyCharmCE2017.2/config/scratches/scratch.py", line 22, in <module>
print('%r' % group) # fails
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)
Has somebody an idea what's going on here?
At issue here is that you are interpolating UTF-8 bytestrings into a Unicode string. Your '%r' string is a Unicode string because you used from __future__ import unicode_literals, but repr(group) (used by the %r placeholder) returns a bytestring. For Django models, repr() can include Unicode data in the representation, encoded to a bytestring using UTF-8. Such representations are not ASCII safe.
For your specific example, repr() on your Group instance produces the bytestring '<Group: sch\xc3\xb6n>'. Interpolating that into a Unicode string triggers the implicit decoding:
>>> u'%s' % '<Group: sch\xc3\xb6n>'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)
Note that I did not use from __future__ import unicode_literals in my Python session, so the '<Group: sch\xc3\xb6n>' string is not a unicode object, it is a str bytestring object!
In Python 2, you should avoid mixing Unicode and byte strings. Always explicitly normalise your data (encoding Unicode to bytes or decoding bytes to Unicode).
If you must use from __future__ import unicode_literals, you can still create bytestrings by using a b prefix:
>>> from __future__ import unicode_literals
>>> type('') # empty unicode string
<type 'unicode'>
>>> type(b'') # empty bytestring, note the b prefix
<type 'str'>
>>> b'%s' % b'<Group: sch\xc3\xb6n>' # two bytestrings
'<Group: sch\xc3\xb6n>'
I had a hard time finding general solution to your problem.
__repr__() is what I understand supposed to return str, any efforts to change that seems to cause new problems.
Regarding the fact that the __repr__() method is defined outside the project, you are able to overload methods. For example
def new_repr(self):
return 'My representation of self {}'.format(self.name)
Group.add_to_class("__repr__", new_repr)
The only solution I can find, that works is to explicitly tell the interpreter how to handle the strings.
from __future__ import unicode_literals
from django.contrib.auth.models import Group
group = Group(name='schön')
print(type(repr(group)))
print(type(str(group)))
print(type(unicode(group)))
print(group)
print(repr(group))
print(str(group))
print(unicode(group))
print('%s' % group)
print('%r' % repr(group))
print('%s' % [str(group)])
print('%r' % [repr(group)])
# added
print('{}'.format([repr(group).decode("utf-8")]))
print('{}'.format([repr(group)]))
print('{}'.format(group))
Working with strings in python 2.x is a mess.
Hope this brings some light into how to work around (which is the only way I can find) the problem.
I think the real issue is in the django code.
It was reported six years ago:
https://code.djangoproject.com/ticket/18063
I think patch to django would solve it:
def __repr__(self):
return self.....encode('ascii', 'replace')
I think the repr() method should return "7 bit ascii".
If it's the case then we need to override the unicode method with our customised method. Try below code. It will work. I have tested it.
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from django.contrib.auth.models import Group
def custom_unicode(self):
return u"%s" % (self.name.encode('utf-8', 'ignore'))
Group.__unicode__ = custom_unicode
group = Group(name='schön')
# Tests
print(type(repr(group)))
print(type(str(group)))
print(type(unicode(group)))
print(group)
print(repr(group))
print(str(group))
print(unicode(group))
print('%s' % group)
print('%r' % group)
print('%s' % [group])
print('%r' % [group])
# output:
<type 'str'>
<type 'str'>
<type 'unicode'>
schön
<Group: schön>
schön
schön
schön
<Group: schön>
[<Group: schön>]
[<Group: schön>]
Reference: https://docs.python.org/2/howto/unicode.html
I am not familiar with Django. Your issue seems to be representing text data in ASCI which is actually in unicode. Please try unidecode module in Python.
from unidecode import unidecode
#print(string) is replaced with
print(unidecode(string))
Refer Unidecode
I have a function that prints some information called print_info(). Can I use it to print this info when raising an exception?
raise ValueError('This is invalid. Check the valid items here %s' % str(self.print_info()))
I can imagine this would be possible in two ways:
1- Call the print_info() function to print to stdout instead of providing a string
2- Convert the output of the print_info() function to a string and pass it as an argument
I am not sure if this is possible, and if it is, I'm not sure of the correct way to implement it.
print is a statement (in Py2.7) that prints something to standard output. Unless your print_info function also returns the same string, this won't work.
If you need to use your info string in multiple ways (printing, exceptions, etc.) then it'd be better to have the string creation and the output be separated:
def make_info():
return 'This is my info string.'
def print_info():
print make_info()
def raise_info():
raise ValueError('Something happened. See info: {}'.format(make_info())
A more cleaner approach
def print_info():
print 'This is invalid. Check the valid items here'
try:
a = int(raw_input())
except ValueError as p:
print_info()
else:
print a
Result
Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================
>>>
10
10
>>> ================================ RESTART ================================
>>>
wdfdsnj
This is invalid. Check the valid items here
>>>
In my opinion its perfectly fine to do it this way as long as the self.print_info() returns a string (not printing one!).
An alternative one (which I prefer and will handle the str transformation for you) is this:
raise ValueError('This is invalid. Check the valid items here {}'.format(self.print_info()))
Example 1:
class Foo(object):
def gg(self):
return 'Hello!'
>>> f = Foo()
>>> raise ValueError('123... {}'.format(f.gg()))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: 123... Hello!
Example 2:
class foo(object):
def gg(self):
return 'Hello!'
def xx(self):
raise ValueError('123... {}'.format(self.gg()))
>>> f = foo()
>>> f.xx()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in xx
ValueError: 123... Hello!
I use this code to deal with chinese:
# -*- coding: utf-8 -*-
strInFilNname = u'%s' % raw_input("input fileName:").decode('utf-8')
pathName = u'%s' % raw_input("input filePath:").decode('utf-8')
When I run this on PyCharm everything is ok. But when I run this on windows CMD, I get this error code:
Traceback (most recent call last):
File "E:\Sites\GetAllFile.py", line 23, in <module>
strInFilNname = u'%s' % raw_input("input filename:").decode('utf-8')
File "E:\Portable Python 2.7.5.1\App\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd3 in position 0: invalid continuation byte
I have read this python document about Unicode HOWTO, but can't get effective solution.
I really want to know why it does so .
The Windows console encoding is not UTF-8. I will assume you are using a Chinese-localized version of Windows since you mention the errors go away in Python 3.3 and suggest trying sys.stdin.encoding instead of utf-8.
Below is an example from my US-localized Windows using characters from the cp437 code page, which is what the US console uses (Python 2.7.9):
This returns a byte string in the console encoding:
>>> raw_input('test? ')
test? │┤╡╢╖╕╣
'\xb3\xb4\xb5\xb6\xb7\xb8\xb9'
Convert to Unicode:
>>> import sys
>>> sys.stdin.encoding
'cp437'
>>> raw_input('test? ').decode(sys.stdin.encoding)
test? │┤╡╢╖╕╣║╗╝╜╛
u'\u2502\u2524\u2561\u2562\u2556\u2555\u2563\u2551\u2557\u255d\u255c\u255b'
Note it prints correctly:
>>> print(raw_input('test? ').decode(sys.stdin.encoding))
test? │┤╡╢╖╕╣║╗
│┤╡╢╖╕╣║╗
This works correctly for a Chinese Windows console as well as it will use the correct console encoding for Chinese. Here's the same code after switching my system to use Chinese:
>>> raw_input('Test? ')
Test? 我是美国人。
'\xce\xd2\xca\xc7\xc3\xc0\xb9\xfa\xc8\xcb\xa1\xa3'
>>> import sys
>>> sys.stdin.encoding
'cp936'
>>> raw_input('Test? ').decode(sys.stdin.encoding)
Test? 我是美国人。
u'\u6211\u662f\u7f8e\u56fd\u4eba\u3002'
>>> print raw_input('Test? ').decode(sys.stdin.encoding)
Test? 我是美国人。
我是美国人。
Python 3.3 makes this much simpler:
>>> input('Test? ')
Test? 我是美国人。
'我是美国人。'