I am updating a hobby app, written in Python 2.7 on Ubuntu 14.04 that stores railway history data in json. I used it upto now to work on british data.
When starting with french data I encountered a problem which puzzles me. I have a class CompaniesCache which implements __str__(). Inside that implementation everything is using str's. Let's say I instantiate a CompaniesCache and assign into a variable companies. When I, in IPython2, give the command print companies, I get an error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 184: ordinal not in range(128)".
Alright, that is not strange. Testing. str(companies) reproduces the error, as expected. But, companies.__str__() succeeds without problems, as does print company.__str__(). What is wrong here ?
Here the code of the __str__ method of the CompaniesCache object:
class CompaniesCache(object):
def __init__(self, railrefdatapath):
self.cache = restoreCompanies(railrefdatapath)
def __getitem__(self, compcode):
return self.cache[compcode.upper()]
def __str__(self):
s = ''
for k in sorted(self.cache.keys()):
s += '\n%s: %s' % (k, self[k].title)
return s
This is the code for the CompaniesCache object, which contains Company objects in its cache dict. The Company object does not implement the __str__() method.
str doesn't just call __str__. Among other things, it validates the return type, it falls back to __repr__ if __str__ isn't available, and it tries to convert unicode return values to str with the ASCII codec.
Your __str__ method is returning a unicode instance with non-ASCII characters. When str tries to convert that to a bytestring, it fails, producing the error you're seeing.
Don't return a unicode object from __str__. You can implement a __unicode__ method to define how unicode(your_object) behaves, and return an appropriately-encoded bytestring from __str__.
Using maxpolk answer
I think all you should do is setup your environment variable to
export LC_ALL='en_US.utf8'
All and all I think you can find your answer in this post
Related
I have an instance of a class A passed as the value to the format specifier %d in the string formatting using the % operator. Without any preparation, this will result in the following error message: TypeError: %d format: a number is required, not A:
class A: pass
'%d' % A()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: %d format: a number is required, not A
If the class A defines a method called __int__, this gets called:
class A:
def __int__(self): return 42
'%d' % A()
'42'
In my use case I would like the formatting with %d create a string representations for the instances of my class which do not look like a number (but instead an arbitrary string like n/a, ²³, or similar).
Is this possible?
I was considering returning another special object in the __int__ method but that resulted in a warning (only returning basic ints is allowed, anything else might become illegal in later versions; I'm trying on Python 3.7.4, btw) and no success eventually.
I know it is an easy task using the __format__ method in combination with the '{0}'.format(a) way of formatting strings, but that's not what I'm asking for. I'm specifically and only asking about formatting using the %d specifier in formatting string used with the % operator.
Printing a string using the %d operator will not work. As far as I know there is no built in function to change this. However, if you want to print a string representation of an object their are three different ways using the modulo operator (%).
%s - Returns a string using the str() built in method
%r - Returns a string using the repr() built in method
%a - Returns a string using the ascii() built in method
Using these three you can customize the string using their corresponding dunber methods. For example, if you were to use %s it would use the built str() method. In order to edit what the str() method returns you would put this within your class definition.
def __str__(self):
return "String Representation"
The python documentation describes why %d can't print strings perfectly. Make sure to scroll down to where the % explanation is (about half way down the page.)
I'm using ruamel in the following way:
from ruamel.yaml import YAML
yaml = YAML()
print yaml.load('!!python/unicode aa')
Wanted output:
u'aa'
Actual output:
<ruamel.yaml.comments.TaggedScalar at 0x106557150>
I know of a hack that could be used with the SafeLoader to give me this behavior:
SafeLoader.add_constructor('tag:yaml.org,2002:python/unicode', lambda _, node: node.value)
This returns the value of the node, which is what I want. However, this hack doesn't seem to work with the RoundTripLoader.
the first 'u' means the string was encoder by 'utf-8', so if you pass 'u'aa'' into the function, it just feed the string which is 'aa'. So you can pass s"u'aa'" to get output u'aa'.
There seems to be something funny with ipython's handling of printing classes. In that it doesn't take into account the __str__ method on the class TaggedScalar.
The RoundTripConstructor (used when doing a round-trip-load) is based on the SafeConstructor and for that the python/unicode tag is not defined (it is defined for the non-safe Constructor). Therefore you fall back to the construct_undefined method of the RoundConstructor which creates this TaggedScalar and yields it as part of the normal two-step creation process.
This TaggedScalar has a __str__ method which, in a normal CPython returns the actual string value (stored in the value attribute). IPython doesn't seem to call that method.
If you change the name of the __str__ method you get the same erroneous result in CPython as you are getting in IPython.
You might be able to trick IPython assuming it does use the __repr__ method when print-ing:
from ruamel.yaml import YAML
from ruamel.yaml.comments import TaggedScalar
def my_representer(x):
try:
if isinstance(x.value, unicode):
return "u'{}'".format(x.value)
except:
pass
return x.value
TaggedScalar.__repr__ = my_representer
yaml = YAML()
print yaml.load('!!python/unicode aa')
which gives
u'aa'
on my Linux based CPython when the __str__ method is deactivated (i.e. __str__ should be used by print in favor of __repr__, but IPython doesn't seem to do that).
Just wondering if I can convert everything into a string in Python 3.5.
For instance, will this ever throw an error?
def test(arg):
try:
arg=str(arg)
except ValueError:
Every type in standard library can be converted to string in Python using str() function, but not every object in Python needs to be in standard library.
Theoretically this will throw an error (as we override here default __str__ method to something that does not return string):
class O:
def __str__(self):
pass
o = O()
print(str(o))
// TypeError: __str__ returned non-string (type NoneType)
Ok, so the general background for this question is that I'm trying to make a custom dictionary class that will create a string representation of the dictionary which is just a lookup of one of the values (which are all unicode values). In the real code, depending on some internal logic, one of the keys is chosen as the current default for the lookup, so that unicode(dict_obj) will return a single value within the dictionary such as u'Some value' or if the value doesn't exist for the current default key: u'None'
This functionality is working no problem. The real problem lies when using it within the application from the zope page templates which wrap the object in a security proxy. The proxied object doesn't behave the same as the original object.
Here is the boiled down code of the custom dictionary class:
class IDefaultKeyDict(Interface):
def __unicode__():
"""Create a unicode representation of the dictionary."""
def __str__():
"""Create a string representation of the dictionary."""
class DefaultKeyDict(dict):
"""A custom dictionary for handling default values"""
implements(IDefaultKeyDict)
def __init__(self, default, *args, **kwargs):
super(DefaultKeyDict, self).__init__(*args, **kwargs)
self._default = default
def __unicode__(self):
print "In DefaultKeyDict.__unicode__"
key = self.get_current_default()
result = self.get(key)
return unicode(result)
def __str__(self):
print "In DefaultKeyDict.__str__"
return unicode(self).encode('utf-8')
def get_current_default(self):
return self._default
And the associated zcml permissions for this class:
<class class=".utils.DefaultKeyDict">
<require
interface=".utils.IDefaultKeyDict"
permission="zope.View" />
</class>
I've left the print statements in both the __unicode__ and __str__ methods to show the different behavior with the proxied objects. So creating a dummy dictionary class with a pre-defined default key:
>>> dummy = DefaultKeyDict(u'key2', {u'key1': u'Normal ascii text', u'key2': u'Espa\xf1ol'})
>>> dummy
{u'key2': u'Espa\xf1ol', u'key1': u'Normal ascii text'}
>>> str(dummy)
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
'Espa\xc3\xb1ol'
>>> unicode(dummy)
In DefaultKeyDict.__unicode__
u'Espa\xf1ol'
>>> print dummy
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
Español
Everything works as expected. Now I can wrap the object in a security proxy from the zope.security package and do the same tests to show the error:
>>> from zope.security.checker import ProxyFactory
>>> prox = ProxyFactory(dummy)
>>> prox
{u'key2': u'Espa\xf1ol', u'key1': u'Normal ascii text'}
>>> type(prox)
<type 'zope.security._proxy._Proxy'>
>>> str(prox)
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
'Espa\xc3\xb1ol'
>>> unicode(prox)
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
*** UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
As you can see, calling unicode on the proxied object isn't possible anymore if it contains any special characters. I can see the proxy object from zope.security is mostly defined with C code and I'm not familiar at all with the C Python API, but it seems that the __str__ and __repr__ methods are defined in the C code but not __unicode__. So to me, what seems to be happening is that when it is trying to create a unicode representation of this proxied object, instead of calling the __unicode__ method directly, it calls the __str__ method (as you can see from the last few print statements above), which returns a utf-8 encoded byte string, but that is then being converted to unicode (using the default ascii encoding). So what is happening seems to be the equivalent of this:
>>> unicode(prox.__str__())
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
*** UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
So of course it will result in a UnicodeDecodeError in this case, trying to decode a utf-8 string with ascii. As expected, if I could specify the encoding of utf-8 there wouldn't be a problem.
>>> unicode(prox.__str__(), encoding='utf-8')
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
u'Espa\xf1ol'
But I can't change that since we are talking about the zope.pagetemplate and zope.tales packages that are creating the unicode representation out of all types of objects, and they always seem to be working with the security proxied objects (from zope.security). Also of note, there is no problem calling the __unicode__ method directly on the object.
>>> prox.__unicode__()
In DefaultKeyDict.__unicode__
u'Espa\xf1ol'
So the real problem is that unicode(prox) calls the __str__ method. I've been spinning my wheels on this for a while and don't know where else to go now. Any insights would be much appreciated.
Judging by what you've said about the C API defining __str__ and __repr__ methods but not __unicode__ methods, I suspect whatever C library you're using has been written to be compatible with python 3. I'm not familiar with zope, but I'm relatively confident this should be the case.
In Python 2, the object model specifies str() and unicode()
methods. If these methods exist, they must return str (bytes) and
unicode (text) respectively.
In Python 3, there’s simply str(), which must return str (text).
I may be slightly missing the point with your program but do you really need the __unicode__ method defined? As you've said everything in the dict is belonging to the unicode character set. So calling the __str__ method will decode that into utf-8, and if you wanted to see the binaries for the string why not just encode it?
Note that decode() returns a string object, whilst encode() returns a bytes object.
If you could, please post an edit/comment so I can understand a little bit more what you're trying to do.
In case anybody is looking for a temporary solution to this problem, I can share the monkeypatch fixes that we've implemented. Patching these two methods from zope.tal and zope.tales seems to do the trick. This will work well as long as you know that the encoding will always be utf-8.
from zope.tal import talinterpreter
def do_insertStructure_tal(self, (expr, repldict, block)):
"""Patch for zope.security proxied I18NDicts.
The Proxy wrapper doesn't support a unicode hook for now. The only way to
fix this is to monkey patch this method which calls 'unicode'.
"""
structure = self.engine.evaluateStructure(expr)
if structure is None:
return
if structure is self.Default:
self.interpret(block)
return
if isinstance(structure, talinterpreter.I18nMessageTypes):
text = self.translate(structure)
else:
try:
text = unicode(structure)
except UnicodeDecodeError:
text = unicode(str(structure), encoding='utf-8')
if not (repldict or self.strictinsert):
# Take a shortcut, no error checking
self.stream_write(text)
return
if self.html:
self.insertHTMLStructure(text, repldict)
else:
self.insertXMLStructure(text, repldict)
talinterpreter.TALInterpreter.do_insertStructure_tal = do_insertStructure_tal
talinterpreter.TALInterpreter.bytecode_handlers_tal["insertStructure"] = \
do_insertStructure_tal
and this one:
from zope.tales import tales
def evaluateText(self, expr):
"""Patch for zope.security proxied I18NDicts.
The Proxy wrapper doesn't support a unicode hook for now. The only way to
fix this is to monkey patch this method which calls 'unicode'.
"""
text = self.evaluate(expr)
if text is self.getDefault() or text is None:
return text
if isinstance(text, basestring):
# text could already be something text-ish, e.g. a Message object
return text
try:
return unicode(text)
except UnicodeDecodeError:
return unicode(str(text), encoding='utf-8')
tales.Context.evaluateText = evaluateText
I'm starting to port some code from Python2.x to Python3.x, but before I make the jump I'm trying to modernise it to recent 2.7. I'm making good progress with the various tools (e.g. futurize), but one area they leave alone is the use of buffer(). In Python3.x buffer() has been removed and replaced with memoryview() which in general looks to be cleaner, but it's not a 1-to-1 swap.
One way in which they differ is:
In [1]: a = "abcdef"
In [2]: b = buffer(a)
In [3]: m = memoryview(a)
In [4]: print b, m
abcdef <memory at 0x101b600e8>
That is, str(<buffer object>) returns a byte-string containing the contents of the object, whereas memoryviews return their repr(). I think the new behaviour is better, but it's causing issues.
In particular I've got some code which is throwing an exception because it's receiving a byte-string containing <memory at 0x1016c95a8>. That suggests that there's a piece of code somewhere else that is relying on this behaviour to work, but I'm having real trouble finding it.
Does anybody have a good debugging trick for this type of problem?
One possible trick is to write a subclass of the memoryview and temporarily change all your memoryview instances to, lets say, memoryview_debug versions:
class memoryview_debug(memoryview):
def __init__(self, string):
memoryview.__init__(self, string)
def __str__(self):
# ... place a breakpoint, log the call, print stack trace, etc.
return memoryview.__str__(self)
EDIT:
As noted by OP it is apparently impossible to subclass from memoryview. Fortunately thanks to dynamic typing that's not a big problem in Python, it will be just more inconvenient. You can change inheritance to composition:
class memoryview_debug:
def __init__(self, string):
self.innerMemoryView = memoryview(string)
def tobytes(self):
return self.innerMemoryView.tobytes()
def tolist(self):
return self.innerMemoryView.tolist()
# some other methods if used by your code
# and if overridden in memoryview implementation (e.g. __len__?)
def __str__(self):
# ... place a breakpoint, log the call, print stack trace, etc.
return self.innerMemoryview.__str__()