Ok, so the general background for this question is that I'm trying to make a custom dictionary class that will create a string representation of the dictionary which is just a lookup of one of the values (which are all unicode values). In the real code, depending on some internal logic, one of the keys is chosen as the current default for the lookup, so that unicode(dict_obj) will return a single value within the dictionary such as u'Some value' or if the value doesn't exist for the current default key: u'None'
This functionality is working no problem. The real problem lies when using it within the application from the zope page templates which wrap the object in a security proxy. The proxied object doesn't behave the same as the original object.
Here is the boiled down code of the custom dictionary class:
class IDefaultKeyDict(Interface):
def __unicode__():
"""Create a unicode representation of the dictionary."""
def __str__():
"""Create a string representation of the dictionary."""
class DefaultKeyDict(dict):
"""A custom dictionary for handling default values"""
implements(IDefaultKeyDict)
def __init__(self, default, *args, **kwargs):
super(DefaultKeyDict, self).__init__(*args, **kwargs)
self._default = default
def __unicode__(self):
print "In DefaultKeyDict.__unicode__"
key = self.get_current_default()
result = self.get(key)
return unicode(result)
def __str__(self):
print "In DefaultKeyDict.__str__"
return unicode(self).encode('utf-8')
def get_current_default(self):
return self._default
And the associated zcml permissions for this class:
<class class=".utils.DefaultKeyDict">
<require
interface=".utils.IDefaultKeyDict"
permission="zope.View" />
</class>
I've left the print statements in both the __unicode__ and __str__ methods to show the different behavior with the proxied objects. So creating a dummy dictionary class with a pre-defined default key:
>>> dummy = DefaultKeyDict(u'key2', {u'key1': u'Normal ascii text', u'key2': u'Espa\xf1ol'})
>>> dummy
{u'key2': u'Espa\xf1ol', u'key1': u'Normal ascii text'}
>>> str(dummy)
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
'Espa\xc3\xb1ol'
>>> unicode(dummy)
In DefaultKeyDict.__unicode__
u'Espa\xf1ol'
>>> print dummy
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
Español
Everything works as expected. Now I can wrap the object in a security proxy from the zope.security package and do the same tests to show the error:
>>> from zope.security.checker import ProxyFactory
>>> prox = ProxyFactory(dummy)
>>> prox
{u'key2': u'Espa\xf1ol', u'key1': u'Normal ascii text'}
>>> type(prox)
<type 'zope.security._proxy._Proxy'>
>>> str(prox)
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
'Espa\xc3\xb1ol'
>>> unicode(prox)
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
*** UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
As you can see, calling unicode on the proxied object isn't possible anymore if it contains any special characters. I can see the proxy object from zope.security is mostly defined with C code and I'm not familiar at all with the C Python API, but it seems that the __str__ and __repr__ methods are defined in the C code but not __unicode__. So to me, what seems to be happening is that when it is trying to create a unicode representation of this proxied object, instead of calling the __unicode__ method directly, it calls the __str__ method (as you can see from the last few print statements above), which returns a utf-8 encoded byte string, but that is then being converted to unicode (using the default ascii encoding). So what is happening seems to be the equivalent of this:
>>> unicode(prox.__str__())
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
*** UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
So of course it will result in a UnicodeDecodeError in this case, trying to decode a utf-8 string with ascii. As expected, if I could specify the encoding of utf-8 there wouldn't be a problem.
>>> unicode(prox.__str__(), encoding='utf-8')
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
u'Espa\xf1ol'
But I can't change that since we are talking about the zope.pagetemplate and zope.tales packages that are creating the unicode representation out of all types of objects, and they always seem to be working with the security proxied objects (from zope.security). Also of note, there is no problem calling the __unicode__ method directly on the object.
>>> prox.__unicode__()
In DefaultKeyDict.__unicode__
u'Espa\xf1ol'
So the real problem is that unicode(prox) calls the __str__ method. I've been spinning my wheels on this for a while and don't know where else to go now. Any insights would be much appreciated.
Judging by what you've said about the C API defining __str__ and __repr__ methods but not __unicode__ methods, I suspect whatever C library you're using has been written to be compatible with python 3. I'm not familiar with zope, but I'm relatively confident this should be the case.
In Python 2, the object model specifies str() and unicode()
methods. If these methods exist, they must return str (bytes) and
unicode (text) respectively.
In Python 3, there’s simply str(), which must return str (text).
I may be slightly missing the point with your program but do you really need the __unicode__ method defined? As you've said everything in the dict is belonging to the unicode character set. So calling the __str__ method will decode that into utf-8, and if you wanted to see the binaries for the string why not just encode it?
Note that decode() returns a string object, whilst encode() returns a bytes object.
If you could, please post an edit/comment so I can understand a little bit more what you're trying to do.
In case anybody is looking for a temporary solution to this problem, I can share the monkeypatch fixes that we've implemented. Patching these two methods from zope.tal and zope.tales seems to do the trick. This will work well as long as you know that the encoding will always be utf-8.
from zope.tal import talinterpreter
def do_insertStructure_tal(self, (expr, repldict, block)):
"""Patch for zope.security proxied I18NDicts.
The Proxy wrapper doesn't support a unicode hook for now. The only way to
fix this is to monkey patch this method which calls 'unicode'.
"""
structure = self.engine.evaluateStructure(expr)
if structure is None:
return
if structure is self.Default:
self.interpret(block)
return
if isinstance(structure, talinterpreter.I18nMessageTypes):
text = self.translate(structure)
else:
try:
text = unicode(structure)
except UnicodeDecodeError:
text = unicode(str(structure), encoding='utf-8')
if not (repldict or self.strictinsert):
# Take a shortcut, no error checking
self.stream_write(text)
return
if self.html:
self.insertHTMLStructure(text, repldict)
else:
self.insertXMLStructure(text, repldict)
talinterpreter.TALInterpreter.do_insertStructure_tal = do_insertStructure_tal
talinterpreter.TALInterpreter.bytecode_handlers_tal["insertStructure"] = \
do_insertStructure_tal
and this one:
from zope.tales import tales
def evaluateText(self, expr):
"""Patch for zope.security proxied I18NDicts.
The Proxy wrapper doesn't support a unicode hook for now. The only way to
fix this is to monkey patch this method which calls 'unicode'.
"""
text = self.evaluate(expr)
if text is self.getDefault() or text is None:
return text
if isinstance(text, basestring):
# text could already be something text-ish, e.g. a Message object
return text
try:
return unicode(text)
except UnicodeDecodeError:
return unicode(str(text), encoding='utf-8')
tales.Context.evaluateText = evaluateText
Related
I'm using ruamel in the following way:
from ruamel.yaml import YAML
yaml = YAML()
print yaml.load('!!python/unicode aa')
Wanted output:
u'aa'
Actual output:
<ruamel.yaml.comments.TaggedScalar at 0x106557150>
I know of a hack that could be used with the SafeLoader to give me this behavior:
SafeLoader.add_constructor('tag:yaml.org,2002:python/unicode', lambda _, node: node.value)
This returns the value of the node, which is what I want. However, this hack doesn't seem to work with the RoundTripLoader.
the first 'u' means the string was encoder by 'utf-8', so if you pass 'u'aa'' into the function, it just feed the string which is 'aa'. So you can pass s"u'aa'" to get output u'aa'.
There seems to be something funny with ipython's handling of printing classes. In that it doesn't take into account the __str__ method on the class TaggedScalar.
The RoundTripConstructor (used when doing a round-trip-load) is based on the SafeConstructor and for that the python/unicode tag is not defined (it is defined for the non-safe Constructor). Therefore you fall back to the construct_undefined method of the RoundConstructor which creates this TaggedScalar and yields it as part of the normal two-step creation process.
This TaggedScalar has a __str__ method which, in a normal CPython returns the actual string value (stored in the value attribute). IPython doesn't seem to call that method.
If you change the name of the __str__ method you get the same erroneous result in CPython as you are getting in IPython.
You might be able to trick IPython assuming it does use the __repr__ method when print-ing:
from ruamel.yaml import YAML
from ruamel.yaml.comments import TaggedScalar
def my_representer(x):
try:
if isinstance(x.value, unicode):
return "u'{}'".format(x.value)
except:
pass
return x.value
TaggedScalar.__repr__ = my_representer
yaml = YAML()
print yaml.load('!!python/unicode aa')
which gives
u'aa'
on my Linux based CPython when the __str__ method is deactivated (i.e. __str__ should be used by print in favor of __repr__, but IPython doesn't seem to do that).
I am updating a hobby app, written in Python 2.7 on Ubuntu 14.04 that stores railway history data in json. I used it upto now to work on british data.
When starting with french data I encountered a problem which puzzles me. I have a class CompaniesCache which implements __str__(). Inside that implementation everything is using str's. Let's say I instantiate a CompaniesCache and assign into a variable companies. When I, in IPython2, give the command print companies, I get an error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 184: ordinal not in range(128)".
Alright, that is not strange. Testing. str(companies) reproduces the error, as expected. But, companies.__str__() succeeds without problems, as does print company.__str__(). What is wrong here ?
Here the code of the __str__ method of the CompaniesCache object:
class CompaniesCache(object):
def __init__(self, railrefdatapath):
self.cache = restoreCompanies(railrefdatapath)
def __getitem__(self, compcode):
return self.cache[compcode.upper()]
def __str__(self):
s = ''
for k in sorted(self.cache.keys()):
s += '\n%s: %s' % (k, self[k].title)
return s
This is the code for the CompaniesCache object, which contains Company objects in its cache dict. The Company object does not implement the __str__() method.
str doesn't just call __str__. Among other things, it validates the return type, it falls back to __repr__ if __str__ isn't available, and it tries to convert unicode return values to str with the ASCII codec.
Your __str__ method is returning a unicode instance with non-ASCII characters. When str tries to convert that to a bytestring, it fails, producing the error you're seeing.
Don't return a unicode object from __str__. You can implement a __unicode__ method to define how unicode(your_object) behaves, and return an appropriately-encoded bytestring from __str__.
Using maxpolk answer
I think all you should do is setup your environment variable to
export LC_ALL='en_US.utf8'
All and all I think you can find your answer in this post
I wanted to use Whoosh in my application and followed the tutorial here, which was written in 2011.
When I try to unpickle data in this block:
def results_to_instances(request, results):
instances = []
for r in results:
cls = pickle.loads('{0}'.format(r.get('cls')))
id = r.get('id')
instance = request.db.query(cls).get(id)
instances.append(instance)
return instances
I get an error from the pickle.loads() command:
TypeError: 'str' does not support the buffer interface
When I check what '{0}'.format(r.get('cls')) returns, it is type str, but the value is "b'foo'".
How do I get the bytes object out of the string? Encoding it just returns b"b'foo'".
The values are pickled in this block:
def first_index(self, writer):
oid = u'{0}'.format(self.id)
cls = u'{0}'.format(pickle.dumps(self.__class__))
attributes = []
for attr in self.__whoosh_value__.split(','):
if getattr(self, attr) is not None:
attributes.append(str(getattr(self, attr)))
value = u' '.join(attributes)
writer.add_document(id=oid, cls=cls, value=value)
So if there is a way to fix it at the root, that would be better.
Just use r.get('cls'). Wrapping it in '{0}'.format() makes a bytes into str in the first place, which is not what you want at all. Same goes for when you wrap pickle.dumps (immediately converting the useful bytes it returns to the useless formatted version). Basically all of your uses of '{0}'.format() make no sense, because they make str when you're trying to work with the raw data.
I'm starting to port some code from Python2.x to Python3.x, but before I make the jump I'm trying to modernise it to recent 2.7. I'm making good progress with the various tools (e.g. futurize), but one area they leave alone is the use of buffer(). In Python3.x buffer() has been removed and replaced with memoryview() which in general looks to be cleaner, but it's not a 1-to-1 swap.
One way in which they differ is:
In [1]: a = "abcdef"
In [2]: b = buffer(a)
In [3]: m = memoryview(a)
In [4]: print b, m
abcdef <memory at 0x101b600e8>
That is, str(<buffer object>) returns a byte-string containing the contents of the object, whereas memoryviews return their repr(). I think the new behaviour is better, but it's causing issues.
In particular I've got some code which is throwing an exception because it's receiving a byte-string containing <memory at 0x1016c95a8>. That suggests that there's a piece of code somewhere else that is relying on this behaviour to work, but I'm having real trouble finding it.
Does anybody have a good debugging trick for this type of problem?
One possible trick is to write a subclass of the memoryview and temporarily change all your memoryview instances to, lets say, memoryview_debug versions:
class memoryview_debug(memoryview):
def __init__(self, string):
memoryview.__init__(self, string)
def __str__(self):
# ... place a breakpoint, log the call, print stack trace, etc.
return memoryview.__str__(self)
EDIT:
As noted by OP it is apparently impossible to subclass from memoryview. Fortunately thanks to dynamic typing that's not a big problem in Python, it will be just more inconvenient. You can change inheritance to composition:
class memoryview_debug:
def __init__(self, string):
self.innerMemoryView = memoryview(string)
def tobytes(self):
return self.innerMemoryView.tobytes()
def tolist(self):
return self.innerMemoryView.tolist()
# some other methods if used by your code
# and if overridden in memoryview implementation (e.g. __len__?)
def __str__(self):
# ... place a breakpoint, log the call, print stack trace, etc.
return self.innerMemoryview.__str__()
I have a simple spyne service:
class JiraAdapter(ServiceBase):
#srpc(Unicode, String, Unicode, _returns=Status)
def CreateJiraIssueWithBase64Attachment(summary, base64attachment, attachment_filename):
status = Status
try:
newkey = jira_client.createWithBase64Attachment(summary, base64attachment, attachment_filename)
status.Code = StatusCodes.IssueCreated
status.Message = unicode(newkey)
except Exception as e:
status.Code = StatusCodes.InternalError
status.Message = u'Internal Exception: %s' % e.message
return status
The problem is that some programs will insert '\n' into generated base64string, after every 60th character or so and it will come into the services' method escaped ('\\n') causing things to behave oddly. Is there a setting or something to avoid this?
First, some comments about the code you posted:
You must instantiate your types (i.e. status = Status() instead of status = Status). As it is, you're setting class attributes on the Status class. Not only this is just wrong, you're also creating race conditions by altering global state without proper locking.
Does Jira have a way of creating issues with binary data? You can use ByteArray that handles base64 encoding/decoding for you. Note that ByteArray gets deserialized as a sequence of strings.
You can define a custom base64 type:
Base64String = String(pattern='[0-9a-zA-Z/+=]+')
... and use it instead of plain String together with validation to effortlessly reject invalid input.
Instead of returning a "Status" object, I'd return nothing but raise an exception when needed (or you can just let the original exception bubble up). Exceptions also get serialized just like normal objects. But that's your decision to make as it depends on how you want your API to be consumed.
Now for your original question:
You'll agree that the right thing to do here is to fix whatever's escaping '\n' (i.e. 0x0a) as r"\n" (i.e. 0x5c 0x6e).
If you want to deal with it though, I guess the solution in your comment (i.e. base64attachment = base64attachment.decode('string-escape') would be the best solution.
I hope that helps.
Best regards,