Get bytes object from bytes object in string quotes

Get bytes object from bytes object in string quotes - python

I wanted to use Whoosh in my application and followed the tutorial here, which was written in 2011.
When I try to unpickle data in this block:
def results_to_instances(request, results):
instances = []
for r in results:
cls = pickle.loads('{0}'.format(r.get('cls')))
id = r.get('id')
instance = request.db.query(cls).get(id)
instances.append(instance)
return instances
I get an error from the pickle.loads() command:
TypeError: 'str' does not support the buffer interface
When I check what '{0}'.format(r.get('cls')) returns, it is type str, but the value is "b'foo'".
How do I get the bytes object out of the string? Encoding it just returns b"b'foo'".
The values are pickled in this block:
def first_index(self, writer):
oid = u'{0}'.format(self.id)
cls = u'{0}'.format(pickle.dumps(self.__class__))
attributes = []
for attr in self.__whoosh_value__.split(','):
if getattr(self, attr) is not None:
attributes.append(str(getattr(self, attr)))
value = u' '.join(attributes)
writer.add_document(id=oid, cls=cls, value=value)
So if there is a way to fix it at the root, that would be better.

Just use r.get('cls'). Wrapping it in '{0}'.format() makes a bytes into str in the first place, which is not what you want at all. Same goes for when you wrap pickle.dumps (immediately converting the useful bytes it returns to the useless formatted version). Basically all of your uses of '{0}'.format() make no sense, because they make str when you're trying to work with the raw data.

Related

How do I use the second Value constructor in gdb's python API?

In gdb's Values From Inferior documentation, there's a second constructor for creating objects within python. It states:
Function: Value.__init__ (val, type)
This second form of the gdb.Value constructor returns a gdb.Value of type type where the value contents are taken from the Python buffer object specified by val. The number of bytes in the Python buffer object must be greater than or equal to the size of type.
My question is, how do I create a buffer object that I can pass into the constructor? For instance, if I wanted to create a string (yes, I know that the first Value constructor can do this, but this is an example) I wrote the following function:
def make_str(self, str):
str += '\0'
s = bytearray(str.encode())
return gdb.Value(s, gdb.lookup_type('char').array(len(str)))
However, when I tried to use it, I got the message:
Python Exception <class 'ValueError'> Size of type is larger than that of buffer object.:
How would I make a buffer object that I could pass into the Value constructor? What would I need to do to generate a Value object?

Hmmmm. Seems to be an off by one error since this worked:
def make_str(self, str):
str += '\0'
s = bytearray(str.encode())
return gdb.Value(s, gdb.lookup_type('char').array(len(s)-1))
Which is strange, I would have expected the array length to be the length of the string, not one less.

Determine the type of the result of `file.read()` from `file` in Python

I have some code that operates on a file object in Python.
Following Python3's string/bytes revolution, if file was opened in binary mode, file.read() returns bytes.
Conversely if file was opened in text mode, file.read() returns str.
In my code, file.read() is called multiple times and therefore it is not practical to check for the result-type every time I call file.read(), e.g.:
def foo(file_obj):
while True:
data = file.read(1)
if not data:
break
if isinstance(data, bytes):
# do something for bytes
...
else: # isinstance(data, str)
# do something for str
...
What I would like to have instead is some ways of reliably checking what the result of file.read() will be, e.g.:
def foo(file_obj):
if is_binary_file(file_obj):
# do something for bytes
while True:
data = file.read(1)
if not data:
break
...
else:
# do something for str
while True:
data = file.read(1)
if not data:
break
...
A possible way would be to check file_obj.mode e.g.:
import io
def is_binary_file(file_obj):
return 'b' in file_obj.mode
print(is_binary_file(open('test_file', 'w')))
# False
print(is_binary_file(open('test_file', 'wb')))
# True
print(is_binary_file(io.StringIO('ciao')))
# AttributeError: '_io.StringIO' object has no attribute 'mode'
print(is_binary_file(io.BytesIO(b'ciao')))
# AttributeError: '_io.BytesIO' object has no attribute 'mode'
which would fail for the objects from io like io.StringIO() and io.BytesIO().
Another way, which would also work for io objects, would be to check for the encoding attribute, e.g:
import io
def is_binary_file(file_obj):
return not hasattr(file_obj, 'encoding')
print(is_binary_file(open('test_file', 'w')))
# False
print(is_binary_file(open('test_file', 'wb')))
# True
print(is_binary_file(io.StringIO('ciao')))
# False
print(is_binary_file(io.BytesIO(b'ciao')))
# True
Is there a cleaner way to perform this check?

I have a version of this in astropy (for Python 3, though a Python 2 version can be found in older versions of Astropy if needed for some reason).
It's not pretty, but it works reliably enough for most cases (I took out the part that checks for a .binary attribute since that's only applicable to a class in Astropy):
def fileobj_is_binary(f):
"""
Returns True if the give file or file-like object has a file open in binary
mode. When in doubt, returns True by default.
"""
if isinstance(f, io.TextIOBase):
return False
mode = fileobj_mode(f)
if mode:
return 'b' in mode
else:
return True
where fileobj_mode is:
def fileobj_mode(f):
"""
Returns the 'mode' string of a file-like object if such a thing exists.
Otherwise returns None.
"""
# Go from most to least specific--for example gzip objects have a 'mode'
# attribute, but it's not analogous to the file.mode attribute
# gzip.GzipFile -like
if hasattr(f, 'fileobj') and hasattr(f.fileobj, 'mode'):
fileobj = f.fileobj
# astropy.io.fits._File -like, doesn't need additional checks because it's
# already validated
elif hasattr(f, 'fileobj_mode'):
return f.fileobj_mode
# PIL-Image -like investigate the fp (filebuffer)
elif hasattr(f, 'fp') and hasattr(f.fp, 'mode'):
fileobj = f.fp
# FILEIO -like (normal open(...)), keep as is.
elif hasattr(f, 'mode'):
fileobj = f
# Doesn't look like a file-like object, for example strings, urls or paths.
else:
return None
return _fileobj_normalize_mode(fileobj)
def _fileobj_normalize_mode(f):
"""Takes care of some corner cases in Python where the mode string
is either oddly formatted or does not truly represent the file mode.
"""
mode = f.mode
# Special case: Gzip modes:
if isinstance(f, gzip.GzipFile):
# GzipFiles can be either readonly or writeonly
if mode == gzip.READ:
return 'rb'
elif mode == gzip.WRITE:
return 'wb'
else:
return None # This shouldn't happen?
# Sometimes Python can produce modes like 'r+b' which will be normalized
# here to 'rb+'
if '+' in mode:
mode = mode.replace('+', '')
mode += '+'
return mode
You might also want to add a special case for io.BytesIO. Again, ugly, but works for most cases. Would be great if there were a simpler way.

After a bit more homework, I can probably answer my own question.
First of all, a general remark: checking for the presence/absence of an attribute/method as a hallmark for the whole API is not a good idea because it will lead to more complex and still relatively unsafe code.
Following the EAFP/duck-typing mindset it may be OK to check for a specific method, but it should be the one used subsequently in the code.
The problem with file.read() (and even more so with file.write()) is that it comes with side-effects that make it unpractical to just try using it and see what happens.
For this specific case, while still following the duck-typing mindset, one could exploit the fact that the first parameter of read() can be set to 0.
This will not actually read anything from the buffer (and it will not change the result of file.tell()), but it will give an empty str or bytes.
Hence, one could write something like:
def is_reading_bytes(file_obj):
return isinstance(file_obj.read(0), bytes)
print(is_reading_bytes(open('test_file', 'r')))
# False
print(is_reading_bytes(open('test_file', 'rb')))
# True
print(is_reading_bytes(io.StringIO('ciao')))
# False
print(is_reading_bytes(io.BytesIO(b'ciao')))
# True
Similarly, one could try writing an empty bytes string b'' for the write() method:
def is_writing_bytes(file_obj)
try:
file_obj.write(b'')
except TypeError:
return False
else:
return True
print(is_writing_bytes(open('test_file', 'w')))
# False
print(is_writing_bytes(open('test_file', 'wb')))
# True
print(is_writing_bytes(io.StringIO('ciao')))
# False
print(is_writing_bytes(io.BytesIO(b'ciao')))
# True
Note that those methods will not check for readability / writability.
Finally, one could implement a proper type checking approach by inspecting the the file-like object API.
A file-like object in Python must support the API described in the io module.
In the documentation it is mentioned that TextIOBase is used for files opened in text mode, while BufferedIOBase (or RawIOBase for unbuffered streams) is used for files opened in binary mode.
The class hierarchy summary indicates that are both subclassed from IOBase.
Hence the following will do the trick (remember that isinstance() checks for subclasses too):
def is_binary_file(file_obj):
return isinstance(file_obj, io.IOBase) and not isinstance(file_obj, io.TextIOBase)
print(is_binary_file(open('test_file', 'w')))
# False
print(is_binary_file(open('test_file', 'wb')))
# True
print(is_binary_file(open('test_file', 'r')))
# False
print(is_binary_file(open('test_file', 'rb')))
# True
print(is_binary_file(io.StringIO('ciao')))
# False
print(is_binary_file(io.BytesIO(b'ciao')))
# True
Note that the documentation explicitly says that TextIOBase will have a encoding parameter, which is not required (i.e. it is not there) for binary file objects.
Hence, with the current API, checking on the encoding attribute may be a handy hack to check if the file object is binary for standard classes, under the assumption the the tested object is file-like.
Checking the mode attribute would only work for FileIO objects and the mode attribute is not part of the IOBase / RawIOBase interface, and that is why it does not work on io.StringIO() / is.BytesIO() objects.

Working with Your Own Types - Python

I am trying to understand the following topic and have some outstanding questions. Can anyone help me?:
class MyObj(object):
def __init__(self, s):
self.s = s
def __repr__(self):
return '<MyObj(%s)>' % self.s
====================================
import json
import json_myobj
obj = json_myobj.MyObj('instance value goes here')
print 'First attempt'
try:
print json.dumps(obj)
except TypeError, err:
print 'ERROR:', err
def convert_to_builtin_type(obj):
print 'default(', repr(obj), ')'
# Convert objects to a dictionary of their representation
d = { '__class__':obj.__class__.__name__,
'__module__':obj.__module__,
}
d.update(obj.__dict__)
return d
print
print 'With default'
print json.dumps(obj, default=convert_to_builtin_type)
Question: what is the purpose of the following code?
d = { '__class__':obj.__class__.__name__,
'__module__':obj.__module__,
}
d.update(obj.__dict__)

I think there are two things you need to know to understand this code snippet.
JSON serialization and deserialization.
JSON is a data-exchange format. Particularly it is text-based, which means if you want to save your data into a text file, you have to determine how to represent your data as the text (The serialization process). Of course, when you load data from a text file, you also need to determine how to parse the text into the memory structure (The deserialization process). Luckily, by default, the json module of python would handle most of the built-in data types, e.g., scalar type, list, dict and etc. But for your case, you have created your own data type, thus you have to specify how to serialize your own data type. This is what function convert_to_builtin_type does.
Python data model
Now we come across the problem how to serialize the self-defined object Myobj. There is no uniform answer for this question, but the base line is that you can recover your object (deserialize) by the serialized text. In your case:
d = { '__class__':obj.__class__.__name__,
'__module__':obj.__module__,
}
d.update(obj.__dict__)
The obj.__dict__ is a built-in dictionary that stores attributes of obj. You may read the python documentation Data Model to understand it. The intention here is try to give enough information to recover obj. For example:
__class__=<c> provides the name of the class
__module__=<m> provides the module to find the class.
s=<v> provides the attribute and value of Myobj.s
With these three, you can recover the object you previously stored. For these hidden (built-in) attributes starting with __, you need to check the python document.
Hope this would be helpful.

Class that returns json, python

I have a python class that should return a json, it looks something like this:
class ScanThis():
def__init__(self, test):
data={}
if test>5:
data["amount"] = test
self.json_data = json.dumps(data)
else:
data["fail"] = test
self.json_data = json.dumps(data)
def __str__(self):
return self.json_data
and I'm trying to call it like so:
output= json.loads(ScanThis(8))
print(output["command"])
But I get this error:
TypeError: the JSON object must be str, bytes or bytearray, not 'ScanThis'
I believe my earlier clas returns an object of type ScanThis() rather than a JSon like I wanted. I just wanted to now how I'd fix this
Thank you
PS: I apologise if this code is rough or invalid, it's not the actual code, just something similar I made up
Update: Again, this isn't the real code, it's just a small basic fragment of the actual code. There's a good reason I'm using a class, and a json is used cause data transfer over the internet is involved

Use str(..)
You can't call json.loads on a ScanThis object directly. So that won't work. Like the error says, json.loads expects a str, bytes or bytearray object.
You can however use str(..) to invoke the __str__(self) method, and thus obtain the JSON data:
output = json.loads(str(ScanThis(8)))
# ^ get the __str__ result
Use another method
That being said, it is usually a better idea to define a method, for instance to_json to obtain the JSON data, since now you have made str(..) return a JSON object. So perhaps a more elegant way to do this is the following:
class ScanThis():
def__init__(self, test):
data={}
if test>5:
data["amount"] = test
self.json_data = json.dumps(data)
else:
data["fail"] = test
self.json_data = json.dumps(data)
def to_json(self):
return self.json_data
and call it with:
output = json.loads(ScanThis(8).to_json())
Now you can still use __str__ for another purpose. Furthermore by using to_json you make it explicit that the result will be a JSON string. Using str for JSON conversion is of course not forbidden, but str(..) as a name, does not provide much guarantees about the format of the result whereas to_json (or another similar name) strongly hints that you will obtain JSON data.

I don't think you are wanting to use a class there at all.
Instead, try using a function that returns a string. For example:
def scan_this(test):
data={}
if test>5:
data["amount"] = test
json_data = json.dumps(data)
else:
data["fail"] = test
json_data = json.dumps(data)
return json_data
output = json.loads(scan_this(8))
However!! Now you are just doing extra work for nothing? Why do you need to serialize a python dictionary as a json formatted string, and then load it back into a python dictionary straight away? While you are working with data in python, it's best to keep it as native data types, and only using the json module to either load from a string/file you already have, or serialize to a string/file for storage or transfer (eg sending over the internet).

Unicode errors on zope security proxied objects

Ok, so the general background for this question is that I'm trying to make a custom dictionary class that will create a string representation of the dictionary which is just a lookup of one of the values (which are all unicode values). In the real code, depending on some internal logic, one of the keys is chosen as the current default for the lookup, so that unicode(dict_obj) will return a single value within the dictionary such as u'Some value' or if the value doesn't exist for the current default key: u'None'
This functionality is working no problem. The real problem lies when using it within the application from the zope page templates which wrap the object in a security proxy. The proxied object doesn't behave the same as the original object.
Here is the boiled down code of the custom dictionary class:
class IDefaultKeyDict(Interface):
def __unicode__():
"""Create a unicode representation of the dictionary."""
def __str__():
"""Create a string representation of the dictionary."""
class DefaultKeyDict(dict):
"""A custom dictionary for handling default values"""
implements(IDefaultKeyDict)
def __init__(self, default, *args, **kwargs):
super(DefaultKeyDict, self).__init__(*args, **kwargs)
self._default = default
def __unicode__(self):
print "In DefaultKeyDict.__unicode__"
key = self.get_current_default()
result = self.get(key)
return unicode(result)
def __str__(self):
print "In DefaultKeyDict.__str__"
return unicode(self).encode('utf-8')
def get_current_default(self):
return self._default
And the associated zcml permissions for this class:
<class class=".utils.DefaultKeyDict">
<require
interface=".utils.IDefaultKeyDict"
permission="zope.View" />
</class>
I've left the print statements in both the __unicode__ and __str__ methods to show the different behavior with the proxied objects. So creating a dummy dictionary class with a pre-defined default key:
>>> dummy = DefaultKeyDict(u'key2', {u'key1': u'Normal ascii text', u'key2': u'Espa\xf1ol'})
>>> dummy
{u'key2': u'Espa\xf1ol', u'key1': u'Normal ascii text'}
>>> str(dummy)
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
'Espa\xc3\xb1ol'
>>> unicode(dummy)
In DefaultKeyDict.__unicode__
u'Espa\xf1ol'
>>> print dummy
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
Español
Everything works as expected. Now I can wrap the object in a security proxy from the zope.security package and do the same tests to show the error:
>>> from zope.security.checker import ProxyFactory
>>> prox = ProxyFactory(dummy)
>>> prox
{u'key2': u'Espa\xf1ol', u'key1': u'Normal ascii text'}
>>> type(prox)
<type 'zope.security._proxy._Proxy'>
>>> str(prox)
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
'Espa\xc3\xb1ol'
>>> unicode(prox)
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
*** UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
As you can see, calling unicode on the proxied object isn't possible anymore if it contains any special characters. I can see the proxy object from zope.security is mostly defined with C code and I'm not familiar at all with the C Python API, but it seems that the __str__ and __repr__ methods are defined in the C code but not __unicode__. So to me, what seems to be happening is that when it is trying to create a unicode representation of this proxied object, instead of calling the __unicode__ method directly, it calls the __str__ method (as you can see from the last few print statements above), which returns a utf-8 encoded byte string, but that is then being converted to unicode (using the default ascii encoding). So what is happening seems to be the equivalent of this:
>>> unicode(prox.__str__())
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
*** UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
So of course it will result in a UnicodeDecodeError in this case, trying to decode a utf-8 string with ascii. As expected, if I could specify the encoding of utf-8 there wouldn't be a problem.
>>> unicode(prox.__str__(), encoding='utf-8')
In DefaultKeyDict.__str__
In DefaultKeyDict.__unicode__
u'Espa\xf1ol'
But I can't change that since we are talking about the zope.pagetemplate and zope.tales packages that are creating the unicode representation out of all types of objects, and they always seem to be working with the security proxied objects (from zope.security). Also of note, there is no problem calling the __unicode__ method directly on the object.
>>> prox.__unicode__()
In DefaultKeyDict.__unicode__
u'Espa\xf1ol'
So the real problem is that unicode(prox) calls the __str__ method. I've been spinning my wheels on this for a while and don't know where else to go now. Any insights would be much appreciated.

Judging by what you've said about the C API defining __str__ and __repr__ methods but not __unicode__ methods, I suspect whatever C library you're using has been written to be compatible with python 3. I'm not familiar with zope, but I'm relatively confident this should be the case.
In Python 2, the object model specifies str() and unicode()
methods. If these methods exist, they must return str (bytes) and
unicode (text) respectively.
In Python 3, there’s simply str(), which must return str (text).
I may be slightly missing the point with your program but do you really need the __unicode__ method defined? As you've said everything in the dict is belonging to the unicode character set. So calling the __str__ method will decode that into utf-8, and if you wanted to see the binaries for the string why not just encode it?
Note that decode() returns a string object, whilst encode() returns a bytes object.
If you could, please post an edit/comment so I can understand a little bit more what you're trying to do.

In case anybody is looking for a temporary solution to this problem, I can share the monkeypatch fixes that we've implemented. Patching these two methods from zope.tal and zope.tales seems to do the trick. This will work well as long as you know that the encoding will always be utf-8.
from zope.tal import talinterpreter
def do_insertStructure_tal(self, (expr, repldict, block)):
"""Patch for zope.security proxied I18NDicts.
The Proxy wrapper doesn't support a unicode hook for now. The only way to
fix this is to monkey patch this method which calls 'unicode'.
"""
structure = self.engine.evaluateStructure(expr)
if structure is None:
return
if structure is self.Default:
self.interpret(block)
return
if isinstance(structure, talinterpreter.I18nMessageTypes):
text = self.translate(structure)
else:
try:
text = unicode(structure)
except UnicodeDecodeError:
text = unicode(str(structure), encoding='utf-8')
if not (repldict or self.strictinsert):
# Take a shortcut, no error checking
self.stream_write(text)
return
if self.html:
self.insertHTMLStructure(text, repldict)
else:
self.insertXMLStructure(text, repldict)
talinterpreter.TALInterpreter.do_insertStructure_tal = do_insertStructure_tal
talinterpreter.TALInterpreter.bytecode_handlers_tal["insertStructure"] = \
do_insertStructure_tal
and this one:
from zope.tales import tales
def evaluateText(self, expr):
"""Patch for zope.security proxied I18NDicts.
The Proxy wrapper doesn't support a unicode hook for now. The only way to
fix this is to monkey patch this method which calls 'unicode'.
"""
text = self.evaluate(expr)
if text is self.getDefault() or text is None:
return text
if isinstance(text, basestring):
# text could already be something text-ish, e.g. a Message object
return text
try:
return unicode(text)
except UnicodeDecodeError:
return unicode(str(text), encoding='utf-8')
tales.Context.evaluateText = evaluateText

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.