numpy ndarray hashability - python

I have some problems understanding how numpy objects hashability is managed.
>>> import numpy as np
>>> class Vector(np.ndarray):
... pass
>>> nparray = np.array([0.])
>>> vector = Vector(shape=(1,), buffer=nparray)
>>> ndarray = np.ndarray(shape=(1,), buffer=nparray)
>>> nparray
array([ 0.])
>>> ndarray
array([ 0.])
>>> vector
Vector([ 0.])
>>> '__hash__' in dir(nparray)
True
>>> '__hash__' in dir(ndarray)
True
>>> '__hash__' in dir(vector)
True
>>> hash(nparray)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'numpy.ndarray'
>>> hash(ndarray)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'numpy.ndarray'
>>> hash(vector)
-9223372036586049780
>>> nparray.__hash__()
269709177
>>> ndarray.__hash__()
269702147
>>> vector.__hash__()
-9223372036586049780
>>> id(nparray)
4315346832
>>> id(ndarray)
4315234352
>>> id(vector)
4299616456
>>> nparray.__hash__() == id(nparray)
False
>>> ndarray.__hash__() == id(ndarray)
False
>>> vector.__hash__() == id(vector)
False
>>> hash(vector) == vector.__hash__()
True
How come
numpy objects define a __hash__ method but are however not hashable
a class deriving numpy.ndarray defines __hash__ and is hashable?
Am I missing something?
I'm using Python 2.7.1 and numpy 1.6.1
Thanks for any help!
EDIT: added objects ids
EDIT2:
And following deinonychusaur comment and trying to figure out if hashing is based on content, I played with numpy.nparray.dtype and have something I find quite strange:
>>> [Vector(shape=(1,), buffer=np.array([1], dtype=mytype), dtype=mytype) for mytype in ('float', 'int', 'float128')]
[Vector([ 1.]), Vector([1]), Vector([ 1.0], dtype=float128)]
>>> [id(Vector(shape=(1,), buffer=np.array([1], dtype=mytype), dtype=mytype)) for mytype in ('float', 'int', 'float128')]
[4317742576, 4317742576, 4317742576]
>>> [hash(Vector(shape=(1,), buffer=np.array([1], dtype=mytype), dtype=mytype)) for mytype in ('float', 'int', 'float128')]
[269858911, 269858911, 269858911]
I'm puzzled... is there some (type independant) caching mechanism in numpy?

I get the same results in Python 2.6.6 and numpy 1.3.0. According to the Python glossary, an object should be hashable if __hash__ is defined (and is not None), and either __eq__ or __cmp__ is defined. ndarray.__eq__ and ndarray.__hash__ are both defined and return something meaningful, so I don't see why hash should fail. After a quick google, I found this post on the python.scientific.devel mailing list, which states that arrays have never been intended to be hashable - so why ndarray.__hash__ is defined, I have no idea. Note that isinstance(nparray, collections.Hashable) returns True.
EDIT: Note that nparray.__hash__() returns the same as id(nparray), so this is just the default implementation. Maybe it was difficult or impossible to remove the implementation of __hash__ in earlier versions of python (the __hash__ = None technique was apparently introduced in 2.6), so they used some kind of C API magic to achieve this in a way that wouldn't propagate to subclasses, and wouldn't stop you from calling ndarray.__hash__ explicitly?
Things are different in Python 3.2.2 and the current numpy 2.0.0 from the repo. The __cmp__ method no longer exists, so hashability now requires __hash__ and __eq__ (see Python 3 glossary). In this version of numpy, ndarray.__hash__ is defined, but it is just None, so cannot be called. hash(nparray) fails andisinstance(nparray, collections.Hashable) returns False as expected. hash(vector) also fails.

This is not a clear answer, but here is some track to follow to understand this behavior.
I refer here to the numpy code of the 1.6.1 release.
According to numpy.ndarray object implementation (look at, numpy/core/src/multiarray/arrayobject.c), hash method is set to NULL.
NPY_NO_EXPORT PyTypeObject PyArray_Type = {
#if defined(NPY_PY3K)
PyVarObject_HEAD_INIT(NULL, 0)
#else
PyObject_HEAD_INIT(NULL)
0, /* ob_size */
#endif
"numpy.ndarray", /* tp_name */
sizeof(PyArrayObject), /* tp_basicsize */
&array_as_mapping, /* tp_as_mapping */
(hashfunc)0, /* tp_hash */
This tp_hash property seems to be overridden in numpy/core/src/multiarray/multiarraymodule.c. See DUAL_INHERIT, DUAL_INHERIT2 and initmultiarray function where tp_hash attribute is modified.
Ex:
PyArrayDescr_Type.tp_hash = PyArray_DescrHash
According to hashdescr.c, hash is implemented as follow:
* How does this work ? The hash is computed from a list which contains all the
* information specific to a type. The hard work is to build the list
* (_array_descr_walk). The list is built as follows:
* * If the dtype is builtin (no fields, no subarray), then the list
* contains 6 items which uniquely define one dtype (_array_descr_builtin)
* * If the dtype is a compound array, one walk on each field. For each
* field, we append title, names, offset to the final list used for
* hashing, and then append the list recursively built for each
* corresponding dtype (_array_descr_walk_fields)
* * If the dtype is a subarray, one adds the shape tuple to the list, and
* then append the list recursively built for each corresponding type
* (_array_descr_walk_subarray)

Related

Why don't python dict keys/values quack like a duck?

Python is duck typed, and generally this avoids casting faff when dealing with primitive objects.
The canonical example (and the reason behind the name) is the duck test: If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.
However one notable exception is dict keys/values, which look like a duck and swim like a duck, but notably do not quack like a duck.
>>> ls = ['hello']
>>> d = {'foo': 'bar'}
>>> for key in d.keys():
.. print(key)
..
'foo'
>>> ls + d.keys()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate list (not "dict_keys") to list
Can someone enlighten me as to why this is?
Dict keys actually implements the set's interface rather than the list's, so you can perform set operations with dict keys directly with other sets:
d.keys() & {'foo', 'bar'} # returns {'foo'}
But it doesn't implement the __getitem__, __setitem__, __delitem__, and insert methods, which are required to "quack" like a list, so it cannot perform any of the list operations without being explicitly converted to a list first:
ls + list(d.keys()) # returns ['hello', 'foo']
There is an explicit check for list type (or its children) in python source code (so even tuple doesn't qualify):
static PyObject *
list_concat(PyListObject *a, PyObject *bb)
{
Py_ssize_t size;
Py_ssize_t i;
PyObject **src, **dest;
PyListObject *np;
if (!PyList_Check(bb)) {
PyErr_Format(PyExc_TypeError,
"can only concatenate list (not \"%.200s\") to list",
bb->ob_type->tp_name);
return NULL;
}
so python can compute size very quickly and reallocate the result without trying all containers or iterate on the right hand to find out, providing very fast list addition.
#define b ((PyListObject *)bb)
size = Py_SIZE(a) + Py_SIZE(b);
if (size < 0)
return PyErr_NoMemory();
np = (PyListObject *) PyList_New(size);
if (np == NULL) {
return NULL;
}
One way to workaround this is to use in-place extension/addition:
my_list += my_dict # adding .keys() is useless
because in that case, in-place add iterates on the right hand side: so every collection qualifies.
(or of course force iteration of the right hand: + list(my_dict))
So it could accept any type but I suspect that the makers of python didn't find it worth it and were satisfied with a simple & fast implementation which is used 99% of the time.
If you go into the definition of d.keys() the you can see the following.
def keys(self): # real signature unknown; restored from __doc__
""" D.keys() -> a set-like object providing a view on D's keys """
pass
Or use this statement:
print(d.keys.__doc__)
It clearly mentions that the output is set-like object.
Now you are trying to append a set to a list.
You need to convert the set into list and then append it.
x = ls + list(d.keys())
print(x)
# ['hello', 'foo']

Python - What is the cheapest data type to be used as "dummy value" in dict

I would like to ask what is the cheapest data type (in term of memory consumption and cost to hold/process it) to be used as dummy value in python dict (only key of the dict matters to me, values are just placeholder)
For examples :
d1 = {1: None, 2: None, 3: None}
d2 = {1: -1, 2: -1, 3: -1}
d3 = {1: False, 2: False, 3: False}
Here only keys (1, 2, 3) are useful to me, the values are not so they can be any value (just used as a place holder. What I want to know is what dummy data I should use here. For now I use None, but not sure if it is the "cheapest" one.
P.S., I know the best option to store only keys may be to use Set instead of dict (with dummy values). However, the reason that I am doing so is because I want to exchange data between Python and C++ using SWIG. And for now I have figured out how to pass Python dict to C++ as std::map using SWIG, but cannot find anything about how to pass Python Set to C++ as std::set...
Helps / Guidances are very appreciated here!
python 3.4 64bit:
>>> import sys
>>> sys.getsizeof(None)
16
>>> sys.getsizeof(False)
24
>>> sys.getsizeof(1)
28
>>>
So None would appear to be the best choice (I've only listed immutable objects, and disregarded strings and tuples). Note that it doesn't matter much as those objects are usually cached, so the size isn't multiplied by the number of elements of your dictionary (furthermore None is guaranteed to be a singleton)
That said, the cost of the actual object is neglectable compared to the cost of storing a reference to that object for each key/value pair. If your dictionary holds 1000 values, you have 1000 references to store, whatever the size of the value.
Conclusion: it doesn't matter much as long as you're using the same reference everywhere, and it's going to cost much more than a set anyway because of the references being stored as the values of each dictionary entry.
One possible alternative would be to pass the set as json representation (in a list, then) as a pointer of characters to the C++ side, which will parse it using a good json parser. Unless your values are big floating point values (or huge integers), that would save memory all right because the object aspect is eliminated with the serialization.
>>> json.dumps(list(set(range(4,10))))
'[4, 5, 6, 7, 8, 9]' # hard to beat that in terms of size!
You can use a set, but SWIG seems to only support passing Python lists as the set parameter (or use the named template) without writing your own typemap. Example (Windows):
test.i*
%module test
%include <std_set.i>
%template(seti) std::set<int>;
%inline %{
#include <set>
#include <iostream>
void func(std::set<int> a)
{
for(auto i : a)
std::cout << i << std::endl;
}
%}
Output:
>>> import set
>>> s = test.seti([1,1,2,2,3,3]) # pass named template
>>> test.func(s)
1
2
3
>>> test.func([1,2,3,3,4,4]) # pass a list that converts to a set
1
2
3
4
>>> test.func({1,1,2,2,3}) # Actual set doesn't work.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: in method 'func', argument 1 of type 'std::set< int,std::less< int >,std::allocator< int > >'

Ctypes read data from a double pointer

I am working on a C++ Dll with C wrapper and I am creating a Python wrapper for future user (I discover ctypes since monday). One of the method of my Dll (because it is a class) return an unsigned short **, call data, which corresponds to an image. On C++, I get the value of a pixel using data[row][column].
I create in Python a function on the following model :
mydll.cMyFunction.argtypes = [c_void_p]
mydll.cMyFunction.restype = POINTER(POINTER(c_ushort))
When I call this function, I have result = <__main__.LP_LP_c_ushort at 0x577fac8>
and when I try to see the data at this address (using result.contents.contents) I get the correct value of the first pixel. But I don't know how to access values of the rest of my image. Is there a easy way to do something like C++ (data[i][j]) ?
Yes, just use result[i][j]. Here's a contrived example:
>>> from ctypes import *
>>> ppus = POINTER(POINTER(c_ushort))
>>> ppus
<class '__main__.LP_LP_c_ushort'>
>>> # This creates an array of pointers to ushort[5] arrays
>>> x=(POINTER(c_ushort)*5)(*[cast((c_ushort*5)(n,n+1,n+2,n+3,n+4),POINTER(c_ushort)) for n in range(0,25,5)])
>>> a = cast(x,ppus) # gets a ushort**
>>> a
<__main__.LP_LP_c_ushort object at 0x00000000026F39C8>
>>> a[0] # deref to get the first ushort[5] array
<__main__.LP_c_ushort object at 0x00000000026F33C8>
>>> a[0][0] # get an item from a row
0
>>> a[0][1]
1
>>>
>>> a[1][0]
5
So if you are returning the ushort** correctly from C, it should "just work".

Why does SimpleNamespace have a different size than that of an empty class?

Consider the following:
In [1]: import types
In [2]: class A:
...: pass
...:
In [3]: a1 = A()
In [4]: a1.a, a1.b, a1.c = 1, 2, 3
In [5]: a2 = types.SimpleNamespace(a=1,b=2,c=3)
In [6]: sys.getsizeof(a1)
Out[6]: 56
In [7]: sys.getsizeof(a2)
Out[7]: 48
Where is this size discrepancy coming from? Looking at:
In [10]: types.__file__
Out[10]: '/Users/juan/anaconda3/lib/python3.5/types.py'
I find:
import sys
# Iterators in Python aren't a matter of type but of protocol. A large
# and changing number of builtin types implement *some* flavor of
# iterator. Don't check the type! Use hasattr to check for both
# "__iter__" and "__next__" attributes instead.
def _f(): pass
FunctionType = type(_f)
LambdaType = type(lambda: None) # Same as FunctionType
CodeType = type(_f.__code__)
MappingProxyType = type(type.__dict__)
SimpleNamespace = type(sys.implementation)
Ok, well, here goes nothing:
>>> import sys
>>> sys.implementation
namespace(cache_tag='cpython-35', hexversion=50660080, name='cpython', version=sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0))
>>> type(sys.implementation)
<class 'types.SimpleNamespace'>
I seem to be chasing my own tail here.
I was able to find this related question, but no answer to my particular query.
I am using CPython 3.5 on a 64-bit system. These 8 bytes seem just the right size for some errant reference that I cannot pin-point.
Consider the following classes, which have different sizes:
class A_dict:
pass
class A_slot_0:
__slots__ = []
class A_slot_1:
__slots__ = ["a"]
class A_slot_2:
__slots__ = ["a", "b"]
Each of these has a differing fundamental memory footprint:
>>> [cls.__basicsize__ for cls in [A_dict, A_slot_0, A_slot_1, A_slot_2]]
>>> [32, 16, 24, 32]
Why? In the source of type_new (in typeobject.c), which is responsible for creating the underlying type and computing the basic size of an instance, we see that tp_basicsize is computed as:
the tp_basicsize of the underlying type (object ... 16 bytes);
another sizeof(PyObject *) for each slot;
a sizeof(PyObject *) if a __dict__ is required;
a sizeof(PyObject *) if a __weakref__ is defined;
A plain class such as A_dict will have a __dict__ and a __weakref__ defined, whereas a class with slots has no __weakref__ by default. Hence the size of plain A_dict is 32 bytes. You could consider it to effectively consist of PyObject_HEAD plus two pointers.
Now, consider a SimpleNamespace, which is defined in namespaceobject.c. Here the type is simply:
typedef struct {
PyObject_HEAD
PyObject *ns_dict;
} _PyNamespaceObject;
and tp_basicsize is defined as sizeof(_PyNamespaceObject), making it one pointer larger than a plain object, and thus 24 bytes.
NOTE:
The difference here is effectively that A_dict provides support for taking weak references, while types.SimpleNamespace does not.
>>> weakref.ref(types.SimpleNamespace())
TypeError: cannot create weak reference to 'types.SimpleNamespace' object

When does ctypes free memory?

In Python I'm using ctypes to exchange data with a C library, and the call interface involves nested pointers-to-structs.
If the memory was allocated from in C, then python should (deeply) extract a copy of any needed values and then explicitly ask that C library to deallocate the memory.
If the memory was allocated from in Python, presumably the memory will be deallocated soon after the corresponding ctypes object passes out of scope. How does this work for pointers? If I create a pointer object from a string buffer, then do I need to keep a variable referencing that original buffer object in scope, to prevent this pointer from dangling? Or does the pointer object itself automatically do this for me (even though it won't return the original object)? Does it make any difference whether I'm using pointer, POINTER, cast, c_void_p, or from_address(addressof)?
Nested pointers to simple objects seem fine. The documentation is explicit that ctypes doesn't support "original object return", but implies that the pointer does store a python-reference in order to keep-alive its target object (the precise mechanics might be implementation-specific).
>>> from ctypes import *
>>> x = c_int(7)
>>> triple_ptr = pointer(pointer(pointer(x)))
>>> triple_ptr.contents.contents.contents.value == x.value
True
>>> triple_ptr.contents.contents.contents is x
False
>>> triple_ptr._objects['1']._objects['1']._objects['1'] is x # CPython 3.5
True
Looks like the pointer function is no different to the POINTER template constructor (like how create_string_buffer relates to c_char * size).
>>> type(pointer(x)) is type(POINTER(c_int)(x))
True
Casting to void also seems to keep the reference (but I'm not sure why it modifies the original pointer?).
>>> ptr = pointer(x)
>>> ptr._objects
{'1': c_int(7)}
>>> pvoid = cast(p, c_void_p)
>>> pvoid._objects is ptr._objects
True
>>> pvoid._objects
{139665053613048: <__main__.LP_c_int object at 0x7f064de87bf8>, '1': c_int(7)}
>>> pvoid._objects['1'] is x
True
Creating an object directly from a memory buffer (or address thereof) looks more fraught.
>>> v = c_void_p.from_buffer(triple_ptr)
>>> v2 = c_void_p.from_buffer_copy(triple_ptr)
>>> type(v._objects)
<class 'memoryview'>
>>> POINTER(POINTER(POINTER(c_int))).from_buffer(v)[0][0][0] == x.value
True
>>> p3 = POINTER(POINTER(POINTER(C_int))).from_address(addressof(triple_ptr))
>>> v2._objects is None is p3._objects is p3._b_base_
True
Incidentally, byref probably keeps-alive the memory it references.
>>> byref(x)._obj is x
True

Categories

Resources