Prevent numpy from creating a multidimensional array

Prevent numpy from creating a multidimensional array - python

NumPy is really helpful when creating arrays. If the first argument for numpy.array has a __getitem__ and __len__ method these are used on the basis that it might be a valid sequence.
Unfortunatly I want to create an array containing dtype=object without NumPy being "helpful".
Broken down to a minimal example the class would like this:
import numpy as np
class Test(object):
def __init__(self, iterable):
self.data = iterable
def __getitem__(self, idx):
return self.data[idx]
def __len__(self):
return len(self.data)
def __repr__(self):
return '{}({})'.format(self.__class__.__name__, self.data)
and if the "iterables" have different lengths everything is fine and I get exactly the result I want to have:
>>> np.array([Test([1,2,3]), Test([3,2])], dtype=object)
array([Test([1, 2, 3]), Test([3, 2])], dtype=object)
but NumPy creates a multidimensional array if these happen to have the same length:
>>> np.array([Test([1,2,3]), Test([3,2,1])], dtype=object)
array([[1, 2, 3],
[3, 2, 1]], dtype=object)
Unfortunatly there is only a ndmin argument so I was wondering if there is a way to enforce a ndmax or somehow prevent NumPy from interpreting the custom classes as another dimension (without deleting __len__ or __getitem__)?

This behavior has been discussed a number of times before (e.g. Override a dict with numpy support). np.array tries to make as high a dimensional array as it can. The model case is nested lists. If it can iterate and the sublists are equal in length it will 'drill' on down.
Here it went down 2 levels before encountering lists of different length:
In [250]: np.array([[[1,2],[3]],[1,2]],dtype=object)
Out[250]:
array([[[1, 2], [3]],
[1, 2]], dtype=object)
In [251]: _.shape
Out[251]: (2, 2)
Without a shape or ndmax parameter it has no way of knowing whether I want it to be (2,) or (2,2). Both of those would work with the dtype.
It's compiled code, so it isn't easy to see exactly what tests it uses. It tries to iterate on lists and tuples, but not on sets or dictionaries.
The surest way to make an object array with a given dimension is to start with an empty one, and fill it
In [266]: A=np.empty((2,3),object)
In [267]: A.fill([[1,'one']])
In [276]: A[:]={1,2}
In [277]: A[:]=[1,2] # broadcast error
Another way is to start with at least one different element (e.g. a None), and then replace that.
There is a more primitive creator, ndarray that takes shape:
In [280]: np.ndarray((2,3),dtype=object)
Out[280]:
array([[None, None, None],
[None, None, None]], dtype=object)
But that's basically the same as np.empty (unless I give it a buffer).
These are fudges, but they aren't expensive (time wise).
================ (edit)
https://github.com/numpy/numpy/issues/5933, Enh: Object array creation function. is an enhancement request. Also https://github.com/numpy/numpy/issues/5303 the error message for accidentally irregular arrays is confusing.
The developer sentiment seems to favor a separate function to create dtype=object arrays, one with more control over the initial dimensions and depth of iteration. They might even strengthen the error checking to keep np.array from creating 'irregular' arrays.
Such a function could detect the shape of a regular nested iterable down to a specified depth, and build an object type array to be filled.
def objarray(alist, depth=1):
shape=[]; l=alist
for _ in range(depth):
shape.append(len(l))
l = l[0]
arr = np.empty(shape, dtype=object)
arr[:]=alist
return arr
With various depths:
In [528]: alist=[[Test([1,2,3])], [Test([3,2,1])]]
In [529]: objarray(alist,1)
Out[529]: array([[Test([1, 2, 3])], [Test([3, 2, 1])]], dtype=object)
In [530]: objarray(alist,2)
Out[530]:
array([[Test([1, 2, 3])],
[Test([3, 2, 1])]], dtype=object)
In [531]: objarray(alist,3)
Out[531]:
array([[[1, 2, 3]],
[[3, 2, 1]]], dtype=object)
In [532]: objarray(alist,4)
...
TypeError: object of type 'int' has no len()

A workaround is of course to create an array of the desired shape and then copy the data:
In [19]: lst = [Test([1, 2, 3]), Test([3, 2, 1])]
In [20]: arr = np.empty(len(lst), dtype=object)
In [21]: arr[:] = lst[:]
In [22]: arr
Out[22]: array([Test([1, 2, 3]), Test([3, 2, 1])], dtype=object)
Notice that in any case I would not be surprised if numpy behavior w.r.t. interpreting iterable objects (which is what you want to use, right?) is numpy version dependent. And possibly buggy. Or maybe some of these bugs are actually features. Anyway, I'd be wary of breakage when a numpy version changes.
On the contrary, copying into a pre-created array should be way more robust.

This workaround may not be the most efficient, but I like it for its clarity:
test_list = [Test([1,2,3]), Test([3,2,1])]
test_list.append(None)
test_array = np.array(test_list, dtype=object)[:-1]
Summary: You take your list, append None, then convert to a numpy array, preventing numpy from converting to a multidimensional array. Finally you just remove the last entry to get the structure you want.

Workaround using pandas
This might not be what OP is looking for. But, just in case if anyone is looking for a way to prevent numpy from constructing multidimensional arrays, this might be useful.
Pass your list to pd.Series and then get the elements as a numpy array using .values.
import pandas as pd
pd.Series([Test([1,2,3]), Test([3,2,1])]).values
# array([Test([1, 2, 3]), Test([3, 2, 1])], dtype=object)
Or, if dealing with numpy arrays:
np.array([np.random.randn(2,2), np.random.randn(2,2)]).shape
(2, 2, 2)
Using pd.Series:
pd.Series([np.random.randn(2,2), np.random.randn(2,2)]).values.shape
#(2,)

Related

How is it possible for Numpy to use comma-separated subscripting with `:`?

Consider the following example:
>>> a=np.array([1,2,3,4])
>>> a
array([1, 2, 3, 4])
>>> a[np.newaxis,:,np.newaxis]
array([[[1],
[2],
[3],
[4]]])
How is it possible for Numpy to use the : (normally used for slicing arrays) as an index when using comma-separated subscripting?
If I try to use comma-separated subscripting with either a Python list or a Python list-of-lists, I get a TypeError:
>>> [[1,2],[3,4]][0,:]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list indices must be integers or slices, not tuple
?

Define a simple class with a getitem, indexing method:
In [128]: class Foo():
...: def __getitem__(self, arg):
...: print(type(arg), arg)
...:
In [129]: f = Foo()
And look at what different indexes produce:
In [130]: f[:]
<class 'slice'> slice(None, None, None)
In [131]: f[1:2:3]
<class 'slice'> slice(1, 2, 3)
In [132]: f[:, [1,2,3]]
<class 'tuple'> (slice(None, None, None), [1, 2, 3])
In [133]: f[:, :3]
<class 'tuple'> (slice(None, None, None), slice(None, 3, None))
In [134]: f[(slice(1,None),3)]
<class 'tuple'> (slice(1, None, None), 3)
For builtin classes like list, a tuple argument raises an error. But that's a class dependent issue, not a syntax one. numpy.ndarray accepts a tuple, as long as it's compatible with its shape.
The syntax for a tuple index was added to Python to meet the needs of numpy. I don't think there are any builtin classes that use it.
The numpy.lib.index_tricks.py module has several classes that take advantage of this behavior. Look at its code for more ideas.
In [137]: np.s_[3:]
Out[137]: slice(3, None, None)
In [139]: np.r_['0,2,1',[1,2,3],[4,5,6]]
Out[139]:
array([[1, 2, 3],
[4, 5, 6]])
In [140]: np.c_[[1,2,3],[4,5,6]]
Out[140]:
array([[1, 4],
[2, 5],
[3, 6]])
other "indexing" examples:
In [141]: f[...]
<class 'ellipsis'> Ellipsis
In [142]: f[[1,2,3]]
<class 'list'> [1, 2, 3]
In [143]: f[10]
<class 'int'> 10
In [144]: f[{1:12}]
<class 'dict'> {1: 12}
I don't know of any class that makes use of a dict argument, but the syntax allows it.

Lists are 1D and you can either pass a single index or a slice. The : when used for index is a short notation to create a slice. In [[1,2],[3,4]][0,:] you are passing 0,:, which is the tuple (0, :). That is, in order to create a tuple the parenthesis are optional here. Just having values separated by comma will create the tuple.
But numpy is different, since it is a N-dimensional array and not just 1D as lists you can pass multiple indexes to index the different dimensions. Therefore, passing a tuple to index numpy is allowed as long as the number of elements in the tuple is not greater than the dimensions of the indexed array. Consuder the array below
arr = np.random.randn(5,10, 3)
It has 3 dimensions and we can index it like arr[0,1,0], but this is the same as arr[(0,1,0)]. That is, we are passing a tuple to index the array. Each tuple element itself can be an integer or a slice and numpy will do the appropriated indexing. Numpy accepts tuples for indexing but lists don't.
However, when you write a[np.newaxis,:,np.newaxis] it is a more than just indexing. First, note that np.newaxis is just None. When you use None to index a dimension in numpy what it does as creating that dimension. The a array in your example is 1D, but a[np.newaxis,:,np.newaxis] is a special type of indexing understood by numpy as a short notation to "give me an array with extra axis where I'm indexing with np.newaxis and whose elements are from my original array indexed as I'm specifying".
So, the TLDR answer is numpy indexing is more general and powerful than list indexing.

What's the difference when indexing a numpy array between using an integer and a numpy scalar?

I didn't expect them to be different, until it just cost me 2 hours to find a bug. Here is an example showing the difference I noticed, but I couldn't make sense of it.
>>> a = np.array([[1, 2], [3, 4]])
>>> a[0][0]
1
>>> a[np.array(0)][np.array(0)]
1
>>> a[0][0] = 5
>>> a
array([[5, 2],
[3, 4]])
>>> a[np.array(0)][np.array(0)] = 6
>>> a
array([[5, 2],
[3, 4]])
It looks like using numpy scalar as index the element can't be changed. Is a copy of the original array element instead of the reference being returned?
However, with tuple indexing, the problem is gone.
>>> a[np.array(0), np.array(0)] = 6
>>> a
array([[6, 2],
[3, 4]])
What's happening here? I understand sementically chain bracket indexing and tuple indexing are different, but in principle shouldn't they both access the same element regardless?
Out of curiosity, I tried it with one dimensional array. The result is different.
>>> a = np.array([1, 2])
>>> a[np.array(0)] = 3
>>> a
array([3, 2])
This time the element has been modified.
The lesson I learned is that I should use tuple index for numpy arrays as much as possible just to be safe. But I would really like an explanation for these inconsistent effects. Thanks!

Looking at the databuffer location:
In [45]: a.__array_interface__['data']
Out[45]: (44666160, False)
In [46]: a[0].__array_interface__['data']
Out[46]: (44666160, False)
Same location for the a[0] case. Modifying a[0] will modify a.
But with the array index, the data buffer is different - this a copy. Modifying this copy will not affect a.
In [47]: a[np.array(0)].__array_interface__['data']
Out[47]: (43467872, False)
a[i,j] indexing is more idiomatic than a[i][j]. In some cases they are the same. But there are enough cases where they differ that it is wise to avoid the later unless you really know what it does, and why.
In [49]: a[0]
Out[49]: array([1, 2])
In [50]: a[np.array(0)]
Out[50]: array([1, 2])
In [51]: a[np.array([0])]
Out[51]: array([[1, 2]])
Indexing with np.array(0), a 0d array, is like indexing with np.array([0]), a 1d array. Both produce a copy, whose first dimension is sized like the index.
Admittedly this is tricky, and probably doesn't show up except when doing this sort of set.
When using np.matrix the choice of [i][j] versus [i,j] affects shape as well - python difference between the two form of matrix x[i,j] and x[i][j]

Mapping an integer to array (Python): ValueError: setting an array element with a sequence

I have a defaultdict which maps certain integers to a numpy array of size 20.
In addition, I have an existing array of indices. I want to turn that array of indices into a 2D array, where each original index is converted into an array via my defaultdict.
Finally, in the case that an index isn't found in the defaultdict, I want to create an array of zeros for that index.
Here's what I have so far
converter = lambda x: np.zeros((d), dtype='float32') if x == -1 else cVf[x]
vfunc = np.vectorize(converter)
cvf = vfunc(indices)
np.zeros((d), dtype='float32') and cVf[x] are identical data types/ shapes:
(Pdb) np.shape(cVf[0])
(20,)
Yet I get the error in the title (*** ValueError: setting an array element with a sequence.) when I try to run this code.
Any ideas?

You should give us a some sample arrays or dictionaries (in the case of cVF, so we can make a test run.
Read what vectorize has to say about the return value. Since you don't define otypes, it makes a test calculation to determine the dtype of the returned array. My first thought was that the test calc and subsequent one might be returning different things. But you claim converter will always be returning the same dtype and shape array.
But let's try something simpler:
In [609]: fv = np.vectorize(lambda x: np.array([x,x]))
In [610]: fv([1,2,3])
...
ValueError: setting an array element with a sequence.
It's having trouble with returning any array.
But if I give an otypes, it works
In [611]: fv = np.vectorize(lambda x: np.array([x,x]), otypes=[object])
In [612]: fv([1,2,3])
Out[612]: array([array([1, 1]), array([2, 2]), array([3, 3])], dtype=object)
In fact in this case I could use frompyfunc, which returns object dtype, and is the underlying function for vectorize (and a bit faster).
In [613]: fv = np.frompyfunc(lambda x: np.array([x,x]), 1,1)
In [614]: fv([1,2,3])
Out[614]: array([array([1, 1]), array([2, 2]), array([3, 3])], dtype=object)
vectorize and frompyfunc are designed for functions that are scalar in- scalar out. That scalar may be an object, even array, but is still treated as a scalar.

Create an ndarray of list or ndarrays of differing lengths [duplicate]

NumPy is really helpful when creating arrays. If the first argument for numpy.array has a __getitem__ and __len__ method these are used on the basis that it might be a valid sequence.
Unfortunatly I want to create an array containing dtype=object without NumPy being "helpful".
Broken down to a minimal example the class would like this:
import numpy as np
class Test(object):
def __init__(self, iterable):
self.data = iterable
def __getitem__(self, idx):
return self.data[idx]
def __len__(self):
return len(self.data)
def __repr__(self):
return '{}({})'.format(self.__class__.__name__, self.data)
and if the "iterables" have different lengths everything is fine and I get exactly the result I want to have:
>>> np.array([Test([1,2,3]), Test([3,2])], dtype=object)
array([Test([1, 2, 3]), Test([3, 2])], dtype=object)
but NumPy creates a multidimensional array if these happen to have the same length:
>>> np.array([Test([1,2,3]), Test([3,2,1])], dtype=object)
array([[1, 2, 3],
[3, 2, 1]], dtype=object)
Unfortunatly there is only a ndmin argument so I was wondering if there is a way to enforce a ndmax or somehow prevent NumPy from interpreting the custom classes as another dimension (without deleting __len__ or __getitem__)?

This behavior has been discussed a number of times before (e.g. Override a dict with numpy support). np.array tries to make as high a dimensional array as it can. The model case is nested lists. If it can iterate and the sublists are equal in length it will 'drill' on down.
Here it went down 2 levels before encountering lists of different length:
In [250]: np.array([[[1,2],[3]],[1,2]],dtype=object)
Out[250]:
array([[[1, 2], [3]],
[1, 2]], dtype=object)
In [251]: _.shape
Out[251]: (2, 2)
Without a shape or ndmax parameter it has no way of knowing whether I want it to be (2,) or (2,2). Both of those would work with the dtype.
It's compiled code, so it isn't easy to see exactly what tests it uses. It tries to iterate on lists and tuples, but not on sets or dictionaries.
The surest way to make an object array with a given dimension is to start with an empty one, and fill it
In [266]: A=np.empty((2,3),object)
In [267]: A.fill([[1,'one']])
In [276]: A[:]={1,2}
In [277]: A[:]=[1,2] # broadcast error
Another way is to start with at least one different element (e.g. a None), and then replace that.
There is a more primitive creator, ndarray that takes shape:
In [280]: np.ndarray((2,3),dtype=object)
Out[280]:
array([[None, None, None],
[None, None, None]], dtype=object)
But that's basically the same as np.empty (unless I give it a buffer).
These are fudges, but they aren't expensive (time wise).
================ (edit)
https://github.com/numpy/numpy/issues/5933, Enh: Object array creation function. is an enhancement request. Also https://github.com/numpy/numpy/issues/5303 the error message for accidentally irregular arrays is confusing.
The developer sentiment seems to favor a separate function to create dtype=object arrays, one with more control over the initial dimensions and depth of iteration. They might even strengthen the error checking to keep np.array from creating 'irregular' arrays.
Such a function could detect the shape of a regular nested iterable down to a specified depth, and build an object type array to be filled.
def objarray(alist, depth=1):
shape=[]; l=alist
for _ in range(depth):
shape.append(len(l))
l = l[0]
arr = np.empty(shape, dtype=object)
arr[:]=alist
return arr
With various depths:
In [528]: alist=[[Test([1,2,3])], [Test([3,2,1])]]
In [529]: objarray(alist,1)
Out[529]: array([[Test([1, 2, 3])], [Test([3, 2, 1])]], dtype=object)
In [530]: objarray(alist,2)
Out[530]:
array([[Test([1, 2, 3])],
[Test([3, 2, 1])]], dtype=object)
In [531]: objarray(alist,3)
Out[531]:
array([[[1, 2, 3]],
[[3, 2, 1]]], dtype=object)
In [532]: objarray(alist,4)
...
TypeError: object of type 'int' has no len()

A workaround is of course to create an array of the desired shape and then copy the data:
In [19]: lst = [Test([1, 2, 3]), Test([3, 2, 1])]
In [20]: arr = np.empty(len(lst), dtype=object)
In [21]: arr[:] = lst[:]
In [22]: arr
Out[22]: array([Test([1, 2, 3]), Test([3, 2, 1])], dtype=object)
Notice that in any case I would not be surprised if numpy behavior w.r.t. interpreting iterable objects (which is what you want to use, right?) is numpy version dependent. And possibly buggy. Or maybe some of these bugs are actually features. Anyway, I'd be wary of breakage when a numpy version changes.
On the contrary, copying into a pre-created array should be way more robust.

This workaround may not be the most efficient, but I like it for its clarity:
test_list = [Test([1,2,3]), Test([3,2,1])]
test_list.append(None)
test_array = np.array(test_list, dtype=object)[:-1]
Summary: You take your list, append None, then convert to a numpy array, preventing numpy from converting to a multidimensional array. Finally you just remove the last entry to get the structure you want.

Workaround using pandas
This might not be what OP is looking for. But, just in case if anyone is looking for a way to prevent numpy from constructing multidimensional arrays, this might be useful.
Pass your list to pd.Series and then get the elements as a numpy array using .values.
import pandas as pd
pd.Series([Test([1,2,3]), Test([3,2,1])]).values
# array([Test([1, 2, 3]), Test([3, 2, 1])], dtype=object)
Or, if dealing with numpy arrays:
np.array([np.random.randn(2,2), np.random.randn(2,2)]).shape
(2, 2, 2)
Using pd.Series:
pd.Series([np.random.randn(2,2), np.random.randn(2,2)]).values.shape
#(2,)

how can I flatten an 2d numpy array, which has different length in the second axis?

I have a numpy array which looks like:
myArray = np.array([[1,2],[3]])
But I can not flatten it,
In: myArray.flatten()
Out: array([[1, 2], [3]], dtype=object)
If I change the array to the same length in the second axis, then I can flatten it.
In: myArray2 = np.array([[1,2],[3,4]])
In: myArray2.flatten()
Out: array([1, 2, 3, 4])
My Question is:
Can I use some thing like myArray.flatten() regardless the dimension of the array and the length of its elements, and get the output: array([1,2,3])?

myArray is a 1-dimensional array of objects. Your list objects will simply remain in the same order with flatten() or ravel(). You can use hstack to stack the arrays in sequence horizontally:
>>> np.hstack(myArray)
array([1, 2, 3])
Note that this is basically equivalent to using concatenate with an axis of 1 (this should make sense intuitively):
>>> np.concatenate(myArray, axis=1)
array([1, 2, 3])
If you don't have this issue however and can merge the items, it is always preferable to use flatten() or ravel() for performance:
In [1]: u = timeit.Timer('np.hstack(np.array([[1,2],[3,4]]))'\
....: , setup = 'import numpy as np')
In [2]: print u.timeit()
11.0124390125
In [3]: u = timeit.Timer('np.array([[1,2],[3,4]]).flatten()'\
....: , setup = 'import numpy as np')
In [4]: print u.timeit()
3.05757689476
Iluengo's answer also has you covered for further information as to why you cannot use flatten() or ravel() given your array type.

Well, I agree with the other answers when they say that hstack or concatenate do the job in this case. However, I would like to point that even if it 'fixes' the problem, the problem is not addressed properly.
The problem is that even if it looks like the second axis has different length, this is not true in practice. If you try:
>>> myArray.shape
(2,)
>>> myArray.dtype
dtype('O') # stands for Object
>>> myArray[0]
[1, 2]
It shows you that your array is not a 2D array with variable size (as you might think), it is just a 1D array of objects. In your case, the elements are list, being the first element of your array a 2-element list and the second element of the array is a 1-element list.
So, flatten and ravel won't work because transforming 1D array to a 1D array results in exactly the same 1D array. If you have a object numpy array, it won't care about what you put inside, it will treat individual items as unkown items and can't decide how to merge them.
What you should have in consideration, is if this is the behaviour you want for your application. Numpy arrays are specially efficient with fixed-size numeric matrices. If you are playing with arrays of objects, I don't see why would you like to use Numpy instead of regular python lists.

np.hstack works in this case
In [69]: np.hstack(myArray)
Out[69]: array([1, 2, 3])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.