In numpy, I have an array that can be either 2-D or 3-D, and I would like to reduce it to 2-D while squaring each element. So I tried this and it doesn't work:
A = np.random.rand(5, 3, 3)
np.einsum('...ij,...ij->ij', A, A)
It returns this error:
ValueError: output has more dimensions than subscripts given in einstein sum, but no '...' ellipsis provided to broadcast the extra dimensions.
I suppose einsum doesn't assume that when the ellipsis goes away in the right hand side, I want to sum over the ellipsis dimension(s), if they exist. Is there some "elegant" way (i.e. without checking the number of dimensions and using an if statement) to tell it that I want to do this for 3-D:
A = np.random.rand(5, 3, 3)
np.einsum('aij,aij->ij', A, A)
and this for 2-D?
A = np.random.rand(3, 3)
np.einsum('ij,ij->ij', A, A)
Sometimes the 'elegant' way to handle variable dimensions is to use a set of if tests, and hide them in a function call. Look for example at np.atleast_3d; it has a 4way if/else clause. I'd recommend it here, except it adds the extra dimension at the end, not the start. if clauses using reshape are not expensive (time wise), so don't be afraid to use them. Even if you find some magical function, look at its code; you may be surprised what is hidden.
ellipsis is used for dimensions that 'go along for the ride', not ones where you want specific control. Here you want to sum over the initial dimension, so you need to index it explicitly:
In [161]: np.einsum('i...,i...',A,A)
Out[161]:
array([[ 1.26942035, 1.32052776, 1.74118617],
[ 1.59679765, 1.49331565, 2.04573002],
[ 2.29027005, 1.48351522, 1.36679208]])
In [162]: np.einsum('aij,aij->ij',A,A)
Out[162]:
array([[ 1.26942035, 1.32052776, 1.74118617],
[ 1.59679765, 1.49331565, 2.04573002],
[ 2.29027005, 1.48351522, 1.36679208]])
For 2d array:
In [165]: np.einsum('ij,ij->ij',A[0],A[0])
Out[165]:
array([[ 0.20497776, 0.11632197, 0.65396968],
[ 0.0529767 , 0.24723351, 0.27559647],
[ 0.62806525, 0.33081124, 0.57070406]])
In [166]: A[0]*A[0]
Out[166]:
array([[ 0.20497776, 0.11632197, 0.65396968],
[ 0.0529767 , 0.24723351, 0.27559647],
[ 0.62806525, 0.33081124, 0.57070406]])
In [167]:
In [167]: np.einsum('...,...',A[0],A[0])
Out[167]:
array([[ 0.20497776, 0.11632197, 0.65396968],
[ 0.0529767 , 0.24723351, 0.27559647],
[ 0.62806525, 0.33081124, 0.57070406]])
I don't think you can handle both cases with one expression.
Another way to get the first sum
In [168]: (A*A).sum(axis=0)
Out[168]:
array([[ 1.26942035, 1.32052776, 1.74118617],
[ 1.59679765, 1.49331565, 2.04573002],
[ 2.29027005, 1.48351522, 1.36679208]])
I contributed the patch that fixed the handling of ellipsis, but that was a couple of years ago. So the details aren't super fresh in my mind. As part of that I reverse engineered the parsing the string expression (the original is compiled), and could review that code (or refer you to it), if we need a more definitive answer.
In [172]: np.einsum('...ij,...ij->ij',A,A)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-172-dfe39e268402> in <module>()
----> 1 np.einsum('...ij,...ij->ij',A,A)
ValueError: output has more dimensions than subscripts given in
einstein sum, but no '...' ellipsis provided to broadcast
the extra dimensions.
In [173]: np.einsum('...ij,...ij->...ij',A,A).shape
Out[173]: (5, 3, 3)
The error message says that it is trying to pass the ... dimensions to the output, and can't - because the output is missing dimensions or .... In other words, it does not perform summation over ... dimensions. They pass to the output unchanged (broadcasting rules apply).
Related
I'm new to Numpy library from Python and I'm not sure what I'm doing wrong here, could you help me please with this?
So, I initialize my ndarray like this.
A = np.array([])
And then I'm training to append into this array A a new array X which has a shape like (1000,32,32) if has any importance.
np.insert(A, X)
The problem here is that if I'm checking the ndarray A after that it's empty, even though the ndarray X has elements inside.
Could you explain me what exactly I'm doing wrong please?
Make sure to write back to A if you use np.append, as in A = np.append(A,X) -- the top-level numpy functions like np.insert and np.append are usually immutable, so even though it gives you a value back, it's your job to store it. np.array likes to flatten the np.ndarray if you use append, so honestly, I think you just want a regular list for A, and that append method is mutable, so no need to write it back.
>>> A = []
>>> X = np.ndarray((1000,32,32))
>>> A.append(X)
>>> print(A)
[array([[[1.43351171e-316, 4.32573840e-317, 4.58492919e-320, ...,
1.14551501e-259, 6.01347002e-154, 1.39804329e-076],
[1.39803697e-076, 1.39804328e-076, 1.39642638e-076, ...,
1.18295070e-076, 7.06474122e-096, 6.01347002e-154],
[1.39804328e-076, 1.39642638e-076, 1.39804065e-076, ...,
1.05118732e-153, 6.01334510e-154, 3.24245662e-086],
...
In [10]: A = np.array([])
In [11]: A.shape
Out[11]: (0,)
In [13]: np.concatenate([A, np.ones((2,3))])
---------------------------------------------------------------------------
...
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 2 dimension(s)
So one first things you need to learn about numpy arrays is that they have shape, and a number of dimensions. Hopefully that error message is clear.
Concatenate with another 1d array does work:
In [14]: np.concatenate([A, np.arange(3)])
Out[14]: array([0., 1., 2.])
But that is just np.arange(3). The concatenate does nothing for us. OK, you might imagine starting a loop like this. But don't. This is not efficient.
You could easily concatenate a list of arrays, as long as the dimensions obey the rules specified in the docs. Those rules are logical, as long as you take the dimensions of the arrays seriously.
In [15]: X = np.ones((1000,32,32))
In [16]: np.concatenate([X,X,X], axis=1).shape
Out[16]: (1000, 96, 32)
Related to this question, I came across an indexing behaviour via Boolean arrays and broadcasting I do not understand. We know it's possible to index a NumPy array in 2 dimensions using integer indices and broadcasting. This is specified in the docs:
a = np.array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
b1 = np.array([False, True, True])
b2 = np.array([True, False, True, False])
c1 = np.where(b1)[0] # i.e. [1, 2]
c2 = np.where(b2)[0] # i.e. [0, 2]
a[c1[:, np.newaxis], c2] # or a[c1[:, None], c2]
array([[ 4, 6],
[ 8, 10]])
However, the same does not work for Boolean arrays.
a[b1[:, None], b2]
IndexError: too many indices for array
The alternative numpy.ix_ works for both integer and Boolean arrays. This seems to be because ix_ performs specific manipulation for Boolean arrays to ensure consistent treatment.
assert np.array_equal(a[np.ix_(b1, b2)], a[np.ix_(c1, c2)])
array([[ 4, 6],
[ 8, 10]])
So my question is: why does broadcasting work with integers, but not with Boolean arrays? Is this behaviour documented? Or am I misunderstanding a more fundamental issue?
As #Divakar noted in comments, Boolean advanced indices behave as if they were first fed through np.nonzero and then broadcast together, see the relevant documentation for extensive explanations. To quote the docs,
In general if an index includes a Boolean array, the result will be identical to inserting obj.nonzero() into the same position and using the integer array indexing mechanism described above. x[ind_1, boolean_array, ind_2] is equivalent to x[(ind_1,) + boolean_array.nonzero() + (ind_2,)].
[...]
Combining multiple Boolean indexing arrays or a Boolean with an integer indexing array can best be understood with the obj.nonzero() analogy. The function ix_ also supports boolean arrays and will work without any surprises.
In your case broadcasting would not necessarily be a problem, since both arrays have only two nonzero elements. The problem is the number of dimensions in the result:
>>> len(b1[:,None].nonzero())
2
>>> len(b2.nonzero())
1
Consequently the indexing expression a[b1[:,None], b2] would be equivalent to a[b1[:,None].nonzero() + b2.nonzero()], which would put a length-3 tuple inside a, corresponding to a 3d array index. Hence the error you see about "too many indices".
The surprises mentioned in the docs are very close to your example: what if you hadn't injected that singleton dimension? Starting from a length-3 and a length-4 Boolean array you would've ended up with a length-2 advanced index, i.e. a 1d array of size (2,). This is never what you'd want, which is leads us to another piece of trivia in the subject.
There's been a lot of discussion in planning to revamp advanced indexing, see the work-in-progress draft NEP 21. The gist of the issue is that fancy indexing in numpy, while clearly documented, has some very quirky features which aren't practically useful for anything, but which can bite you if you make a mistake by producing surprising results rather than errors.
A relevant quote from the NEP:
Mixed cases involving multiple array indices are also surprising, and
only less problematic because the current behavior is so useless that
it is rarely encountered in practice. When a boolean array index is
mixed with another boolean or integer array, boolean array is
converted to integer array indices (equivalent to np.nonzero()) and
then broadcast. For example, indexing a 2D array of size (2, 2) like
x[[True, False], [True, False]] produces a 1D vector with shape (1,),
not a 2D sub-matrix with shape (1, 1).
Now, I emphasize that the NEP is very much work-in-progress, but one of the suggestions in the current state of the NEP is to forbid Boolean arrays in advanced indexing cases such as the above, and only allow them in "outer indexing" scenarios, i.e. exactly what np.ix_ would help you do with your Boolean array:
Boolean indexing is conceptionally outer indexing. Broadcasting together with other advanced indices in the manner of legacy indexing [i.e. the current behaviour] is generally not helpful or well defined. A user who wishes the "nonzero" plus broadcast behaviour can thus be expected to do this manually.
My point is that the behaviour of Boolean advanced indices and their deprecation status (or lack thereof) may change in the not-so-distant future.
Are there any sources or guidelines for safe, bug-free numerical programming with numpy?
I'm asking because I've painfully learned that numpy does many things that seem to really ask for bugs to happen, such as...
Adding matrices of different sizes ("broadcasting") without complaining:
In: np.array([1]) + np.identity(2)
Out: array([[ 2., 1.],
[ 1., 2.]])
Returning different data types depending on the input:
In: scalar1 = 1
In: scalar2 = 1.
In: np.array(scalar1).dtype
Out: dtype('int32')
In: np.array(scalar2).dtype
Out: dtype('float64')
Or simply not performing a desired operation (again, depending on the data type) without raising any warnings:
In: np.squeeze(np.array([[1, 1]])).ndim
Out: 1
In: np.squeeze(np.matrix([[1, 1]])).ndim
Out: 2
These are all very hard to discover bugs, since they do not raise any exceptions or warnings and often return results of the valid data types / shapes. Therefore my question: Are there any general guidelines for improving the safety and preventing bugs in mathematical programming with numpy?
[Note that I don't believe that this answer will attract "opinionated answers and discussions" since it is not about personal recommendations, but rather asking whether there are any existing guidelines or sources on the subject at all - of which I could not find any.]
Frequently I ask SO questioners, what's the shape? the dtype? even the type. Keeping tracking of those properties is a big part of good numpy programming. Even in MATLAB I found that getting the size right was 80% of debugging.
type
The squeeze example revolves around type, the ndarray class versus the np.matrix subclass:
In [160]: np.squeeze(np.array([[1, 1]]))
Out[160]: array([1, 1])
In [161]: np.squeeze(np.matrix([[1, 1]]))
Out[161]: matrix([[1, 1]])
np.matrix object is, by definition, always 2d. That's the core of how it redefines ndarray operations.
Many numpy functions delegate their work to methods. The code fornp.squeeze` is:
try:
squeeze = a.squeeze
except AttributeError:
return _wrapit(a, 'squeeze')
try:
# First try to use the new axis= parameter
return squeeze(axis=axis)
except TypeError:
# For backwards compatibility
return squeeze()
So In [161] is really:
In [163]: np.matrix([[1, 1]]).squeeze()
Out[163]: matrix([[1, 1]])
np.matrix.squeeze has its own documentation.
As a general rule we discourage the use of np.matrix. It was a created years ago to make things easier for wayward MATLAB programmers. Back in those days MATLAB only had 2d matrices (even now MATLAB 'scalars' are 2d).
dtype
np.array is a powerful function. Usually its behavior is intuitive, but sometimes it makes too many assumptions.
Usually it takes clues from the input, whether integer, float, string, and/or lists:
In [170]: np.array(1).dtype
Out[170]: dtype('int64')
In [171]: np.array(1.0).dtype
Out[171]: dtype('float64')
But it provides a number of parameters. Use those if you need more control:
array(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)
In [173]: np.array(1, float).dtype
Out[173]: dtype('float64')
In [174]: np.array('1', float).dtype
Out[174]: dtype('float64')
In [177]: np.array('1', dtype=float,ndmin=2)
Out[177]: array([[1.]])
Look at it's docs, and also at the https://docs.scipy.org/doc/numpy/reference/routines.array-creation.html page which lists many other array creation functions. Look at some their code as well.
For example np.atleast_2d does a lot of shape checking:
def atleast_2d(*arys):
res = []
for ary in arys:
ary = asanyarray(ary)
if ary.ndim == 0:
result = ary.reshape(1, 1)
elif ary.ndim == 1:
result = ary[newaxis,:]
else:
result = ary
res.append(result)
if len(res) == 1:
return res[0]
else:
return res
Functions like this are good examples of defensive programming.
We get a lot SO questions about 1d arrays with dtype=object.
In [272]: np.array([[1,2,3],[2,3]])
Out[272]: array([list([1, 2, 3]), list([2, 3])], dtype=object)
np.array tries to create a multidimensional array with a uniform dtype. But if the elements differ in size or can't be cast to the same dtype, it will fall back on object dtype. This is one of those situations where we need to pay attention to shape and dtype.
broadcasting
Broadcasting has been a part of numpy forever, and there's no way of turning it off. Octave and MATLAB have added it later, and do enable warning switches.
The first defensive step is to understand the broadcasting principles, namely
it can expand the beginning dimensions to match
it coerce unitary dimensions to match.
So a basic example is:
In [180]: np.arange(3)[:,None] + np.arange(4)
Out[180]:
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5]])
The first term is (3,) expanded to (3,1). The second is (4,) which, by broadcasting expands to (1,4). Together (3,1) and (1,4) broadcast to (3,4).
Many numpy functions have parameters that make keeping track of dimensions easier. For example sum (and others) has a keepdims parameter:
In [181]: arr = _
In [182]: arr.sum(axis=0)
Out[182]: array([ 3, 6, 9, 12]) # (4,) shape
In [183]: arr.sum(axis=0,keepdims=True)
Out[183]: array([[ 3, 6, 9, 12]]) # (1,4) shape
In [184]: arr/_ # (3,4) / (1,4) => (3,4)
Out[184]:
array([[0. , 0.16666667, 0.22222222, 0.25 ],
[0.33333333, 0.33333333, 0.33333333, 0.33333333],
[0.66666667, 0.5 , 0.44444444, 0.41666667]])
In this case the keepdims isn't essential since (3,4)/(4,) works. But with axis=1 sum the shape becomes (3,) which can't broadcast with (3,4). But (3,1) can:
In [185]: arr/arr.sum(axis=1,keepdims=True)
Out[185]:
array([[0. , 0.16666667, 0.33333333, 0.5 ],
[0.1 , 0.2 , 0.3 , 0.4 ],
[0.14285714, 0.21428571, 0.28571429, 0.35714286]])
To manage shapes I like to:
display shape while debugging
test snippets interactively
test with diagnostic shapes, e.g. np.arange(24).reshape(2,3,4)
assertion statements in functions can be useful assert(arr.ndim==1)
typing
Recent Python 3 versions have added a typing module
https://docs.python.org/3/library/typing.html
Even for built-in Python types it's provisional. I'm not sure much has been added for numpy.
In some ways, an answer to this question is no different than general guidelines for safe programming:
Check and sanitise code early, for every function
Maintain relevant unit tests.
Yes, this may sound like extra overhead, but the reality is you're probably already doing such checks and tests by hand anyway, so it's good practice to put it down on paper and formalise / automate the process. E.g., while you may have never expected a matrix output specifically, any unit test that checked your output is the expected array would have failed reliably.
You might also want to have a look at specialised testing tools that are specific to scientific code, e.g. the Hypothesis package
One thing that is specific to numpy is the handling of Floating Errors; the default simply 'prints' a warning statement to stdout, which can be missed easily (and does not cater for proper exception handling workflows). You can convert this functionality to throw proper warnings / exceptions that you can capture, via the numpy.seterr method -- e.g. numpy.seterr(all='raise').
If you want to use numpy in a "safer" way, you'll probably have to create your own safety net. One way to do so would be to define wrappers that enforce the rules you want your code to obey. You can come up with your own wrappers and tests as you go along and/or stumble upon behaviour that you consider problematic.
Some toy examples:
Always have float arrays:
def arrayf64(*args, **kwargs):
kwargs.setdefault("dtype", np.float64)
return np.array(*args, **kwargs)
Disable broadcasting:
def without_broadcasting(op, arr1, arr2):
assert arr1.ndim == arr2.ndim
return op(arr1, arr2)
Warn when using np.matrix:
def safe_np_matrix(*args, **kwargs):
warnings.warn('Unsafe code detected. Usage of np.matrix found.')
return np.matrix(*args, **kwargs)
np.matrix = safe_np_matrix
There is a simple function, which intends to accept a scalar parameter, but also works for a numpy matrix. Why does the function fun works for a matrix?
>>> import numpy as np
>>> def fun(a):
return 1.0 / a
>>> b = 2
>>> c = np.mat([1,2,3])
>>> c
matrix([[1, 2, 3]])
>>> fun(b)
0.5
>>> fun(c)
matrix([[ 1. , 0.5 , 0.33333333]])
>>> v_fun = np.vectorize(fun)
>>> v_fun(b)
array(0.5)
>>> v_fun(c)
matrix([[ 1. , 0.5 , 0.33333333]])
It seems like fun is vectorized somehow, because the explictly vectorized function v_fun behaves same on matrix c. But their get different outputs on scalar b. Could anybody explain it? Thanks.
What happens in the case of fun is called broadcasting.
General Broadcasting Rules
When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when
they are equal, or
one of them is 1
If these conditions are not met, a ValueError: frames are not aligned exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the maximum size along each dimension of the input arrays.
fun already works for both scalars and arrays - because elementwise division is defined for both (their own methods). fun(b) does not involve numpy at all, that just a Python operation.
np.vectorize is meant to take a function that only works with scalars, and feed it elements from an array. In your example it first converts b into an array, np.array(b). For both c and this modified b, the result is an array of matching size. c is a 2d np.matrix, and result is the same. Notice that fun(b) is type array, not matrix.
This not a good example of using np.vectorize, nor an example of broadcasting. np.vectorize is a rather 'simple minded' function and doesn't handle scalars in a special way.
1/c or even b/c works because c, an array 'knows' about division. Similarly array multiplication and addition are defined: 1+c or 2*c.
I'm tempted to mark this as a duplicate of
Python function that handles scalar or arrays
I was growing confused during the development of a small Python script involving matrix operations, so I fired up a shell to play around with a toy example and develop a better understanding of matrix indexing in Numpy.
This is what I did:
>>> import numpy as np
>>> A = np.matrix([1,2,3])
>>> A
matrix([[1, 2, 3]])
>>> A[0]
matrix([[1, 2, 3]])
>>> A[0][0]
matrix([[1, 2, 3]])
>>> A[0][0][0]
matrix([[1, 2, 3]])
>>> A[0][0][0][0]
matrix([[1, 2, 3]])
As you can imagine, this has not helped me develop a better understanding of matrix indexing in Numpy. This behavior would make sense for something that I would describe as "An array of itself", but I doubt anyone in their right mind would choose that as a model for matrices in a scientific library.
What is, then, the logic to the output I obtained? Why would the first element of a matrix object be itself?
PS: I know how to obtain the first entry of the matrix. What I am interested in is the logic behind this design decision.
EDIT: I'm not asking how to access a matrix element, or why a matrix row behaves like a matrix. I'm asking for a definition of the behavior of a matrix when indexed with a single number. It's an action typical of arrays, but the resulting behavior is nothing like the one you would expect from an array. I would like to know how this is implemented and what's the logic behind the design decision.
Look at the shape after indexing:
In [295]: A=np.matrix([1,2,3])
In [296]: A.shape
Out[296]: (1, 3)
In [297]: A[0]
Out[297]: matrix([[1, 2, 3]])
In [298]: A[0].shape
Out[298]: (1, 3)
The key to this behavior is that np.matrix is always 2d. So even if you select one row (A[0,:]), the result is still 2d, shape (1,3). So you can string along as many [0] as you like, and nothing new happens.
What are you trying to accomplish with A[0][0]? The same as A[0,0]?
For the base np.ndarray class these are equivalent.
Note that Python interpreter translates indexing to __getitem__ calls.
A.__getitem__(0).__getitem__(0)
A.__getitem__((0,0))
[0][0] is 2 indexing operations, not one. So the effect of the second [0] depends on what the first produces.
For an array A[0,0] is equivalent to A[0,:][0]. But for a matrix, you need to do:
In [299]: A[0,:][:,0]
Out[299]: matrix([[1]]) # still 2d
=============================
"An array of itself", but I doubt anyone in their right mind would choose that as a model for matrices in a scientific library.
What is, then, the logic to the output I obtained? Why would the first element of a matrix object be itself?
In addition, A[0,:] is not the same as A[0]
In light of these comments let me suggest some clarifications.
A[0] does not mean 'return the 1st element'. It means select along the 1st axis. For a 1d array that means the 1st item. For a 2d array it means the 1st row. For ndarray that would be a 1d array, but for a matrix it is another matrix. So for a 2d array or matrix, A[i,:] is the same thing as A[i].
A[0] does not just return itself. It returns a new matrix. Different id:
In [303]: id(A)
Out[303]: 2994367932
In [304]: id(A[0])
Out[304]: 2994532108
It may have the same data, shape and strides, but it's a new object. It's just as unique as the ith row of a many row matrix.
Most of the unique matrix activity is defined in: numpy/matrixlib/defmatrix.py. I was going to suggest looking at the matrix.__getitem__ method, but most of the action is performed in np.ndarray.__getitem__.
np.matrix class was added to numpy as a convenience for old-school MATLAB programmers. numpy arrays can have almost any number of dimensions, 0, 1, .... MATLAB allowed only 2, though a release around 2000 generalized it to 2 or more.
Imagine you have the following
>> A = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
If you want to get the second column value, use the following:
>> A.T[1]
array([ 2, 6, 10])