Are there any sources or guidelines for safe, bug-free numerical programming with numpy?
I'm asking because I've painfully learned that numpy does many things that seem to really ask for bugs to happen, such as...
Adding matrices of different sizes ("broadcasting") without complaining:
In: np.array([1]) + np.identity(2)
Out: array([[ 2., 1.],
[ 1., 2.]])
Returning different data types depending on the input:
In: scalar1 = 1
In: scalar2 = 1.
In: np.array(scalar1).dtype
Out: dtype('int32')
In: np.array(scalar2).dtype
Out: dtype('float64')
Or simply not performing a desired operation (again, depending on the data type) without raising any warnings:
In: np.squeeze(np.array([[1, 1]])).ndim
Out: 1
In: np.squeeze(np.matrix([[1, 1]])).ndim
Out: 2
These are all very hard to discover bugs, since they do not raise any exceptions or warnings and often return results of the valid data types / shapes. Therefore my question: Are there any general guidelines for improving the safety and preventing bugs in mathematical programming with numpy?
[Note that I don't believe that this answer will attract "opinionated answers and discussions" since it is not about personal recommendations, but rather asking whether there are any existing guidelines or sources on the subject at all - of which I could not find any.]
Frequently I ask SO questioners, what's the shape? the dtype? even the type. Keeping tracking of those properties is a big part of good numpy programming. Even in MATLAB I found that getting the size right was 80% of debugging.
type
The squeeze example revolves around type, the ndarray class versus the np.matrix subclass:
In [160]: np.squeeze(np.array([[1, 1]]))
Out[160]: array([1, 1])
In [161]: np.squeeze(np.matrix([[1, 1]]))
Out[161]: matrix([[1, 1]])
np.matrix object is, by definition, always 2d. That's the core of how it redefines ndarray operations.
Many numpy functions delegate their work to methods. The code fornp.squeeze` is:
try:
squeeze = a.squeeze
except AttributeError:
return _wrapit(a, 'squeeze')
try:
# First try to use the new axis= parameter
return squeeze(axis=axis)
except TypeError:
# For backwards compatibility
return squeeze()
So In [161] is really:
In [163]: np.matrix([[1, 1]]).squeeze()
Out[163]: matrix([[1, 1]])
np.matrix.squeeze has its own documentation.
As a general rule we discourage the use of np.matrix. It was a created years ago to make things easier for wayward MATLAB programmers. Back in those days MATLAB only had 2d matrices (even now MATLAB 'scalars' are 2d).
dtype
np.array is a powerful function. Usually its behavior is intuitive, but sometimes it makes too many assumptions.
Usually it takes clues from the input, whether integer, float, string, and/or lists:
In [170]: np.array(1).dtype
Out[170]: dtype('int64')
In [171]: np.array(1.0).dtype
Out[171]: dtype('float64')
But it provides a number of parameters. Use those if you need more control:
array(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)
In [173]: np.array(1, float).dtype
Out[173]: dtype('float64')
In [174]: np.array('1', float).dtype
Out[174]: dtype('float64')
In [177]: np.array('1', dtype=float,ndmin=2)
Out[177]: array([[1.]])
Look at it's docs, and also at the https://docs.scipy.org/doc/numpy/reference/routines.array-creation.html page which lists many other array creation functions. Look at some their code as well.
For example np.atleast_2d does a lot of shape checking:
def atleast_2d(*arys):
res = []
for ary in arys:
ary = asanyarray(ary)
if ary.ndim == 0:
result = ary.reshape(1, 1)
elif ary.ndim == 1:
result = ary[newaxis,:]
else:
result = ary
res.append(result)
if len(res) == 1:
return res[0]
else:
return res
Functions like this are good examples of defensive programming.
We get a lot SO questions about 1d arrays with dtype=object.
In [272]: np.array([[1,2,3],[2,3]])
Out[272]: array([list([1, 2, 3]), list([2, 3])], dtype=object)
np.array tries to create a multidimensional array with a uniform dtype. But if the elements differ in size or can't be cast to the same dtype, it will fall back on object dtype. This is one of those situations where we need to pay attention to shape and dtype.
broadcasting
Broadcasting has been a part of numpy forever, and there's no way of turning it off. Octave and MATLAB have added it later, and do enable warning switches.
The first defensive step is to understand the broadcasting principles, namely
it can expand the beginning dimensions to match
it coerce unitary dimensions to match.
So a basic example is:
In [180]: np.arange(3)[:,None] + np.arange(4)
Out[180]:
array([[0, 1, 2, 3],
[1, 2, 3, 4],
[2, 3, 4, 5]])
The first term is (3,) expanded to (3,1). The second is (4,) which, by broadcasting expands to (1,4). Together (3,1) and (1,4) broadcast to (3,4).
Many numpy functions have parameters that make keeping track of dimensions easier. For example sum (and others) has a keepdims parameter:
In [181]: arr = _
In [182]: arr.sum(axis=0)
Out[182]: array([ 3, 6, 9, 12]) # (4,) shape
In [183]: arr.sum(axis=0,keepdims=True)
Out[183]: array([[ 3, 6, 9, 12]]) # (1,4) shape
In [184]: arr/_ # (3,4) / (1,4) => (3,4)
Out[184]:
array([[0. , 0.16666667, 0.22222222, 0.25 ],
[0.33333333, 0.33333333, 0.33333333, 0.33333333],
[0.66666667, 0.5 , 0.44444444, 0.41666667]])
In this case the keepdims isn't essential since (3,4)/(4,) works. But with axis=1 sum the shape becomes (3,) which can't broadcast with (3,4). But (3,1) can:
In [185]: arr/arr.sum(axis=1,keepdims=True)
Out[185]:
array([[0. , 0.16666667, 0.33333333, 0.5 ],
[0.1 , 0.2 , 0.3 , 0.4 ],
[0.14285714, 0.21428571, 0.28571429, 0.35714286]])
To manage shapes I like to:
display shape while debugging
test snippets interactively
test with diagnostic shapes, e.g. np.arange(24).reshape(2,3,4)
assertion statements in functions can be useful assert(arr.ndim==1)
typing
Recent Python 3 versions have added a typing module
https://docs.python.org/3/library/typing.html
Even for built-in Python types it's provisional. I'm not sure much has been added for numpy.
In some ways, an answer to this question is no different than general guidelines for safe programming:
Check and sanitise code early, for every function
Maintain relevant unit tests.
Yes, this may sound like extra overhead, but the reality is you're probably already doing such checks and tests by hand anyway, so it's good practice to put it down on paper and formalise / automate the process. E.g., while you may have never expected a matrix output specifically, any unit test that checked your output is the expected array would have failed reliably.
You might also want to have a look at specialised testing tools that are specific to scientific code, e.g. the Hypothesis package
One thing that is specific to numpy is the handling of Floating Errors; the default simply 'prints' a warning statement to stdout, which can be missed easily (and does not cater for proper exception handling workflows). You can convert this functionality to throw proper warnings / exceptions that you can capture, via the numpy.seterr method -- e.g. numpy.seterr(all='raise').
If you want to use numpy in a "safer" way, you'll probably have to create your own safety net. One way to do so would be to define wrappers that enforce the rules you want your code to obey. You can come up with your own wrappers and tests as you go along and/or stumble upon behaviour that you consider problematic.
Some toy examples:
Always have float arrays:
def arrayf64(*args, **kwargs):
kwargs.setdefault("dtype", np.float64)
return np.array(*args, **kwargs)
Disable broadcasting:
def without_broadcasting(op, arr1, arr2):
assert arr1.ndim == arr2.ndim
return op(arr1, arr2)
Warn when using np.matrix:
def safe_np_matrix(*args, **kwargs):
warnings.warn('Unsafe code detected. Usage of np.matrix found.')
return np.matrix(*args, **kwargs)
np.matrix = safe_np_matrix
Related
Why is the empty list [] being inferred as float type when using np.append?
np.append([1,2,3], [0])
# output: array([1, 2, 3, 0]), dtype = np.int64
np.append([1,2,3], [])
# output: array([1., 2., 3.]), dtype = np.float64
This is persistent even when using a np.array([1,2,3], dtype=np.int32) as arr.
It's not possible to specify a dtype for append, so I am just curious on why this happens. Numpy's concatenate does the same thing, but when I try to specify the dtype I get an error:
np.concatenate([[1,2,3], []], dtype=np.int64)
Error:
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'same_kind'
But finally if I set the unsafe casting rule it works:
np.concatenate([[1,2,3], []], dtype=np.int64, casting='unsafe')
Why is [] considered a float?
np.append is subject to well-defined semantic rules like any Numpy binary operation. As a result, it first converts the input operands to Numpy arrays if this is not the case (typically with np.array) and then apply the semantic rules to find the type of the resulting array and check it is a valid operation before applying the actual operation (here the concatenation). The array type returned by np.array is "determined as the minimum type required to hold the objects in the sequence" regarding to the documentation. When the list is empty, like in your case, the default type is numpy.float64 as stated in the documentation of np.empty. This arbitrary choice was made long ago and has not been changed since in order not to break old codes. Please note that It seems not all Numpy developers agree with the current choice and so this is a matter of debate. For more information, you can read this opened issue.
The rule of thumb is to use either existing Numpy arrays or to perform an explicit conversion to a Numpy array using np.array with a fixed dtype parameter (as described in the above comments).
Look at the code for np.append (via docs link or ipython):
def append(arr, values, axis=None):
arr = asanyarray(arr)
if axis is None:
if arr.ndim != 1:
arr = arr.ravel()
values = ravel(values)
axis = arr.ndim-1
return concatenate((arr, values), axis=axis)
The first argument is turned into an array, if it isn't one already.
You don't specify the axis, so both arr and values are ravelled - turned into 1d array. np.ravel is also python code, and does asanyarray(a).ravel(order=order)
So the dtype inference is done by np.asanyarray.
The rest of the action is np.concatenate. It too will convert the inputs to arrays if necessary. The result dtype is the "highest" of the inputs.
np.append is a poorly conceived (IMO) alternative way of using np.concatenate. It is not a list append clone.
Also be careful about "empty" arrays:
In [73]: np.array([])
Out[73]: array([], dtype=float64)
In [74]: np.empty((0))
Out[74]: array([], dtype=float64)
In [75]: np.empty((0),int)
Out[75]: array([], dtype=int64)
The common list idiom
alist = []
for i in range(10):
alist.append(i)
does not translate well into numpy. Build a list of arrays, and do one concatenate/vstack at the end. Don't iterate over "empty" arrays, however created.
Related to this question, I came across an indexing behaviour via Boolean arrays and broadcasting I do not understand. We know it's possible to index a NumPy array in 2 dimensions using integer indices and broadcasting. This is specified in the docs:
a = np.array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
b1 = np.array([False, True, True])
b2 = np.array([True, False, True, False])
c1 = np.where(b1)[0] # i.e. [1, 2]
c2 = np.where(b2)[0] # i.e. [0, 2]
a[c1[:, np.newaxis], c2] # or a[c1[:, None], c2]
array([[ 4, 6],
[ 8, 10]])
However, the same does not work for Boolean arrays.
a[b1[:, None], b2]
IndexError: too many indices for array
The alternative numpy.ix_ works for both integer and Boolean arrays. This seems to be because ix_ performs specific manipulation for Boolean arrays to ensure consistent treatment.
assert np.array_equal(a[np.ix_(b1, b2)], a[np.ix_(c1, c2)])
array([[ 4, 6],
[ 8, 10]])
So my question is: why does broadcasting work with integers, but not with Boolean arrays? Is this behaviour documented? Or am I misunderstanding a more fundamental issue?
As #Divakar noted in comments, Boolean advanced indices behave as if they were first fed through np.nonzero and then broadcast together, see the relevant documentation for extensive explanations. To quote the docs,
In general if an index includes a Boolean array, the result will be identical to inserting obj.nonzero() into the same position and using the integer array indexing mechanism described above. x[ind_1, boolean_array, ind_2] is equivalent to x[(ind_1,) + boolean_array.nonzero() + (ind_2,)].
[...]
Combining multiple Boolean indexing arrays or a Boolean with an integer indexing array can best be understood with the obj.nonzero() analogy. The function ix_ also supports boolean arrays and will work without any surprises.
In your case broadcasting would not necessarily be a problem, since both arrays have only two nonzero elements. The problem is the number of dimensions in the result:
>>> len(b1[:,None].nonzero())
2
>>> len(b2.nonzero())
1
Consequently the indexing expression a[b1[:,None], b2] would be equivalent to a[b1[:,None].nonzero() + b2.nonzero()], which would put a length-3 tuple inside a, corresponding to a 3d array index. Hence the error you see about "too many indices".
The surprises mentioned in the docs are very close to your example: what if you hadn't injected that singleton dimension? Starting from a length-3 and a length-4 Boolean array you would've ended up with a length-2 advanced index, i.e. a 1d array of size (2,). This is never what you'd want, which is leads us to another piece of trivia in the subject.
There's been a lot of discussion in planning to revamp advanced indexing, see the work-in-progress draft NEP 21. The gist of the issue is that fancy indexing in numpy, while clearly documented, has some very quirky features which aren't practically useful for anything, but which can bite you if you make a mistake by producing surprising results rather than errors.
A relevant quote from the NEP:
Mixed cases involving multiple array indices are also surprising, and
only less problematic because the current behavior is so useless that
it is rarely encountered in practice. When a boolean array index is
mixed with another boolean or integer array, boolean array is
converted to integer array indices (equivalent to np.nonzero()) and
then broadcast. For example, indexing a 2D array of size (2, 2) like
x[[True, False], [True, False]] produces a 1D vector with shape (1,),
not a 2D sub-matrix with shape (1, 1).
Now, I emphasize that the NEP is very much work-in-progress, but one of the suggestions in the current state of the NEP is to forbid Boolean arrays in advanced indexing cases such as the above, and only allow them in "outer indexing" scenarios, i.e. exactly what np.ix_ would help you do with your Boolean array:
Boolean indexing is conceptionally outer indexing. Broadcasting together with other advanced indices in the manner of legacy indexing [i.e. the current behaviour] is generally not helpful or well defined. A user who wishes the "nonzero" plus broadcast behaviour can thus be expected to do this manually.
My point is that the behaviour of Boolean advanced indices and their deprecation status (or lack thereof) may change in the not-so-distant future.
In numpy, I have an array that can be either 2-D or 3-D, and I would like to reduce it to 2-D while squaring each element. So I tried this and it doesn't work:
A = np.random.rand(5, 3, 3)
np.einsum('...ij,...ij->ij', A, A)
It returns this error:
ValueError: output has more dimensions than subscripts given in einstein sum, but no '...' ellipsis provided to broadcast the extra dimensions.
I suppose einsum doesn't assume that when the ellipsis goes away in the right hand side, I want to sum over the ellipsis dimension(s), if they exist. Is there some "elegant" way (i.e. without checking the number of dimensions and using an if statement) to tell it that I want to do this for 3-D:
A = np.random.rand(5, 3, 3)
np.einsum('aij,aij->ij', A, A)
and this for 2-D?
A = np.random.rand(3, 3)
np.einsum('ij,ij->ij', A, A)
Sometimes the 'elegant' way to handle variable dimensions is to use a set of if tests, and hide them in a function call. Look for example at np.atleast_3d; it has a 4way if/else clause. I'd recommend it here, except it adds the extra dimension at the end, not the start. if clauses using reshape are not expensive (time wise), so don't be afraid to use them. Even if you find some magical function, look at its code; you may be surprised what is hidden.
ellipsis is used for dimensions that 'go along for the ride', not ones where you want specific control. Here you want to sum over the initial dimension, so you need to index it explicitly:
In [161]: np.einsum('i...,i...',A,A)
Out[161]:
array([[ 1.26942035, 1.32052776, 1.74118617],
[ 1.59679765, 1.49331565, 2.04573002],
[ 2.29027005, 1.48351522, 1.36679208]])
In [162]: np.einsum('aij,aij->ij',A,A)
Out[162]:
array([[ 1.26942035, 1.32052776, 1.74118617],
[ 1.59679765, 1.49331565, 2.04573002],
[ 2.29027005, 1.48351522, 1.36679208]])
For 2d array:
In [165]: np.einsum('ij,ij->ij',A[0],A[0])
Out[165]:
array([[ 0.20497776, 0.11632197, 0.65396968],
[ 0.0529767 , 0.24723351, 0.27559647],
[ 0.62806525, 0.33081124, 0.57070406]])
In [166]: A[0]*A[0]
Out[166]:
array([[ 0.20497776, 0.11632197, 0.65396968],
[ 0.0529767 , 0.24723351, 0.27559647],
[ 0.62806525, 0.33081124, 0.57070406]])
In [167]:
In [167]: np.einsum('...,...',A[0],A[0])
Out[167]:
array([[ 0.20497776, 0.11632197, 0.65396968],
[ 0.0529767 , 0.24723351, 0.27559647],
[ 0.62806525, 0.33081124, 0.57070406]])
I don't think you can handle both cases with one expression.
Another way to get the first sum
In [168]: (A*A).sum(axis=0)
Out[168]:
array([[ 1.26942035, 1.32052776, 1.74118617],
[ 1.59679765, 1.49331565, 2.04573002],
[ 2.29027005, 1.48351522, 1.36679208]])
I contributed the patch that fixed the handling of ellipsis, but that was a couple of years ago. So the details aren't super fresh in my mind. As part of that I reverse engineered the parsing the string expression (the original is compiled), and could review that code (or refer you to it), if we need a more definitive answer.
In [172]: np.einsum('...ij,...ij->ij',A,A)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-172-dfe39e268402> in <module>()
----> 1 np.einsum('...ij,...ij->ij',A,A)
ValueError: output has more dimensions than subscripts given in
einstein sum, but no '...' ellipsis provided to broadcast
the extra dimensions.
In [173]: np.einsum('...ij,...ij->...ij',A,A).shape
Out[173]: (5, 3, 3)
The error message says that it is trying to pass the ... dimensions to the output, and can't - because the output is missing dimensions or .... In other words, it does not perform summation over ... dimensions. They pass to the output unchanged (broadcasting rules apply).
I was growing confused during the development of a small Python script involving matrix operations, so I fired up a shell to play around with a toy example and develop a better understanding of matrix indexing in Numpy.
This is what I did:
>>> import numpy as np
>>> A = np.matrix([1,2,3])
>>> A
matrix([[1, 2, 3]])
>>> A[0]
matrix([[1, 2, 3]])
>>> A[0][0]
matrix([[1, 2, 3]])
>>> A[0][0][0]
matrix([[1, 2, 3]])
>>> A[0][0][0][0]
matrix([[1, 2, 3]])
As you can imagine, this has not helped me develop a better understanding of matrix indexing in Numpy. This behavior would make sense for something that I would describe as "An array of itself", but I doubt anyone in their right mind would choose that as a model for matrices in a scientific library.
What is, then, the logic to the output I obtained? Why would the first element of a matrix object be itself?
PS: I know how to obtain the first entry of the matrix. What I am interested in is the logic behind this design decision.
EDIT: I'm not asking how to access a matrix element, or why a matrix row behaves like a matrix. I'm asking for a definition of the behavior of a matrix when indexed with a single number. It's an action typical of arrays, but the resulting behavior is nothing like the one you would expect from an array. I would like to know how this is implemented and what's the logic behind the design decision.
Look at the shape after indexing:
In [295]: A=np.matrix([1,2,3])
In [296]: A.shape
Out[296]: (1, 3)
In [297]: A[0]
Out[297]: matrix([[1, 2, 3]])
In [298]: A[0].shape
Out[298]: (1, 3)
The key to this behavior is that np.matrix is always 2d. So even if you select one row (A[0,:]), the result is still 2d, shape (1,3). So you can string along as many [0] as you like, and nothing new happens.
What are you trying to accomplish with A[0][0]? The same as A[0,0]?
For the base np.ndarray class these are equivalent.
Note that Python interpreter translates indexing to __getitem__ calls.
A.__getitem__(0).__getitem__(0)
A.__getitem__((0,0))
[0][0] is 2 indexing operations, not one. So the effect of the second [0] depends on what the first produces.
For an array A[0,0] is equivalent to A[0,:][0]. But for a matrix, you need to do:
In [299]: A[0,:][:,0]
Out[299]: matrix([[1]]) # still 2d
=============================
"An array of itself", but I doubt anyone in their right mind would choose that as a model for matrices in a scientific library.
What is, then, the logic to the output I obtained? Why would the first element of a matrix object be itself?
In addition, A[0,:] is not the same as A[0]
In light of these comments let me suggest some clarifications.
A[0] does not mean 'return the 1st element'. It means select along the 1st axis. For a 1d array that means the 1st item. For a 2d array it means the 1st row. For ndarray that would be a 1d array, but for a matrix it is another matrix. So for a 2d array or matrix, A[i,:] is the same thing as A[i].
A[0] does not just return itself. It returns a new matrix. Different id:
In [303]: id(A)
Out[303]: 2994367932
In [304]: id(A[0])
Out[304]: 2994532108
It may have the same data, shape and strides, but it's a new object. It's just as unique as the ith row of a many row matrix.
Most of the unique matrix activity is defined in: numpy/matrixlib/defmatrix.py. I was going to suggest looking at the matrix.__getitem__ method, but most of the action is performed in np.ndarray.__getitem__.
np.matrix class was added to numpy as a convenience for old-school MATLAB programmers. numpy arrays can have almost any number of dimensions, 0, 1, .... MATLAB allowed only 2, though a release around 2000 generalized it to 2 or more.
Imagine you have the following
>> A = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
If you want to get the second column value, use the following:
>> A.T[1]
array([ 2, 6, 10])
Given the following code, I expect the last two lines to behave the same, however the don't.
import numpy as np
C = np.matrix(np.zeros((4,4)))
C[0, 0:2] = np.matrix([[1, 2]]) # Works as expected.
C[0, [0,1]] = np.matrix([[1, 2]]) # Throws an "array is not broadcastable to correct shape" error.
When using an ndarray instead, things work as expected (adjusting the right-hand-side of the assignment to a one-dimensional ndarray):
D = np.zeros((4,4))
D[0, 0:2] = np.array([1, 2]) # Works as expected.
D[0, [0,1]] = np.array([1, 2]) # Works too.
And to make things even weirder, if one is only indexing the matrix C (as opposed to assigning to it), it seems using slice indices or a list just return the same:
C[0, 0:2] # => matrix([[ 1., 2.]])
C[0, [0, 1]] # => matrix([[ 1., 2.]])
The question is, why is the behavior of the two approaches in assignment different? What am I missing?
(Edit: typo)
It appears to be a bug in numpy: http://projects.scipy.org/numpy/ticket/803 . The solution is to assign an ordinary list or numpy array instead of assigning a matrix to the selected elements.
Edit: Had to realize that while what I write is true, the fact that D[0,0:2] = ... is different from D[0,[0,1]] = ... (so for arrays) is maybe a real inconsistency (and related).
Maybe an explenation why this happens as far as I see. Check this:
D[0,[0,1]] = np.array([[1,2]])
Gives the same error. The thing is that internally the slicing operation takes place before the matrix shape is "fixed" to 2D again, which, since matrix is a subclass occurs whenver a new view is created, but here no view is created as its unnecessary normally!
This means that when you are setting elements like this, it always behaves like:
C.A[0,[0,1]] = matrix([[1,2]]) # Note the C.A giving normal array view of C.
Which fails, because the matrix is 2D, but C.A[0,[0,1]] is 1D (since it is not "fixed" to be at least 2D by the matrix object), in this case one could say that since its just removing a 1 dimension axis from the right hand side numpy could maybe tolerate it, but as long as it doesn't it would require the matrix object to make a full custom sets of in place/assignment operators which would not be very elegent as well probably.
But maybe the use of C.A, etc. can help getting around this inconvenience. On a general note however, in numpy it is better to always use base class arrays unless you are doing a lot of matrix multiplications, etc. (in which case if it is limited to one part of the program, its likely better to just view your arrays as matrixes before it but work with arrays in the rest)