input/output validation/casting of a numpy calculation - python

This is a situation that happens quite often in my codes. Say I have a function do_sth(a,b), that, only for the sake of this example, simply calculates a+b, with a,b either 1D numpy arrays or scalars. In many occasions, I need the function to broadcast the operation, so that if both a,b are 1D arrays, the result will be a 2D array. An example of what I mean follows:
do_sth(1,2) -> 3
do_sth([1,2],0) -> array([1, 2])
do_sth(0,[3,4]) -> array([3, 4])
do_sth([1,2],[3,4]) -> array([[4, 5], [5, 6]])
This is a bit similar to how a numpy ufunc behaves. A possible implementation follows:
from numpy import newaxis, atleast_1d
def do_sth(a, b):
"a,b should be either 1d numpy arrays or scalars"
a, b = map(atleast_1d, [a, b])
# the line below mocks a more complicated calculation
res = a[:, newaxis] + b[newaxis]
conds = [a.size == 1, b.size == 1]
if all(conds):
return res[0, 0]
elif any(conds):
return res.ravel()
else:
return res
As you can see, there's quite a lot of boilerplate. The first question is: is this the right way to do this input/output casting? Is there any reason to not use a decorator to deal with a situation like this? Is there any guideline on the matter?
Moreover, the more complicated calculation, here mocked by the addition, often fails badly if a or b are numpy arrays with 2D,3D shape for example. I say badly in the sense that the point where the calculation fails is not obvious, or may change with time in different revisions of the code, and it is hard to see the connection between the error and the wrong input shape. I think it is then NOT advisable to put the complicated calculation in a try/except block (following python EAFP). In this case, is it correct to check the shape of the 2 arrays at the beginning of the function? Is there any alternative? Is there a numpy function that allows at the same time to convert the input to a numpy array, and also check that the input is compatible with a certain number of dimensions, something like asarray_withdim(arr,ndim=5)?

Regarding the use of decorators - I haven't seen much use of decorators in numpy code, but I think that's because most of the functionality was developed before decorators become common in Python. If you can make it work, there shouldn't be a any downside (but I'm not an expert with either decorators or ufunc).
Non complied numpy functions often have a lot of code that massages the inputs into convenient dimensions. Then they do the core action, followed by final reshaping and type wrapping. They might use functions like np.atleast_2d to ensure there are enough dimensions, and .reshape(-1,1,1) to compress excess dimensions.
np.tensordot is an example of one that performs axes transpose plus reshape on the inputs so it can apply the compiled np.dot. np.insert starts with a number of ndim and isinstance tests. Special cases are handled early, while the general one is left to the end. np.einsum is compiled, but there's a lot of preprocessing being done in C code, before it finally creates an nditer object and does the calculation.

Related

Reinterpreting NumPy arrays as a different dtype

Say I have a large NumPy array of dtype int32
import numpy as np
N = 1000 # (large) number of elements
a = np.random.randint(0, 100, N, dtype=np.int32)
but now I want the data to be uint32. I could do
b = a.astype(np.uint32)
or even
b = a.astype(np.uint32, copy=False)
but in both cases b is a copy of a, whereas I want to simply reinterpret the data in a as being uint32, as to not duplicate the memory. Similarly, using np.asarray() does not help.
What does work is
a.dtpye = np.uint32
which simply changes the dtype without altering the data at all. Here's a striking example:
import numpy as np
a = np.array([-1, 0, 1, 2], dtype=np.int32)
print(a)
a.dtype = np.uint32
print(a) # shows "overflow", which is what I want
My questions are about the solution of simply overwriting the dtype of the array:
Is this legitimate? Can you point me to where this feature is documented?
Does it in fact leave the data of the array untouched, i.e. no duplication of the data?
What if I want two arrays a and b sharing the same data, but view it as different dtypes? I've found the following to work, but again I'm concerned if this is really OK to do:
import numpy as np
a = np.array([0, 1, 2, 3], dtype=np.int32)
b = a.view(np.uint32)
print(a) # [0 1 2 3]
print(b) # [0 1 2 3]
a[0] = -1
print(a) # [-1 1 2 3]
print(b) # [4294967295 1 2 3]
Though this seems to work, I find it weird that the underlying data of the two arrays does not seem to be located the same place in memory:
print(a.data)
print(b.data)
Actually, it seems that the above gives different results each time it is run, so I don't understand what's going on there at all.
This can be extended to other dtypes, the most extreme of which is probably mixing 32 and 64 bit floats:
import numpy as np
a = np.array([0, 1, 2, np.pi], dtype=np.float32)
b = a.view(np.float64)
print(a) # [0. 1. 2. 3.1415927]
print(b) # [0.0078125 50.12387848]
b[0] = 8
print(a) # [0. 2.5 2. 3.1415927]
print(b) # [8. 50.12387848]
Again, is this condoned, if the obtained behaviour is really what I'm after?
Is this legitimate? Can you point me to where this feature is documented?
This is legitimate. However, using np.view (which is equivalent) is better since it is compatible with a static analysers (so it is somehow safer). Indeed, the documentation states:
It’s possible to mutate the dtype of an array at runtime. [...]
This sort of mutation is not allowed by the types. Users who want to write statically typed code should instead use the numpy.ndarray.view method to create a view of the array with a different dtype.
Does it in fact leave the data of the array untouched, i.e. no duplication of the data?
Yes. Since the array is still a view on the same internal memory buffer (a basic byte array). Numpy will just reinterpret it differently (this is directly done the C code of each Numpy computing function).
What if I want two arrays a and b sharing the same data, but view it as different dtypes? [...]
np.view can be used in this case as you did in your example. However, the result is platform dependent. Indeed, Numpy just reinterpret bytes of memory and theoretically the representation of negative numbers can change from one machine to another. Hopefully, nowadays, all mainstream modern processors use use the two's complement (source). This means that a np.in32 value like -1 will be reinterpreted as 2**32-1 = 4294967295 with a view of type np.uint32. Positive signed values are unchanged. As long as you are aware of this, this is fine and the behaviour is predictable.
This can be extended to other dtypes, the most extreme of which is probably mixing 32 and 64 bit floats.
Well, put it shortly, this is really like playing fire. In this case this certainly unsafe although it may work on your specific machine. Let us venturing into troubled waters.
First of all, the documentation of np.view states:
The behavior of the view cannot be predicted just from the superficial appearance of a. It also depends on exactly how a is stored in memory. Therefore if a is C-ordered versus fortran-ordered, versus defined as a slice or transpose, etc., the view may give different results.
The thing is Numpy reinterpret the pointer using a C code. Thus, AFAIK, the strict aliasing rule applies. This means that reinterpreting a np.float32 value to a np.float64 cause an undefined behaviour. One reason is that the alignment requirements are not the same for np.float32 (typically 4) and np.float32 (typically 8) and so reading an unaligned np.float64 value from memory can cause a crash on some architecture (eg. POWER) although x86-64 processors support this. Another reason comes from the compiler which can over-optimize the code due to the strict aliasing rule by making wrong assumptions in your case (like a np.float32 value and a np.float64 value cannot overlap in memory so the modification of the view should not change the original array). However, since Numpy is called from CPython and no function calls are inlined from the interpreter (probably not with Cython), this last point should not be a problem (it may be the case be if you use Numba or any JIT though). Note that this is safe to get an np.uint8 view of a np.float32 since it does not break the strict aliasing rule (and the alignment is Ok). This could be useful to efficiently serialize Numpy arrays. The opposite operation is not safe (especially due to the alignment).
Update about last section: a deeper analysis from the Numpy code show that some part of the code like type-conversion functions perform a safe type punning using the memmove C call, while some other functions like all basic unary operators or binary ones do not appear to do a proper type punning yet! Moreover, such feature is barely tested by users and tricky corner cases are likely to cause weird bugs (especially if you read and write in two views of the same array). Thus, use it at your own risk.

An issue with paralellising function broadcasting over a mesh using dask

I am looking to parallelise a function which takes multiple 1-dimensional ranges (which are of the form np.linspace(x,y,t)) of numerical input values (this is variable, but lets say it takes five), creates a mesh out of these ranges, and then evaluates some (5-dimensional) cost function for this over this mesh. In its current form it looks something like this:
def func_5d(a,b,c,d,e):
return a + b + c + d + e
def range_search(a_range, b_range, c_range, d_range, e_range):
mesh = itertools.product(a_range, b_range, c_range, d_range, e_range)
func_eval = map(lambda x: (func_5d(np.array(x)), x), mesh)
return func_eval
So, here I would be looking to parallelise the function range_search using dask. Ideally, this would be done by creating a dask mesh, which could then be chunked, and then mapped through to our cost function using either multi-threading or multi-core processing. Looking through the dask documentation, it does not appear that dask.array contains any suitable mechanism to achieve this. There is a dask.array.meshgrid function, extended from the numpy library, but this does not support chunking. Additionally, dask.array does not seem to contain a paralellised map function. However, there is one in dask.bag. But the documentation seems to suggest that dask.bag is used only as a module to carry out preliminary processing of raw data (in formats such as CSV, JSON, etc). Dask.bag objects do also have a method called product() which seems to imitate the itertools.product; however this only takes one other dask.bag object as an argument. So meshing 5 arrays required this method called to be stacked (4 times), which aside from being hideously ugly, is also inefficent when the number of inputs is variable.
From here, I don't really know where to go. I have worked through the Jupyter Notebooks that dask have put together, but they do not seem to hold an answer to my question. Any suggestions on the best approach to paralellising functions of the above form would be much appreciated.
I would use Numpy Slicing for this
a[:, None, None] + b[None, :, None] + c[None, None, :]
You will want to make sure that your input vectors are chunked finely enough that the products of them will still fit comfortably in memory.

Does numba np.diff have a bug?

I'm having this problem where the numba implementation of np.diff does not work on a slice of a matrix. Is this a bug or am I doing something wrong?
import numpy as np
from numba import njit
v = np.ones((2,2))
np.diff(v[:,0])
array([0.])
#njit
def numbadiff(x):
return np.diff(x)
numbadiff(v[:,0])
This last call returns in an error, but I'm not sure why.
The problem is that np.diff in Numba does an internal reshape, which is only supported for contiguous arrays. The slice that you are making, v[:, 0], is not contiguous, hence the error. You can get it to work using np.ascontiguousarray, which returns a contiguous copy of the given array if it is not already contiguous:
numbadiff(np.ascontiguousarray(v[:, 0]))
Note you can also just avoid np.diff and redefine numbadiff as:
#njit
def numbadiff(x):
return x[1:] - x[:-1]
When you encounter an error, the polite thing is to show the error. Sometimes the full error with traceback is appropriate. For numba that may be too much, but you should try to post a summary. It makes it easier for us, especially if we aren't in a position to run your code and see the error ourselves. You might even learn something.
I ran your example and got (in part):
In [428]: numbadiff(np.ones((2,2))[:,0])
---------------------------------------------------------------------------
TypingError
...
TypeError: reshape() supports contiguous array only
...
def diff_impl(a, n=1):
<source elided>
# To make things easier, normalize input and output into 2d arrays
a2 = a.reshape((-1, size))
...
TypeError: reshape() supports contiguous array only
....
This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.
This supports the diagnosis and fix that #jdehesa provides. It's not a bug in numba; it's a problem with your input.
One disadvantage with using numba is that the errors are harder to understand. Another apparently is that it isn't quite as flexible about inputs such as this array view. If you seriously want the speed advantages, you need to be willing to dig into the error messages yourself.

"Direct" numpy functions on an array vs numpy array functions

I have a question about the design of Python. I have realised that some functions are implemented directly on container classes (e.g. numpy arrays) while other function that act on these containers must be called from numpy itself. An example would be:
import numpy as np
y = np.array([4,7,9,1])
m1 = np.mean(y) # Ok
m2 = y.mean() # Ok
print(m1 == m2) # True
x = [2,3]
r1 = np.concatenate([x, y]) # Ok
r2 = y.concatenate(x) # AttributeError: 'numpy.ndarray' object has no attribute 'concatenate'
print(r1 == r2)
Why can the mean be calculated directly from the array while the array as no method to concatenate another one to it? Is there a general rule which functions can be called directly on the array and which ones not? And if both is possible what is the pythonic way to do it?
The overview of NumPy history gives an indication of why not everything is consistent: it has two predecessors that were developed independently. Backward compatibility requires the project to keep array methods like max. Ongoing development favors the function syntax np.fun(array). I suppose one reason for the latter is that it allows array_like input (the term used throughout NumPy documentation): anything that NumPy can turn into an ndarray.
The question of why there are both methods and functions of the same name has been discussed and links provided.
But to focus on your two examples:
mean uses just one array. Logically it can be an ndarray method.
concatenate takes a list of arrays, and doesn't give priority to any one of them.
There is a np.append function that looks superficially like the list .append method. But it just passes the task on to concatenate with just a few modifications. And it causes all kinds of newby errors - it isn't inplace, it ravels, and it is slow compared to the list method.
Or consider the large family of ufunc. Those are functions, some take one array, others two. They share a common ufunc functionality.
np.add(a,b) <=> a+b <=> a.__add__(b)
np.sin(a) # no a.sin()
I suspect the choice to make sin a ufunc rather than a method has been influenced by common mathematical notation.
To me a big plus to the function approach is that it can be applied to a list or scalar. np.sin(1) works just as well as np.sin([0,.5,1]) or np.sin(np.arange(0,1,.5)).
Yes, history goes a long way toward excusing the mix of functions of methods, but many of the choices are logical.

why I can't use x.unique() in numpy, however, x.sum() or x.mean() works?

I'm learning numpy, however, I don't understand that, for example:
import numpy as np
ints = np.array([3,3,3,2,2,1,1,4,4])
ints.unique() # this won't work
np.unique(ints) # this works
however, some function works both ways
ints.sum()
np.sum(ints)
And I was reading numpy documents, what's the different between attributes vs methods? arributes will return something as well as methods.
unique unlike sum is a free function only and not a class (instance to be precise) method. The difference between the two is
obj.foo() # instance method, obj is implicitly passed to foo()
foo(obj) # free function, obj is explicity passed to foo()
Have a look here for some explanation on different variants of methods. In NumPy, this is mainly a design decision, I believe, however there are certain reasons for some functions to be a free function. One reason that comes to mind, is that unlike in other technical languages (such as MATLAB), numpy arrays can be structured or unstructured and can be flexible in terms of containing objects of different types, for example
a = np.array([[1,2],[3,4]]) # structured array
b = np.array([[1,2],[3,4,5]]) # unstructured array
c = np.array([[1,2],["abc",True]]) # unstructured array with flexible data type
In such scenarios, having to make every function/method an instance method, would lead to confusing behaviour. Even the sum function behaves differently with structured and unstructured arrays
In [18]: a.sum() # sums all elements of the array
Out[18]: 10
In [19]: b.sum() # concatenates all elements of the array
Out[19]: [1, 2, 3, 4, 5]
In contrast, some functions like unique have a much narrower scope in terms of their applications. For example unique only works for structured arrays/buffers of uniform data type and operates on the flattened (1D dimensional) version of the arrays.
attributes of numpy arrays typically tell you about the underlying data type, shape, dimensionality, memory layout/strides and data ownership of the array, for instance:
In [20]: a=np.random.rand(3,4)
In [21]: a.flags
Out[21]:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
In [22]: a.shape
Out[22]: (3, 4)
In [23]: a.dtype
Out[23]: dtype('float64')
are all attributes and not array methods per say, in other words they are properties.
np.sum is a function that takes an array, or anything that can be turned into an array, and applies it's sum method. See np.source(np.sum) for details.
arr.sum is a method of the arr array. For a ndarray is compiled code. A subclassed array may have a different sum method.
Most of the cases where the are like-named functions and methods, a relationship like this holds.
Look at the source for np.unique to see a different design. One difference that comes to mind is that unique only works with 1d arrays, or with a flattened array. It's not as general purpose a method like sum or mean.
Some of these differences follow a pattern, or are explained, others are probably more the result of a development history. Often it is easier to add new functionality by writing a 'stand-alone' function, rather than adding a method to an existing class. The method is more closely integrated with the class.
To get into more details you'll have to spend time reading the development archives. For roughly that last 5 years, much of that can found by searching the respective github repository and its issues.

Categories

Resources