"Direct" numpy functions on an array vs numpy array functions - python

I have a question about the design of Python. I have realised that some functions are implemented directly on container classes (e.g. numpy arrays) while other function that act on these containers must be called from numpy itself. An example would be:
import numpy as np
y = np.array([4,7,9,1])
m1 = np.mean(y) # Ok
m2 = y.mean() # Ok
print(m1 == m2) # True
x = [2,3]
r1 = np.concatenate([x, y]) # Ok
r2 = y.concatenate(x) # AttributeError: 'numpy.ndarray' object has no attribute 'concatenate'
print(r1 == r2)
Why can the mean be calculated directly from the array while the array as no method to concatenate another one to it? Is there a general rule which functions can be called directly on the array and which ones not? And if both is possible what is the pythonic way to do it?

The overview of NumPy history gives an indication of why not everything is consistent: it has two predecessors that were developed independently. Backward compatibility requires the project to keep array methods like max. Ongoing development favors the function syntax np.fun(array). I suppose one reason for the latter is that it allows array_like input (the term used throughout NumPy documentation): anything that NumPy can turn into an ndarray.

The question of why there are both methods and functions of the same name has been discussed and links provided.
But to focus on your two examples:
mean uses just one array. Logically it can be an ndarray method.
concatenate takes a list of arrays, and doesn't give priority to any one of them.
There is a np.append function that looks superficially like the list .append method. But it just passes the task on to concatenate with just a few modifications. And it causes all kinds of newby errors - it isn't inplace, it ravels, and it is slow compared to the list method.
Or consider the large family of ufunc. Those are functions, some take one array, others two. They share a common ufunc functionality.
np.add(a,b) <=> a+b <=> a.__add__(b)
np.sin(a) # no a.sin()
I suspect the choice to make sin a ufunc rather than a method has been influenced by common mathematical notation.
To me a big plus to the function approach is that it can be applied to a list or scalar. np.sin(1) works just as well as np.sin([0,.5,1]) or np.sin(np.arange(0,1,.5)).
Yes, history goes a long way toward excusing the mix of functions of methods, but many of the choices are logical.

Related

An issue with paralellising function broadcasting over a mesh using dask

I am looking to parallelise a function which takes multiple 1-dimensional ranges (which are of the form np.linspace(x,y,t)) of numerical input values (this is variable, but lets say it takes five), creates a mesh out of these ranges, and then evaluates some (5-dimensional) cost function for this over this mesh. In its current form it looks something like this:
def func_5d(a,b,c,d,e):
return a + b + c + d + e
def range_search(a_range, b_range, c_range, d_range, e_range):
mesh = itertools.product(a_range, b_range, c_range, d_range, e_range)
func_eval = map(lambda x: (func_5d(np.array(x)), x), mesh)
return func_eval
So, here I would be looking to parallelise the function range_search using dask. Ideally, this would be done by creating a dask mesh, which could then be chunked, and then mapped through to our cost function using either multi-threading or multi-core processing. Looking through the dask documentation, it does not appear that dask.array contains any suitable mechanism to achieve this. There is a dask.array.meshgrid function, extended from the numpy library, but this does not support chunking. Additionally, dask.array does not seem to contain a paralellised map function. However, there is one in dask.bag. But the documentation seems to suggest that dask.bag is used only as a module to carry out preliminary processing of raw data (in formats such as CSV, JSON, etc). Dask.bag objects do also have a method called product() which seems to imitate the itertools.product; however this only takes one other dask.bag object as an argument. So meshing 5 arrays required this method called to be stacked (4 times), which aside from being hideously ugly, is also inefficent when the number of inputs is variable.
From here, I don't really know where to go. I have worked through the Jupyter Notebooks that dask have put together, but they do not seem to hold an answer to my question. Any suggestions on the best approach to paralellising functions of the above form would be much appreciated.
I would use Numpy Slicing for this
a[:, None, None] + b[None, :, None] + c[None, None, :]
You will want to make sure that your input vectors are chunked finely enough that the products of them will still fit comfortably in memory.

What is the difference between NumPy object methods and NumPy function calls?

If I have a NumPy array,
>>> x = np.arange(10)
what is the difference between getting information about that array using the object method
>>> x.mean()
4.5
compared to using the NumPy functions
>>> np.mean(x)
4.5
I expect the object method is calling the function, but there are examples where a function is not included as an method, such as
>>> np.median(x)
4.5
>>> x.median()
AttributeError: 'numpy.ndarray' object has no attribute 'median'
The exclusion of some functions seems to indicate a functional approach is more complete or preferred to the object oriented approach because it eliminates the need to switch back and forth. Is the exclusion of some methods intentional? Is there an inherent advantage for one approach compared to the other?
There is a notable difference between numpy.sort and ndarray.sort: the former returns a copy of the array, the latter sorts in-place.
For other methods, one can use equivalent functions just as well. The function form will accept array-like collections that can be converted to a NumPy array; this is sometimes convenient. On the other hand, this comes at the expense of a few more checks and function calls, so the method form should be a tiny bit faster. In practice, this is probably negligible; for me the deciding factor is usually that methods take fewer characters to type.
Some mathematical operations are more naturally written as methods: compare np.transpose(A) and A.T, when A is a 2-dimensional array.

why I can't use x.unique() in numpy, however, x.sum() or x.mean() works?

I'm learning numpy, however, I don't understand that, for example:
import numpy as np
ints = np.array([3,3,3,2,2,1,1,4,4])
ints.unique() # this won't work
np.unique(ints) # this works
however, some function works both ways
ints.sum()
np.sum(ints)
And I was reading numpy documents, what's the different between attributes vs methods? arributes will return something as well as methods.
unique unlike sum is a free function only and not a class (instance to be precise) method. The difference between the two is
obj.foo() # instance method, obj is implicitly passed to foo()
foo(obj) # free function, obj is explicity passed to foo()
Have a look here for some explanation on different variants of methods. In NumPy, this is mainly a design decision, I believe, however there are certain reasons for some functions to be a free function. One reason that comes to mind, is that unlike in other technical languages (such as MATLAB), numpy arrays can be structured or unstructured and can be flexible in terms of containing objects of different types, for example
a = np.array([[1,2],[3,4]]) # structured array
b = np.array([[1,2],[3,4,5]]) # unstructured array
c = np.array([[1,2],["abc",True]]) # unstructured array with flexible data type
In such scenarios, having to make every function/method an instance method, would lead to confusing behaviour. Even the sum function behaves differently with structured and unstructured arrays
In [18]: a.sum() # sums all elements of the array
Out[18]: 10
In [19]: b.sum() # concatenates all elements of the array
Out[19]: [1, 2, 3, 4, 5]
In contrast, some functions like unique have a much narrower scope in terms of their applications. For example unique only works for structured arrays/buffers of uniform data type and operates on the flattened (1D dimensional) version of the arrays.
attributes of numpy arrays typically tell you about the underlying data type, shape, dimensionality, memory layout/strides and data ownership of the array, for instance:
In [20]: a=np.random.rand(3,4)
In [21]: a.flags
Out[21]:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
In [22]: a.shape
Out[22]: (3, 4)
In [23]: a.dtype
Out[23]: dtype('float64')
are all attributes and not array methods per say, in other words they are properties.
np.sum is a function that takes an array, or anything that can be turned into an array, and applies it's sum method. See np.source(np.sum) for details.
arr.sum is a method of the arr array. For a ndarray is compiled code. A subclassed array may have a different sum method.
Most of the cases where the are like-named functions and methods, a relationship like this holds.
Look at the source for np.unique to see a different design. One difference that comes to mind is that unique only works with 1d arrays, or with a flattened array. It's not as general purpose a method like sum or mean.
Some of these differences follow a pattern, or are explained, others are probably more the result of a development history. Often it is easier to add new functionality by writing a 'stand-alone' function, rather than adding a method to an existing class. The method is more closely integrated with the class.
To get into more details you'll have to spend time reading the development archives. For roughly that last 5 years, much of that can found by searching the respective github repository and its issues.

Confusion about numpy's apply along axis and list comprehensions

Alright, so I apologize ahead of time if I'm just asking something silly, but I really thought I understood how apply_along_axis worked. I just ran into something that might be an edge case that I just didn't consider, but it's baffling me. In short, this is the code that is confusing me:
class Leaf(object):
def __init__(self, location):
self.location = location
def __len__(self):
return self.location.shape[0]
def bulk_leaves(child_array, axis=0):
test = np.array([Leaf(location) for location in child_array]) # This is what I want
check = np.apply_along_axis(Leaf, 0, child_array) # This returns an array of individual leafs with the same shape as child_array
return test, check
if __name__ == "__main__":
test, check = bulk_leaves(np.random.ran(100, 50))
test == check # False
I always feel silly using a list comprehension with numpy and then casting back to an array, but I'm just nor sure of another way to do this. Am I just missing something obvious?
The apply_along_axis is pure Python that you can look at and decode yourself. In this case it essentially does:
check = np.empty(child_array.shape,dtype=object)
for i in range(child_array.shape[1]):
check[:,i] = Leaf(child_array[:,i])
In other words, it preallocates the container array, and then fills in the values with an iteration. That certainly is better than appending to the array, but rarely better than appending values to a list (which is what the comprehension is doing).
You could take the above template and adjust it to produce the array that you really want.
for i in range(check.shape[0]):
check[i]=Leaf(child_array[i,:])
In quick tests this iteration times the same as the comprehension. The apply_along_axis, besides being wrong, is slower.
The problem seems to be that apply_along_axis uses isscalar to determine whether the returned object is a scalar, but isscalar returns False for user-defined classes. The documentation for apply_along_axis says:
The shape of outarr is identical to the shape of arr, except along the axis dimension, where the length of outarr is equal to the size of the return value of func1d.
Since your class's __len__ returns the length of the array it wraps, numpy "expands" the resulting array into the original shape. If you don't define a __len__, you'll get an error, because numpy doesn't think user-defined types are scalars, so it will still try to call len on it.
As far as I can see, there is no way to make this work with a user-defined class. You can return 1 from __len__, but then you'll still get an Nx1 2D result, not a 1D array of length N. I don't see any way to make Numpy see a user-defined instance as a scalar.
There is a numpy bug about the apply_along_axis behavior, but surprisingly I can't find any discussion of the underlying issue that isscalar returns False for non-numpy objects. It may be that numpy just decided to punt and not guess whether user-defined types are vector or scalar. Still, it might be worth asking about this on the numpy list, as it seems odd to me that things like isscalar(object()) return False.
However, if as you say you don't care about performance anyway, it doesn't really matter. Just use your first way with the list comprehension, which already does what you want.

input/output validation/casting of a numpy calculation

This is a situation that happens quite often in my codes. Say I have a function do_sth(a,b), that, only for the sake of this example, simply calculates a+b, with a,b either 1D numpy arrays or scalars. In many occasions, I need the function to broadcast the operation, so that if both a,b are 1D arrays, the result will be a 2D array. An example of what I mean follows:
do_sth(1,2) -> 3
do_sth([1,2],0) -> array([1, 2])
do_sth(0,[3,4]) -> array([3, 4])
do_sth([1,2],[3,4]) -> array([[4, 5], [5, 6]])
This is a bit similar to how a numpy ufunc behaves. A possible implementation follows:
from numpy import newaxis, atleast_1d
def do_sth(a, b):
"a,b should be either 1d numpy arrays or scalars"
a, b = map(atleast_1d, [a, b])
# the line below mocks a more complicated calculation
res = a[:, newaxis] + b[newaxis]
conds = [a.size == 1, b.size == 1]
if all(conds):
return res[0, 0]
elif any(conds):
return res.ravel()
else:
return res
As you can see, there's quite a lot of boilerplate. The first question is: is this the right way to do this input/output casting? Is there any reason to not use a decorator to deal with a situation like this? Is there any guideline on the matter?
Moreover, the more complicated calculation, here mocked by the addition, often fails badly if a or b are numpy arrays with 2D,3D shape for example. I say badly in the sense that the point where the calculation fails is not obvious, or may change with time in different revisions of the code, and it is hard to see the connection between the error and the wrong input shape. I think it is then NOT advisable to put the complicated calculation in a try/except block (following python EAFP). In this case, is it correct to check the shape of the 2 arrays at the beginning of the function? Is there any alternative? Is there a numpy function that allows at the same time to convert the input to a numpy array, and also check that the input is compatible with a certain number of dimensions, something like asarray_withdim(arr,ndim=5)?
Regarding the use of decorators - I haven't seen much use of decorators in numpy code, but I think that's because most of the functionality was developed before decorators become common in Python. If you can make it work, there shouldn't be a any downside (but I'm not an expert with either decorators or ufunc).
Non complied numpy functions often have a lot of code that massages the inputs into convenient dimensions. Then they do the core action, followed by final reshaping and type wrapping. They might use functions like np.atleast_2d to ensure there are enough dimensions, and .reshape(-1,1,1) to compress excess dimensions.
np.tensordot is an example of one that performs axes transpose plus reshape on the inputs so it can apply the compiled np.dot. np.insert starts with a number of ndim and isinstance tests. Special cases are handled early, while the general one is left to the end. np.einsum is compiled, but there's a lot of preprocessing being done in C code, before it finally creates an nditer object and does the calculation.

Categories

Resources