Are numpy-matrix-functions buffered? - python

Are numpy matrix specific functions, such as x.max() buffered when applied several times?
So should one write:
bincount=np.apply_along_axis(lambda x: np.bincount(x, minlength=data.max()+1), axis=0, arr=data)
or better
data_max=data.max()+1
bincount=np.apply_along_axis(lambda x: np.bincount(x, minlength=data_max), axis=0, arr=data)
where data is e.g.
data=np.array([[1,2,5,4,8,7,8,9,14,8,14,5,2,1],
[5,8,7,13,7,8,9,21,5,7,9,24,3,2]])
or of course even much larger

After updating the question, it seems that you are asking whether numpy implements some form of caching of its results. While there is no general response to this question, for a method like ndarray.max, it is clear that no caching is done.
How can we know that without looking at the implementation? Consider that a caching scheme must resolve two problems:
find a place to store the cached result(s);
have a strategy to invalidate the cache once it no longer applies.
Although the first issue is non-trivial, the second one is the real killer. Not only can a numpy array be changed at any time, but the contents of the array can be shared by many objects. Additionally, C code can obtain the address of the internal buffers, and implement its own modifications to the underlying memory. Caching results would effectively disable many interesting uses of numpy.
You can consider numpy as a low-level library that doesn't concern itself with optimizations of that nature. If caching is needed, it should be implemented at a higher level, such as shown in your second example.

Like Slater Tyranus pointed out, only a benchmakr will show any results:
import numpy as np
import timeit
def func_a(data):
return np.apply_along_axis(lambda x: np.bincount(x, minlength=data.max()+1), axis=0, arr=data)
def func_b(data):
data_max=data.max()+1
return np.apply_along_axis(lambda x: np.bincount(x, minlength=data_max), axis=0, arr=data)
setup = '''import numpy as np
data=np.array([[1,2,5,4,8,7,8,9,14,8,14,5,2,1],
[5,8,7,13,7,8,9,21,5,7,9,24,3,2]])
from __main__ import func_a, func_b'''
min(timeit.Timer('func_a(data)', setup=setup).repeat(100,100))
0.02922797203063965
min(timeit.Timer('func_b(data)', setup=setup).repeat(100,100))
0.018524169921875
I tested also with much larger data. Overall one can say, it pays back calculating data_max=data.max() before. With much bigger arrays the discrepancy gets even larger.

Related

Very large in-place numpy array operations : numba, pythran or other?

tI need to perform operations on very large arrays (several millions entries), with the cumulated size of these arrays close to the available memory.
I'm understanding that when doing naive operation using numpy like a=a*3+b-c**2, several temporary arrays are created and thus occupy more memory.
As I'm planning to work at the limit of the memory occupancy, I'm afraid this simple approach won't work. So I'd like to start my developments with the right approach.
I know that packages like numba or pythran can help with improving performance when manipulating arrays, but it is not clear to me if they can deal automatically or not with in-place operations, avoiding temporary objects... ?
As a simple example here's one function I'll have to use on large arrays :
def find_bins(a, indices):
global offset, width, nstep
i = (a-offset) *nstep/ width
i = np.where(i<0,0,i)
i = np.where(i>=nstep,nstep, i)
indices[:] = i.astype(int)
So something that mixes arithmetic operations and calls to numpy functions.
How easy would it be to write such functions using numba or pythran (or something else?) ?
What would be the pros and cons in each case ?
Thanks for any hint !
ps: I know about numexpr, but I'm not sure it is convenient or well adapted to functions more complex than a single arithmetic expression ?
using numexpr. For example:
import numexpr
numexpr.evaluate("a+b*c", out=a)
this could help you to avoid the tmp variables and you could refer to High Performance Python, M.G, I.O.
Pythran avoids many temporary arrays by design. For the simple expression you're pointing at, that would be
#pythran export find_bins(float[], int[], float, float, int)
import numpy as np
def find_bins(a, indices, offset, width, nstep):
i = (a-offset) *nstep/ width #
i = np.where(i<0,0,i)
i = np.where(i>=nstep,nstep, i)
indices[:] = i.astype(int)
This both avoids temporary and speeds-up computation.
Not that you should use np.clip function here, it's supported by Pythran as well.

An issue with paralellising function broadcasting over a mesh using dask

I am looking to parallelise a function which takes multiple 1-dimensional ranges (which are of the form np.linspace(x,y,t)) of numerical input values (this is variable, but lets say it takes five), creates a mesh out of these ranges, and then evaluates some (5-dimensional) cost function for this over this mesh. In its current form it looks something like this:
def func_5d(a,b,c,d,e):
return a + b + c + d + e
def range_search(a_range, b_range, c_range, d_range, e_range):
mesh = itertools.product(a_range, b_range, c_range, d_range, e_range)
func_eval = map(lambda x: (func_5d(np.array(x)), x), mesh)
return func_eval
So, here I would be looking to parallelise the function range_search using dask. Ideally, this would be done by creating a dask mesh, which could then be chunked, and then mapped through to our cost function using either multi-threading or multi-core processing. Looking through the dask documentation, it does not appear that dask.array contains any suitable mechanism to achieve this. There is a dask.array.meshgrid function, extended from the numpy library, but this does not support chunking. Additionally, dask.array does not seem to contain a paralellised map function. However, there is one in dask.bag. But the documentation seems to suggest that dask.bag is used only as a module to carry out preliminary processing of raw data (in formats such as CSV, JSON, etc). Dask.bag objects do also have a method called product() which seems to imitate the itertools.product; however this only takes one other dask.bag object as an argument. So meshing 5 arrays required this method called to be stacked (4 times), which aside from being hideously ugly, is also inefficent when the number of inputs is variable.
From here, I don't really know where to go. I have worked through the Jupyter Notebooks that dask have put together, but they do not seem to hold an answer to my question. Any suggestions on the best approach to paralellising functions of the above form would be much appreciated.
I would use Numpy Slicing for this
a[:, None, None] + b[None, :, None] + c[None, None, :]
You will want to make sure that your input vectors are chunked finely enough that the products of them will still fit comfortably in memory.

"Direct" numpy functions on an array vs numpy array functions

I have a question about the design of Python. I have realised that some functions are implemented directly on container classes (e.g. numpy arrays) while other function that act on these containers must be called from numpy itself. An example would be:
import numpy as np
y = np.array([4,7,9,1])
m1 = np.mean(y) # Ok
m2 = y.mean() # Ok
print(m1 == m2) # True
x = [2,3]
r1 = np.concatenate([x, y]) # Ok
r2 = y.concatenate(x) # AttributeError: 'numpy.ndarray' object has no attribute 'concatenate'
print(r1 == r2)
Why can the mean be calculated directly from the array while the array as no method to concatenate another one to it? Is there a general rule which functions can be called directly on the array and which ones not? And if both is possible what is the pythonic way to do it?
The overview of NumPy history gives an indication of why not everything is consistent: it has two predecessors that were developed independently. Backward compatibility requires the project to keep array methods like max. Ongoing development favors the function syntax np.fun(array). I suppose one reason for the latter is that it allows array_like input (the term used throughout NumPy documentation): anything that NumPy can turn into an ndarray.
The question of why there are both methods and functions of the same name has been discussed and links provided.
But to focus on your two examples:
mean uses just one array. Logically it can be an ndarray method.
concatenate takes a list of arrays, and doesn't give priority to any one of them.
There is a np.append function that looks superficially like the list .append method. But it just passes the task on to concatenate with just a few modifications. And it causes all kinds of newby errors - it isn't inplace, it ravels, and it is slow compared to the list method.
Or consider the large family of ufunc. Those are functions, some take one array, others two. They share a common ufunc functionality.
np.add(a,b) <=> a+b <=> a.__add__(b)
np.sin(a) # no a.sin()
I suspect the choice to make sin a ufunc rather than a method has been influenced by common mathematical notation.
To me a big plus to the function approach is that it can be applied to a list or scalar. np.sin(1) works just as well as np.sin([0,.5,1]) or np.sin(np.arange(0,1,.5)).
Yes, history goes a long way toward excusing the mix of functions of methods, but many of the choices are logical.

Calculate mean of hue angles

I have been struggling with this for some time, despite there being related questions on SO (e.g. this one).
def circmean(arr):
arr = np.deg2rad(arr)
return np.rad2deg(np.arctan2(np.mean(np.sin(arr)),np.mean(np.cos(arr))))
But the results I'm getting don't make sense! I regularly get negative values, e.g.:
test = np.array([323.64,161.29])
circmean(test)
>> -117.53500000000004
I don't know if (a) my function is incorrect, (b) the method I'm using is incorrect, or (c) I just have to do a transformation to the negative values (add 360 degrees?). My research suggests that the problem isn't (a), and I've seen implementations (e.g. here) matching my own, so I'm leaning towards (c), but I really don't know.
Following this question, I've done some research that led me to find the circmean function in the scipy library.
Considering you're using the numpy library, I thought that a proper implementation in the scipy library shall suit your needs.
As noted in my answer to the aforementioned question, I haven't found any documentation of that function, but inspecting its source code revealed the proper way it should be invoked:
>>> import numpy as np
>>> from scipy import stats
>>>
>>> test = np.array([323.64,161.29])
>>> stats.circmean(test, high=360)
242.46499999999995
>>>
>>> test = np.array([5, 350])
>>> stats.circmean(test, high=360)
357.49999999999994
This might not be of any use to you, since some time passed since you posted your question and considering you've already implemented the function yourself, but I hope it may benefit future readers who are struggling with the same issue.

numpy.max or max ? Which one is faster?

In python, which one is faster ?
numpy.max(), numpy.min()
or
max(), min()
My list/array length varies from 2 to 600. Which one should I use to save some run time ?
Well from my timings it follows if you already have numpy array a you should use a.max (the source tells it's the same as np.max if a.max available). But if you have built-in list then most of the time takes converting it into np.ndarray => that's why max is better in your timings.
In essense: if np.ndarray then a.max, if list and no need for all the machinery of np.ndarray then standard max.
I was also interested in this and tested the three variants with perfplot (a little project of mine). Result: You're not going wrong with a.max().
Code to reproduce the plot:
import numpy as np
import perfplot
b = perfplot.bench(
setup=np.random.rand,
kernels=[max, np.max, lambda a: a.max()],
labels=["max(a)", "np.max(a)", "a.max()"],
n_range=[2 ** k for k in range(25)],
xlabel="len(a)",
)
b.show()
It's probably best if you use something like the Python timeit module to test it for yourself. That way you can test your own data in your own environment, rather than relying on third parties with various test data and environments which aren't necessarily representative of yours.
numpy.min and numpy.max have slightly different semantics (and call signatures) to the builtins, so the choice shouldn't be to do with speed. Use the numpy versions if you need to be able to handle multidimensional data sanely. If you're just using Python lists or other things that don't know about dimensionality, use the builtins.

Categories

Resources