I need to minimize a cost function for a large number (1000s) of different inputs. Obviously, this can be implemented by looping over scipy.optimize.minimize or any other minimization routine. Here is an example:
import numpy as np
import scipy as sp
def cost(x, a, b):
return np.sum((np.sum(a * x.reshape(a.shape), axis=1) - b)**2)
a = np.random.randn(500, 40)
b = np.array(np.arange(500))
x = []
for i in range(a.shape[0]):
res = sp.optimize.minimize(cost, np.zeros(40), args=(a[None, i], b[None, i]))
x.append(res.x)
It finds x[i, :] that minimize cost for each a[i, :] and b[i], but this is very slow. I guess looping over minimize causes considerable overhead.
A partial solution is to solve for all x simultaneously:
res = sp.optimize.minimize(cost, np.zeros_like(a), args=(a, b))
This is even slower than the loop. minimize does not know that elements in x are group-wise independent. So it computes the full hessian although a block-diagonal matrix would be sufficient, considering the problem structure. This is slow and overflows my computer's memory.
Is there any way to inform minimize or another optimization function about the problem structure so that it can solve multiple indepentent optimizations in a single function call? (Similar to certain options supported by Matlab's fsolve.)
First, a solution:
Turns out scipy.optimize.least_squares supports exploiting the structure of the jacobian by setting the jac_sparsity argument.
The least_squares function works slightly different than minimize so the cost function needs to be rewritten to return residuals instead:
def residuals(x, a, b):
return np.sum(a * x.reshape(a.shape), axis=1) - b
The jacobian has block-diagonal sparsity structure, given by
jacs = sp.sparse.block_diag([np.ones((1, 40), dtype=bool)]*500)
And calling the optimization routine:
res = sp.optimize.least_squares(residuals, np.zeros(500*40),
jac_sparsity=jacs, args=(a, b))
x = res.x.reshape(500, 40)
But is it really faster?
%timeit opt1_loopy_min(a, b) # 1 loop, best of 3: 2.43 s per loop
%timeit opt2_loopy_min_start(a, b) # 1 loop, best of 3: 2.55 s per loop
%timeit opt3_loopy_lsq(a, b) # 1 loop, best of 3: 13.7 s per loop
%timeit opt4_dense_lsq(a, b) # ValueError: array is too big; ...
%timeit opt5_jacs_lsq(a, b) # 1 loop, best of 3: 1.04 s per loop
Conclusions:
There is no obvious difference between the original solution (opt1) and re-use of the starting point (opt2) without sorting.
looping over least_squares (opt3) is considerable slower than looping over minimize (opt1, opt2).
The problem is too big to naiively run with least_squares because the jacobian matrix does not fit in memory.
Exploiting the sparsity structure of the jacobian in least_squares (opt5) seems to be the fastest approach.
This is the timing test environment:
import numpy as np
import scipy as sp
def cost(x, a, b):
return np.sum((np.sum(a * x.reshape(a.shape), axis=1) - b)**2)
def residuals(x, a, b):
return np.sum(a * x.reshape(a.shape), axis=1) - b
a = np.random.randn(500, 40)
b = np.arange(500)
def opt1_loopy_min(a, b):
x = []
x0 = np.zeros(a.shape[1])
for i in range(a.shape[0]):
res = sp.optimize.minimize(cost, x0, args=(a[None, i], b[None, i]))
x.append(res.x)
return np.stack(x)
def opt2_loopy_min_start(a, b):
x = []
x0 = np.zeros(a.shape[1])
for i in range(a.shape[0]):
res = sp.optimize.minimize(cost, x0, args=(a[None, i], b[None, i]))
x.append(res.x)
x0 = res.x
return np.stack(x)
def opt3_loopy_lsq(a, b):
x = []
x0 = np.zeros(a.shape[1])
for i in range(a.shape[0]):
res = sp.optimize.least_squares(residuals, x0, args=(a[None, i], b[None, i]))
x.append(res.x)
return x
def opt4_dense_lsq(a, b):
res = sp.optimize.least_squares(residuals, np.zeros(a.size), args=(a, b))
return res.x.reshape(a.shape)
def opt5_jacs_lsq(a, b):
jacs = sp.sparse.block_diag([np.ones((1, a.shape[1]), dtype=bool)]*a.shape[0])
res = sp.optimize.least_squares(residuals, np.zeros(a.size), jac_sparsity=jacs, args=(a, b))
return res.x.reshape(a.shape)
I guess looping over minimize causes considerable overhead.
Wrong guess. The time required for minimizing a function dwarfs any loop overhead. There is no vectorization magic for this problem.
Some time can be saved by using a better starting point of minimization. First, sort the parameters so that consecutive loops have similar parameters. Then use the end point of previous minimization as a starting point of the next one:
a = np.sort(np.random.randn(500, 40), axis=0) # sorted parameters
b = np.arange(500) # no need for np.array here, np.arange is already an ndarray
x0 = np.zeros(40)
for i in range(a.shape[0]):
res = minimize(cost, x0, args=(a[None, i], b[None, i]))
x.append(res.x)
x0 = res.x
This saves 30-40 percent of execution time in my test.
Another, minor, optimization to do is to preallocate an ndarray of appropriate size for resulting x-values, instead of using a list and append method. Before the loop: x = np.zeros((500, 40)); within the loop, x[i, :] = res.x.
Related
I have a time series of vectors: Y = [v1, v2, ..., vn]. At each time t, I want to compute the distance between vector t and the average of the vectors before t. So for example, at t=3 I want to compute the cosine distance between v3 and (v1+v2)/2.
I have a script to do it but wondering if there's any way to do this faster via numpy's convolve feature or something like that?
import numpy
from scipy.spatial.distance import cosine
np.random.seed(10)
# Generate `T` vectors of dimension `vector_dim`
# NOTE: In practice, the vector is a very large column vector!
T = 3
vector_dim = 2
y = [np.random.rand(1, vector_dim)[0] for t in range(T)]
def moving_distance(v):
moving_dists = []
for t in range(len(v)):
if t == 0:
pass
else:
# Create moving average of values up until time t
prior_vals = v[:t]
m_avg = np.add.reduce(prior_vals) / len(prior_vals)
# Now compute distance between this moving average and vector t
moving_dists.append(cosine(m_avg, v[t]))
return moving_dists
d = moving_distance(y)
For this dataset, it should return: [0.3337342770170698, 0.0029993196890111262]
ndarray.cumsum or np.add.accumulate can be used to calculate the cumulative sum:
>>> y
array([[0.77132064, 0.02075195],
[0.63364823, 0.74880388],
[0.49850701, 0.22479665]])
>>> y.cumsum(0)
array([[0.77132064, 0.02075195],
[1.40496888, 0.76955583],
[1.90347589, 0.99435248]])
Therefore, the equivalent code of the function you provide is as follows:
>>> means = y.cumsum(0)[:-1] / np.arange(1, len(y))[:, None]
>>> [cosine(avg, vec) for avg, vec in zip(means, y[1:])]
[0.3337342770170698, 0.0029993196890111262]
Referring to the implementation of cosine, the more vectorized code is as follows:
>>> y_ = y[1:]
>>> uv = (means * y_).mean(1)
>>> uu = (means ** 2).mean(1)
>>> vv = (y_ ** 2).mean(1)
>>> np.clip(np.abs(1 - uv / np.sqrt(uu * vv)), 0, 2)
array([0.33373428, 0.00299932])
TL;DR
This is a much faster approach using NumPy (speedups above ~100x for even modest input sizes like 64x16):
import numpy as np
def cos_dist(a, b, axis=None):
ab = np.sum(a * b, axis=axis)
aa = np.sum(a * a, axis=axis)
bb = np.sum(b * b, axis=axis)
return 1 - (ab / np.sqrt(aa * bb))
def moving_dist_cumsum_np(arr, dist=cos_dist):
return dist(np.cumsum(arr, axis=0)[:-1], arr[1:], axis=1)
which uses a custom definition of cosine distance and is much more efficient than OP's approach as it is fully vectorized.
A slightly faster and more memory efficient (O(1) instead of O(n)) approach involves using Numba-accelerated explicit looping:
import numba as nb
#nb.njit
def cos_dist_nb(a, b):
a = a.ravel()
b = b.ravel()
ab = aa = bb = 0
n = len(a)
for i in range(n):
ab += a[i] * b[i]
aa += a[i] * a[i]
bb += b[i] * b[i]
return 1 - (ab / (aa * bb) ** 0.5)
#nb.njit
def moving_dist_nb(arr, dist=cos_dist_nb):
n, m = arr.shape
result = np.empty(n - 1)
moving = np.zeros(m)
for i in range(n - 1):
moving += arr[i, :]
result[i] = dist(moving, arr[i + 1, :])
return result
Long Answer
The computation delineated in the OP can be further speed up with various optimizations.
OP's code is significantly more complex than needed.
Let us start with an adaptation that essentially just:
renames the main input
exposes the dist function
returns a NumPy array
replaces len(prior_vals) with t as it is the same value by construction
def moving_dist_OP(arr, dist=sp.spatial.distance.cosine):
moving_dists = []
for t in range(len(arr)):
if t == 0:
pass
else:
# Create moving average of values up until time t
prior_vals = arr[:t]
m_avg = np.add.reduce(prior_vals) / t
# Now compute distance between this moving average and vector t
moving_dists.append(dist(m_avg, arr[t]))
return np.array(moving_dists)
Now, this can be further simplified to this:
def moving_dist_simpler(arr, dist=sp.spatial.distance.cosine):
return np.array([dist(np.add.reduce(arr[:t]), arr[t]) for t in range(1, len(arr))])
On the provision that:
the loop appending can be rewritten as a list comprehension
the range can be made to start from 1 rather than skipping
the division by the length (a non-negative number) can be factored out in the cosine distance
This last observation stems from the definition of the cosine distance for two vectors a and b of identical size, where a . b is the dot product of a and b and |a| = √(a . a) is the norm induced by said dot product:
cos_dist(a, b) = 1 - (a . b) / (|a| |b|)
if a is replaced with k * a with k > 0 (and |k| is the absolute value of k), this becomes:
1 - ((k * a) . b) / (|k * a| |b|)
-> 1 - (k * (a . b)) / (|k| |a| |b|)
-> 1 - sign(k) * (a . b) / (|a| |b|)
-> 1 - (a . b) / (|a| |b|)
The np.add.reduce() computation is not very efficient because its values at the next iteration could be computed in terms of the result from the previous iteration, but instead at each iteration an increasing number of numbers are summed up together to perform the computation.
Instead, re-written with partial sums, this becomes:
def moving_dist_part(arr, dist=sp.spatial.distance.cosine):
n, m = arr.shape
moving_dists = []
moving = np.zeros(m)
for i in range(n - 1):
moving += arr[i, :]
moving_dists.append(dist(moving, arr[i + 1]))
return np.array(moving_dists)
It has been already noted (in #MechanicPig's answer) that the np.add.reduce() computation can also be rewritten with np.cumsum(), which is also more efficient than np.add.reduce() and of similar efficiency as the partial sum, but it uses more temporary memory (O(n) for np.cumsum() versus O(1) for partial sums):
def moving_dist_cumsum(arr, dist=sp.spatial.distance.cosine):
movings = np.cumsum(arr, axis=0)[:-1]
return np.array([dist(moving, arr[i]) for i, moving in enumerate(movings, 1)])
It is beneficial to rewrite this either fully vectorized or with simpler loops to be accelerated with Numba.
For the fully vectorized version, np.cumsum() is very helpful as it provides some of the partial computation in vector form.
Unfortunately, scipy.spatial.distance.cosine() does not accept higher dimensional input.
However, based on its definition, it is relatively simple to write a vectorized version of the cosine distance:
def cos_dist(a, b, axis=None):
ab = np.sum(a * b, axis=axis)
aa = np.sum(a * a, axis=axis)
bb = np.sum(b * b, axis=axis)
return 1 - (ab / np.sqrt(aa * bb))
With this, one can define a fully vectorized approach:
def moving_dist_cumsum_np(arr, dist=cos_dist):
return dist(np.cumsum(arr, axis=0)[:-1], arr[1:], axis=1)
Note that the new definition of the cosine distance can be used just about anywhere else scipy.spatial.distance.cosine() was used, e.g.:
def moving_dist_cumsum2(arr, dist=cos_dist):
movings = np.cumsum(arr, axis=0)[:-1]
return np.array([dist(moving, arr[i]) for i, moving in enumerate(movings, 1)])
However, the vectorized version still has the shortcoming of requiring a potentially large (O(n)) temporary object to store the result of np.cumsum().
Fortunately, with a little more adaptation it is possible to write a Numba-accelerated version of this (similar to moving_dist_part()) that does require only O(1) temporary memory:
import numba as nb
#nb.njit
def cos_dist_nb(a, b):
a = a.ravel()
b = b.ravel()
ab = aa = bb = 0
n = len(a)
for i in range(n):
ab += a[i] * b[i]
aa += a[i] * a[i]
bb += b[i] * b[i]
return 1 - (ab / (aa * bb) ** 0.5)
#nb.njit
def moving_dist_nb(arr, dist=cos_dist_nb):
n, m = arr.shape
result = np.empty(n - 1)
moving = np.zeros(m)
for i in range(n - 1):
moving += arr[i, :]
result[i] = dist(moving, arr[i + 1, :])
return result
The above approaches can be benchmarked and plotted with the following (where smaller inputs are tested multiple times for more stable results):
import pandas as pd
import matplotlib.pyplot as plt
def benchmark(
funcs,
args=None,
kws=None,
ii=range(4, 15),
m=16,
kk=1024,
is_equal=np.allclose,
seed=0,
unit="ms",
verbose=True
):
labels = [func.__name__ for func in funcs]
units = {"s": 0, "ms": 3, "µs": 6, "ns": 9}
args = tuple(args) if args else ()
kws = dict(kws) if kws else {}
assert unit in units
np.random.seed(seed)
timings = {}
for i in ii:
n = 2 ** i
k = 1 + i * kk // n
if verbose:
print(f"i={i}, n={n}, m={m}, k={k}")
arrs = np.random.random((k, n, m))
base = np.array([funcs[0](arr, *args, **kws) for arr in arrs])
timings[n] = []
for func in funcs:
res = np.array([func(arr, *args, **kws) for arr in arrs])
is_good = is_equal(base, res)
timed = %timeit -n 1 -r 1 -q -o [func(arr, *args, **kws) for arr in arrs]
timing = timed.best / k
timings[n].append(timing if is_good else None)
if verbose:
print(
f"{func.__name__:>24}"
f" {is_good!s:5}"
f" {timing * (10 ** units[unit]):10.3f} {unit}"
f" {timings[n][0] / timing:5.1f}x")
return timings, labels
def plot(timings, labels, xlabel="Input Size / #", unit="ms"):
n_rows = 1
n_cols = 3
fig, axs = plt.subplots(n_rows, n_cols, figsize=(8 * n_cols, 6 * n_rows), squeeze=False)
units = {"s": 0, "ms": 3, "µs": 6, "ns": 9}
df = pd.DataFrame(data=timings, index=labels).transpose()
base = df[[labels[0]]].to_numpy()
(df * 10 ** units[unit]).plot(marker="o", xlabel=xlabel, ylabel=f"Best timing / {unit}", ax=axs[0, 0])
(df / base * 100).plot(marker='o', xlabel=xlabel, ylabel='Relative speed /labels %', logx=True, ax=axs[0, 1])
(base / df).plot(marker='o', xlabel=xlabel, ylabel='Speed Gain / x', ax=axs[0, 2])
fig.patch.set_facecolor('white')
to be used as:
funcs = moving_dist_OP, moving_dist_simpler, moving_dist_part, moving_dist_cumsum, moving_dist_cumsum2, moving_dist_cumsum_np, moving_dist_nb
timings, labels = benchmark(funcs, unit="ms", verbose=True)
plot(timings, labels, "Benchmarks", unit="ms")
to obtain:
These results indicate that Numba approach is the fastest by far and large, but the vectorized approach is reasonably fast.
When it comes to explicit non-accelerated looping, it is still beneficial to use the custom-defined cos_dist() in place of scipy.spatial.distance.cosine() (see moving_dist_cumsum() vs moving_dist_cumsum2()), while np.cumsum() is reasonably faster than np.add.reduce() but only marginally faster over computing the partial sum. Finally, moving_dist_OP() and moving_dist_simpler() are effectively equivalent (as expected).
The answer for two matrices was given in this question, but I'm not sure how to apply this logic to three pairwise connected matrices since there are no 'free' indices. I want to maximize the following function:
f(i, j, k) = min(A(i, j), B(j, k), C(i,k))
Where A, B and C are matrices and i, j and k are indices that range up to the respective dimensions of the matrices. I would like to find (i, j, k) such that f(i, j, k) is maximized. I am currently doing that as follows:
import numpy as np
import itertools
I = 100
J = 150
K = 200
A = np.random.rand(I, J)
B = np.random.rand(J, K)
C = np.random.rand(I, K)
# All the different i,j,k
combinations = itertools.product(np.arange(I), np.arange(J), np.arange(K))
combinations = np.asarray(list(combinations))
A_vals = A[combinations[:,0], combinations[:,1]]
B_vals = B[combinations[:,1], combinations[:,2]]
C_vals = C[combinations[:,0], combinations[:,2]]
f = np.min([A_vals,B_vals,C_vals],axis=0)
best_indices = combinations[np.argmax(f)]
print(best_indices)
[ 49 14 136]
This is faster than iterating over all (i, j, k), but a lot of (and most of the) time is spent constructing the _vals matrices. This is unfortunate, because they contain many many duplicate values as the same i, j and k appear multiple times. Is there a way to do this where (1) the speed of numpy's matrix computation can be preserved and (2) I don't have to construct the memory-intensive _vals matrices.
In other languages you could maybe construct the matrices so that they contain pointers to A, B and C, but I do not see how to achieve this in Python.
Edit: see a follow-up question for more indices here
We can either brute force it using numpy broadcasting or try a bit of smart branch cutting:
import numpy as np
def bf(A,B,C):
I,J = A.shape
J,K = B.shape
return np.unravel_index((np.minimum(np.minimum(A[:,:,None],C[:,None,:]),B[None,:,:])).argmax(),(I,J,K))
def cut(A,B,C):
gmx = min(A.min(),B.min(),C.min())
I,J = A.shape
J,K = B.shape
Y,X = np.unravel_index(A.argsort(axis=None)[::-1],A.shape)
for y,x in zip(Y,X):
if A[y,x] <= gmx:
return gamx
curr = np.minimum(B[x,:],C[y,:])
camx = curr.argmax()
cmx = curr[camx]
if cmx >= A[y,x]:
return y,x,camx
if gmx < cmx:
gmx = cmx
gamx = y,x,camx
return gamx
from timeit import timeit
I = 100
J = 150
K = 200
for rep in range(4):
print("trial",rep+1)
A = np.random.rand(I, J)
B = np.random.rand(J, K)
C = np.random.rand(I, K)
print("results identical",cut(A,B,C)==bf(A,B,C))
print("brute force",timeit(lambda:bf(A,B,C),number=2)*500,"ms")
print("branch cut",timeit(lambda:cut(A,B,C),number=10)*100,"ms")
It turns out that at the given sizes branch cutting is well worth it:
trial 1
results identical True
brute force 169.74265850149095 ms
branch cut 1.951422297861427 ms
trial 2
results identical True
brute force 180.37619898677804 ms
branch cut 2.1000938024371862 ms
trial 3
results identical True
brute force 181.6371419990901 ms
branch cut 1.999850495485589 ms
trial 4
results identical True
brute force 217.75578951928765 ms
branch cut 1.5871295996475965 ms
How does the branch cutting work?
We pick one array (A, say) and sort it from largest to smallest. We then go through the array one by one comparing each value to the appropriate values from the other arrays and keeping track of the running maximum of minima. As soon as the maximum is no smaller than the remaining values in A we are done. As this will typically happen rather soonish we get a huge saving.
Instead of using itertools, you can "build" the combinations with repeats and tiles:
A_=np.repeat(A.reshape((-1,1)),K,axis=0).T
B_=np.tile(B.reshape((-1,1)),(I,1)).T
C_=np.tile(C,J).reshape((-1,1)).T
And passing them to np.min:
print((t:=np.argmax(np.min([A_,B_,C_],axis=0)) , t//(K*J),(t//K)%J, t%K,))
With timeit 10 repetitions of your code takes around 18 seconds and with numpy only about 1 second.
Building upon great answer of loopy walt - you can get slight speed-up (~20%) by using numba:
import numba
#numba.jit(nopython=True)
def find_gamx(A, B, C, X, Y, gmx):
gamx = (0, 0, 0)
for y, x in zip(Y, X):
if A[y, x] <= gmx:
return gamx
curr = np.minimum(B[x, :], C[y, :])
camx = curr.argmax()
cmx = curr[camx]
if cmx >= A[y, x]:
return y, x, camx
if gmx < cmx:
gmx = cmx
gamx = y, x, camx
return gamx
def cut_numba(A, B, C):
gmx = min(A.min(), B.min(), C.min())
I, J = A.shape
J, K = B.shape
Y, X = np.unravel_index(A.argsort(axis=None)[::-1], A.shape)
gamx = find_gamx(A, B, C, X, Y, gmx)
return gamx
from timeit import timeit
I = 100
J = 150
K = 200
for rep in range(40):
print("trial", rep + 1)
A = np.random.rand(I, J)
B = np.random.rand(J, K)
C = np.random.rand(I, K)
print("results identical", cut(A, B, C) == bf(A, B, C))
print("results identical", cut_numba(A, B, C) == bf(A, B, C))
print("brute force", timeit(lambda: bf(A, B, C), number=2) * 500, "ms")
print("branch cut", timeit(lambda: cut(A, B, C), number=10) * 100, "ms")
print("branch cut_numba", timeit(lambda: cut_numba(A, B, C), number=10) * 100, "ms")
trial 1
results identical True
results identical True
brute force 38.774325 ms
branch cut 1.7196750999999955 ms
branch cut_numba 1.3950291999999864 ms
trial 2
results identical True
results identical True
brute force 38.77167049999996 ms
branch cut 1.8655760999999993 ms
branch cut_numba 1.4977325999999902 ms
trial 3
results identical True
results identical True
brute force 39.69611449999999 ms
branch cut 1.8876490000000024 ms
branch cut_numba 1.421615300000001 ms
trial 4
results identical True
results identical True
brute force 44.338816499999936 ms
branch cut 1.614051399999994 ms
branch cut_numba 1.3842962000000014 ms
Firstly, I'd like to apologize for the badly worded title - I can't currently think of a better way to phrase it. Basically, I'm wondering if there's a faster way to implement an array operation in Python where each operation depends on previous outputs in an iterative fashion (e.g. forward differencing operations, filtering, etc.). Basically, operations that are of a form like:
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
Where X is an array of values, and Y is the output. In this case, Y[0] is assumed to be known or calculated separately before the above loop. My question is: Is there a NumPy functionality to speed up this sort of self-referential loop? This is the major bottleneck in almost all the scripts I have. I know NumPy routines benefit from being executed from C routines, so I was curious if anyone knew of any numpy routines that would help here. Else failing that, are there better ways to program this loop (in Python) that would speed up its execution for large array sizes? (>500,000 data points).
Accessing single NumPy array elements or (elementwise-)iterating over a NumPy array is slow (like really slow). If you ever want to do a manual iteration over a NumPy array: Just don't do it!
But you got some options. The easiest is to convert the array to a Python list and iterate over the list (sounds silly, but stay with me - I'll present some benchmarks at the end of the answer 1):
X = X.tolist()
Y = Y.tolist()
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
If you also use direct iteration over the lists, it could be even faster:
X = X.tolist()
Y = Y.tolist()
for idx, (Y_n_m1, X_n, X_n_m1) in enumerate(zip(Y, X[1:], X), 1):
Y[idx] = X_n + X_n_m1 + Y_n_m1
Then there are more sophisticated options that require additional packages. Most notably Cython and Numba, these are designed to work on the array-elements directly and avoid Python overhead whenever possible. For example with Numba you could just use your approach inside a jitted (just-in-time compiled) function:
import numba as nb
#nb.njit
def func(X, Y):
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
There X and Y can be NumPy arrays but numba will work on the buffer directly, out-speeding the other approaches (possibly by orders of magnitude).
Numba is a "heavier" dependency than Cython, but it can be faster and easier to use. But without conda it's hard to install numba... YMMV
However here's also a Cython version of the code (compiled using IPython magic, it's a bit different if you're not using IPython):
In [1]: %load_ext cython
In [2]: %%cython
...:
...: cimport cython
...:
...: #cython.boundscheck(False)
...: #cython.wraparound(False)
...: cpdef cython_indexing(double[:] X, double[:] Y):
...: cdef Py_ssize_t n
...: for n in range(1, len(X)):
...: Y[n] = X[n] + X[n - 1] + Y[n-1]
...: return Y
Just to give an example (based on the timing framework from my answer to another question), regarding the timings:
import numpy as np
import numba as nb
import scipy.signal
def numpy_indexing(X, Y):
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
return Y
def list_indexing(X, Y):
X = X.tolist()
Y = Y.tolist()
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
return Y
def list_direct(X, Y):
X = X.tolist()
Y = Y.tolist()
for idx, (Y_n_m1, X_n, X_n_m1) in enumerate(zip(Y, X[1:], X), 1):
Y[idx] = X_n + X_n_m1 + Y_n_m1
return Y
#nb.njit
def numba_indexing(X, Y):
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
return Y
def numpy_cumsum(X, Y):
Y[1:] = X[1:] + X[:-1]
np.cumsum(Y, out=Y)
return Y
def scipy_lfilter(X, Y):
a = [1, -1]
b = [1, 1]
return Y[0] - X[0] + scipy.signal.lfilter(b, a, X)
# Make sure the approaches give the same result
X = np.random.random(10000)
Y = np.zeros(10000)
Y[0] = np.random.random()
np.testing.assert_array_equal(numba_indexing(X, Y), numpy_indexing(X, Y))
np.testing.assert_array_equal(numba_indexing(X, Y), numpy_cumsum(X, Y))
np.testing.assert_almost_equal(numba_indexing(X, Y), scipy_lfilter(X, Y))
np.testing.assert_array_equal(numba_indexing(X, Y), cython_indexing(X, Y))
# Timing setup
timings = {numpy_indexing: [],
list_indexing: [],
list_direct: [],
numba_indexing: [],
numpy_cumsum: [],
scipy_lfilter: [],
cython_indexing: []}
sizes = [2**i for i in range(1, 20, 2)]
# Timing
for size in sizes:
X = np.random.random(size=size)
Y = np.zeros(size)
Y[0] = np.random.random()
for func in timings:
res = %timeit -o func(X, Y)
timings[func].append(res)
# Plottig absolute times
%matplotlib notebook
import matplotlib.pyplot as plt
fig = plt.figure(1)
ax = plt.subplot(111)
for func in timings:
ax.plot(sizes,
[time.best for time in timings[func]],
label=str(func.__name__))
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()
# Plotting relative times
fig = plt.figure(1)
ax = plt.subplot(111)
baseline = numba_indexing # choose one function as baseline
for func in timings:
ax.plot(sizes,
[time.best / ref.best for time, ref in zip(timings[func], timings[baseline])],
label=str(func.__name__))
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time relative to "{}"'.format(baseline.__name__))
ax.grid(which='both')
ax.legend()
plt.tight_layout()
With the following results:
Absolute runtimes
Relative runtimes (compared to the numba function)
So, just by converting it to a list you will be roughly 3 times faster! By iterating directly over these lists you get another (yet smaller) speedup, only 20% in this benchmark but we're almost 4 times faster now compared to the original solution. With numba you can speed it by a factor of more than 100 compared to the list operations! And Cython is only a bit slower than numba (~40-50%), probably because I haven't squeezed out every possible optimization (usually it's not more than 10-20% slower) you could do with Cython. However for large arrays the difference gets smaller.
1 I did go into more details in another answer. That Q+A was about converting to a set but because set uses (hidden) "manual iteration" it also applies here.
I included the timings for the NumPy cumsum and Scipy lfilter approaches. These were roughly 20 times slower for small arrays and 4 times slower for large arrays compared to the numba function. However if I interpret the question correctly you looked for general ways not only ones that applied in the example. Not every self-referencing loop can be implemented using cum* functions from NumPy or SciPys filters. But even then it seems like they can't compete with Cython and/or numba.
It's pretty simple using np.cumsum:
#!/usr/bin/env python3
import numpy as np
import random
def r():
return random.randint(100, 1000)
X = np.array([r() for _ in range(10)])
fast_Y = np.ndarray(X.shape, dtype=X.dtype)
slow_Y = np.ndarray(X.shape, dtype=X.dtype)
slow_Y[0] = fast_Y[0] = r()
# fast method
fast_Y[1:] = X[1:] + X[:-1]
np.cumsum(fast_Y, out=fast_Y)
# original method
for n in range(1, len(X)):
slow_Y[n] = X[n] + X[n - 1] + slow_Y[n-1]
assert (fast_Y == slow_Y).all()
The situation you describe is basically a discrete filter operation. This is implemented in scipy.signal.lfilter. The particular condition you describe corresponds to a = [1, -1] and b = [1, 1].
import numpy as np
import scipy.signal
a = [1, -1]
b = [1, 1]
X = np.random.random(10000)
Y = np.zeros(10000)
newY = scipy.signal.lfilter(b, a, X) + (Y[0] - X[0])
On my computer, the timings work out as follows:
%timeit func4(X, Y.copy())
# 100000 loops, best of 3: 14.6 µs per loop
% timeit newY = scipy.signal.lfilter(b, a, X) - (Y[0] - X[0])
# 10000 loops, best of 3: 68.1 µs per loop
I need to return the sin and cos values of every element in a large array. At the moment I am doing:
a,b=np.sin(x),np.cos(x)
where x is some large array. I need to keep the sign information for each result, so:
a=np.sin(x)
b=(1-a**2)**0.5
is not an option. Is there any faster way to return both sin and cos at once?
I compared the suggested solution with perfplot and found that nothing beats calling sin and cos explicitly.
Code to reproduce the plot:
import perfplot
import numpy as np
def sin_cos(x):
return np.sin(x), np.cos(x)
def exp_ix(x):
eix = np.exp(1j * x)
return eix.imag, eix.real
def cos_from_sin(x):
sin = np.sin(x)
abs_cos = np.sqrt(1 - sin ** 2)
sgn_cos = np.sign(((x - np.pi / 2) % (2 * np.pi)) - np.pi)
cos = abs_cos * sgn_cos
return sin, cos
perfplot.save(
"out.png",
setup=lambda n: np.linspace(0.0, 2 * np.pi, n),
kernels=[sin_cos, exp_ix, cos_from_sin],
n_range=[2 ** k for k in range(20)],
xlabel="n",
)
You can use complex numbers and the fact that e i · φ = cos(φ) + i · sin(φ).
import numpy as np
from cmath import rect
nprect = np.vectorize(rect)
x = np.arange(2 * np.pi, step=0.01)
c = nprect(1, x)
a, b = c.imag, c.real
I'm using here the trick from https://stackoverflow.com/a/27788291/674064 to make a version of cmath.rect() that'll accept and return NumPy arrays.
This doesn't gain any speedup on my machine, though:
c = nprect(1, x)
a, b = c.imag, c.real
takes about three times the time (160μs) that
a, b = np.sin(x), np.cos(x)
took in my measurement (50.4μs).
A pure numpy version via complex numbers, e iφ = cosφ + i sinφ,
inspired by the answer from das-g.
x = np.arange(2 * np.pi, step=0.01)
eix = np.exp(1j*x)
cosx, sinx = eix.real, eix.imag
This is faster than the nprect, but still slower than sin and cos calls:
In [6]: timeit c = nprect(1, x); cosx, sinx = cos(x), sin(x)
1000 loops, best of 3: 242 us per loop
In [7]: timeit eix = np.exp(1j*x); cosx, sinx = eix.real, eix.imag
10000 loops, best of 3: 49.1 us per loop
In [8]: timeit cosx, sinx = cos(x), sin(x)
10000 loops, best of 3: 32.7 us per loop
For completeness, another way to combine this down to a single cos() call is to prepare an angle array where the second half has a phase shift of pi/2.
Borrowing the profiling code from Nico Schlömer, we get:
import perfplot
import numpy as np
def sin_cos(x):
return np.sin(x), np.cos(x)
def exp_ix(x):
eix = np.exp(1j * x)
return eix.imag, eix.real
def cos_shift(x):
angles = x[np.newaxis, :] + np.array(((-np.pi/2,), (0,)))
return tuple(np.cos(angles))
perfplot.save(
"out.png",
setup=lambda n: np.linspace(0.0, 2 * np.pi, n),
kernels=[sin_cos, exp_ix, cos_shift],
n_range=[2 ** k for k in range(1, 16)],
xlabel="n",
)
So it's slower than the separate sin/cos calls, but in some (narrow) contexts might be more convenient because - from the cos() onward - it only needs to deal with a single array.
You could take advantage by the fact that tan(x) contains both sin(x) and cos(x) function. So you could use the tan(x) and retrieve cos(x) ans sin(x) using the common transformation function.
def cosfromsin(x,sinx):
cosx=absolute((1-sinx**2)**0.5)
signx=sign(((x-pi/2)%(2*pi))-pi)
return cosx*signx
a=sin(x)
b=cosfromsin(x,a)
I've just timed this and it is about 25% faster than using sin and cos.
I am computing the backpropagation algorithm for a sparse autoencoder. I have implemented it in python using numpy and in matlab. The code is almost the same, but the performance is very different. The time matlab takes to complete the task is 0.252454 seconds while numpy 0.973672151566, that is almost four times more. I will call this code several times later in a minimization problem so this difference leads to several minutes of delay between the implementations. Is this a normal behaviour? How could I improve the performance in numpy?
Numpy implementation:
Sparse.rho is a tuning parameter, sparse.nodes are the number of nodes in the hidden layer (25), sparse.input (64) the number of nodes in the input layer, theta1 and theta2 are the weight matrices for the first and second layer respectively with dimensions 25x64 and 64x25, m is equal to 10000, rhoest has a dimension of (25,), x has a dimension of 10000x64, a3 10000x64 and a2 10000x25.
UPDATE: I have introduced changes in the code following some of the ideas of the responses. The performance is now numpy: 0.65 vs matlab: 0.25.
partial_j1 = np.zeros(sparse.theta1.shape)
partial_j2 = np.zeros(sparse.theta2.shape)
partial_b1 = np.zeros(sparse.b1.shape)
partial_b2 = np.zeros(sparse.b2.shape)
t = time.time()
delta3t = (-(x-a3)*a3*(1-a3)).T
for i in range(m):
delta3 = delta3t[:,i:(i+1)]
sum1 = np.dot(sparse.theta2.T,delta3)
delta2 = ( sum1 + sum2 ) * a2[i:(i+1),:].T* (1 - a2[i:(i+1),:].T)
partial_j1 += np.dot(delta2, a1[i:(i+1),:])
partial_j2 += np.dot(delta3, a2[i:(i+1),:])
partial_b1 += delta2
partial_b2 += delta3
print "Backprop time:", time.time() -t
Matlab implementation:
tic
for i = 1:m
delta3 = -(data(i,:)-a3(i,:)).*a3(i,:).*(1 - a3(i,:));
delta3 = delta3.';
sum1 = W2.'*delta3;
sum2 = beta*(-sparsityParam./rhoest + (1 - sparsityParam) ./ (1.0 - rhoest) );
delta2 = ( sum1 + sum2 ) .* a2(i,:).' .* (1 - a2(i,:).');
W1grad = W1grad + delta2* a1(i,:);
W2grad = W2grad + delta3* a2(i,:);
b1grad = b1grad + delta2;
b2grad = b2grad + delta3;
end
toc
It would be wrong to say "Matlab is always faster than NumPy" or vice
versa. Often their performance is comparable. When using NumPy, to get good
performance you have to keep in mind that NumPy's speed comes from calling
underlying functions written in C/C++/Fortran. It performs well when you apply
those functions to whole arrays. In general, you get poorer performance when you call those NumPy function on smaller arrays or scalars in a Python loop.
What's wrong with a Python loop you ask? Every iteration through the Python loop is
a call to a next method. Every use of [] indexing is a call to a
__getitem__ method. Every += is a call to __iadd__. Every dotted attribute
lookup (such as in like np.dot) involves function calls. Those function calls
add up to a significant hinderance to speed. These hooks give Python
expressive power -- indexing for strings means something different than indexing
for dicts for example. Same syntax, different meanings. The magic is accomplished by giving the objects different __getitem__ methods.
But that expressive power comes at a cost in speed. So when you don't need all
that dynamic expressivity, to get better performance, try to limit yourself to
NumPy function calls on whole arrays.
So, remove the for-loop; use "vectorized" equations when possible. For example, instead of
for i in range(m):
delta3 = -(x[i,:]-a3[i,:])*a3[i,:]* (1 - a3[i,:])
you can compute delta3 for each i all at once:
delta3 = -(x-a3)*a3*(1-a3)
Whereas in the for-loop delta3 is a vector, using the vectorized equation delta3 is a matrix.
Some of the computations in the for-loop do not depend on i and therefore should be lifted outside the loop. For example, sum2 looks like a constant:
sum2 = sparse.beta*(-float(sparse.rho)/rhoest + float(1.0 - sparse.rho) / (1.0 - rhoest) )
Here is a runnable example with an alternative implementation (alt) of your code (orig).
My timeit benchmark shows a 6.8x improvement in speed:
In [52]: %timeit orig()
1 loops, best of 3: 495 ms per loop
In [53]: %timeit alt()
10 loops, best of 3: 72.6 ms per loop
import numpy as np
class Bunch(object):
""" http://code.activestate.com/recipes/52308 """
def __init__(self, **kwds):
self.__dict__.update(kwds)
m, n, p = 10 ** 4, 64, 25
sparse = Bunch(
theta1=np.random.random((p, n)),
theta2=np.random.random((n, p)),
b1=np.random.random((p, 1)),
b2=np.random.random((n, 1)),
)
x = np.random.random((m, n))
a3 = np.random.random((m, n))
a2 = np.random.random((m, p))
a1 = np.random.random((m, n))
sum2 = np.random.random((p, ))
sum2 = sum2[:, np.newaxis]
def orig():
partial_j1 = np.zeros(sparse.theta1.shape)
partial_j2 = np.zeros(sparse.theta2.shape)
partial_b1 = np.zeros(sparse.b1.shape)
partial_b2 = np.zeros(sparse.b2.shape)
delta3t = (-(x - a3) * a3 * (1 - a3)).T
for i in range(m):
delta3 = delta3t[:, i:(i + 1)]
sum1 = np.dot(sparse.theta2.T, delta3)
delta2 = (sum1 + sum2) * a2[i:(i + 1), :].T * (1 - a2[i:(i + 1), :].T)
partial_j1 += np.dot(delta2, a1[i:(i + 1), :])
partial_j2 += np.dot(delta3, a2[i:(i + 1), :])
partial_b1 += delta2
partial_b2 += delta3
# delta3: (64, 1)
# sum1: (25, 1)
# delta2: (25, 1)
# a1[i:(i+1),:]: (1, 64)
# partial_j1: (25, 64)
# partial_j2: (64, 25)
# partial_b1: (25, 1)
# partial_b2: (64, 1)
# a2[i:(i+1),:]: (1, 25)
return partial_j1, partial_j2, partial_b1, partial_b2
def alt():
delta3 = (-(x - a3) * a3 * (1 - a3)).T
sum1 = np.dot(sparse.theta2.T, delta3)
delta2 = (sum1 + sum2) * a2.T * (1 - a2.T)
# delta3: (64, 10000)
# sum1: (25, 10000)
# delta2: (25, 10000)
# a1: (10000, 64)
# a2: (10000, 25)
partial_j1 = np.dot(delta2, a1)
partial_j2 = np.dot(delta3, a2)
partial_b1 = delta2.sum(axis=1)
partial_b2 = delta3.sum(axis=1)
return partial_j1, partial_j2, partial_b1, partial_b2
answer = orig()
result = alt()
for a, r in zip(answer, result):
try:
assert np.allclose(np.squeeze(a), r)
except AssertionError:
print(a.shape)
print(r.shape)
raise
Tip: Notice that I left in the comments the shape of all the intermediate arrays. Knowing the shape of the arrays helped me understand what your code was doing. The shape of the arrays can help guide you toward the right NumPy functions to use. Or at least, paying attention to the shapes can help you know if an operation is sensible. For example, when you compute
np.dot(A, B)
and A.shape = (n, m) and B.shape = (m, p), then np.dot(A, B) will be an array of shape (n, p).
It can help to build the arrays in C_CONTIGUOUS-order (at least, if using np.dot). There might be as much as a 3x speed up by doing so:
Below, x is the same as xf except that x is C_CONTIGUOUS and
xf is F_CONTIGUOUS -- and the same relationship for y and yf.
import numpy as np
m, n, p = 10 ** 4, 64, 25
x = np.random.random((n, m))
xf = np.asarray(x, order='F')
y = np.random.random((m, n))
yf = np.asarray(y, order='F')
assert np.allclose(x, xf)
assert np.allclose(y, yf)
assert np.allclose(np.dot(x, y), np.dot(xf, y))
assert np.allclose(np.dot(x, y), np.dot(xf, yf))
%timeit benchmarks show the difference in speed:
In [50]: %timeit np.dot(x, y)
100 loops, best of 3: 12.9 ms per loop
In [51]: %timeit np.dot(xf, y)
10 loops, best of 3: 27.7 ms per loop
In [56]: %timeit np.dot(x, yf)
10 loops, best of 3: 21.8 ms per loop
In [53]: %timeit np.dot(xf, yf)
10 loops, best of 3: 33.3 ms per loop
Regarding benchmarking in Python:
It can be misleading to use the difference in pairs of time.time() calls to benchmark the speed of code in Python.
You need to repeat the measurement many times. It's better to disable the automatic garbage collector. It is also important to measure large spans of time (such as at least 10 seconds worth of repetitions) to avoid errors due to poor resolution in the clock timer and to reduce the significance of time.time call overhead. Instead of writing all that code yourself, Python provides you with the timeit module. I'm essentially using that to time the pieces of code, except that I'm calling it through an IPython terminal for convenience.
I'm not sure if this is affecting your benchmarks, but be aware it could make a difference. In the question I linked to, according to time.time two pieces of code differed by a factor of 1.7x while benchmarks using timeit showed the pieces of code ran in essentially identical amounts of time.
I would start with inplace operations to avoid to allocate new arrays every time:
partial_j1 += np.dot(delta2, a1[i,:].reshape(1,a1.shape[1]))
partial_j2 += np.dot(delta3, a2[i,:].reshape(1,a2.shape[1]))
partial_b1 += delta2
partial_b2 += delta3
You can replace this expression:
a1[i,:].reshape(1,a1.shape[1])
with a simpler and faster (thanks to Bi Rico):
a1[i:i+1]
Also, this line:
sum2 = sparse.beta*(-float(sparse.rho)/rhoest + float(1.0 - sparse.rho) / (1.0 - rhoest))
seems to be the same at each loop, you don't need to recompute it.
And, a probably minor optimization, you can replace all the occurrences of
x[i,:] with x[i].
Finally, if you can afford to allocate the m times more memory, you can follow unutbu suggestion and vectorize the loop:
for m in range(m):
delta3 = -(x[i]-a3[i])*a3[i]* (1 - a3[i])
with:
delta3 = -(x-a3)*a3*(1-a3)
And you can always use Numba and gain in speed significantly without vectorizing (and without using more memory).
Difference in performance between numpy and matlab have always frustrated me. They often in the end boil down to the underlying lapack libraries. As far as I know matlab uses the full atlas lapack as a default while numpy uses a lapack light. Matlab reckons people dont care about space and bulk, while numpy reckons people do. Similar question with a good answer.