Fastest way to find the maximum minimum value of three 'connected' matrices

Fastest way to find the maximum minimum value of three 'connected' matrices - python

The answer for two matrices was given in this question, but I'm not sure how to apply this logic to three pairwise connected matrices since there are no 'free' indices. I want to maximize the following function:
f(i, j, k) = min(A(i, j), B(j, k), C(i,k))
Where A, B and C are matrices and i, j and k are indices that range up to the respective dimensions of the matrices. I would like to find (i, j, k) such that f(i, j, k) is maximized. I am currently doing that as follows:
import numpy as np
import itertools
I = 100
J = 150
K = 200
A = np.random.rand(I, J)
B = np.random.rand(J, K)
C = np.random.rand(I, K)
# All the different i,j,k
combinations = itertools.product(np.arange(I), np.arange(J), np.arange(K))
combinations = np.asarray(list(combinations))
A_vals = A[combinations[:,0], combinations[:,1]]
B_vals = B[combinations[:,1], combinations[:,2]]
C_vals = C[combinations[:,0], combinations[:,2]]
f = np.min([A_vals,B_vals,C_vals],axis=0)
best_indices = combinations[np.argmax(f)]
print(best_indices)
[ 49 14 136]
This is faster than iterating over all (i, j, k), but a lot of (and most of the) time is spent constructing the _vals matrices. This is unfortunate, because they contain many many duplicate values as the same i, j and k appear multiple times. Is there a way to do this where (1) the speed of numpy's matrix computation can be preserved and (2) I don't have to construct the memory-intensive _vals matrices.
In other languages you could maybe construct the matrices so that they contain pointers to A, B and C, but I do not see how to achieve this in Python.
Edit: see a follow-up question for more indices here

We can either brute force it using numpy broadcasting or try a bit of smart branch cutting:
import numpy as np
def bf(A,B,C):
I,J = A.shape
J,K = B.shape
return np.unravel_index((np.minimum(np.minimum(A[:,:,None],C[:,None,:]),B[None,:,:])).argmax(),(I,J,K))
def cut(A,B,C):
gmx = min(A.min(),B.min(),C.min())
I,J = A.shape
J,K = B.shape
Y,X = np.unravel_index(A.argsort(axis=None)[::-1],A.shape)
for y,x in zip(Y,X):
if A[y,x] <= gmx:
return gamx
curr = np.minimum(B[x,:],C[y,:])
camx = curr.argmax()
cmx = curr[camx]
if cmx >= A[y,x]:
return y,x,camx
if gmx < cmx:
gmx = cmx
gamx = y,x,camx
return gamx
from timeit import timeit
I = 100
J = 150
K = 200
for rep in range(4):
print("trial",rep+1)
A = np.random.rand(I, J)
B = np.random.rand(J, K)
C = np.random.rand(I, K)
print("results identical",cut(A,B,C)==bf(A,B,C))
print("brute force",timeit(lambda:bf(A,B,C),number=2)*500,"ms")
print("branch cut",timeit(lambda:cut(A,B,C),number=10)*100,"ms")
It turns out that at the given sizes branch cutting is well worth it:
trial 1
results identical True
brute force 169.74265850149095 ms
branch cut 1.951422297861427 ms
trial 2
results identical True
brute force 180.37619898677804 ms
branch cut 2.1000938024371862 ms
trial 3
results identical True
brute force 181.6371419990901 ms
branch cut 1.999850495485589 ms
trial 4
results identical True
brute force 217.75578951928765 ms
branch cut 1.5871295996475965 ms
How does the branch cutting work?
We pick one array (A, say) and sort it from largest to smallest. We then go through the array one by one comparing each value to the appropriate values from the other arrays and keeping track of the running maximum of minima. As soon as the maximum is no smaller than the remaining values in A we are done. As this will typically happen rather soonish we get a huge saving.

Instead of using itertools, you can "build" the combinations with repeats and tiles:
A_=np.repeat(A.reshape((-1,1)),K,axis=0).T
B_=np.tile(B.reshape((-1,1)),(I,1)).T
C_=np.tile(C,J).reshape((-1,1)).T
And passing them to np.min:
print((t:=np.argmax(np.min([A_,B_,C_],axis=0)) , t//(K*J),(t//K)%J, t%K,))
With timeit 10 repetitions of your code takes around 18 seconds and with numpy only about 1 second.

Building upon great answer of loopy walt - you can get slight speed-up (~20%) by using numba:
import numba
#numba.jit(nopython=True)
def find_gamx(A, B, C, X, Y, gmx):
gamx = (0, 0, 0)
for y, x in zip(Y, X):
if A[y, x] <= gmx:
return gamx
curr = np.minimum(B[x, :], C[y, :])
camx = curr.argmax()
cmx = curr[camx]
if cmx >= A[y, x]:
return y, x, camx
if gmx < cmx:
gmx = cmx
gamx = y, x, camx
return gamx
def cut_numba(A, B, C):
gmx = min(A.min(), B.min(), C.min())
I, J = A.shape
J, K = B.shape
Y, X = np.unravel_index(A.argsort(axis=None)[::-1], A.shape)
gamx = find_gamx(A, B, C, X, Y, gmx)
return gamx
from timeit import timeit
I = 100
J = 150
K = 200
for rep in range(40):
print("trial", rep + 1)
A = np.random.rand(I, J)
B = np.random.rand(J, K)
C = np.random.rand(I, K)
print("results identical", cut(A, B, C) == bf(A, B, C))
print("results identical", cut_numba(A, B, C) == bf(A, B, C))
print("brute force", timeit(lambda: bf(A, B, C), number=2) * 500, "ms")
print("branch cut", timeit(lambda: cut(A, B, C), number=10) * 100, "ms")
print("branch cut_numba", timeit(lambda: cut_numba(A, B, C), number=10) * 100, "ms")
trial 1
results identical True
results identical True
brute force 38.774325 ms
branch cut 1.7196750999999955 ms
branch cut_numba 1.3950291999999864 ms
trial 2
results identical True
results identical True
brute force 38.77167049999996 ms
branch cut 1.8655760999999993 ms
branch cut_numba 1.4977325999999902 ms
trial 3
results identical True
results identical True
brute force 39.69611449999999 ms
branch cut 1.8876490000000024 ms
branch cut_numba 1.421615300000001 ms
trial 4
results identical True
results identical True
brute force 44.338816499999936 ms
branch cut 1.614051399999994 ms
branch cut_numba 1.3842962000000014 ms

Related

Why is Strassen's algorithm slower than the usual matrix multiplication?

I'm trying to figure out how to multpiply matrixes really fast on Python without using NumPy. By this reason, I've recreated the Strassen algorithm and compared it with the standard multiplication of loops. Also, I compare only square matrices of size NxN where N is 2^k. Surprisingly, my Strassen algoritm is 3.5 times slower than standart one (I use from time import perf_counter() for tracking time). As an example, the average values are approximately as follows:
matrix size
Strassen
standart mult
16x16
0.006 sec
0.002 sec
32x32
0.036 sec
0.013 sec
64x64
0.26 sec
0.07 sec
128x128
1.69 sec
0.49 sec
1024x1024
771.42 sec
221.09 sec
I generated matrix elements with randint(1, 9) but anyway I tracked time only after creating matrixes for testing and stopped doing it before writing matrix result using loops. So I only tracked time of my functions. And yeah I've seen some other posts with the same problem like this one:
Why is Strassen matrix multiplication so much slower than standard matrix multiplication?
but actually I can't say that it was pretty helpful for me.
Also, It would be really cool just to change smth in my current algorithm for optimizing instead of creating new one.
Stressen algorithm
t_start = perf_counter()
def submatrices(n, matrix): # dividing matrix for pieces
A = [[j for j in matrix[i][:int(n / 2)]] for i in range(int(n / 2))]
B = [[j for j in matrix[i][int(n / 2):]] for i in range(int(n / 2))]
C = [[j for j in matrix[i][:int(n / 2)]] for i in range(int(n / 2), n)]
D = [[j for j in matrix[i][int(n / 2):]] for i in range(int(n / 2), n)]
return [A, B, C, D]
def addition(n, matrix1, matrix2): # just addition
res = [[matrix1[i][j] + matrix2[i][j] for j in range(n)] for i in range(n)]
return res
def subtraction(n, matrix1, matrix2): # just substraction
res = [[matrix1[i][j] - matrix2[i][j] for j in range(n)] for i in range(n)]
return res
def strassen(n, matrix1, matrix2):
if n == 2: # the last step of algorithm is just standart multiplycation
xy = [[0] * n for i in range(n)]
for i in range(n):
for j in range(n):
for x in range(n):
xy[i][j] += matrix1[i][x] * matrix2[x][j]
else:
A, B, C, D = submatrices(n, matrix1) # divide the original matrix1
E, F, G, H = submatrices(n, matrix2) # divide the original matrix2
n = int(n / 2) # the matrix size is changed now
p1 = strassen(n, A, subtraction(n, F, H))
p2 = strassen(n, addition(n, A, B), H)
p3 = strassen(n, addition(n, C, D), E)
p4 = strassen(n, D, subtraction(n, G, E))
p5 = strassen(n, addition(n, A, D), addition(n, E, H))
p6 = strassen(n, subtraction(n, B, D), addition(n, G, H))
p7 = strassen(n, subtraction(n, A, C), addition(n, E, F))
xy1 = addition(n, addition(n, p5, p6), subtraction(n, p4, p2)) # making new blocks of matrix
xy2 = addition(n, p1, p2)
xy3 = addition(n, p3, p4)
xy4 = subtraction(n, addition(n, p1, p5), addition(n, p3, p7))
xy = [xy1[i] + xy2[i] for i in range(n)] + [xy3[i] + xy4[i] for i in range(n)] # assembling a matrix of blocks
return xy
print(f'Time: {perf_counter() - t_start} sec')
# printing result
for raw in strassen(n, matrix1, matrix2):
print(*raw)
Standart multiplication
t_start = perf_counter()
def multiply(n, matrix1, matrix2):
res = [[0]*n for i in range(n)]
for i in range(n):
for j in range(n):
for x in range(n):
res[i][j] += matrix1[i][x] * matrix2[x][j]
return res
print(f'Time: {perf_counter() - t_start} sec')
for raw in multiply(n, matrix1, matrix2):
print(*raw)

Both are horribly inefficient (certainly at least 3-4 order of magnitude slower than optimized implementations). The standard implementation of Python is the CPython interpreter which is clearly not design for such a kind of computation. It is mainly meant to execute glue code calling C functions like the one of BLAS libraries. In practice, accessing a list with lst[i][j] cause many functions calls, memory indirections, object allocation/destruction, etc. All these overheads are pretty huge compare to the same since in a native compiled language (like C/C++) and they are also hard to track without a good understanding of the interpreter (another Python implementation will certainly result in completely different results). One issue with the Strassen implementation is recursion: recursion is pretty slow with CPython (you can write a naive Fibonacci recursive implementation so to easily check that). Another big issue is the creation of the many temporary sub list of lists which should be very slow too (a lot of objects need to be allocated).
If you want to compare such computationally intensive algorithm, please use a natively compiled language. The outcome of such a comparison in Python will be irrelevant in other languages and very dependent of the details of the target implementation of Python. Note that Strassen tends to be efficient only for huge matrices (generally much bigger than 1024x1024).
Besides, a first step could be to use Numpy so to create views of sub-matrices with a much smaller overhead. The operation on matrice should also be far faster and even also simpler to implement.

Fastest way to compute a rolling distance between high-dimensional vectors in numpy?

I have a time series of vectors: Y = [v1, v2, ..., vn]. At each time t, I want to compute the distance between vector t and the average of the vectors before t. So for example, at t=3 I want to compute the cosine distance between v3 and (v1+v2)/2.
I have a script to do it but wondering if there's any way to do this faster via numpy's convolve feature or something like that?
import numpy
from scipy.spatial.distance import cosine
np.random.seed(10)
# Generate `T` vectors of dimension `vector_dim`
# NOTE: In practice, the vector is a very large column vector!
T = 3
vector_dim = 2
y = [np.random.rand(1, vector_dim)[0] for t in range(T)]
def moving_distance(v):
moving_dists = []
for t in range(len(v)):
if t == 0:
pass
else:
# Create moving average of values up until time t
prior_vals = v[:t]
m_avg = np.add.reduce(prior_vals) / len(prior_vals)
# Now compute distance between this moving average and vector t
moving_dists.append(cosine(m_avg, v[t]))
return moving_dists
d = moving_distance(y)
For this dataset, it should return: [0.3337342770170698, 0.0029993196890111262]

ndarray.cumsum or np.add.accumulate can be used to calculate the cumulative sum:
>>> y
array([[0.77132064, 0.02075195],
[0.63364823, 0.74880388],
[0.49850701, 0.22479665]])
>>> y.cumsum(0)
array([[0.77132064, 0.02075195],
[1.40496888, 0.76955583],
[1.90347589, 0.99435248]])
Therefore, the equivalent code of the function you provide is as follows:
>>> means = y.cumsum(0)[:-1] / np.arange(1, len(y))[:, None]
>>> [cosine(avg, vec) for avg, vec in zip(means, y[1:])]
[0.3337342770170698, 0.0029993196890111262]
Referring to the implementation of cosine, the more vectorized code is as follows:
>>> y_ = y[1:]
>>> uv = (means * y_).mean(1)
>>> uu = (means ** 2).mean(1)
>>> vv = (y_ ** 2).mean(1)
>>> np.clip(np.abs(1 - uv / np.sqrt(uu * vv)), 0, 2)
array([0.33373428, 0.00299932])

TL;DR
This is a much faster approach using NumPy (speedups above ~100x for even modest input sizes like 64x16):
import numpy as np
def cos_dist(a, b, axis=None):
ab = np.sum(a * b, axis=axis)
aa = np.sum(a * a, axis=axis)
bb = np.sum(b * b, axis=axis)
return 1 - (ab / np.sqrt(aa * bb))
def moving_dist_cumsum_np(arr, dist=cos_dist):
return dist(np.cumsum(arr, axis=0)[:-1], arr[1:], axis=1)
which uses a custom definition of cosine distance and is much more efficient than OP's approach as it is fully vectorized.
A slightly faster and more memory efficient (O(1) instead of O(n)) approach involves using Numba-accelerated explicit looping:
import numba as nb
#nb.njit
def cos_dist_nb(a, b):
a = a.ravel()
b = b.ravel()
ab = aa = bb = 0
n = len(a)
for i in range(n):
ab += a[i] * b[i]
aa += a[i] * a[i]
bb += b[i] * b[i]
return 1 - (ab / (aa * bb) ** 0.5)
#nb.njit
def moving_dist_nb(arr, dist=cos_dist_nb):
n, m = arr.shape
result = np.empty(n - 1)
moving = np.zeros(m)
for i in range(n - 1):
moving += arr[i, :]
result[i] = dist(moving, arr[i + 1, :])
return result
Long Answer
The computation delineated in the OP can be further speed up with various optimizations.
OP's code is significantly more complex than needed.
Let us start with an adaptation that essentially just:
renames the main input
exposes the dist function
returns a NumPy array
replaces len(prior_vals) with t as it is the same value by construction
def moving_dist_OP(arr, dist=sp.spatial.distance.cosine):
moving_dists = []
for t in range(len(arr)):
if t == 0:
pass
else:
# Create moving average of values up until time t
prior_vals = arr[:t]
m_avg = np.add.reduce(prior_vals) / t
# Now compute distance between this moving average and vector t
moving_dists.append(dist(m_avg, arr[t]))
return np.array(moving_dists)
Now, this can be further simplified to this:
def moving_dist_simpler(arr, dist=sp.spatial.distance.cosine):
return np.array([dist(np.add.reduce(arr[:t]), arr[t]) for t in range(1, len(arr))])
On the provision that:
the loop appending can be rewritten as a list comprehension
the range can be made to start from 1 rather than skipping
the division by the length (a non-negative number) can be factored out in the cosine distance
This last observation stems from the definition of the cosine distance for two vectors a and b of identical size, where a . b is the dot product of a and b and |a| = √(a . a) is the norm induced by said dot product:
cos_dist(a, b) = 1 - (a . b) / (|a| |b|)
if a is replaced with k * a with k > 0 (and |k| is the absolute value of k), this becomes:
1 - ((k * a) . b) / (|k * a| |b|)
-> 1 - (k * (a . b)) / (|k| |a| |b|)
-> 1 - sign(k) * (a . b) / (|a| |b|)
-> 1 - (a . b) / (|a| |b|)
The np.add.reduce() computation is not very efficient because its values at the next iteration could be computed in terms of the result from the previous iteration, but instead at each iteration an increasing number of numbers are summed up together to perform the computation.
Instead, re-written with partial sums, this becomes:
def moving_dist_part(arr, dist=sp.spatial.distance.cosine):
n, m = arr.shape
moving_dists = []
moving = np.zeros(m)
for i in range(n - 1):
moving += arr[i, :]
moving_dists.append(dist(moving, arr[i + 1]))
return np.array(moving_dists)
It has been already noted (in #MechanicPig's answer) that the np.add.reduce() computation can also be rewritten with np.cumsum(), which is also more efficient than np.add.reduce() and of similar efficiency as the partial sum, but it uses more temporary memory (O(n) for np.cumsum() versus O(1) for partial sums):
def moving_dist_cumsum(arr, dist=sp.spatial.distance.cosine):
movings = np.cumsum(arr, axis=0)[:-1]
return np.array([dist(moving, arr[i]) for i, moving in enumerate(movings, 1)])
It is beneficial to rewrite this either fully vectorized or with simpler loops to be accelerated with Numba.
For the fully vectorized version, np.cumsum() is very helpful as it provides some of the partial computation in vector form.
Unfortunately, scipy.spatial.distance.cosine() does not accept higher dimensional input.
However, based on its definition, it is relatively simple to write a vectorized version of the cosine distance:
def cos_dist(a, b, axis=None):
ab = np.sum(a * b, axis=axis)
aa = np.sum(a * a, axis=axis)
bb = np.sum(b * b, axis=axis)
return 1 - (ab / np.sqrt(aa * bb))
With this, one can define a fully vectorized approach:
def moving_dist_cumsum_np(arr, dist=cos_dist):
return dist(np.cumsum(arr, axis=0)[:-1], arr[1:], axis=1)
Note that the new definition of the cosine distance can be used just about anywhere else scipy.spatial.distance.cosine() was used, e.g.:
def moving_dist_cumsum2(arr, dist=cos_dist):
movings = np.cumsum(arr, axis=0)[:-1]
return np.array([dist(moving, arr[i]) for i, moving in enumerate(movings, 1)])
However, the vectorized version still has the shortcoming of requiring a potentially large (O(n)) temporary object to store the result of np.cumsum().
Fortunately, with a little more adaptation it is possible to write a Numba-accelerated version of this (similar to moving_dist_part()) that does require only O(1) temporary memory:
import numba as nb
#nb.njit
def cos_dist_nb(a, b):
a = a.ravel()
b = b.ravel()
ab = aa = bb = 0
n = len(a)
for i in range(n):
ab += a[i] * b[i]
aa += a[i] * a[i]
bb += b[i] * b[i]
return 1 - (ab / (aa * bb) ** 0.5)
#nb.njit
def moving_dist_nb(arr, dist=cos_dist_nb):
n, m = arr.shape
result = np.empty(n - 1)
moving = np.zeros(m)
for i in range(n - 1):
moving += arr[i, :]
result[i] = dist(moving, arr[i + 1, :])
return result
The above approaches can be benchmarked and plotted with the following (where smaller inputs are tested multiple times for more stable results):
import pandas as pd
import matplotlib.pyplot as plt
def benchmark(
funcs,
args=None,
kws=None,
ii=range(4, 15),
m=16,
kk=1024,
is_equal=np.allclose,
seed=0,
unit="ms",
verbose=True
):
labels = [func.__name__ for func in funcs]
units = {"s": 0, "ms": 3, "µs": 6, "ns": 9}
args = tuple(args) if args else ()
kws = dict(kws) if kws else {}
assert unit in units
np.random.seed(seed)
timings = {}
for i in ii:
n = 2 ** i
k = 1 + i * kk // n
if verbose:
print(f"i={i}, n={n}, m={m}, k={k}")
arrs = np.random.random((k, n, m))
base = np.array([funcs[0](arr, *args, **kws) for arr in arrs])
timings[n] = []
for func in funcs:
res = np.array([func(arr, *args, **kws) for arr in arrs])
is_good = is_equal(base, res)
timed = %timeit -n 1 -r 1 -q -o [func(arr, *args, **kws) for arr in arrs]
timing = timed.best / k
timings[n].append(timing if is_good else None)
if verbose:
print(
f"{func.__name__:>24}"
f" {is_good!s:5}"
f" {timing * (10 ** units[unit]):10.3f} {unit}"
f" {timings[n][0] / timing:5.1f}x")
return timings, labels
def plot(timings, labels, xlabel="Input Size / #", unit="ms"):
n_rows = 1
n_cols = 3
fig, axs = plt.subplots(n_rows, n_cols, figsize=(8 * n_cols, 6 * n_rows), squeeze=False)
units = {"s": 0, "ms": 3, "µs": 6, "ns": 9}
df = pd.DataFrame(data=timings, index=labels).transpose()
base = df[[labels[0]]].to_numpy()
(df * 10 ** units[unit]).plot(marker="o", xlabel=xlabel, ylabel=f"Best timing / {unit}", ax=axs[0, 0])
(df / base * 100).plot(marker='o', xlabel=xlabel, ylabel='Relative speed /labels %', logx=True, ax=axs[0, 1])
(base / df).plot(marker='o', xlabel=xlabel, ylabel='Speed Gain / x', ax=axs[0, 2])
fig.patch.set_facecolor('white')
to be used as:
funcs = moving_dist_OP, moving_dist_simpler, moving_dist_part, moving_dist_cumsum, moving_dist_cumsum2, moving_dist_cumsum_np, moving_dist_nb
timings, labels = benchmark(funcs, unit="ms", verbose=True)
plot(timings, labels, "Benchmarks", unit="ms")
to obtain:
These results indicate that Numba approach is the fastest by far and large, but the vectorized approach is reasonably fast.
When it comes to explicit non-accelerated looping, it is still beneficial to use the custom-defined cos_dist() in place of scipy.spatial.distance.cosine() (see moving_dist_cumsum() vs moving_dist_cumsum2()), while np.cumsum() is reasonably faster than np.add.reduce() but only marginally faster over computing the partial sum. Finally, moving_dist_OP() and moving_dist_simpler() are effectively equivalent (as expected).

Select N elements from each row without looping

Given a numpy array items of of shape (D, N, Q) and another array of indices ids of shape (N, P), how can I make a new array my_items of shape (D, N, P), by using the indices nq_ids, like the following:
# How can these loops be avoided?
my_items = np.zeros((D, N, P))
for n in range(N):
for p in range(P):
my_items[:, n, p] = items[:, n, ids[n, p]]
with numpy magic instead of using any explicit loops? Here is a minimal example:
import numpy as np
D, N, Q, P = 2, 5, 4, 3 # Reduced problem dimensions.
items = 1.0 * np.arange(D * N * Q).reshape((D, N, Q)) # Example data
ids = np.arange(0, N * P).reshape(N, P) % Q # Example ids
# How can these loops be avoided?
my_items = np.zeros((D, N, P))
for n in range(N):
for p in range(P):
my_items[:, n, p] = items[:, n, ids[n, p]]
# print('items', items)
# print('ids', ids)
# print('my_items', my_items)
I would also like to preserve the element order if possible.

This should work now, returning the exact same ndarray as your loop:
np.stack([np.take(items[:,i,:], ids[i, :], axis=1)
for i in range(ids.shape[0])], axis=2).transpose((0,2,1))
However, #hpaulj's method is faster, by 23.5 µs vs 5 µs. So use that.

Numpy - create an almost zero matrix with row from other matrix

I have square matrix A and I want to create matrix Z which elements are zero everywhere except for an i'th row, and the i'th row is j'th row of matrix A.
I am aware of two ways to accomplish this. The fist one is fairly straightforward and seems to be the most effective performance-wise:
def do_this(mx: np.array, i: int, j: int):
Z = np.zeros_like(mx)
Z[i, :] = mx[j, :]
return Z
The other, less straightforward way and seemingly much less efficient, is to prepare a mx matrix beforehand, which a zero matrix of the same shape as A, but has 1 in it's (i, j) position, and then to calculate Z as mx # A.
def do_this_other_way(mx: np.array, ref_mx: np.array):
return ref_mx # mx
I decided to benchmark both approaches:
from time import time
import numpy as np
n = 20
num_iters = 5000
A = np.random.rand(n, n)
i, j = 5, 10
t = time()
for _ in range(num_iters):
Z = do_this(A, i, j)
print((time() - t) / num_iters)
ref_mx = np.zeros_like(A)
ref_mx[i, j] = 1
t = time()
for _ in range(num_iters):
Z = do_this_other_way(A, ref_mx)
print((time() - t) / num_iters)
However, when A is relatively small (on my laptop it means that A's size is less than 40), do_this_other_way wins, and when A has size like 20, it wins by an order of magnitude.
That's it: I have doubts that I am doing it the most effective way possible in numpy. Is it possible to do it better without resorting to writing your own low-level implementation of do_this?

Solve multiple independent optimizations in scipy

I need to minimize a cost function for a large number (1000s) of different inputs. Obviously, this can be implemented by looping over scipy.optimize.minimize or any other minimization routine. Here is an example:
import numpy as np
import scipy as sp
def cost(x, a, b):
return np.sum((np.sum(a * x.reshape(a.shape), axis=1) - b)**2)
a = np.random.randn(500, 40)
b = np.array(np.arange(500))
x = []
for i in range(a.shape[0]):
res = sp.optimize.minimize(cost, np.zeros(40), args=(a[None, i], b[None, i]))
x.append(res.x)
It finds x[i, :] that minimize cost for each a[i, :] and b[i], but this is very slow. I guess looping over minimize causes considerable overhead.
A partial solution is to solve for all x simultaneously:
res = sp.optimize.minimize(cost, np.zeros_like(a), args=(a, b))
This is even slower than the loop. minimize does not know that elements in x are group-wise independent. So it computes the full hessian although a block-diagonal matrix would be sufficient, considering the problem structure. This is slow and overflows my computer's memory.
Is there any way to inform minimize or another optimization function about the problem structure so that it can solve multiple indepentent optimizations in a single function call? (Similar to certain options supported by Matlab's fsolve.)

First, a solution:
Turns out scipy.optimize.least_squares supports exploiting the structure of the jacobian by setting the jac_sparsity argument.
The least_squares function works slightly different than minimize so the cost function needs to be rewritten to return residuals instead:
def residuals(x, a, b):
return np.sum(a * x.reshape(a.shape), axis=1) - b
The jacobian has block-diagonal sparsity structure, given by
jacs = sp.sparse.block_diag([np.ones((1, 40), dtype=bool)]*500)
And calling the optimization routine:
res = sp.optimize.least_squares(residuals, np.zeros(500*40),
jac_sparsity=jacs, args=(a, b))
x = res.x.reshape(500, 40)
But is it really faster?
%timeit opt1_loopy_min(a, b) # 1 loop, best of 3: 2.43 s per loop
%timeit opt2_loopy_min_start(a, b) # 1 loop, best of 3: 2.55 s per loop
%timeit opt3_loopy_lsq(a, b) # 1 loop, best of 3: 13.7 s per loop
%timeit opt4_dense_lsq(a, b) # ValueError: array is too big; ...
%timeit opt5_jacs_lsq(a, b) # 1 loop, best of 3: 1.04 s per loop
Conclusions:
There is no obvious difference between the original solution (opt1) and re-use of the starting point (opt2) without sorting.
looping over least_squares (opt3) is considerable slower than looping over minimize (opt1, opt2).
The problem is too big to naiively run with least_squares because the jacobian matrix does not fit in memory.
Exploiting the sparsity structure of the jacobian in least_squares (opt5) seems to be the fastest approach.
This is the timing test environment:
import numpy as np
import scipy as sp
def cost(x, a, b):
return np.sum((np.sum(a * x.reshape(a.shape), axis=1) - b)**2)
def residuals(x, a, b):
return np.sum(a * x.reshape(a.shape), axis=1) - b
a = np.random.randn(500, 40)
b = np.arange(500)
def opt1_loopy_min(a, b):
x = []
x0 = np.zeros(a.shape[1])
for i in range(a.shape[0]):
res = sp.optimize.minimize(cost, x0, args=(a[None, i], b[None, i]))
x.append(res.x)
return np.stack(x)
def opt2_loopy_min_start(a, b):
x = []
x0 = np.zeros(a.shape[1])
for i in range(a.shape[0]):
res = sp.optimize.minimize(cost, x0, args=(a[None, i], b[None, i]))
x.append(res.x)
x0 = res.x
return np.stack(x)
def opt3_loopy_lsq(a, b):
x = []
x0 = np.zeros(a.shape[1])
for i in range(a.shape[0]):
res = sp.optimize.least_squares(residuals, x0, args=(a[None, i], b[None, i]))
x.append(res.x)
return x
def opt4_dense_lsq(a, b):
res = sp.optimize.least_squares(residuals, np.zeros(a.size), args=(a, b))
return res.x.reshape(a.shape)
def opt5_jacs_lsq(a, b):
jacs = sp.sparse.block_diag([np.ones((1, a.shape[1]), dtype=bool)]*a.shape[0])
res = sp.optimize.least_squares(residuals, np.zeros(a.size), jac_sparsity=jacs, args=(a, b))
return res.x.reshape(a.shape)

I guess looping over minimize causes considerable overhead.
Wrong guess. The time required for minimizing a function dwarfs any loop overhead. There is no vectorization magic for this problem.
Some time can be saved by using a better starting point of minimization. First, sort the parameters so that consecutive loops have similar parameters. Then use the end point of previous minimization as a starting point of the next one:
a = np.sort(np.random.randn(500, 40), axis=0) # sorted parameters
b = np.arange(500) # no need for np.array here, np.arange is already an ndarray
x0 = np.zeros(40)
for i in range(a.shape[0]):
res = minimize(cost, x0, args=(a[None, i], b[None, i]))
x.append(res.x)
x0 = res.x
This saves 30-40 percent of execution time in my test.
Another, minor, optimization to do is to preallocate an ndarray of appropriate size for resulting x-values, instead of using a list and append method. Before the loop: x = np.zeros((500, 40)); within the loop, x[i, :] = res.x.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.