Firstly, I'd like to apologize for the badly worded title - I can't currently think of a better way to phrase it. Basically, I'm wondering if there's a faster way to implement an array operation in Python where each operation depends on previous outputs in an iterative fashion (e.g. forward differencing operations, filtering, etc.). Basically, operations that are of a form like:
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
Where X is an array of values, and Y is the output. In this case, Y[0] is assumed to be known or calculated separately before the above loop. My question is: Is there a NumPy functionality to speed up this sort of self-referential loop? This is the major bottleneck in almost all the scripts I have. I know NumPy routines benefit from being executed from C routines, so I was curious if anyone knew of any numpy routines that would help here. Else failing that, are there better ways to program this loop (in Python) that would speed up its execution for large array sizes? (>500,000 data points).
Accessing single NumPy array elements or (elementwise-)iterating over a NumPy array is slow (like really slow). If you ever want to do a manual iteration over a NumPy array: Just don't do it!
But you got some options. The easiest is to convert the array to a Python list and iterate over the list (sounds silly, but stay with me - I'll present some benchmarks at the end of the answer 1):
X = X.tolist()
Y = Y.tolist()
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
If you also use direct iteration over the lists, it could be even faster:
X = X.tolist()
Y = Y.tolist()
for idx, (Y_n_m1, X_n, X_n_m1) in enumerate(zip(Y, X[1:], X), 1):
Y[idx] = X_n + X_n_m1 + Y_n_m1
Then there are more sophisticated options that require additional packages. Most notably Cython and Numba, these are designed to work on the array-elements directly and avoid Python overhead whenever possible. For example with Numba you could just use your approach inside a jitted (just-in-time compiled) function:
import numba as nb
#nb.njit
def func(X, Y):
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
There X and Y can be NumPy arrays but numba will work on the buffer directly, out-speeding the other approaches (possibly by orders of magnitude).
Numba is a "heavier" dependency than Cython, but it can be faster and easier to use. But without conda it's hard to install numba... YMMV
However here's also a Cython version of the code (compiled using IPython magic, it's a bit different if you're not using IPython):
In [1]: %load_ext cython
In [2]: %%cython
...:
...: cimport cython
...:
...: #cython.boundscheck(False)
...: #cython.wraparound(False)
...: cpdef cython_indexing(double[:] X, double[:] Y):
...: cdef Py_ssize_t n
...: for n in range(1, len(X)):
...: Y[n] = X[n] + X[n - 1] + Y[n-1]
...: return Y
Just to give an example (based on the timing framework from my answer to another question), regarding the timings:
import numpy as np
import numba as nb
import scipy.signal
def numpy_indexing(X, Y):
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
return Y
def list_indexing(X, Y):
X = X.tolist()
Y = Y.tolist()
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
return Y
def list_direct(X, Y):
X = X.tolist()
Y = Y.tolist()
for idx, (Y_n_m1, X_n, X_n_m1) in enumerate(zip(Y, X[1:], X), 1):
Y[idx] = X_n + X_n_m1 + Y_n_m1
return Y
#nb.njit
def numba_indexing(X, Y):
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
return Y
def numpy_cumsum(X, Y):
Y[1:] = X[1:] + X[:-1]
np.cumsum(Y, out=Y)
return Y
def scipy_lfilter(X, Y):
a = [1, -1]
b = [1, 1]
return Y[0] - X[0] + scipy.signal.lfilter(b, a, X)
# Make sure the approaches give the same result
X = np.random.random(10000)
Y = np.zeros(10000)
Y[0] = np.random.random()
np.testing.assert_array_equal(numba_indexing(X, Y), numpy_indexing(X, Y))
np.testing.assert_array_equal(numba_indexing(X, Y), numpy_cumsum(X, Y))
np.testing.assert_almost_equal(numba_indexing(X, Y), scipy_lfilter(X, Y))
np.testing.assert_array_equal(numba_indexing(X, Y), cython_indexing(X, Y))
# Timing setup
timings = {numpy_indexing: [],
list_indexing: [],
list_direct: [],
numba_indexing: [],
numpy_cumsum: [],
scipy_lfilter: [],
cython_indexing: []}
sizes = [2**i for i in range(1, 20, 2)]
# Timing
for size in sizes:
X = np.random.random(size=size)
Y = np.zeros(size)
Y[0] = np.random.random()
for func in timings:
res = %timeit -o func(X, Y)
timings[func].append(res)
# Plottig absolute times
%matplotlib notebook
import matplotlib.pyplot as plt
fig = plt.figure(1)
ax = plt.subplot(111)
for func in timings:
ax.plot(sizes,
[time.best for time in timings[func]],
label=str(func.__name__))
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()
# Plotting relative times
fig = plt.figure(1)
ax = plt.subplot(111)
baseline = numba_indexing # choose one function as baseline
for func in timings:
ax.plot(sizes,
[time.best / ref.best for time, ref in zip(timings[func], timings[baseline])],
label=str(func.__name__))
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time relative to "{}"'.format(baseline.__name__))
ax.grid(which='both')
ax.legend()
plt.tight_layout()
With the following results:
Absolute runtimes
Relative runtimes (compared to the numba function)
So, just by converting it to a list you will be roughly 3 times faster! By iterating directly over these lists you get another (yet smaller) speedup, only 20% in this benchmark but we're almost 4 times faster now compared to the original solution. With numba you can speed it by a factor of more than 100 compared to the list operations! And Cython is only a bit slower than numba (~40-50%), probably because I haven't squeezed out every possible optimization (usually it's not more than 10-20% slower) you could do with Cython. However for large arrays the difference gets smaller.
1 I did go into more details in another answer. That Q+A was about converting to a set but because set uses (hidden) "manual iteration" it also applies here.
I included the timings for the NumPy cumsum and Scipy lfilter approaches. These were roughly 20 times slower for small arrays and 4 times slower for large arrays compared to the numba function. However if I interpret the question correctly you looked for general ways not only ones that applied in the example. Not every self-referencing loop can be implemented using cum* functions from NumPy or SciPys filters. But even then it seems like they can't compete with Cython and/or numba.
It's pretty simple using np.cumsum:
#!/usr/bin/env python3
import numpy as np
import random
def r():
return random.randint(100, 1000)
X = np.array([r() for _ in range(10)])
fast_Y = np.ndarray(X.shape, dtype=X.dtype)
slow_Y = np.ndarray(X.shape, dtype=X.dtype)
slow_Y[0] = fast_Y[0] = r()
# fast method
fast_Y[1:] = X[1:] + X[:-1]
np.cumsum(fast_Y, out=fast_Y)
# original method
for n in range(1, len(X)):
slow_Y[n] = X[n] + X[n - 1] + slow_Y[n-1]
assert (fast_Y == slow_Y).all()
The situation you describe is basically a discrete filter operation. This is implemented in scipy.signal.lfilter. The particular condition you describe corresponds to a = [1, -1] and b = [1, 1].
import numpy as np
import scipy.signal
a = [1, -1]
b = [1, 1]
X = np.random.random(10000)
Y = np.zeros(10000)
newY = scipy.signal.lfilter(b, a, X) + (Y[0] - X[0])
On my computer, the timings work out as follows:
%timeit func4(X, Y.copy())
# 100000 loops, best of 3: 14.6 µs per loop
% timeit newY = scipy.signal.lfilter(b, a, X) - (Y[0] - X[0])
# 10000 loops, best of 3: 68.1 µs per loop
Related
I'm trying to implement a differential in python via numpy that can accept a scalar, a vector, or a matrix.
import numpy as np
def foo_scalar(x):
f = x * x
df = 2 * x
return f, df
def foo_vector(x):
f = x * x
n = x.size
df = np.zeros((n, n))
for mu in range(n):
for i in range(n):
if mu == i:
df[mu, i] = 2 * x[i]
return f, df
def foo_matrix(x):
f = x * x
m, n = x.shape
df = np.zeros((m, n, m, n))
for mu in range(m):
for nu in range(n):
for i in range(m):
for j in range(n):
if (mu == i) and (nu == j):
df[mu, nu, i, j] = 2 * x[i, j]
return f, df
This works fine, but it seems like there should be a way to do this in a single function, and let numpy "figure out" the correct dimensions. I could force everything into a 2-D array form with something like
x = np.array(x)
if len(x.shape) == 0:
x = x.reshape(1, 1)
elif len(x.shape) == 1:
x = x.reshape(-1, 1)
if len(f.shape) == 0:
f = f.reshape(1, 1)
elif len(f.shape) == 1:
f = f.reshape(-1, 1)
and always have 4 nested for loops, but this doesn't scale if I need to generalize to higher-order tensors.
Is what I'm trying to do possible, and if so, how?
I highly doubt there is a function to generate the second parameter returned by the function in Numpy. That being said you can play with the feature of Numpy and Python so to vectorize this and make the function faster. You first need to generate the indices and, then generate the target matrix and set it. Note that operating with N-dimensional generic arrays tends to be slow and tricky in non-trivial cases. The magic * unrolling operator is used to generate N parameters.
def foo_generic(x):
f = x ** 2
idx = np.stack(np.meshgrid(*[np.arange(e) for e in x.shape], indexing='ij'))
idx = tuple(np.concatenate((idx, idx)).reshape(2*x.ndim, -1))
df = np.zeros([*x.shape, *x.shape])
df[idx] = 2 * x.ravel()
return f, df
Note that foo_generic does not support scalar and it would be very inefficient to use it for that anyway, but you can add a condition in it to support this special case apart.
The df matrix will very quickly be huge for higher order so I strongly advise you not to use dense matrices for that since the number of zeros is huge compared to the number of values in the matrix case already. Sparse matrices fix this. In fact, for a 5x5 matrix, there are >95% of zeros. Not to mention the matrix becomes quickly huge and willing a huge matrix full of zeros is not efficient.
I tried to model voxels of 3D cylinder with the following code:
import math
import numpy as np
R0 = 500
hz = 1
x = np.arange(-1000, 1000, 1)
y = np.arange(-1000, 1000, 1)
z = np.arange(-10, 10, 1)
xx, yy, zz = np.meshgrid(x, y, z)
def density_f(x, y, z):
r_xy = math.sqrt(x ** 2 + y ** 2)
if r_xy <= R0 and -hz <= z <= hz:
return 1
else:
return 0
density = np.vectorize(density_f)(xx, yy, zz)
and it took many minutes to compute.
Equivalent suboptimal Java code runs 10-15 seconds.
How to make Python compute voxels at the same speed? Where to optimize?
Please do not use .vectorize(..), it is not efficient since it will still do the processing at the Python level. .vectorize() should only be used as a last resort if for example the function can not be calculated in "bulk" because its "structure" is too complex.
But you do not need to use .vectorize here, you can implement your function to work over arrays with:
r_xy = np.sqrt(xx ** 2 + yy ** 2)
density = (r_xy <= R0) & (-hz <= zz) & (zz <= hz)
or even a bit faster:
r_xy = xx * xx + yy * yy
density = (r_xy <= R0 * R0) & (-hz <= zz) & (zz <= hz)
This will construct a 2000×2000×20 array of booleans. We can use:
intdens = density.astype(int)
to construct an array of ints.
Printing the array here is quite combersome, but it contains a total of 2'356'047 ones:
>>> density.astype(int).sum()
2356047
Benchmarks: If I run this locally 10 times, I get:
>>> timeit(f, number=10)
18.040479518999973
>>> timeit(f2, number=10) # f2 is the optimized variant
13.287886952000008
So on average, we calculate this matrix (including casting it to ints) in 1.3-1.8 seconds.
You can also use a compiled version of the function to calculate density. You can use cython or numba for that. I use numba to jit compile the density calculation function in the ans, as it is as easy as putting in a decorator.
Pros :
You can write if conditions as you mention in your comments
Slightly faster than the numpy version mentioned in the ans by
#Willem Van Onsem, as we have to iterate through the boolean array to
calculate the sum in density.astype(int).sum().
Cons:
Write an ugly three level loop. Looses the beauty of the singlish liner numpy solution.
Code:
import numba as nb
#nb.jit(nopython=True, cache=True)
def calc_density(xx, yy, zz, R0, hz):
threshold = R0 * R0
dimensions = xx.shape
density = 0
for i in range(dimensions[0]):
for j in range(dimensions[1]):
for k in range(dimensions[2]):
r_xy = xx[i][j][k] ** 2 + yy[i][j][k] ** 2
if(r_xy <= threshold and -hz <= zz[i][j][k] <= hz):
density+=1
return density
Running times:
Willem Van Onsem solution, f2 variant : 1.28s without sum, 2.01 with sum.
Numba solution( calc_density, on second run, to discount the compile time) : 0.48s.
As suggested in the comments, we need not calculate the meshgrid also. We can directly pass the x, y, z to the function. Thus:
#nb.jit(nopython=True, cache=True)
def calc_density2(x, y, z, R0, hz):
threshold = R0 * R0
dimensions = len(x), len(y), len(z)
density = 0
for i in range(dimensions[0]):
for j in range(dimensions[1]):
for k in range(dimensions[2]):
r_xy = x[i] ** 2 + y[j] ** 2
if(r_xy <= threshold and -hz <= z[k] <= hz):
density+=1
return density
Now, for fair comparison, we also include the time of np.meshgrid in #Willem Van Onsem's ans.
Running times:
Willem Van Onsem solution, f2 variant(np.meshgrid time included) : 2.24s
Numba solution( calc_density2, on second run, to discount the compile time) : 0.079s.
This is meant as a lengthy comment on the answer of Deepak Saini.
The main change is to not use the coordinates generated by np.meshgrid which contains unecessary repetitions. This isn't recommandable if you can avoid it (both in terms of memory usage and performance)
Code
import numba as nb
import numpy as np
#nb.jit(nopython=True,parallel=True)
def calc_density_2(x, y, z,R0,hz):
threshold = R0 * R0
density = 0
for i in nb.prange(y.shape[0]):
for j in range(x.shape[0]):
r_xy = x[j] ** 2 + y[i] ** 2
for k in range(z.shape[0]):
if(r_xy <= threshold and -hz <= z[k] <= hz):
density+=1
return density
Timings
R0 = 500
hz = 1
x = np.arange(-1000, 1000, 1)
y = np.arange(-1000, 1000, 1)
z = np.arange(-10, 10, 1)
xx, yy, zz = np.meshgrid(x, y, z)
#after the first call (compilation overhead)
#calc_density_2 9.7 ms
#calc_density_2 parallel 3.9 ms
##Deepak Saini 115 ms
In the following code I have written 2 methods that theoretically(in my mind) should do the same thing. Unfortunately they don't, I am unable to find out why they don't do the same thing per the numpy documentation.
import numpy as np
dW = np.zeros((20, 10))
y = [1 for _ in range(100)]
X = np.ones((100, 20))
# ===================
# Method 1 (works!)
# ===================
for i in range(len(y)):
dW[:, y[i]] -= X[i]
# ===================
# Method 2 (does not work)
# ===================
dW[:, y] -= X.T
As indicated, in principle you cannot operate multiple times over the same element in a single operation, due to how buffering works in NumPy. For that purpose there is the at function, which can be used on about any standard NumPy function (add, subtract, etc.). For your case, you can do:
import numpy as np
dW = np.zeros((20, 10))
y = [1 for _ in range(100)]
X = np.ones((100, 20))
# at modifies in place dW, does not return a new array
np.subtract.at(dW, (slice(None), y), X.T)
This is a column-wise version of this question.
The answer there can be adapted to work column-wise as follows:
Approach 1: np.<ufunc>.at
>>> np.subtract.at(dW, (slice(None), y), X.T)
Approach 2: np.bincount
>>> m, n = dW.shape
>>> dW -= np.bincount(np.add.outer(np.arange(m) * n, y).ravel(), (X.T).ravel(), dW.size).reshape(m, n)
Please note that the bincount based solution - even though it involves more steps - is faster by a factor of ~6.
>>> from timeit import repeat
>>> kwds = dict(globals=globals(), number=5000)
>>>
>>> repeat('np.subtract.at(dW, (slice(None), y), X.T); np.add.at(dW, (slice(None), y), X.T)', **kwds)
[1.590626839082688, 1.5769231889862567, 1.5802007300080732]
>>> repeat('_= dW; _ -= np.bincount(np.add.outer(np.arange(m) * n, y).ravel(), (X.T).ravel(), dW.size).reshape(m, n); _ += np.bincount(np.add.outer(np.arange(m) * n, y).ravel(), (X.T).ravel(), dW.size).reshape(m, n)', **kwds)
[0.2582490430213511, 0.25572817400097847, 0.25478115503210574]
I found a third solution for this problem. Normal Matrix multiplication:
ind = np.zeros((X.shape[0],dW.shape[1]))
ind[range(X.shape[0]),y] = -1
dW = X.T.dot(ind)
I did some experiment using the proposed methods above on some neural network data. In my example X.shape = (500,3073), W.shape = (3073,10) and ind.shape = (500,10).
The subtract version takes about 0.2 second (slowest). The Matrix multiplication method 0.01 s (fastest). Normal loop 0.015 and then bincountmethod 0.04 s. Note that in the question y is vector of ones. This is not my case. The case with only ones can be solved with a simple sum.
Option 1:
for i in range(len(y)):
dW[:, y[i]] -= X[i]
This works because you are looping through and updating value which was updated last time.
Option 2:
dW[:, [1,1,1,1,....1,1,1]] -= [[1,1,1,1...1],
[1,1,1,1...1],
.
.
[1,1,1,1...1]]
It does not work because update happens to 1st index at same time in parallel not in serial way. Initially all are 0 so subtracting results in -1.
I have the following code. It is taking forever in Python. There must be a way to translate this calculation into a broadcast...
def euclidean_square(a,b):
squares = np.zeros((a.shape[0],b.shape[0]))
for i in range(squares.shape[0]):
for j in range(squares.shape[1]):
diff = a[i,:] - b[j,:]
sqr = diff**2.0
squares[i,j] = np.sum(sqr)
return squares
You can use np.einsum after calculating the differences in a broadcasted way, like so -
ab = a[:,None,:] - b
out = np.einsum('ijk,ijk->ij',ab,ab)
Or use scipy's cdist with its optional metric argument set as 'sqeuclidean' to give us the squared euclidean distances as needed for our problem, like so -
from scipy.spatial.distance import cdist
out = cdist(a,b,'sqeuclidean')
I collected the different methods proposed here, and in two other questions, and measured the speed of the different methods:
import numpy as np
import scipy.spatial
import sklearn.metrics
def dist_direct(x, y):
d = np.expand_dims(x, -2) - y
return np.sum(np.square(d), axis=-1)
def dist_einsum(x, y):
d = np.expand_dims(x, -2) - y
return np.einsum('ijk,ijk->ij', d, d)
def dist_scipy(x, y):
return scipy.spatial.distance.cdist(x, y, "sqeuclidean")
def dist_sklearn(x, y):
return sklearn.metrics.pairwise.pairwise_distances(x, y, "sqeuclidean")
def dist_layers(x, y):
res = np.zeros((x.shape[0], y.shape[0]))
for i in range(x.shape[1]):
res += np.subtract.outer(x[:, i], y[:, i])**2
return res
# inspired by the excellent https://github.com/droyed/eucl_dist
def dist_ext1(x, y):
nx, p = x.shape
x_ext = np.empty((nx, 3*p))
x_ext[:, :p] = 1
x_ext[:, p:2*p] = x
x_ext[:, 2*p:] = np.square(x)
ny = y.shape[0]
y_ext = np.empty((3*p, ny))
y_ext[:p] = np.square(y).T
y_ext[p:2*p] = -2*y.T
y_ext[2*p:] = 1
return x_ext.dot(y_ext)
# https://stackoverflow.com/a/47877630/648741
def dist_ext2(x, y):
return np.einsum('ij,ij->i', x, x)[:,None] + np.einsum('ij,ij->i', y, y) - 2 * x.dot(y.T)
I use timeit to compare the speed of the different methods. For the comparison, I use vectors of length 10, with 100 vectors in the first group, and 1000 vectors in the second group.
import timeit
p = 10
x = np.random.standard_normal((100, p))
y = np.random.standard_normal((1000, p))
for method in dir():
if not method.startswith("dist_"):
continue
t = timeit.timeit(f"{method}(x, y)", number=1000, globals=globals())
print(f"{method:12} {t:5.2f}ms")
On my laptop, the results are as follows:
dist_direct 5.07ms
dist_einsum 3.43ms
dist_ext1 0.20ms <-- fastest
dist_ext2 0.35ms
dist_layers 2.82ms
dist_scipy 0.60ms
dist_sklearn 0.67ms
While the two methods dist_ext1 and dist_ext2, both based on the idea of writing (x-y)**2 as x**2 - 2*x*y + y**2, are very fast, there is a downside: When the distance between x and y is very small, due to cancellation error the numerical result can sometimes be (very slightly) negative.
Another solution besides using cdist is the following
difference_squared = np.zeros((a.shape[0], b.shape[0]))
for dimension_iterator in range(a.shape[1]):
difference_squared = difference_squared + np.subtract.outer(a[:, dimension_iterator], b[:, dimension_iterator])**2.
This problem is similar to a former problem answered Fast interpolation over 3D array, but cannot solve my problem.
I have a 4D array with dimensions of (time,altitude,latitude, longitude), marked as y.shape=(nt, nalt, nlat, nlon). The x is altitude and change with (time, latitude, longtitude), which means x.shape = (nt, nalt, nlat, nlon). I want to interpolate in altitude for every (nt, nlat, nlon). The interpolated x_new should be 1d, not change with (time, latitude, longtitude).
I use numpy.interp, same as scipy.interpolate.interp1d and think about the answers in former post. I cannot reduced the loops with those answers.
I can only do like this:
# y is a 4D ndarray
# x is a 4D ndarray
# new_y is a 4D array
for i in range(nlon):
for j in range(nlat):
for k in range(nt):
y_new[k,:,j,i] = np.interp(new_x, x[k,:,j,i], y[k,:,j,i])
These loops make this interpolation too slow to calculation. Would someone have good ideas? Help will be highly appreciated.
Here is my solution by using numba, it's about 3x faster.
create the test data first, x need to in ascending order:
import numpy as np
rows = 200000
cols = 66
new_cols = 69
x = np.random.rand(rows, cols)
x.sort(axis=-1)
y = np.random.rand(rows, cols)
nx = np.random.rand(new_cols)
nx.sort()
do 200000 times interp in numpy:
%%time
ny = np.empty((x.shape[0], len(nx)))
for i in range(len(x)):
ny[i] = np.interp(nx, x[i], y[i])
I use merge method instead of binary search method, because nx is in order, and the length of nx is about the same as x.
interp() use binary search, the time complexity is O(len(nx)*log2(len(x))
merge method: the time complexity is O(len(nx) + len(x))
Here is the numba code:
import numba
#numba.jit("f8[::1](f8[::1], f8[::1], f8[::1], f8[::1])")
def interp2(x, xp, fp, f):
n = len(x)
n2 = len(xp)
j = 0
i = 0
while x[i] <= xp[0]:
f[i] = fp[0]
i += 1
slope = (fp[j+1] - fp[j])/(xp[j+1] - xp[j])
while i < n:
if x[i] >= xp[j] and x[i] < xp[j+1]:
f[i] = slope*(x[i] - xp[j]) + fp[j]
i += 1
continue
j += 1
if j + 1 == n2:
break
slope = (fp[j+1] - fp[j])/(xp[j+1] - xp[j])
while i < n:
f[i] = fp[n2-1]
i += 1
#numba.jit("f8[:, ::1](f8[::1], f8[:, ::1], f8[:, ::1])")
def multi_interp(x, xp, fp):
nrows = xp.shape[0]
f = np.empty((nrows, x.shape[0]))
for i in range(nrows):
interp2(x, xp[i, :], fp[i, :], f[i, :])
return f
Then call the numba function:
%%time
ny2 = multi_interp(nx, x, y)
To check the result:
np.allclose(ny, ny2)
On my pc, the time is:
python version: 3.41 s
numba version: 1.04 s
This method need an array that the last axis is the axis to be interp().