RegularGridInterpolator excruciatingly slower than interp2d - python

Consider the following code example:
# %%
import numpy
from scipy.interpolate import interp2d, RegularGridInterpolator
x = numpy.arange(9000)
y = numpy.arange(9000)
z = numpy.random.randint(-1000, high=1000, size=(9000, 9000))
f = interp2d(x, y, z, kind='linear', copy=False)
f2 = RegularGridInterpolator((x, y), z, "linear")
mx, my = np.meshgrid(x, y)
M = np.stack([mx, my], axis=-1)
# %%
%timeit f(x, y)
# %%
%timeit f2(M)
It sets up some example interpolators using scipy.interpolate.interp2d and scipy.interpolate.RegularGridInterpolator. The output of the two cells above is
1.09 s ± 4.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and
10 s ± 17.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
respectively.
The RegularGridInterpolator is about 10 times slower than the interp2d. The problem is that interp2d has been marked as deprecated in scipy 1.10.0. And new code should use RegularGridInterpolator. This seems a bit strange to me since it would be such a bad replacement. Is there maybe a problem in my code example above? How can I speed this interpolation process up?

There is no problem with your code, it's probably a bug in scipy.
I've reported it on github

Related

Can this matrix calculation be implemented or approximated without an intermediate 3D matrix?

Given an NxN matrix W, I'm looking to calculate an NxN matrix C given by the equation in this link: https://i.stack.imgur.com/dY7rY.png, or in LaTeX
$$C_{ij} = \max_k \bigg\{ \sum_l \bigg( W_{ik}W_{kl}W_{lj} - W_{ik}W_{kj} \bigg) \bigg\}.$$
I have tried to implement this in PyTorch but I've either encountered memory problems by constructing an intermediate NxNxN 3D matrix which, for large N, causes my GPU to run out of memory, or used a for-loop over k which is then very slow. I can't work out how I can get round these. How might I implement this calculation, or an approximation of it, without a large intermediate matrix like this?
Suggestions, pseudocode in any language or an implementation in any of Python/Numpy/PyTorch would be much appreciated.
The formula can be simplified as
C_ij = max_k ( W_ik M_kj)
where
M = W * W - N * W
with N the size of the matrix W and W * W the usual product.
Then, in the formula above, for every i, j there is an independent maximum to be computed. Without knowing further properties of W, it is in general not possible to further simplify the problem. So, after computing the matrix M, you can do a loop over i and j, and compute the maximum.
The first solution using Numba (You can do the same using Cython or plain C) would be to formulate the problem using simple loops.
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def calc_1(W):
C=np.empty_like(W)
N=W.shape[0]
for i in nb.prange(N):
TMP=np.empty(N,dtype=W.dtype)
for j in range(N):
for k in range(N):
acc=0
for l in range(N):
acc+=W[i,k]*W[k,l]*W[l,j]-W[i,k]*W[k,j]
TMP[k]=acc
C[i,j]=np.max(TMP)
return C
Francesco provided a simplification which scales far better for larger array sizes. This leads to the following, where I also optimized away a small temporary array.
#nb.njit(fastmath=True,parallel=True)
def calc_2(W):
C=np.empty_like(W)
N=W.shape[0]
M = np.dot(W,W) - N * W
for i in nb.prange(N):
for j in range(N):
val=W[i,0]*M[0,j]
for k in range(1,N):
TMP=W[i,k]*M[k,j]
if TMP>val:
val=TMP
C[i,j]=val
return C
This can be optimized further by partial loop unrolling and optimizing the array access. Some compilers may do this automatically.
#nb.njit(fastmath=True,parallel=True)
def calc_3(W):
C=np.empty_like(W)
N=W.shape[0]
W=np.ascontiguousarray(W)
M = np.dot(W.T,W.T) - W.shape[0] * W.T
for i in nb.prange(N//4):
for j in range(N):
val_1=W[i*4+0,0]*M[j,0]
val_2=W[i*4+1,0]*M[j,0]
val_3=W[i*4+2,0]*M[j,0]
val_4=W[i*4+3,0]*M[j,0]
for k in range(1,N):
TMP_1=W[i*4+0,k]*M[j,k]
TMP_2=W[i*4+1,k]*M[j,k]
TMP_3=W[i*4+2,k]*M[j,k]
TMP_4=W[i*4+3,k]*M[j,k]
if TMP_1>val_1:
val_1=TMP_1
if TMP_2>val_2:
val_2=TMP_2
if TMP_3>val_3:
val_3=TMP_3
if TMP_4>val_4:
val_4=TMP_4
C[i*4+0,j]=val_1
C[i*4+1,j]=val_2
C[i*4+2,j]=val_3
C[i*4+3,j]=val_4
#Remainder
for i in range(N//4*4,N):
for j in range(N):
val=W[i,0]*M[j,0]
for k in range(1,N):
TMP=W[i,k]*M[j,k]
if TMP>val:
val=TMP
C[i,j]=val
return C
Timings
W=np.random.rand(100,100)
%timeit calc_1(W)
#16.8 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit calc_2(W)
#449 µs ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit calc_3(W)
#259 µs ± 47.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
W=np.random.rand(2000,2000)
#Temporary array would be 64GB in this case
%timeit calc_2(W)
#5.37 s ± 174 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit calc_3(W)
#596 ms ± 30.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Why random write of numpy array element is much slower than random read?

The following code is part of an image binarization algorithm. I am trying to improve the performance of this algorithm, but I find the bottleneck in a strange place. (Irrelevant parts of the algorithm are omitted.)
#jit(nopython=True)
def make_buckets(img):
height, width = img.shape
buckets = np.zeros(256)
# buckets = np.full(256, 0) [I tried both, sometimes one works better than other]
for h in range(height):
for w in range(width):
g = img[h, w]
a = buckets[g] + 1
# buckets[g] = 1
# buckets[g] = a
img = np.random.randint(0, 256, (224, 224))
%timeit make_buckets(img)
The runtime of the above code is:
539 ns ± 12.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
which is fine. It means that, getting/accessing buckets[g] inside a loop is not demanding at all. Let's see what happens if I uncomment this line: buckets[g] = 1
for h in range(height):
for w in range(width):
g = img[h, w]
a = buckets[g] + 1
buckets[g] = 1
# buckets[g] = a
The runtime is,
27.8 µs ± 865 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This means setting buckets[g] inside a loop increases runtime drastically.
Now, if I uncomment the buckets[g] = a line like this:
for h in range(height):
for w in range(width):
g = img[h, w]
a = buckets[g] + 1
# buckets[g] = 1
buckets[g] = a
The runtime increase to:
37.8 µs ± 1.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
This increment of runtime is consistent, and depending on the img matrix, it can take >100 µs. This also means that setting matrix element to a variable is much slower than setting it to a constant. Can someone explain why setting the matrix element to a variable in a loop increases runtime so drastically? Also, can you suggest fs there is any way to improve the performance than this?
I am doing machine learning with millions of images (of handwritten documents), so my datagenerator should be fast enough so that it does not bottleneck the GPU. In the datagenerator pipeline, there are multiple image manipulations, so I want to optimize each operation as much as possible. This binarization is done on every image, so I want to bring down the runtime to a few microseconds.
I think you can replace your code with this:
def make_buckets(img):
return np.bincount(img.reshape(-1), minlength=256)
Please try that and let us know the timings.
Ref: https://numpy.org/doc/stable/reference/generated/numpy.bincount.html

Python: Fastest way of packing a 2d array of binary values into UINT64 array

I have a 2D UINT8 numpy array of size (149797, 64). Each of the elements are either 0 or 1. I want to pack these binary values in each row into a UINT64 value so that i get a UINT64 array of shape 149797 as a result. I tried the following code using numpy bitpack function.
test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8)
col_pack=np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64)
The packbits function takes about 10 ms to execute. A simple reshaping of this array itself seems to take around 7 ms.I also tried iterating over 2d numpy array using shifting operations to achieve the same result; but there was no speed improvement.
Finally i also want to compile it using numba for CPU.
#njit
def shifting(bitlist):
x=np.zeros(149797,dtype=np.uint64) #54
rows,cols=bitlist.shape
for i in range(0,rows): #56
out=0
for bit in range(0,cols):
out = (out << 1) | bitlist[i][bit] # If i comment out bitlist, time=190 microsec
x[i]=np.uint64(out) # Reduces time to microseconds if line is commented in njit
return x
It takes about 6 ms using njit.
Here is the parallel njit version
#njit(parallel=True)
def shifting(bitlist):
rows,cols=149797,64
out=0
z=np.zeros(rows,dtype=np.uint64)
for i in prange(rows):
for bit in range(cols):
z[i] = (z[i] * 2) + bitlist[i,bit] # Time becomes 100 micro if i use 'out' instead of 'z[i] array'
return z
It's slightly better wit 3.24ms execution time(google colab dual core 2.2Ghz)
Currently, the python solution with swapbytes(Paul's) method seems to be the best one i.e 1.74 ms.
How can we further speed up this conversion? Is there scope for using any vectorization(or parallelization), bitarrays etc, for achieving speedup?
Ref: numpy packbits pack to uint16 array
On a 12 core machine(Intel(R) Xeon(R) CPU E5-1650 v2 # 3.50GHz),
Pauls method: 1595.0 microseconds (it does not use multicore, i suppose)
Numba code: 146.0 microseconds (aforementioned parallel-numba)
i.e around 10x speedup !!!
You can get a sizeable speedup by using byteswap instead of reshaping etc.:
test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8)
np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64)
# array([ 1079982015491401631, 246233595099746297, 16216705265283876830,
# ..., 1943876987915462704, 14189483758685514703,
12753669247696755125], dtype=uint64)
np.packbits(test).view(np.uint64).byteswap()
# array([ 1079982015491401631, 246233595099746297, 16216705265283876830,
# ..., 1943876987915462704, 14189483758685514703,
12753669247696755125], dtype=uint64)
timeit(lambda:np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64),number=100)
# 1.1054180909413844
timeit(lambda:np.packbits(test).view(np.uint64).byteswap(),number=100)
# 0.18370431219227612
A bit Numba solution (version 0.46/Windows).
Code
import numpy as np
import numba as nb
#with memory allocation
#nb.njit(parallel=True)
def shifting(bitlist):
assert bitlist.shape[1]==64
x=np.empty(bitlist.shape[0],dtype=np.uint64)
for i in nb.prange(bitlist.shape[0]):
out=np.uint64(0)
for bit in range(bitlist.shape[1]):
out = (out << 1) | bitlist[i,bit]
x[i]=out
return x
#without memory allocation
#nb.njit(parallel=True)
def shifting_2(bitlist,x):
assert bitlist.shape[1]==64
for i in nb.prange(bitlist.shape[0]):
out=np.uint64(0)
for bit in range(bitlist.shape[1]):
out = (out << 1) | bitlist[i,bit]
x[i]=out
return x
Timings
test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8)
#If you call this function multiple times, only allocating memory
#once may be enough
x=np.empty(test.shape[0],dtype=np.uint64)
#Warmup first call takes significantly longer
res=shifting(test)
res=shifting_2(test,x)
%timeit res=shifting(test)
#976 µs ± 41.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit res=shifting_2(test,x)
#764 µs ± 63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.packbits(test).view(np.uint64).byteswap()
#8.07 ms ± 52.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64)
#17.9 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Distance matrix for custom distance

From what I understand, the scipy function scipy.spatial.distance_matrix returns the Minkowski distance for any pair of vectors from the provided matrices of vectors. Is there a way to get the same result for a different distance? Something that would look like distance_matrix(X, Y, distance_function) ?
I assume that scipy does some sort of optimization under the hood. Since I am dealing with very large vectors, I would rather not lose the benefit of these optimizations by implementing my own distance_matrix function.
It is quite straight forward to implement it yourself
Also the performance will very likely be better than the distance functions already implemented in scipy.
Most of the distance functions are applying one function on all pairs and sum them up eg. (A_ik-B_jk)**n for Minkowski distance and at the end there is some other function applied eg. acc**(1/n).
Template function
You don't have to change anything here to implement various distance functions.
import numpy as np
import numba as nb
def gen_cust_dist_func(kernel_inner,kernel_outer,parallel=True):
kernel_inner_nb=nb.njit(kernel_inner,fastmath=True,inline='always')
kernel_outer_nb=nb.njit(kernel_outer,fastmath=True,inline='always')
def cust_dot_T(A,B):
assert B.shape[1]==A.shape[1]
out=np.empty((A.shape[0],B.shape[0]),dtype=A.dtype)
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
acc=0
for k in range(A.shape[1]):
acc+=kernel_inner_nb(A[i,k],B[j,k])
out[i,j]=kernel_outer_nb(acc)
return out
if parallel==True:
return nb.njit(cust_dot_T,fastmath=True,parallel=True)
else:
return nb.njit(cust_dot_T,fastmath=True,parallel=False)
Examples and Timings
#Implement for example a Minkowski distance and euclidian distance
#Minkowski distance p=20
inner=lambda A,B:(A-B)**20
outer=lambda acc:acc**(1./20)
my_minkowski_dist=gen_cust_dist_func(inner,outer,parallel=True)
#Euclidian distance
inner=lambda A,B:(A-B)**2
outer=lambda acc:np.sqrt(acc)
my_euclidian_dist=gen_cust_dist_func(inner,outer,parallel=True)
from scipy.spatial.distance import cdist
A=np.random.rand(1000,50)
B=np.random.rand(1000,50)
#Minkowski p=20
%timeit res_1=cdist(A,B,'m',p=20)
#1.44 s ± 8.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=my_minkowski_dist(A,B)
#10.8 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
res_1=cdist(A,B,'m',p=20)
res_2=my_minkowski_dist(A,B)
print(np.allclose(res_1,res_2))
#True
#Euclidian
%timeit res_1=cdist(A,B,'euclidean')
#39.3 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_2=my_euclidian_dist(A,B)
#3.61 ms ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
res_1=res_1=cdist(A,B,'euclidean')
res_2=my_euclidian_dist(A,B)
print(np.allclose(res_1,res_2))
#True

A faster numpy.polynomial?

I have a very simple problem: in my python toolbox, I have to compute the values of polynomials (usually degree 3 or 2, seldom others, always integer degree) from a large vector (size >> 10^6). Storing the result in a buffer is not an option because I have several of these vectors so I would quickly run out of memory, and I usually have to compute it only once in any case. The performance of numpy.polyval is actually quite good, but still this is my bottleneck. Can I somehow make the evaluation of the polynomial faster?
Addendum
I think that the pure-numpy solution of Joe Kington is good for me, in particular because it avoids potential issues at installation time of other libraries or cython. For those who asked, the numbers in the vector are large (order 10^4), so I don't think that the suggested approximations would work.
You actually can speed it up slightly by doing the operations in-place (or using numexpr or numba which will automatically do what I'm doing manually below).
numpy.polyval is a very short function. Leaving out a few type checks, etc, it amounts to:
def polyval(p, x):
y = np.zeros_like(x)
for i in range(len(p)):
y = x * y + p[i]
return y
The downside to this approach is that a temporary array will be created inside the loop as opposed to doing the operation in-place.
What I'm about to do is a micro-optimization and is only worthwhile for very large x inputs. Furthermore, we'll have to assume floating-point output instead of letting the upcasting rules determine the output's dtype. However, it will speed this up slighly and make it use less memory:
def faster_polyval(p, x):
y = np.zeros(x.shape, dtype=float)
for i, v in enumerate(p):
y *= x
y += v
return y
As an example, let's say we have the following input:
# Third order polynomial
p = [4.5, 9.8, -9.2, 1.2]
# One-million element array
x = np.linspace(-10, 10, 1e6)
The results are identical:
In [3]: np_result = np.polyval(p, x)
In [4]: new_result = faster_polyval(p, x)
In [5]: np.allclose(np_result, new_result)
Out[5]: True
And we get a modest 2-3x speedup (which is mostly independent of array size, as it relates to memory allocation, not number of operations):
In [6]: %timeit np.polyval(p, x)
10 loops, best of 3: 20.7 ms per loop
In [7]: %timeit faster_polyval(p, x)
100 loops, best of 3: 7.46 ms per loop
For really huge inputs, the memory usage difference will matter more than the speed differences. The "bare" numpy version will use ~2x more memory at peak usage than the faster_polyval version.
I ended up here, when I wanted to know whether np.polyval or np.polynomial.polynomial.polyval is faster.
And it is interesting to see that simple implementations are faster as #Joe Kington shows. (I hoped for some optimisation by numpy.)
So here is my comparison with np.polynomial.polynomial.polyval and a slightly faster version.
def fastest_polyval(x, a):
y = a[-1]
for ai in a[-2::-1]:
y *= x
y += ai
return y
It avoids the initial zero array and needs one loop less.
y_np = np.polyval(p, x)
y_faster = faster_polyval(p, x)
prev = 1 * p[::-1] # reverse coefficients
y_np2 = np.polynomial.polynomial.polyval(x, prev)
y_fastest = fastest_polyval(x, prev)
np.allclose(y_np, y_faster), np.allclose(y_np, y_np2), np.allclose(y_np, y_fastest)
# (True, True, True)
%timeit np.polyval(p, x)
%timeit faster_polyval(p, x)
%timeit np.polynomial.polynomial.polyval(x, prev)
%timeit fastest_polyval(x, prev)
# 6.51 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3.69 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 6.28 ms ± 43.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2.65 ms ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Categories

Resources