From what I understand, the scipy function scipy.spatial.distance_matrix returns the Minkowski distance for any pair of vectors from the provided matrices of vectors. Is there a way to get the same result for a different distance? Something that would look like distance_matrix(X, Y, distance_function) ?
I assume that scipy does some sort of optimization under the hood. Since I am dealing with very large vectors, I would rather not lose the benefit of these optimizations by implementing my own distance_matrix function.
It is quite straight forward to implement it yourself
Also the performance will very likely be better than the distance functions already implemented in scipy.
Most of the distance functions are applying one function on all pairs and sum them up eg. (A_ik-B_jk)**n for Minkowski distance and at the end there is some other function applied eg. acc**(1/n).
Template function
You don't have to change anything here to implement various distance functions.
import numpy as np
import numba as nb
def gen_cust_dist_func(kernel_inner,kernel_outer,parallel=True):
kernel_inner_nb=nb.njit(kernel_inner,fastmath=True,inline='always')
kernel_outer_nb=nb.njit(kernel_outer,fastmath=True,inline='always')
def cust_dot_T(A,B):
assert B.shape[1]==A.shape[1]
out=np.empty((A.shape[0],B.shape[0]),dtype=A.dtype)
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
acc=0
for k in range(A.shape[1]):
acc+=kernel_inner_nb(A[i,k],B[j,k])
out[i,j]=kernel_outer_nb(acc)
return out
if parallel==True:
return nb.njit(cust_dot_T,fastmath=True,parallel=True)
else:
return nb.njit(cust_dot_T,fastmath=True,parallel=False)
Examples and Timings
#Implement for example a Minkowski distance and euclidian distance
#Minkowski distance p=20
inner=lambda A,B:(A-B)**20
outer=lambda acc:acc**(1./20)
my_minkowski_dist=gen_cust_dist_func(inner,outer,parallel=True)
#Euclidian distance
inner=lambda A,B:(A-B)**2
outer=lambda acc:np.sqrt(acc)
my_euclidian_dist=gen_cust_dist_func(inner,outer,parallel=True)
from scipy.spatial.distance import cdist
A=np.random.rand(1000,50)
B=np.random.rand(1000,50)
#Minkowski p=20
%timeit res_1=cdist(A,B,'m',p=20)
#1.44 s ± 8.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=my_minkowski_dist(A,B)
#10.8 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
res_1=cdist(A,B,'m',p=20)
res_2=my_minkowski_dist(A,B)
print(np.allclose(res_1,res_2))
#True
#Euclidian
%timeit res_1=cdist(A,B,'euclidean')
#39.3 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_2=my_euclidian_dist(A,B)
#3.61 ms ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
res_1=res_1=cdist(A,B,'euclidean')
res_2=my_euclidian_dist(A,B)
print(np.allclose(res_1,res_2))
#True
Related
Consider the following code example:
# %%
import numpy
from scipy.interpolate import interp2d, RegularGridInterpolator
x = numpy.arange(9000)
y = numpy.arange(9000)
z = numpy.random.randint(-1000, high=1000, size=(9000, 9000))
f = interp2d(x, y, z, kind='linear', copy=False)
f2 = RegularGridInterpolator((x, y), z, "linear")
mx, my = np.meshgrid(x, y)
M = np.stack([mx, my], axis=-1)
# %%
%timeit f(x, y)
# %%
%timeit f2(M)
It sets up some example interpolators using scipy.interpolate.interp2d and scipy.interpolate.RegularGridInterpolator. The output of the two cells above is
1.09 s ± 4.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and
10 s ± 17.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
respectively.
The RegularGridInterpolator is about 10 times slower than the interp2d. The problem is that interp2d has been marked as deprecated in scipy 1.10.0. And new code should use RegularGridInterpolator. This seems a bit strange to me since it would be such a bad replacement. Is there maybe a problem in my code example above? How can I speed this interpolation process up?
There is no problem with your code, it's probably a bug in scipy.
I've reported it on github
Given an NxN matrix W, I'm looking to calculate an NxN matrix C given by the equation in this link: https://i.stack.imgur.com/dY7rY.png, or in LaTeX
$$C_{ij} = \max_k \bigg\{ \sum_l \bigg( W_{ik}W_{kl}W_{lj} - W_{ik}W_{kj} \bigg) \bigg\}.$$
I have tried to implement this in PyTorch but I've either encountered memory problems by constructing an intermediate NxNxN 3D matrix which, for large N, causes my GPU to run out of memory, or used a for-loop over k which is then very slow. I can't work out how I can get round these. How might I implement this calculation, or an approximation of it, without a large intermediate matrix like this?
Suggestions, pseudocode in any language or an implementation in any of Python/Numpy/PyTorch would be much appreciated.
The formula can be simplified as
C_ij = max_k ( W_ik M_kj)
where
M = W * W - N * W
with N the size of the matrix W and W * W the usual product.
Then, in the formula above, for every i, j there is an independent maximum to be computed. Without knowing further properties of W, it is in general not possible to further simplify the problem. So, after computing the matrix M, you can do a loop over i and j, and compute the maximum.
The first solution using Numba (You can do the same using Cython or plain C) would be to formulate the problem using simple loops.
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def calc_1(W):
C=np.empty_like(W)
N=W.shape[0]
for i in nb.prange(N):
TMP=np.empty(N,dtype=W.dtype)
for j in range(N):
for k in range(N):
acc=0
for l in range(N):
acc+=W[i,k]*W[k,l]*W[l,j]-W[i,k]*W[k,j]
TMP[k]=acc
C[i,j]=np.max(TMP)
return C
Francesco provided a simplification which scales far better for larger array sizes. This leads to the following, where I also optimized away a small temporary array.
#nb.njit(fastmath=True,parallel=True)
def calc_2(W):
C=np.empty_like(W)
N=W.shape[0]
M = np.dot(W,W) - N * W
for i in nb.prange(N):
for j in range(N):
val=W[i,0]*M[0,j]
for k in range(1,N):
TMP=W[i,k]*M[k,j]
if TMP>val:
val=TMP
C[i,j]=val
return C
This can be optimized further by partial loop unrolling and optimizing the array access. Some compilers may do this automatically.
#nb.njit(fastmath=True,parallel=True)
def calc_3(W):
C=np.empty_like(W)
N=W.shape[0]
W=np.ascontiguousarray(W)
M = np.dot(W.T,W.T) - W.shape[0] * W.T
for i in nb.prange(N//4):
for j in range(N):
val_1=W[i*4+0,0]*M[j,0]
val_2=W[i*4+1,0]*M[j,0]
val_3=W[i*4+2,0]*M[j,0]
val_4=W[i*4+3,0]*M[j,0]
for k in range(1,N):
TMP_1=W[i*4+0,k]*M[j,k]
TMP_2=W[i*4+1,k]*M[j,k]
TMP_3=W[i*4+2,k]*M[j,k]
TMP_4=W[i*4+3,k]*M[j,k]
if TMP_1>val_1:
val_1=TMP_1
if TMP_2>val_2:
val_2=TMP_2
if TMP_3>val_3:
val_3=TMP_3
if TMP_4>val_4:
val_4=TMP_4
C[i*4+0,j]=val_1
C[i*4+1,j]=val_2
C[i*4+2,j]=val_3
C[i*4+3,j]=val_4
#Remainder
for i in range(N//4*4,N):
for j in range(N):
val=W[i,0]*M[j,0]
for k in range(1,N):
TMP=W[i,k]*M[j,k]
if TMP>val:
val=TMP
C[i,j]=val
return C
Timings
W=np.random.rand(100,100)
%timeit calc_1(W)
#16.8 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit calc_2(W)
#449 µs ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit calc_3(W)
#259 µs ± 47.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
W=np.random.rand(2000,2000)
#Temporary array would be 64GB in this case
%timeit calc_2(W)
#5.37 s ± 174 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit calc_3(W)
#596 ms ± 30.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have a 2D UINT8 numpy array of size (149797, 64). Each of the elements are either 0 or 1. I want to pack these binary values in each row into a UINT64 value so that i get a UINT64 array of shape 149797 as a result. I tried the following code using numpy bitpack function.
test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8)
col_pack=np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64)
The packbits function takes about 10 ms to execute. A simple reshaping of this array itself seems to take around 7 ms.I also tried iterating over 2d numpy array using shifting operations to achieve the same result; but there was no speed improvement.
Finally i also want to compile it using numba for CPU.
#njit
def shifting(bitlist):
x=np.zeros(149797,dtype=np.uint64) #54
rows,cols=bitlist.shape
for i in range(0,rows): #56
out=0
for bit in range(0,cols):
out = (out << 1) | bitlist[i][bit] # If i comment out bitlist, time=190 microsec
x[i]=np.uint64(out) # Reduces time to microseconds if line is commented in njit
return x
It takes about 6 ms using njit.
Here is the parallel njit version
#njit(parallel=True)
def shifting(bitlist):
rows,cols=149797,64
out=0
z=np.zeros(rows,dtype=np.uint64)
for i in prange(rows):
for bit in range(cols):
z[i] = (z[i] * 2) + bitlist[i,bit] # Time becomes 100 micro if i use 'out' instead of 'z[i] array'
return z
It's slightly better wit 3.24ms execution time(google colab dual core 2.2Ghz)
Currently, the python solution with swapbytes(Paul's) method seems to be the best one i.e 1.74 ms.
How can we further speed up this conversion? Is there scope for using any vectorization(or parallelization), bitarrays etc, for achieving speedup?
Ref: numpy packbits pack to uint16 array
On a 12 core machine(Intel(R) Xeon(R) CPU E5-1650 v2 # 3.50GHz),
Pauls method: 1595.0 microseconds (it does not use multicore, i suppose)
Numba code: 146.0 microseconds (aforementioned parallel-numba)
i.e around 10x speedup !!!
You can get a sizeable speedup by using byteswap instead of reshaping etc.:
test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8)
np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64)
# array([ 1079982015491401631, 246233595099746297, 16216705265283876830,
# ..., 1943876987915462704, 14189483758685514703,
12753669247696755125], dtype=uint64)
np.packbits(test).view(np.uint64).byteswap()
# array([ 1079982015491401631, 246233595099746297, 16216705265283876830,
# ..., 1943876987915462704, 14189483758685514703,
12753669247696755125], dtype=uint64)
timeit(lambda:np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64),number=100)
# 1.1054180909413844
timeit(lambda:np.packbits(test).view(np.uint64).byteswap(),number=100)
# 0.18370431219227612
A bit Numba solution (version 0.46/Windows).
Code
import numpy as np
import numba as nb
#with memory allocation
#nb.njit(parallel=True)
def shifting(bitlist):
assert bitlist.shape[1]==64
x=np.empty(bitlist.shape[0],dtype=np.uint64)
for i in nb.prange(bitlist.shape[0]):
out=np.uint64(0)
for bit in range(bitlist.shape[1]):
out = (out << 1) | bitlist[i,bit]
x[i]=out
return x
#without memory allocation
#nb.njit(parallel=True)
def shifting_2(bitlist,x):
assert bitlist.shape[1]==64
for i in nb.prange(bitlist.shape[0]):
out=np.uint64(0)
for bit in range(bitlist.shape[1]):
out = (out << 1) | bitlist[i,bit]
x[i]=out
return x
Timings
test = np.random.randint(0, 2, (149797, 64),dtype=np.uint8)
#If you call this function multiple times, only allocating memory
#once may be enough
x=np.empty(test.shape[0],dtype=np.uint64)
#Warmup first call takes significantly longer
res=shifting(test)
res=shifting_2(test,x)
%timeit res=shifting(test)
#976 µs ± 41.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit res=shifting_2(test,x)
#764 µs ± 63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.packbits(test).view(np.uint64).byteswap()
#8.07 ms ± 52.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.packbits(test.reshape(-1, 8, 8)[:, ::-1]).view(np.uint64)
#17.9 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I am trying to generate a bunch of random numbers quickly to do a MCMC.
I have the following benchmarks:
#njit
def getRandos(n):
for i in prange(n):
a = np.random.rand()
%timeit np.random.rand(1000000000)
13.1 s ± 287 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit getRandos(1000000000)
1.97 s ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Clearly the parallelization improves my runtime. However, I don't know how the seeding of the random number generation works. How can I ensure that these numbers are truly random? Do I have to randomly choose a seed somehow?
You don't have an apples to apples comparison. The first call you make np.random.rand(1000000000) is spending a ton of time creating space and storing the random numbers, while the second call getRandos(1000000000) just generates values and drops them.
Here is the apples to apples comparison (which is about the same speed):
from numba import prange, njit
import numpy as np
#njit
def getRandos(n):
a = np.zeros(n)
for i in prange(n):
a[i] = np.random.rand()
return a
%timeit -n 100 getRandos(100000)
%timeit -n 100 np.random.rand(100000)
To answer your question however, reference the numba documentation here.
They don't allow you to create individual RandomState instances, but you can set the seed inside the definition.
#njit
def getRandos(n):
np.random.seed(1111)
a = np.zeros(n)
for i in prange(n):
a[i] = np.random.rand()
return a
values = getRandos(100000)
values2 = getRandos(100000)
print(all(values == values2)) # True
I have a very simple problem: in my python toolbox, I have to compute the values of polynomials (usually degree 3 or 2, seldom others, always integer degree) from a large vector (size >> 10^6). Storing the result in a buffer is not an option because I have several of these vectors so I would quickly run out of memory, and I usually have to compute it only once in any case. The performance of numpy.polyval is actually quite good, but still this is my bottleneck. Can I somehow make the evaluation of the polynomial faster?
Addendum
I think that the pure-numpy solution of Joe Kington is good for me, in particular because it avoids potential issues at installation time of other libraries or cython. For those who asked, the numbers in the vector are large (order 10^4), so I don't think that the suggested approximations would work.
You actually can speed it up slightly by doing the operations in-place (or using numexpr or numba which will automatically do what I'm doing manually below).
numpy.polyval is a very short function. Leaving out a few type checks, etc, it amounts to:
def polyval(p, x):
y = np.zeros_like(x)
for i in range(len(p)):
y = x * y + p[i]
return y
The downside to this approach is that a temporary array will be created inside the loop as opposed to doing the operation in-place.
What I'm about to do is a micro-optimization and is only worthwhile for very large x inputs. Furthermore, we'll have to assume floating-point output instead of letting the upcasting rules determine the output's dtype. However, it will speed this up slighly and make it use less memory:
def faster_polyval(p, x):
y = np.zeros(x.shape, dtype=float)
for i, v in enumerate(p):
y *= x
y += v
return y
As an example, let's say we have the following input:
# Third order polynomial
p = [4.5, 9.8, -9.2, 1.2]
# One-million element array
x = np.linspace(-10, 10, 1e6)
The results are identical:
In [3]: np_result = np.polyval(p, x)
In [4]: new_result = faster_polyval(p, x)
In [5]: np.allclose(np_result, new_result)
Out[5]: True
And we get a modest 2-3x speedup (which is mostly independent of array size, as it relates to memory allocation, not number of operations):
In [6]: %timeit np.polyval(p, x)
10 loops, best of 3: 20.7 ms per loop
In [7]: %timeit faster_polyval(p, x)
100 loops, best of 3: 7.46 ms per loop
For really huge inputs, the memory usage difference will matter more than the speed differences. The "bare" numpy version will use ~2x more memory at peak usage than the faster_polyval version.
I ended up here, when I wanted to know whether np.polyval or np.polynomial.polynomial.polyval is faster.
And it is interesting to see that simple implementations are faster as #Joe Kington shows. (I hoped for some optimisation by numpy.)
So here is my comparison with np.polynomial.polynomial.polyval and a slightly faster version.
def fastest_polyval(x, a):
y = a[-1]
for ai in a[-2::-1]:
y *= x
y += ai
return y
It avoids the initial zero array and needs one loop less.
y_np = np.polyval(p, x)
y_faster = faster_polyval(p, x)
prev = 1 * p[::-1] # reverse coefficients
y_np2 = np.polynomial.polynomial.polyval(x, prev)
y_fastest = fastest_polyval(x, prev)
np.allclose(y_np, y_faster), np.allclose(y_np, y_np2), np.allclose(y_np, y_fastest)
# (True, True, True)
%timeit np.polyval(p, x)
%timeit faster_polyval(p, x)
%timeit np.polynomial.polynomial.polyval(x, prev)
%timeit fastest_polyval(x, prev)
# 6.51 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3.69 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 6.28 ms ± 43.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2.65 ms ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)