Is it possible to "vectorize" successive matrix multiplications? - python

I am in the process of "vectorizing" a pet project of mine. The function 'propagateXi' propagates (solves) a time-varying, linear system of equations and answers to ~20% of the cost of the project.
The sizes vary a lot between instances of the problem. A "typical" set would be: N = 501, n = 4, s = 3.
Here is a simplification of the code:
import numpy
class Propagator():
def __init__(self,mode):
# lots of code here
pass
def propagateXi(self,Xi,terms):
""" This function propagates the linear system, called many times with different initial conditions.
The initial conditions were previously set.
The "arc" represents a kind of "realization";
the arcs are essentially independent from one another.
Sizes: "vector dimensions": 2*n, "time dimension": N, "arcs dimension": s
- Xi (N, 2*n, s) stores the (vector) state in each time and each "arc".
- self.MainPropMat (N, 2*n, 2*n, s) stores the 2n x 2n matrices
- terms (N, 2*n, s) stores the "forcing terms" of the equation.
The equation would be something like:
\xi_{k+1, arc} = MainPropMat_{k, arc} * \xi_{k, arc} + terms_{k, arc} \forall k, arc
"""
for k in range(self.N - 1):
Xi[k+1,:,:] = numpy.einsum('ijs,js->is', self.MainPropMat[k, :, :, :], Xi[k, :, :]) +\
terms[k, :, :]
return Xi
I have tried to re-interpret the problem as the multiplication of a huge 2nN x 2nN matrix by a long 2nN column containing the initial conditions and the forcing terms. In these conditions, the code was something like this:
import numpy
class Propagator():
def __init__(self,mode):
# lots of code here
self.prepareBigMat()
def prepareBigMat(self)
"""
Prepare the "big matrix" for Xi propagation. Called only once.
This function assembles the "big matrix" (s x 2*n*N x 2*n*N !) for propagating
the Xi differential equation in a single step.
:return: None
"""
n2 = 2 * self.n
BigMat = numpy.empty((self.s,self.N * n2,self.N * n2))
I = numpy.eye(n2*self.N)
for arc in range(self.s):
BigMat[arc,:,:] = I
for k in range(1,self.N):
kn2 = k * n2
BigMat[:, kn2:(kn2+n2) , 0:kn2] = numpy.einsum('ijs,sjk->sik',
self.MainPropMat[k-1,:,:,:],
BigMat[:,(kn2-n2):k*n2, 0:kn2])
self.BigMat = BigMat
def propagateXi(self,Xi,terms):
n2 = n * 2
# Assemble the big column
BigCol = numpy.empty((n2 * N, s))
# Initial conditions
BigCol[:n2, :] = Xi[0, :, :]
# Forcing terms
BigCol[n2:, :] = terms.reshape((n2 * (N - 1), s))
# Perform the multiplication and reshaping of the Xi array
Xi = numpy.einsum('sij,js->is', self.BigMat, BigCol).reshape((N, n2, s))
return Xi
The second version does precisely the same thing as the first one. Sadly, this second version takes almost 5x the running time of the previous one. My guess is that the cost of assembly of the Big Matrix is too great...
Any ideas?
Thanks!

Related

Output for an implementation of a Monte Carlo numerical solver for a 1D potential well doesn't match expected output

I am trying to implement a numerical solver for a 1D Harmonic well's ground state using the Metropolis algorithm and the Feynman Path Integral technique in Python. When I run my program, I end with a distribution of the different points that my particle has gone to; this distribution ought to match up with that of a particle trapped in a 1D harmonic well. It does not. I have gone through and rewritten my code; I have checked it to similar code used for the same purpose; it all looks like it should work, yet it doesn't.
In blue is the histogram of my results, with Density set to True; the orange line is the function describing the expected distribution
As can be seen in the image, what I have ended up with is a distribution that isn't dissimilar to what I was expecting, but it isn't the correct distribution. The code I used (see below), is based on Lepage (2005) work on the same topic, although I used a slightly different formula to describe the same physical system.
import numpy as np
import random
import matplotlib.pyplot as plt
time = 4 #time over which we evolve our function
steps = 7 #number of steps we take
epsilon = 3 #the pos & neg bounds of our rand variable
N_cor = 100 #the number of times we need to thermalise our function before we take a path
N_cf = 20000 #the number of paths we take
def S(x, j, t, s): #the action of our potential well
e = t / s
return (1/(2*e))*(x[j] - x[j - 1])**2 + ((x[j] + x[j-1])/2)**2/2
def update(x, t, s, eps):
for j in range(0, s):
old_x = x[j] #old x value
old_Sj = S(x, j, t, s) #original action value
x[j] = x[j] + random.uniform(-eps,eps) #new x value
dS = S(x, j, t, s) - old_Sj #change in action
if dS > 0 and np.exp(-dS) < random.uniform(0,1): #check for Metropolis alg
x[j] = old_x
return x
def gamma(t, s, eps, thermal_num, num_paths):
zeros = np.zeros(s) #our initial path with s steps
gamma_arr = np.empty(0) #our initial empty result
for i in range(0, 10*thermal_num): #thermalisation
zeros = update(zeros, t, s, eps)
for j in range(0, num_paths):
for i in range(0, thermal_num): #thermalising again
zeros = update(zeros, t, s, eps)
gamma_arr = np.append(gamma_arr, zeros) #add new path post thermalising
#print(zeros)
#print(gamma_arr)
return gamma_arr
test = gamma(time, steps, epsilon, N_cor, N_cf)
x = np.arange(-4, 4, 0.1)
y = 1/np.sqrt(np.pi)*np.exp(-(x**2)) #expected result
plt.hist(test, bins= 500, density = True)
plt.plot(x, y)
plt.show()

Is there a DP solution for my subset average problem?

I have a combinatorics problem that I can't solve.
Given a set of vectors and a target vector, return a scalar for each vector, so that the average of the scaled vectors in the set is closest to the target.
Edit: Weights w_i are in range [0, 1]. This is a constrained optimisation problem:
minimise d(avg(w_i * x_i), target)
subject to sum(w_i) - 1 = 0
If i had to name this problem it would be unbounded subset average.
I have looked at the unbounded knapsack and similar problems, but a dynamic programming implementation seems to be impossible due to the interdependence of the numbers.
I also inplemented a genetic algorithm that is able to approximate the weights moderately well, but it takes too long and I was initially hoping to solve the problem using dynamic programming.
Is there any hope?
Visualization
In a 2D space the solution to the problem can be represented like this
Problem class identification
As recognized by others this is a an optimization problem. You have linear constraints and a convex objective function, it can be cast to quadratic programming, (read Least squares session)
Casting to standard form
If you want to minimize the average of w[i] * x[i], this is sum(w[i] * x[i]) / N, if you arrange w[i] as the elements of a (1 x N_vectors) matrix, and each vector x[i] as the i-th row of a (N_vectors x DIM) matrix, it becomes w # X / N_vectors (with # being the matrix product operator).
To cast to that form you would have to construct a matrix so that each rows of A*x < b expressing -w[i] < 0, the equality is sum(w) = 1 becomes sum(w) < 1 and -sum(w) < -1. But there there are amazing tools to automate this part.
Implementation
This can be readily implemented using cvxpy, and you don't have to care about expanding all the constraints.
The following function solves the problem and if the vectors have dimension 2 plot the result.
import cvxpy;
import numpy as np
import matplotlib.pyplot as plt
def place_there(X, target):
# some linear algebra arrangements
target = target.reshape((1, -1))
ncols = target.shape[1]
X = np.array(X).reshape((-1, ncols))
N_vectors = X.shape[0]
# variable of the problem
w = cvxpy.Variable((1, X.shape[0]))
# solve the problem with the objective of minimize the norm of w * X - T (# is the matrix product)
P = cvxpy.Problem(cvxpy.Minimize(cvxpy.norm((w # X) / N_vectors - target)), [w >= 0, cvxpy.sum(w) == 1])
# here it is solved
print('Distance from target is: ', P.solve())
# show the solution in a nice plot
# w.value is the w that gave the optimal solution
Y = w.value.transpose() * X / N_vectors
path = np.zeros((X.shape[0] + 1, 2))
path[1:, :] = np.cumsum(Y, axis=0)
randColors=np.random.rand( 3* X.shape[0], 3).reshape((-1, 3)) * 0.7
plt.quiver(path[:-1,0], path[:-1, 1], Y[:, 0], Y[:, 1], color=randColors, angles='xy', scale_units='xy', scale=1)
plt.plot(target[:, 0], target[:, 1], 'or')
And you can run it like this
target = np.array([[1.234, 0.456]]);
plt.figure(figsize=(12, 4))
for i in [1,2,3]:
X = np.random.randn(20) * 100
plt.subplot(1,3,i)
place_there(X, target)
plt.xlim([-3, 3])
plt.ylim([-3, 3])
plt.grid()
plt.show();

I am newbie in python and doing coding for my physics project which requires to generate a matrix with a variable E

I am newbie in python and doing coding for my physics project which requires to generate a matrix with a variable E for which first element of the matrix has to be solved. Please help me. Thanks in advance.
Here is the part of code
import numpy as np
import pylab as pl
import math
import cmath
import sympy as sy
from scipy.optimize import fsolve
#Constants(Values at temp 10K)
hbar = 1.055E-34
m0=9.1095E-31 #free mass of electron
q= 1.602E-19
v = [0.510,0,0.510] # conduction band offset in eV
m1= 0.043 #effective mass in In_0.53Ga_0.47As
m2 = 0.072 #effective mass in Al_0.48In_0.52As
d = [-math.inf,100,math.inf] # dimension of structure in nanometers
'''scaling factor to with units of E in eV, mass in terms of free mass of electron, length in terms
of nanometers '''
s = (2*q*m0*1E-18)/(hbar)**2
#print('scaling factor is ',s)
E = sy.symbols('E') #Suppose energy of incoming particle is 0.3eV
m = [0.043,0.072,0.043] #effective mass of electrons in layers
for i in range(3):
print ('Effective mass of e in layer', i ,'is', m[i])
k=[ ] #Defining an array for wavevectors in different layers
for i in range(3):
k.append(sy.sqrt(s*m[i]*(E-v[i])))
print('Wave vector in layer',i,'is',k[i])
x = []
for i in range(2):
x.append((k[i+1]*m[i])/(k[i]*m[i+1]))
# print(x[i])
#Define Boundary condition matrix for two interfaces.
D0 = (1/2)*sy.Matrix([[1+x[0],1-x[0]], [1-x[0], 1+x[0]]], dtype = complex)
#print(D0)
#A = sy.matrix2numpy(D0,dtype=complex)
D1 = (1/2)*sy.Matrix([[1+x[1],1-x[1]], [1-x[1], 1+x[1]]], dtype = complex)
#print(D1)
#a=eye(3,3)
#print(a)
#Define Propagation matrix for 2nd layer or quantum well
#print(d[1])
#print(k[1])
P1 = 1*sy.Matrix([[sy.exp(-1j*k[1]*d[1]), 0],[0, sy.exp(1j*k[1]*d[1])]], dtype = complex)
#print(P1)
print("abs")
T= D0*P1*D1
#print('Transfer Matrix is given by:',T)
#print('Dimension of tranfer matrix T is' ,T.shape)
#print(T[0,0]
# I want to solve T{0,0} = 0 equation for E
def f(x):
return T[0,0]
x0= 0.5 #intial guess
x = fsolve(f, x0)
print("E is",x)
'''
y=sy.Eq(T[0,0],0)
z=sy.solve(y,E)
print('z',z)
'''
**The main part i guess is the part of the code where i am trying to solve the equation.***Steps I am following:
Defining a symbol E by using sympy
Generating three matrices which involves sum formulae and with variable E
Generating a matrix T my multiplying those 3 matrices,note that elements are complex and involves square roots of negative number.
I need to solve first element of this matrix T[0,0]=0,for variable E and find out value of E. I used fsolve for soving T[0,0]=0.*
Just a note for future questions, please leave out unused imports such as numpy and leave out zombie code like # a = eye(3,3). This helps keep the code as clean and short as possible. Also, the sample code would not run because of indentation problems, so when you copy and paste code, make sure it works before you do so. Always try to make your questions as short and modular as possible.
The expression of T[0,0] is too complex to solve analytically by SymPy so numerical approximation is needed. This leaves 2 options:
using SciPy's solvers which are advanced but require type casting to float values since SciPy does not deal with SymPy objects in any way.
using SymPy's root solvers which are less advanced but are probably simpler to use.
Both of these will only ever produce a single number as output since you can't expect numeric solvers to find every root. If you wanted to find more than one, then I advise that you use a list of points that you want to use as initial values, input each of them into the solvers and keep track of the distinct outputs. This will however never guarantee that you have obtained every root.
Only mix SciPy and SymPy if you are comfortable using both with no problems. SciPy doesn't play at all with SymPy and you should only have list, float, and complex instances when working with SciPy.
import math
import sympy as sy
from scipy.optimize import newton
# Constants(Values at temp 10K)
hbar = 1.055E-34
m0 = 9.1095E-31 # free mass of electron
q = 1.602E-19
v = [0.510, 0, 0.510] # conduction band offset in eV
m1 = 0.043 # effective mass in In_0.53Ga_0.47As
m2 = 0.072 # effective mass in Al_0.48In_0.52As
d = [-math.inf, 100, math.inf] # dimension of structure in nanometers
'''scaling factor to with units of E in eV, mass in terms of free mass of electron, length in terms
of nanometers '''
s = (2 * q * m0 * 1E-18) / hbar ** 2
E = sy.symbols('E') # Suppose energy of incoming particle is 0.3eV
m = [0.043, 0.072, 0.043] # effective mass of electrons in layers
for i in range(3):
print('Effective mass of e in layer', i, 'is', m[i])
k = [] # Defining an array for wavevectors in different layers
for i in range(3):
k.append(sy.sqrt(s * m[i] * (E - v[i])))
print('Wave vector in layer', i, 'is', k[i])
x = []
for i in range(2):
x.append((k[i + 1] * m[i]) / (k[i] * m[i + 1]))
# Define Boundary condition matrix for two interfaces.
D0 = (1 / 2) * sy.Matrix([[1 + x[0], 1 - x[0]], [1 - x[0], 1 + x[0]]], dtype=complex)
D1 = (1 / 2) * sy.Matrix([[1 + x[1], 1 - x[1]], [1 - x[1], 1 + x[1]]], dtype=complex)
# Define Propagation matrix for 2nd layer or quantum well
P1 = 1 * sy.Matrix([[sy.exp(-1j * k[1] * d[1]), 0], [0, sy.exp(1j * k[1] * d[1])]], dtype=complex)
print("abs")
T = D0 * P1 * D1
# did not converge for 0.5
x0 = 0.75
# method 1:
def f(e):
# evaluate T[0,0] at e and remove all sympy related things.
result = complex(T[0, 0].replace(E, e))
return result
solution1 = newton(f, x0)
print(solution1)
# method 2:
solution2 = sy.nsolve(T[0,0], E, x0)
print(solution2)
This prints:
(0.7533104353644469-0.023775286117722193j)
1.00808496181754 - 0.0444042144405285*I
Note that the first line is a native Python complex instance while the second is an instance of SymPy's complex number. One can convert the second simply with print(complex(solution2)).
Now, you'll notice that they produce different numbers but both are correct. This function seems to have a lot of zeros as can be shown from the Geogebra plot:
The red axis is Re(E), green is Im(E) and blue is |T[0,0]|. Each of those "spikes" are probably zeros.

Fastest code to calculate distance between points in numpy array with cyclic (periodic) boundary conditions

I know how to calculate the Euclidean distance between points in an array using
scipy.spatial.distance.cdist
Similar to answers to this question:
Calculate Distances Between One Point in Matrix From All Other Points
However, I would like to make the calculation assuming cyclic boundary conditions, e.g. so that point [0,0] is distance 1 from point [0,n-1] in this case, not a distance of n-1. (I will then make a mask for all points within a threshold distance of my target cells, but that is not central to the question).
The only way I can think of is to repeat the calculation 9 times, with the domain indices having n added/subtracted in the x, y and then x&y directions, and then stacking the results and finding the minimum across the 9 slices. To illustrate the need for 9 repetitions, I put together a simple schematic with just 1 J-point, marked with a circle, and which shows an example where the cell marked by the triangle in this case has its nearest neighbour in the domain reflected to the top-left.
this is the code I developed for this using cdist:
import numpy as np
from scipy import spatial
n=5 # size of 2D box (n X n points)
np.random.seed(1) # to make reproducible
a=np.random.uniform(size=(n,n))
i=np.argwhere(a>-1) # all points, for each loc we want distance to nearest J
j=np.argwhere(a>0.85) # set of J locations to find distance to.
# this will be used in the KDtree soln
global maxdist
maxdist=2.0
def dist_v1(i,j):
dist=[]
# 3x3 search required for periodic boundaries.
for xoff in [-n,0,n]:
for yoff in [-n,0,n]:
jo=j.copy()
jo[:,0]-=xoff
jo[:,1]-=yoff
dist.append(np.amin(spatial.distance.cdist(i,jo,metric='euclidean'),1))
dist=np.amin(np.stack(dist),0).reshape([n,n])
return(dist)
This works, and produces e.g. :
print(dist_v1(i,j))
[[1.41421356 1. 1.41421356 1.41421356 1. ]
[2.23606798 2. 1.41421356 1. 1.41421356]
[2. 2. 1. 0. 1. ]
[1.41421356 1. 1.41421356 1. 1. ]
[1. 0. 1. 1. 0. ]]
The zeros obviously mark the J points, and the distances are correct (this EDIT corrects my earlier attempts which was incorrect).
Note that if you change the last two lines to stack the raw distances and then only use one minimum like this :
def dist_v2(i,j):
dist=[]
# 3x3 search required for periodic boundaries.
for xoff in [-n,0,n]:
for yoff in [-n,0,n]:
jo=j.copy()
jo[:,0]-=xoff
jo[:,1]-=yoff
dist.append(spatial.distance.cdist(i,jo,metric='euclidean'))
dist=np.amin(np.dstack(dist),(1,2)).reshape([n,n])
return(dist)
it is faster for small n (<10) but considerably slower for larger arrays (n>10)
...but either way, it is slow for my large arrays (N=500 and J points number around 70), this search is taking up about 99% of the calculation time, (and it is a bit ugly too using the loops) - is there a better/faster way?
The other options I thought of were:
scipy.spatial.KDTree.query_ball_point
With further searching I have found that there is a function
scipy.spatial.KDTree.query_ball_point which directly calculates the coordinates within a radius of my J-points, but it doesn't seem to have any facility to use periodic boundaries, so I presume one would still need to somehow use a 3x3 loop, stack and then use amin as I do above, so I'm not sure if this will be any faster.
I coded up a solution using this function WITHOUT worrying about the periodic boundary conditions (i.e. this doesn't answer my question)
def dist_v3(n,j):
x, y = np.mgrid[0:n, 0:n]
points = np.c_[x.ravel(), y.ravel()]
tree=spatial.KDTree(points)
mask=np.zeros([n,n])
for results in tree.query_ball_point((j), maxdist):
mask[points[results][:,0],points[results][:,1]]=1
return(mask)
Maybe I'm not using it in the most efficient way, but this is already as slow as my cdist-based solutions even without the periodic boundaries. Including the mask function in the two cdist solutions, i.e. replacing the return(dist) with return(np.where(dist<=maxdist,1,0)) in those functions, and then using timeit, I get the following timings for n=100:
from timeit import timeit
print("cdist v1:",timeit(lambda: dist_v1(i,j), number=3)*100)
print("cdist v2:",timeit(lambda: dist_v2(i,j), number=3)*100)
print("KDtree:", timeit(lambda: dist_v3(n,j), number=3)*100)
cdist v1: 181.80927299981704
cdist v2: 554.8205785999016
KDtree: 605.119637199823
Make an array of relative coordinates for points within a set distance of [0,0] and then manually loop over the J points setting up a mask with this list of relative points - This has the advantage that the "relative distance" calculation is only performed once (my J points change each timestep), but I suspect the looping will be very slow.
Precalculate a set of masks for EVERY point in the 2D domain, so in each timestep of the model integration I just pick out the mask for the J-point and apply. This would use a LOT of memory (proportional to n^4) and perhaps is still slow as you need to loop over J points to combine the masks.
I'll show an alternative approach from an image processing perspective, which may be of interest to you, regardless of whether it's the fastest or not. For convenience, I've only implemented it for an odd n.
Rather than considering a set of nxn points i, let's instead take the nxn box. We can consider this as a binary image. Let each point in j be a positive pixel in this image. For n=5 this would look something like:
Now let's think about another concept from image processing: Dilation. For any input pixel, if it has a positive pixel in its neighborhood, the output pixel will be 1. This neighborhood is defined by what is called the Structuring Element: a boolean kernel where the ones will show which neighbors to consider.
Here's how I define the SE for this problem:
Y, X = np.ogrid[-n:n+1, -n:n+1]
SQ = X*X+Y*Y
H = SQ == r
Intuitively, H is a mask denoting 'all points from the center at who satisfy the equation x*x+y*y=r. That is, all points in H lie at sqrt(r) distance from the center. Another visualization and it'll be absolutely clear:
It is an ever expanding pixel circle. Each white pixel in each mask denotes a point where the distance from the center pixel is exactly sqrt(r). You might also be able to tell that if we iteratively increase the value of r, we're actually steadily covering all the pixel locations around a particular location, eventually covering the entire image. (Some values of r don't give responses, because no such distance sqrt(r) exists for any pair of points. We skip those r values - like 3.)
So here's what the main algorithm does.
We will incrementally increase the value of r starting from 0 to some high upper limit.
At each step, if any position (x,y) in the image gives a response to dilation, that means that there is a j point at exactly sqrt(r) distance from it!
We can find a match multiple times; we'll only keep the first match and discard further matches for points. We do this till all pixels (x,y) have found their minimum distance / first match.
So you could say that this algorithm is dependent on the number of unique distance pairs in the nxn image.
This also implies that if you have more and more points in j, the algorithm will actually get faster, which goes against common sense!
The worst case for this dilation algorithm is when you have the minimum number of points (exactly one point in j), because then it would need to iterate r to a very high value to get a match from points far away.
In terms of implementing:
n=5 # size of 2D box (n X n points)
np.random.seed(1) # to make reproducible
a=np.random.uniform(size=(n,n))
I=np.argwhere(a>-1) # all points, for each loc we want distance to nearest J
J=np.argwhere(a>0.85)
Y, X = np.ogrid[-n:n+1, -n:n+1]
SQ = X*X+Y*Y
point_space = np.zeros((n, n))
point_space[J[:,0], J[:,1]] = 1
C1 = point_space[:, :n//2]
C2 = point_space[:, n//2+1:]
C = np.hstack([C2, point_space, C1])
D1 = point_space[:n//2, :]
D2 = point_space[n//2+1:, :]
D2_ = np.hstack([point_space[n//2+1:, n//2+1:],D2,point_space[n//2+1:, :n//2]])
D1_ = np.hstack([point_space[:n//2:, n//2+1:],D1,point_space[:n//2, :n//2]])
D = np.vstack([D2_, C, D1_])
p = (3*n-len(D))//2
D = np.pad(D, (p,p), constant_values=(0,0))
plt.imshow(D, cmap='gray')
plt.title(f'n={n}')
If you look at the image for n=5, you can tell what I've done; I've simply padded the image with its four quadrants in a way to represent the cyclic space, and then added some additional zero padding to account for the worst case search boundary.
#nb.jit
def dilation(image, output, kernel, N, i0, i1):
for i in range(i0,i1):
for j in range(i0, i1):
a_0 = i-(N//2)
a_1 = a_0+N
b_0 = j-(N//2)
b_1 = b_0+N
neighborhood = image[a_0:a_1, b_0:b_1]*kernel
if np.any(neighborhood):
output[i-i0,j-i0] = 1
return output
#nb.njit(cache=True)
def progressive_dilation(point_space, out, total, dist, SQ, n, N_):
for i in range(N_):
if not np.any(total):
break
H = SQ == i
rows, cols = np.nonzero(H)
if len(rows) == 0: continue
rmin, rmax = rows.min(), rows.max()
cmin, cmax = cols.min(), cols.max()
H_ = H[rmin:rmax+1, cmin:cmax+1]
out[:] = False
out = dilation(point_space, out, H_, len(H_), n, 2*n)
idx = np.logical_and(out, total)
for a, b in zip(*np.where(idx)):
dist[a, b] = i
total = total * np.logical_not(out)
return dist
def dilateWrap(D, SQ, n):
out = np.zeros((n,n), dtype=bool)
total = np.ones((n,n), dtype=bool)
dist=-1*np.ones((n,n))
dist = progressive_dilation(D, out, total, dist, SQ, n, 2*n*n+1)
return dist
dout = dilateWrap(D, SQ, n)
If we visualize dout, we can actually get an awesome visual representation of the distances.
The dark spots are basically positions where j points were present. The brightest spots naturally means points farthest away from any j. Note that I've kept the values in squared form to get an integer image. The actual distance is still one square root away. The results match with the outputs of the ball park algorithm.
# after resetting n = 501 and rerunning the first block
N = J.copy()
NE = J.copy()
E = J.copy()
SE = J.copy()
S = J.copy()
SW = J.copy()
W = J.copy()
NW = J.copy()
N[:,1] = N[:,1] - n
NE[:,0] = NE[:,0] - n
NE[:,1] = NE[:,1] - n
E[:,0] = E[:,0] - n
SE[:,0] = SE[:,0] - n
SE[:,1] = SE[:,1] + n
S[:,1] = S[:,1] + n
SW[:,0] = SW[:,0] + n
SW[:,1] = SW[:,1] + n
W[:,0] = W[:,0] + n
NW[:,0] = NW[:,0] + n
NW[:,1] = NW[:,1] - n
def distBP(I,J):
tree = BallTree(np.concatenate([J,N,E,S,W,NE,SE,SW,NW]), leaf_size=15, metric='euclidean')
dist = tree.query(I, k=1, return_distance=True)
minimum_distance = dist[0].reshape(n,n)
return minimum_distance
print(np.array_equal(distBP(I,J), np.sqrt(dilateWrap(D, SQ, n))))
Out:
True
Now for a time check, at n=501.
from timeit import timeit
nl=1
print("ball tree:",timeit(lambda: distBP(I,J),number=nl))
print("dilation:",timeit(lambda: dilateWrap(D, SQ, n),number=nl))
Out:
ball tree: 1.1706031339999754
dilation: 1.086665302000256
I would say they are roughly equal, although dilation has a very minute edge. In fact, dilation is still missing a square root operation, let's add that.
from timeit import timeit
nl=1
print("ball tree:",timeit(lambda: distBP(I,J),number=nl))
print("dilation:",timeit(lambda: np.sqrt(dilateWrap(D, SQ, n)),number=nl))
Out:
ball tree: 1.1712950239998463
dilation: 1.092416919000243
Square root basically has negligible effect on the time.
Now, I said earlier that dilation becomes faster when there are actually more points in j. So let's increase the number of points in j.
n=501 # size of 2D box (n X n points)
np.random.seed(1) # to make reproducible
a=np.random.uniform(size=(n,n))
I=np.argwhere(a>-1) # all points, for each loc we want distance to nearest J
J=np.argwhere(a>0.4) # previously a>0.85
Checking the time now:
from timeit import timeit
nl=1
print("ball tree:",timeit(lambda: distBP(I,J),number=nl))
print("dilation:",timeit(lambda: np.sqrt(dilateWrap(D, SQ, n)),number=nl))
Out:
ball tree: 3.3354218500007846
dilation: 0.2178608220001479
Ball tree has actually gotten slower while dilation got faster! This is because if there are many j points, we can quickly find all distances with a few repeats of dilation. I find this effect rather interesting - normally you would expect runtimes to get worse as number of points increase, but the opposite happens here.
Conversely, if we reduce j, we'll see dilation get slower:
#Setting a>0.9
print("ball tree:",timeit(lambda: distBP(I,J),number=nl))
print("dilation:",timeit(lambda: np.sqrt(dilateWrap(D, SQ, n)),number=nl))
Out:
ball tree: 1.010353464000218
dilation: 1.4776274510004441
I think we can safely conclude that convolutional or kernel based approaches would offer much better gains in this particular problem, rather than pairs or points or tree based approaches.
Lastly, I've mentioned it at the beginning and I'll mention it again: this entire implementation only accounts for an odd value of n; I didn't have the patience to calculate the proper padding for an even n. (If you're familiar with image processing, you've probably faced this before: everything's easier with odd sizes.)
This may also be further optimized, since I'm only an occasional dabbler in numba.
[EDIT] - I found a mistake in the way the code keeps track of the points where the job is done, fixed it with the mask_kernel. The pure python version of the newer code is ~1.5 times slower, but the numba version is slightly faster (due to some other optimisations).
[current best : ~100xto 120x the original speed]
First of all, thank you for submitting this problem, I had a lot of fun optimizing it!
My current best solution relies on the assumption that the grid is regular and that the "source" points (the ones from which we need to compute the distance) are roughly evenly distributed.
The idea here is that all of the distances are going to be either 1, sqrt(2), sqrt(3), ... so we can do the numerical calculation beforehand. Then we simply put these values in a matrix and copy that matrix around each source point (and making sure to keep the minimum value found at each point). This covers the vast majority of the points (>99%). Then we apply another more "classical" method for the remaining 1%.
Here's the code:
import numpy as np
def sq_distance(x1, y1, x2, y2, n):
# computes the pairwise squared distance between 2 sets of points (with periodicity)
# x1, y1 : coordinates of the first set of points (source)
# x2, y2 : same
dx = np.abs((np.subtract.outer(x1, x2) + n//2)%(n) - n//2)
dy = np.abs((np.subtract.outer(y1, y2) + n//2)%(n) - n//2)
d = (dx*dx + dy*dy)
return d
def apply_kernel(sources, sqdist, kern_size, n, mask):
ker_i, ker_j = np.meshgrid(np.arange(-kern_size, kern_size+1), np.arange(-kern_size, kern_size+1), indexing="ij")
kernel = np.add.outer(np.arange(-kern_size, kern_size+1)**2, np.arange(-kern_size, kern_size+1)**2)
mask_kernel = kernel > kern_size**2
for pi, pj in sources:
ind_i = (pi+ker_i)%n
ind_j = (pj+ker_j)%n
sqdist[ind_i,ind_j] = np.minimum(kernel, sqdist[ind_i,ind_j])
mask[ind_i,ind_j] *= mask_kernel
def dist_vf(sources, n, kernel_size):
sources = np.asfortranarray(sources) #for memory contiguity
kernel_size = min(kernel_size, n//2)
kernel_size = max(kernel_size, 1)
sqdist = np.full((n,n), 10*n**2, dtype=np.int32) #preallocate with a huge distance (>max**2)
mask = np.ones((n,n), dtype=bool) #which points have not been reached?
#main code
apply_kernel(sources, sqdist, kernel_size, n, mask)
#remaining points
rem_i, rem_j = np.nonzero(mask)
if len(rem_i) > 0:
sq_d = sq_distance(sources[:,0], sources[:,1], rem_i, rem_j, n).min(axis=0)
sqdist[rem_i, rem_j] = sq_d
#eff = 1-rem_i.size/n**2
#print("covered by kernel :", 100*eff, "%")
#print("overlap :", sources.shape[0]*(1+2*kernel_size)**2/n**2)
#print()
return np.sqrt(sqdist)
Testing this version with
n=500 # size of 2D box (n X n points)
np.random.seed(1) # to make reproducible
a=np.random.uniform(size=(n,n))
all_points=np.argwhere(a>-1) # all points, for each loc we want distance to nearest J
source_points=np.argwhere(a>1-70/n**2) # set of J locations to find distance to.
#
# code for dist_v1 and dist_vf
#
overlap=5.2
kernel_size = int(np.sqrt(overlap*n**2/source_points.shape[0])/2)
print("cdist v1 :", timeit(lambda: dist_v1(all_points,source_points), number=1)*1000, "ms")
print("kernel version:", timeit(lambda: dist_vf(source_points, n, kernel_size), number=10)*100, "ms")
gives
cdist v1 : 1148.6694 ms
kernel version: 69.21876999999998 ms
which is a already a ~17x speedup! I also implemented a numba version of sq_distance and apply_kernel: [this is the new correct version]
#njit(cache=True)
def sq_distance(x1, y1, x2, y2, n):
m1 = x1.size
m2 = x2.size
n2 = n//2
d = np.empty((m1,m2), dtype=np.int32)
for i in range(m1):
for j in range(m2):
dx = np.abs(x1[i] - x2[j] + n2)%n - n2
dy = np.abs(y1[i] - y2[j] + n2)%n - n2
d[i,j] = (dx*dx + dy*dy)
return d
#njit(cache=True)
def apply_kernel(sources, sqdist, kern_size, n, mask):
# creating the kernel
kernel = np.empty((2*kern_size+1, 2*kern_size+1))
vals = np.arange(-kern_size, kern_size+1)**2
for i in range(2*kern_size+1):
for j in range(2*kern_size+1):
kernel[i,j] = vals[i] + vals[j]
mask_kernel = kernel > kern_size**2
I = sources[:,0]
J = sources[:,1]
# applying the kernel for each point
for l in range(sources.shape[0]):
pi = I[l]
pj = J[l]
if pj - kern_size >= 0 and pj + kern_size<n: #if we are in the middle, no need to do the modulo for j
for i in range(2*kern_size+1):
ind_i = np.mod((pi+i-kern_size), n)
for j in range(2*kern_size+1):
ind_j = (pj+j-kern_size)
sqdist[ind_i,ind_j] = np.minimum(kernel[i,j], sqdist[ind_i,ind_j])
mask[ind_i,ind_j] = mask_kernel[i,j] and mask[ind_i,ind_j]
else:
for i in range(2*kern_size+1):
ind_i = np.mod((pi+i-kern_size), n)
for j in range(2*kern_size+1):
ind_j = np.mod((pj+j-kern_size), n)
sqdist[ind_i,ind_j] = np.minimum(kernel[i,j], sqdist[ind_i,ind_j])
mask[ind_i,ind_j] = mask_kernel[i,j] and mask[ind_i,ind_j]
return
and testing with
overlap=5.2
kernel_size = int(np.sqrt(overlap*n**2/source_points.shape[0])/2)
print("cdist v1 :", timeit(lambda: dist_v1(all_points,source_points), number=1)*1000, "ms")
print("kernel numba (first run):", timeit(lambda: dist_vf(source_points, n, kernel_size), number=1)*1000, "ms") #first run = cimpilation = long
print("kernel numba :", timeit(lambda: dist_vf(source_points, n, kernel_size), number=10)*100, "ms")
which gave the following results
cdist v1 : 1163.0742 ms
kernel numba (first run): 2060.0802 ms
kernel numba : 8.80377000000001 ms
Due to the JIT compilation, the first run is pretty slow but otherwise, it's a 120x improvement!
It may be possible to get a little bit more out of this algorithm by tweaking the kernel_size parameter (or the overlap). The current choice of kernel_size is only effective for a small number of source points. For example, this choice fails miserably with source_points=np.argwhere(a>0.85) (13s) while manually setting kernel_size=5 gives the answer in 22ms.
I hope my post isn't (unnecessarily) too complicated, I don't really know how to organise it better.
[EDIT 2]:
I gave a little more attention to the non-numba part of the code and managed to get a pretty significant speedup, getting very close to what numba could achieve: Here is the new version of the function apply_kernel:
def apply_kernel(sources, sqdist, kern_size, n, mask):
ker_i = np.arange(-kern_size, kern_size+1).reshape((2*kern_size+1,1))
ker_j = np.arange(-kern_size, kern_size+1).reshape((1,2*kern_size+1))
kernel = np.add.outer(np.arange(-kern_size, kern_size+1)**2, np.arange(-kern_size, kern_size+1)**2)
mask_kernel = kernel > kern_size**2
for pi, pj in sources:
imin = pi-kern_size
jmin = pj-kern_size
imax = pi+kern_size+1
jmax = pj+kern_size+1
if imax < n and jmax < n and imin >=0 and jmin >=0: # we are inside
sqdist[imin:imax,jmin:jmax] = np.minimum(kernel, sqdist[imin:imax,jmin:jmax])
mask[imin:imax,jmin:jmax] *= mask_kernel
elif imax < n and imin >=0:
ind_j = (pj+ker_j.ravel())%n
sqdist[imin:imax,ind_j] = np.minimum(kernel, sqdist[imin:imax,ind_j])
mask[imin:imax,ind_j] *= mask_kernel
elif jmax < n and jmin >=0:
ind_i = (pi+ker_i.ravel())%n
sqdist[ind_i,jmin:jmax] = np.minimum(kernel, sqdist[ind_i,jmin:jmax])
mask[ind_i,jmin:jmax] *= mask_kernel
else :
ind_i = (pi+ker_i)%n
ind_j = (pj+ker_j)%n
sqdist[ind_i,ind_j] = np.minimum(kernel, sqdist[ind_i,ind_j])
mask[ind_i,ind_j] *= mask_kernel
The main optimisations are
Indexing with slices (rather than a dense array)
Use of sparse indexes (how did I not think about that earlier)
Testing with
overlap=5.4
kernel_size = int(np.sqrt(overlap*n**2/source_points.shape[0])/2)
print("cdist v1 :", timeit(lambda: dist_v1(all_points,source_points), number=1)*1000, "ms")
print("kernel v2 :", timeit(lambda: dist_vf(source_points, n, kernel_size), number=10)*100, "ms")
gives
cdist v1 : 1209.8163000000002 ms
kernel v2 : 11.319049999999997 ms
which is a nice 100x improvement over cdist, a ~5.5x improvement over the previous numpy-only version and just ~25% slower than what I could achieve with numba.
Here are a fixed version of your code and a different method that is a bit faster. They give the same results so I'm reasonably confident they are correct:
import numpy as np
from scipy.spatial.distance import squareform, pdist, cdist
from numpy.linalg import norm
def pb_OP(A, p=1.0):
distl = []
for *offs, ct in [(0, 0, 0), (0, p, 1), (p, 0, 1), (p, p, 1), (-p, p, 1)]:
B = A - offs
distl.append(cdist(B, A, metric='euclidean'))
if ct:
distl.append(distl[-1].T)
return np.amin(np.dstack(distl), 2)
def pb_pp(A, p=1.0):
out = np.empty((2, A.shape[0]*(A.shape[0]-1)//2))
for o, i in zip(out, A.T):
pdist(i[:, None], 'cityblock', out=o)
out[out > p/2] -= p
return squareform(norm(out, axis=0))
test = np.random.random((1000, 2))
assert np.allclose(pb_OP(test), pb_pp(test))
from timeit import timeit
t_OP = timeit(lambda: pb_OP(test), number=10)*100
t_pp = timeit(lambda: pb_pp(test), number=10)*100
print('OP', t_OP)
print('pp', t_pp)
Sample run. 1000 points:
OP 210.11001259903423
pp 22.288734700123314
We see that my method is ~9x faster which by a neat coincidence is the number of offset cponfigurations OP's version has to check. It uses pdist on the individual coordinates to get absolute differences. Where these are larger than half the grid spacing we subtract one period. It remains to take Euclidean norm and to unpack storage.
For calculating multiple distances I think it is hard to beat a simple BallTree (or similar).
I didn't quite understand the cyclic boundary, or at least why you need to loop 3x3 times, as I see it is behaves like torus and it is enough to make 5 copies.
Update: Indeed you need 3x3 for the edges. I updated the code.
To make sure my minimum_distance is correct by doing for n = 200 a np.all( minimum_distance == dist_v1(i,j) ) test gave True.
For n = 500 generated with the provided code, the %%time for a cold start gave
CPU times: user 1.12 s, sys: 0 ns, total: 1.12 s
Wall time: 1.11 s
So I generate 500 data points like in the post
import numpy as np
n=500 # size of 2D box (n X n points)
np.random.seed(1) # to make reproducible
a=np.random.uniform(size=(n,n))
i=np.argwhere(a>-1) # all points, for each loc we want distance to nearest J
j=np.argwhere(a>0.85) # set of J locations to find distance to.
And use the BallTree
import numpy as np
from sklearn.neighbors import BallTree
N = j.copy()
NE = j.copy()
E = j.copy()
SE = j.copy()
S = j.copy()
SW = j.copy()
W = j.copy()
NW = j.copy()
N[:,1] = N[:,1] - n
NE[:,0] = NE[:,0] - n
NE[:,1] = NE[:,1] - n
E[:,0] = E[:,0] - n
SE[:,0] = SE[:,0] - n
SE[:,1] = SE[:,1] + n
S[:,1] = S[:,1] + n
SW[:,0] = SW[:,0] + n
SW[:,1] = SW[:,1] + n
W[:,0] = W[:,0] + n
NW[:,0] = NW[:,0] + n
NW[:,1] = NW[:,1] - n
tree = BallTree(np.concatenate([j,N,E,S,W,NE,SE,SW,NW]), leaf_size=15, metric='euclidean')
dist = tree.query(i, k=1, return_distance=True)
minimum_distance = dist[0].reshape(n,n)
Update:
Note here I copied the data to N,E,S,W,NE,SE,NW,SE, of the box to handle the boundary conditions. Again, for n = 200 this gave the same results. You could tweak the leaf_size, but I feel this setting is allright.
The performance is sensitive for the number of points in j.
These are 8 different solutions I've timed, some of my own and some posted in response to my question, that use 4 broad approaches:
spatial cdist
spatial KDtree
Sklearn BallTree
Kernel approach
This is the code with the 8 test routines:
import numpy as np
from scipy import spatial
from sklearn.neighbors import BallTree
n=500 # size of 2D box
f=200./(n*n) # first number is rough number of target cells...
np.random.seed(1) # to make reproducable
a=np.random.uniform(size=(n,n))
i=np.argwhere(a>-1) # all points, we want to know distance to nearest point
j=np.argwhere(a>1.0-f) # set of locations to find distance to.
# long array of 3x3 j points:
for xoff in [0,n,-n]:
for yoff in [0,-n,n]:
if xoff==0 and yoff==0:
j9=j.copy()
else:
jo=j.copy()
jo[:,0]+=xoff
jo[:,1]+=yoff
j9=np.vstack((j9,jo))
global maxdist
maxdist=10
overlap=5.2
kernel_size=int(np.sqrt(overlap*n**2/j.shape[0])/2)
print("no points",len(j))
# repear cdist over each member of 3x3 block
def dist_v1(i,j):
dist=[]
# 3x3 search required for periodic boundaries.
for xoff in [-n,0,n]:
for yoff in [-n,0,n]:
jo=j.copy()
jo[:,0]+=xoff
jo[:,1]+=yoff
dist.append(np.amin(spatial.distance.cdist(i,jo,metric='euclidean'),1))
dist=np.amin(np.stack(dist),0).reshape([n,n])
#dmask=np.where(dist<=maxdist,1,0)
return(dist)
# same as v1, but taking one amin function at the end
def dist_v2(i,j):
dist=[]
# 3x3 search required for periodic boundaries.
for xoff in [-n,0,n]:
for yoff in [-n,0,n]:
jo=j.copy()
jo[:,0]+=xoff
jo[:,1]+=yoff
dist.append(spatial.distance.cdist(i,jo,metric='euclidean'))
dist=np.amin(np.dstack(dist),(1,2)).reshape([n,n])
#dmask=np.where(dist<=maxdist,1,0)
return(dist)
# using a KDTree query ball points, looping over j9 points as in online example
def dist_v3(n,j):
x,y=np.mgrid[0:n,0:n]
points=np.c_[x.ravel(), y.ravel()]
tree=spatial.KDTree(points)
mask=np.zeros([n,n])
for results in tree.query_ball_point((j), 2.1):
mask[points[results][:,0],points[results][:,1]]=1
return(mask)
# using ckdtree query on the j9 long array
def dist_v4(i,j):
tree=spatial.cKDTree(j)
dist,minid=tree.query(i)
return(dist.reshape([n,n]))
# back to using Cdist, but on the long j9 3x3 array, rather than on each element separately
def dist_v5(i,j):
# 3x3 search required for periodic boundaries.
dist=np.amin(spatial.distance.cdist(i,j,metric='euclidean'),1)
#dmask=np.where(dist<=maxdist,1,0)
return(dist)
def dist_v6(i,j):
tree = BallTree(j,leaf_size=5,metric='euclidean')
dist = tree.query(i, k=1, return_distance=True)
mindist = dist[0].reshape(n,n)
return(mindist)
def sq_distance(x1, y1, x2, y2, n):
# computes the pairwise squared distance between 2 sets of points (with periodicity)
# x1, y1 : coordinates of the first set of points (source)
# x2, y2 : same
dx = np.abs((np.subtract.outer(x1, x2) + n//2)%(n) - n//2)
dy = np.abs((np.subtract.outer(y1, y2) + n//2)%(n) - n//2)
d = (dx*dx + dy*dy)
return d
def apply_kernel1(sources, sqdist, kern_size, n, mask):
ker_i, ker_j = np.meshgrid(np.arange(-kern_size, kern_size+1), np.arange(-kern_size, kern_size+1), indexing="ij")
kernel = np.add.outer(np.arange(-kern_size, kern_size+1)**2, np.arange(-kern_size, kern_size+1)**2)
mask_kernel = kernel > kern_size**2
for pi, pj in sources:
ind_i = (pi+ker_i)%n
ind_j = (pj+ker_j)%n
sqdist[ind_i,ind_j] = np.minimum(kernel, sqdist[ind_i,ind_j])
mask[ind_i,ind_j] *= mask_kernel
def apply_kernel2(sources, sqdist, kern_size, n, mask):
ker_i = np.arange(-kern_size, kern_size+1).reshape((2*kern_size+1,1))
ker_j = np.arange(-kern_size, kern_size+1).reshape((1,2*kern_size+1))
kernel = np.add.outer(np.arange(-kern_size, kern_size+1)**2, np.arange(-kern_size, kern_size+1)**2)
mask_kernel = kernel > kern_size**2
for pi, pj in sources:
imin = pi-kern_size
jmin = pj-kern_size
imax = pi+kern_size+1
jmax = pj+kern_size+1
if imax < n and jmax < n and imin >=0 and jmin >=0: # we are inside
sqdist[imin:imax,jmin:jmax] = np.minimum(kernel, sqdist[imin:imax,jmin:jmax])
mask[imin:imax,jmin:jmax] *= mask_kernel
elif imax < n and imin >=0:
ind_j = (pj+ker_j.ravel())%n
sqdist[imin:imax,ind_j] = np.minimum(kernel, sqdist[imin:imax,ind_j])
mask[imin:imax,ind_j] *= mask_kernel
elif jmax < n and jmin >=0:
ind_i = (pi+ker_i.ravel())%n
sqdist[ind_i,jmin:jmax] = np.minimum(kernel, sqdist[ind_i,jmin:jmax])
mask[ind_i,jmin:jmax] *= mask_kernel
else :
ind_i = (pi+ker_i)%n
ind_j = (pj+ker_j)%n
sqdist[ind_i,ind_j] = np.minimum(kernel, sqdist[ind_i,ind_j])
mask[ind_i,ind_j] *= mask_kernel
def dist_v7(sources, n, kernel_size,method):
sources = np.asfortranarray(sources) #for memory contiguity
kernel_size = min(kernel_size, n//2)
kernel_size = max(kernel_size, 1)
sqdist = np.full((n,n), 10*n**2, dtype=np.int32) #preallocate with a huge distance (>max**2)
mask = np.ones((n,n), dtype=bool) #which points have not been reached?
#main code
if (method==1):
apply_kernel1(sources, sqdist, kernel_size, n, mask)
else:
apply_kernel2(sources, sqdist, kernel_size, n, mask)
#remaining points
rem_i, rem_j = np.nonzero(mask)
if len(rem_i) > 0:
sq_d = sq_distance(sources[:,0], sources[:,1], rem_i, rem_j, n).min(axis=0)
sqdist[rem_i, rem_j] = sq_d
return np.sqrt(sqdist)
from timeit import timeit
nl=10
print ("-----------------------")
print ("Timings for ",nl,"loops")
print ("-----------------------")
print("1. cdist looped amin:",timeit(lambda: dist_v1(i,j),number=nl))
print("2. cdist single amin:",timeit(lambda: dist_v2(i,j),number=nl))
print("3. KDtree ball pt:", timeit(lambda: dist_v3(n,j9),number=nl))
print("4. KDtree query:",timeit(lambda: dist_v4(i,j9),number=nl))
print("5. cdist long list:",timeit(lambda: dist_v5(i,j9),number=nl))
print("6. ball tree:",timeit(lambda: dist_v6(i,j9),number=nl))
print("7. kernel orig:", timeit(lambda: dist_v7(j, n, kernel_size,1), number=nl))
print("8. kernel optimised:", timeit(lambda: dist_v7(j, n, kernel_size,2), number=nl))
The output (timing in seconds) on my linux 12 core desktop (with 48GB RAM) for n=350 and 63 points:
no points 63
-----------------------
Timings for 10 loops
-----------------------
1. cdist looped amin: 3.2488364999881014
2. cdist single amin: 6.494611179979984
3. KDtree ball pt: 5.180531410995172
4. KDtree query: 0.9377906009904109
5. cdist long list: 3.906166430999292
6. ball tree: 3.3540162370190956
7. kernel orig: 0.7813036740117241
8. kernel optimised: 0.17046571199898608
and for n=500 and npts=176:
no points 176
-----------------------
Timings for 10 loops
-----------------------
1. cdist looped amin: 16.787221198988846
2. cdist single amin: 40.97849371898337
3. KDtree ball pt: 9.926229109987617
4. KDtree query: 0.8417396580043714
5. cdist long list: 14.345821461000014
6. ball tree: 1.8792325239919592
7. kernel orig: 1.0807358759921044
8. kernel optimised: 0.5650744160229806
So in summary I reached the following conclusions:
avoid cdist if you have quite a large problem
If your problem is not too computational-time constrained I would recommend the "KDtree query" approach as it is just 2 lines without the periodic boundaries and a few more with periodic boundary to set up the j9 array
For maximum performance (e.g. a long integration of a model where this is required each time step as is my case) then the Kernal solution is now by far the fastest.

Numpy - create an almost zero matrix with row from other matrix

I have square matrix A and I want to create matrix Z which elements are zero everywhere except for an i'th row, and the i'th row is j'th row of matrix A.
I am aware of two ways to accomplish this. The fist one is fairly straightforward and seems to be the most effective performance-wise:
def do_this(mx: np.array, i: int, j: int):
Z = np.zeros_like(mx)
Z[i, :] = mx[j, :]
return Z
The other, less straightforward way and seemingly much less efficient, is to prepare a mx matrix beforehand, which a zero matrix of the same shape as A, but has 1 in it's (i, j) position, and then to calculate Z as mx # A.
def do_this_other_way(mx: np.array, ref_mx: np.array):
return ref_mx # mx
I decided to benchmark both approaches:
from time import time
import numpy as np
n = 20
num_iters = 5000
A = np.random.rand(n, n)
i, j = 5, 10
t = time()
for _ in range(num_iters):
Z = do_this(A, i, j)
print((time() - t) / num_iters)
ref_mx = np.zeros_like(A)
ref_mx[i, j] = 1
t = time()
for _ in range(num_iters):
Z = do_this_other_way(A, ref_mx)
print((time() - t) / num_iters)
However, when A is relatively small (on my laptop it means that A's size is less than 40), do_this_other_way wins, and when A has size like 20, it wins by an order of magnitude.
That's it: I have doubts that I am doing it the most effective way possible in numpy. Is it possible to do it better without resorting to writing your own low-level implementation of do_this?

Categories

Resources