I have an upper-triangular matrix of np.float64 values, like this:
array([[ 1., 2., 3., 4.],
[ 0., 5., 6., 7.],
[ 0., 0., 8., 9.],
[ 0., 0., 0., 10.]])
I would like to convert this into the corresponding symmetric matrix, like this:
array([[ 1., 2., 3., 4.],
[ 2., 5., 6., 7.],
[ 3., 6., 8., 9.],
[ 4., 7., 9., 10.]])
The conversion can be done in place, or as a new matrix. I would like it to be as fast as possible. How can I do this quickly?
np.where seems quite fast in the out-of-place, no-cache scenario:
np.where(ut,ut,ut.T)
On my laptop:
timeit(lambda:np.where(ut,ut,ut.T))
# 1.909718865994364
If you have pythran installed you can speed this up 3 times with near zero effort. But note that as far as I know pythran (currently) only understands contguous arrays.
file <upp2sym.py>, compile with pythran -O3 upp2sym.py
import numpy as np
#pythran export upp2sym(float[:,:])
def upp2sym(a):
return np.where(a,a,a.T)
Timing:
from upp2sym import *
timeit(lambda:upp2sym(ut))
# 0.5760842661838979
This is almost as fast as looping:
#pythran export upp2sym_loop(float[:,:])
def upp2sym_loop(a):
out = np.empty_like(a)
for i in range(len(a)):
out[i,i] = a[i,i]
for j in range(i):
out[i,j] = out[j,i] = a[j,i]
return out
Timing:
timeit(lambda:upp2sym_loop(ut))
# 0.4794591029640287
We can also do it inplace:
#pythran export upp2sym_inplace(float[:,:])
def upp2sym_inplace(a):
for i in range(len(a)):
for j in range(i):
a[i,j] = a[j,i]
Timing
timeit(lambda:upp2sym_inplace(ut))
# 0.28711927914991975
This is the fastest routine I've found so far that doesn't use Cython or a JIT like Numba. I takes about 1.6 μs on my machine to process a 4x4 array (average time over a list of 100K 4x4 arrays):
inds_cache = {}
def upper_triangular_to_symmetric(ut):
n = ut.shape[0]
try:
inds = inds_cache[n]
except KeyError:
inds = np.tri(n, k=-1, dtype=np.bool)
inds_cache[n] = inds
ut[inds] = ut.T[inds]
Here are some other things I've tried that are not as fast:
The above code, but without the cache. Takes about 8.3 μs per 4x4 array:
def upper_triangular_to_symmetric(ut):
n = ut.shape[0]
inds = np.tri(n, k=-1, dtype=np.bool)
ut[inds] = ut.T[inds]
A plain Python nested loop. Takes about 2.5 μs per 4x4 array:
def upper_triangular_to_symmetric(ut):
n = ut.shape[0]
for r in range(1, n):
for c in range(r):
ut[r, c] = ut[c, r]
Floating point addition using np.triu. Takes about 11.9 μs per 4x4 array:
def upper_triangular_to_symmetric(ut):
ut += np.triu(ut, k=1).T
Numba version of Python nested loop. This was the fastest thing I found (about 0.4 μs per 4x4 array), and was what I ended up using in production, at least until I started running into issues with Numba and had to revert back to a pure Python version:
import numba
#numba.njit()
def upper_triangular_to_symmetric(ut):
n = ut.shape[0]
for r in range(1, n):
for c in range(r):
ut[r, c] = ut[c, r]
Cython version of Python nested loop. I'm new to Cython so this may not be fully optimized. Since Cython adds operational overhead, I'm interested in hearing both Cython and pure-Numpy answers. Takes about 0.6 μs per 4x4 array:
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def upper_triangular_to_symmetric(np.ndarray[np.float64_t, ndim=2] ut):
cdef int n, r, c
n = ut.shape[0]
for r in range(1, n):
for c in range(r):
ut[r, c] = ut[c, r]
You are mainly measuring function call overhead on such tiny problems
Another way to do that would be to use Numba. Let's start with a implementation for only one (4x4) array.
Only one 4x4 array
import numpy as np
import numba as nb
#nb.njit()
def sym(A):
for i in range(A.shape[0]):
for j in range(A.shape[1]):
A[j,i]=A[i,j]
return A
A=np.array([[ 1., 2., 3., 4.],
[ 0., 5., 6., 7.],
[ 0., 0., 8., 9.],
[ 0., 0., 0., 10.]])
%timeit sym(A)
#277 ns ± 5.21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Larger example
#nb.njit(parallel=False)
def sym_3d(A):
for i in nb.prange(A.shape[0]):
for j in range(A.shape[1]):
for k in range(A.shape[2]):
A[i,k,j]=A[i,j,k]
return A
A=np.random.rand(1_000_000,4,4)
%timeit sym_3d(A)
#13.8 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#13.8 ns per 4x4 submatrix
Related
Say I have a 2D pytorch tensor and a 2D numpy boolean as follows,
a = torch.tensor([[ 0., 1., 2.],
[ 3., 4., 5.],
[ 6., 7., 8.],
[ 9., 10., 11.],
[12., 13., 14.]])
m = numpy.array([[ False, True, False],
[ True, False, True],
[ False, True, True],
[ False, False, False],
[ True, False, False]])
They have the same dimension and the number of True's in each column of m is the same.
I need to get the 2x3 tensor that is
a.transpose(0,1).masked_select(torch.from_numpy(m.transpose())).reshape(a.shape[1],-1).transpose(0,1)
which is
tensor([[ 3., 1., 5.],
[12., 7., 8.]])
The actual tensor is very large, and the operation needs to be performed many times. So I want to ask what is an efficient way of doing this (or the most efficient way).
In my benchmarks a jitted numba solution is the fastest, I could find
My benchmarks for a, m with shape (10000,200)(equal result tensors)
1
#numba.jit
13.2 ms (3.46x)
2
list comprehension
31.3 ms (1.46x)
3
baseline
45.7 ms (1.00x)
Generation of sufficiently large sample data for benchmarking
import torch
import numpy as np
def generate_data(rows=500, columns=100):
a = torch.from_numpy(np.random.uniform(1,10, (rows,columns)).astype(np.float32))
# argsort trick by #divakar https://stackoverflow.com/a/55317373/14277722
def shuffle_along_axis(a, axis):
idx = np.random.rand(*a.shape).argsort(axis=axis)
return np.take_along_axis(a,idx,axis=axis)
m = shuffle_along_axis(np.full((columns,rows), np.random.randint(2, size=rows)), 1).astype('bool').T
return a, np.ascontiguousarray(m)
a, m = generate_data(10000,200)
A jitted numba implementation
import numba as nb
#nb.njit
def gather2d(arr1, arr2):
res = np.zeros((np.count_nonzero(arr2[:,0]), arr1.shape[1]), np.float32)
counter = np.zeros(arr1.shape[1], dtype=np.intp)
for i in range(arr1.shape[0]):
for j in range(arr1.shape[1]):
if arr2[i,j]:
res[counter[j], j] = arr1[i,j]
counter[j] += 1
return res
torch.from_numpy(gather2d(a.numpy(),m))
Output
# %timeit 10 loops, best of 5: 13.2 ms per loop
tensor([[2.1846, 7.8890, 8.8218, ..., 4.8309, 9.2853, 6.4404],
[5.8842, 3.7332, 6.7436, ..., 1.2914, 3.2983, 3.5627],
[9.5128, 2.4283, 2.2152, ..., 4.9512, 9.7335, 9.6252],
...,
[7.3193, 7.8524, 9.6654, ..., 3.3665, 8.8926, 4.7660],
[1.3829, 1.3347, 6.6436, ..., 7.1956, 4.0446, 6.4633],
[6.4264, 3.6283, 3.6385, ..., 8.4152, 5.8498, 5.0281]])
Against a vectorized baseline solution
# %timeit 10 loops, best of 5: 45.7 ms per loop
a.gather(0, torch.from_numpy(np.nonzero(m.T)[1].reshape(-1, m.shape[1], order='F')))
A python list comprehension turns out to be surprisingly fast
def g(arr1,arr2):
return np.array([i[j] for i,j in zip(arr1.T,arr2.T)]).T
# %timeit 10 loops, best of 5: 31.3 ms per loop
torch.from_numpy(g(a.numpy(), m))
You can try this way by using only NumPy and PyTorch:
b,c = m.nonzero()
b = torch.tensor(b)
c = torch.tensor(c)
a[b,c].reshape(2,3)
#output
tensor([[ 1., 3., 5.],
[ 7., 8., 12.]]) # True values are taken on axis=1
I used the same example provided by #Michael Szczesny to measure the time:
import numpy as np
import timeit
import torch
rows, columns = (10000,200)
a = torch.from_numpy(np.random.uniform(1,10, (rows,columns)).astype(np.float32))
m = np.random.choice([False, True], size=(rows, columns))
starttime = timeit.default_timer()
b,c = m.nonzero()
b = torch.tensor(b)
c = torch.tensor(c)
a[b,c]
print(f"The time difference is :{(timeit.default_timer() - starttime)*1000} ms")
#output
The time difference is : 26.4 ms
It is better than the second and third approaches of #Michael Szczesny.
I'm writing a numerical code, where I'm using scipy.sparse.dia_matrix. My matrices are quite large (up to about 1000000 x 1000000), but very sparse. Sometimes tridiagonal, sometimes with a some more diagonals.
For various reasons, it is extremely convenient and clear from a coding point of view to just add together several of these matrices (of the same size, of course). However, I have found that adding these sparse matrices is very slow. The following example illustrates what I mean:
import numpy as np
from scipy.sparse import diags, dia_matrix
N = 100000
M1 = diags(diagonals = [np.random.random(N-1), np.random.random(N), np.random.random(N-1)], offsets = [-1, 0, 1])
M2 = diags(diagonals = [np.random.random(N-1), np.random.random(N), np.random.random(N-1)], offsets = [-1, 0, 1])
M3 = diags(diagonals = [np.random.random(N-1), np.random.random(N), np.random.random(N-1)], offsets = [-1, 0, 1])
def simple_add():
M = M1 + M2 + M3
def complicated_add():
M_ = dia_matrix((N, N))
for d in [-1, 0, 1]:
M_.setdiag(M1.diagonal(d) + M2.diagonal(d) + M3.diagonal(d), d)
%timeit simple_add()
%timeit complicated_add()
The output of the timing is:
16.9 ms ± 730 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
959 µs ± 39.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I don't understand why adding the matrices together with the + operator is 17 times slower than creating an empty digagonal matrix and explicitly setting the diagonals. Is there anything I can do to speed this up? I would much prefer to keep the simpler expression with the + operator, as it's far more readable, but not at the expense of an order of magnitude increase in computational time.
Update:
I proposed a change in Scipy that would make addition of two instances of dia_matrix faster, and after a bit of discussion I submitted a pull request to Scipy, which has now been merged. So in the future, adding two instances of dia_matrix will no longer convert to csr_matrix.
https://github.com/scipy/scipy/pull/14004
diags makes a dia_matrix from the list of inputs:
In [84]: M=sparse.diags([np.arange(1,4),np.arange(1,5),np.arange(1,4)], offsets=[-1,0,1])
In [85]: M
Out[85]:
<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements (3 diagonals) in DIAgonal format>
In [86]: M.offsets
Out[86]: array([-1, 0, 1], dtype=int32)
In [87]: M.data
Out[87]:
array([[1., 2., 3., 0.],
[1., 2., 3., 4.],
[0., 1., 2., 3.]])
The list of diagonals (different lengths) has been transformed into a 2 array, with offsets. This is intended primarily as a input format. Most, if not all, math is implemented in the csr format. And even there, matrix_multiplication is the relative strong point. Element-wise math is distinctly inferior to numpy array equivalents.
In [89]: Mr=M.tocsr()
In [90]: Mr
Out[90]:
<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in Compressed Sparse Row format>
In [91]: Mr.data
Out[91]: array([1., 1., 1., 2., 2., 2., 3., 3., 3., 4.])
In [92]: Mr.indices
Out[92]: array([0, 1, 0, 1, 2, 1, 2, 3, 2, 3], dtype=int32)
In [93]: Mr.indptr
Out[93]: array([ 0, 2, 5, 8, 10], dtype=int32)
The dia format suggests a faster addition, if the offsets and shape are all the same.
In [94]: M.data += M.data + M.data
In [95]: M.data
Out[95]:
array([[ 3., 6., 9., 0.],
[ 3., 6., 9., 12.],
[ 0., 3., 6., 9.]])
In [96]: M.A
Out[96]:
array([[ 3., 3., 0., 0.],
[ 3., 6., 6., 0.],
[ 0., 6., 9., 9.],
[ 0., 0., 9., 12.]])
With any of the sparse formats, if the sparsity is the same for all arguments and output, you can often do math directly on the data attribute, leaving the implied 0's unchanged.
The implementation of _add_sparse(self, other) is return self.tocsr()._add_sparse(other). The extra time is to turn it into a CSR matrix (which has a C extension for addition).
Could you create a sparse matrix that does what you want? Probably.
from scipy.sparse import dia_matrix, isspmatrix_dia
class dia_matrix_adder(dia_matrix):
def __add__(self, other):
if not isspmatrix_dia(other):
return super(dia_matrix_adder, self).__add__(other)
M_ = dia_matrix((self.shape[0], self.shape[1]))
for d in [-1, 0, 1]:
M_.setdiag(self.diagonal(d) + other.diagonal(d), d)
return M_
I would probably not do that and just write yourself a function:
def add_dia_matrix(*mats):
if len(mats) == 1:
return mats[0]
M_ = dia_matrix((mats[0].shape[0], mats[0].shape[1]))
for d in [-1, 0, 1]:
M_diag = mats[0].diagonal(d).copy()
for i in range(1, len(mats)):
M_diag += mats[i].diagonal(d)
M_.setdiag(M_diag, d)
return M_
This should be as readable as a bunch of + without having to deal with a new class.
%timeit simple_add()
30.3 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit complicated_add()
1.28 ms ± 2.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit add_dia_matrix(M1, M2, M3)
1.22 ms ± 4.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I need to fill N rectangle-shaped regions of an 0-filled two dimensional array with ones. Regions to fill are stored in Nx4 numpy array, where each row contains rectangle bounds (x_low, x_high, y_low, y_high). This part is currently the slowest part of what I'm currently working, and I'm wondering if it can be done any faster.
Currently this is done by simply iterating over region array, and target array is filled with ones using slices:
import numpy as np
def fill_array_with_ones(coordinates_array, target_array):
for row in coordinates_array:
target_array[row[0]:row[1], row[2]:row[3]] = 1
coords = np.array([[1,3,1,3], [3,5,3,5]])
target = np.zeros((5,5))
fill_array_with_ones(coords, target)
print(target)
Output:
array([[0., 0., 0., 0., 0.],
[0., 1., 1., 0., 0.],
[0., 1., 1., 0., 0.],
[0., 0., 0., 1., 1.],
[0., 0., 0., 1., 1.]])
I was expecting that there is some numpy magic that would allow me to do it in a vectorized manner, which would get rid of iterating over rows and, hopefully, lead to faster execution:
target[bounds_to_slices(coords)] = 1
I did some test about the method mentioned in comment and the for loop method. I'm wondering is your bottle neck really this?
# prepare data
import numpy as np
a = np.zeros((1024, 1024), '?')
bound = np.random.randint(0, 1024, (9999,4), 'H')
# build indices, as indices can be pre-computed, don't time it
x = np.arange(1024, dtype='H')[:,None]
y = x[:,None]
# the force vectorized method
%%timeit
ymask = y >= bound[:,0]
ymask &= y < bound[:,1]
xmask = x >= bound[:,2]
xmask &= x < bound[:,3]
a[(ymask & xmask).any(2)] = True
# outputs 3.06 s ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# the normal method
%%timeit
for i,j,k,l in bound:
a[i:k,j:l] = True
# outputs 22.8 ms ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Not only the "vectorized" method is always slower no matter how many bounding boxes are there, it also generates a 10GB temp array here. On the other hand the normal method is reasonably fast.
Considering a function below:
import numpy as np
a = np.ones(16).reshape(4,4)
def fn(a):
b = np.array(a)
for i in range(b.shape[0]):
for j in range(b.shape[1] - 1):
b[i][j+1] += b[i][j]
return b
print(fn(a))
That is, for a general function that calculates t+1 based on t in an array, can I make this faster? I'm aware there's a np.vectorize but not seeming appropriate for this case.
You can use cumsum I think that would be helpful.
import numpy as np
import pandas as pd
a = np.ones(16).reshape(4,4)
df =pd.DataFrame(a)
df.cumsum(axis=1)
Or you can use np.cumsum():
np.cumsum(a,axis=1)
It is possible to reduce the two for loops to one for loop with little bit of copying overhead in addition.
In [86]: a
Out[86]:
array([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
In [87]: b = a.copy()
In [88]: for col in range(b.shape[1]-1):
...: b[:, col+1] = np.sum(a[:, :col+2], axis=1)
In [89]: b
Out[89]:
array([[1., 2., 3., 4.],
[1., 2., 3., 4.],
[1., 2., 3., 4.],
[1., 2., 3., 4.]])
To make this work for a generic function, you can look for an equivalent function in numpy or implement one using numpy operations (vectorized one). For the example you provided, I just used numpy.sum() that does the job for us.
In terms of performance, this approach would be much better than operating using two for loops at the indices level, particularly for larger arrays. In the approach I used above, we work with slices of columns.
Here are the timings which suggest more than 3X speedup over native python implementation.
Native Python:
def fn(a):
b = np.array(a)
for i in range(b.shape[0]):
for j in range(b.shape[1] - 1):
b[i][j+1] += b[i][j]
return b
Slightly vectorized:
In [104]: def slightly_vectorized(b):
...: for col in range(b.shape[1]-1):
...: b[:, col+1] = np.sum(a[:, :col+2], axis=1)
...: return b
In [100]: a = np.ones(625).reshape(25, 25)
In [101]: %timeit fn(a)
303 µs ± 2.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [102]: b = a.copy()
In [103]: %timeit slightly_vectorized(b)
99.8 µs ± 501 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
What you're looking for is called accumulate here's an example:
import numpy as np
from itertools import accumulate
def fn(a):
acc = accumulate(a, lambda prev, row: prev + row)
return np.array(list(acc))
a = np.arange(16).reshape(4, 4)
print(fn(a))
# [[ 0 1 2 3]
# [ 4 6 8 10]
# [12 15 18 21]
# [24 28 32 36]]
There is no optimized accumulate function in numpy because it's not really possible to write accumulate in a way that's both performant & general. The python implementation is general, but will perform much like a hand coded lok.
To get optimal performance you'll probably need to find or write a low level implementation of the specific accumulate function that you need. You've already mentioned numba and you could also look into cython.
I have two 1D numpy arrays A and B of size (n, ) and (m, ) respectively which correspond to the x positions of points on a line. I want to calculate the distance between every point in A to every point in B. I then need to use these distances at a set y distance, d, to work out the potential at each point in A.
I'm currently using the following:
V = numpy.zeros(n)
for i in range(n):
xdist = A[i] - B
r = numpy.sqrt(xdist**2 + d**2)
dV = 1/r
V[i] = numpy.sum(dV)
This works but for large data sets it can take a while so I would like to use a function similar to scipy.spatial.distance.cdist which doesn't work for 1D arrays and I don't want to add another dimension to the arrays as they become too large.
Vectorized approach
One vectorized approach after extending A to 2D with the introduction of a new axis using np.newaxis/None and thus making use of broadcasting would be -
(1/(np.sqrt((A[:,None] - B)**2 + d**2))).sum(1)
Hybrid approach for large arrays
Now, for large arrays, we might have to divide the data into chunks.
Thus, with BSZ as the block size, we would have a hybrid approach, like so -
dsq = d**2
V = np.zeros((n//BSZ,BSZ))
for i in range(n//BSZ):
V[i] = (1/(np.sqrt((A[i*BSZ:(i+1)*BSZ,None] - B)**2 + dsq))).sum(1)
Runtime test
Approaches -
def original_app(A,B,d):
V = np.zeros(n)
for i in range(n):
xdist = A[i] - B
r = np.sqrt(xdist**2 + d**2)
dV = 1/r
V[i] = np.sum(dV)
return V
def vectorized_app1(A,B,d):
return (1/(np.sqrt((A[:,None] - B)**2 + d**2))).sum(1)
def vectorized_app2(A,B,d, BSZ = 100):
dsq = d**2
V = np.zeros((n//BSZ,BSZ))
for i in range(n//BSZ):
V[i] = (1/(np.sqrt((A[i*BSZ:(i+1)*BSZ,None] - B)**2 + dsq))).sum(1)
return V.ravel()
Timings and verification -
In [203]: # Setup inputs
...: n,m = 10000,2000
...: A = np.random.rand(n)
...: B = np.random.rand(m)
...: d = 10
...:
In [204]: out1 = original_app(A,B,d)
...: out2 = vectorized_app1(A,B,d)
...: out3 = vectorized_app2(A,B,d, BSZ = 100)
...:
...: print np.allclose(out1, out2)
...: print np.allclose(out1, out3)
...:
True
True
In [205]: %timeit original_app(A,B,d)
10 loops, best of 3: 133 ms per loop
In [206]: %timeit vectorized_app1(A,B,d)
10 loops, best of 3: 138 ms per loop
In [207]: %timeit vectorized_app2(A,B,d, BSZ = 100)
10 loops, best of 3: 65.2 ms per loop
We can play around with the parameter block size BSZ -
In [208]: %timeit vectorized_app2(A,B,d, BSZ = 200)
10 loops, best of 3: 74.5 ms per loop
In [209]: %timeit vectorized_app2(A,B,d, BSZ = 50)
10 loops, best of 3: 67.4 ms per loop
Thus, the best one seems to be giving a 2x speedup with a block size of 100 at my end.
EDIT: My answer turned out to be nearly identical to Divakar's after a closer look. However, you can save some memory by doing the operations in-place. Taking the sum along the second axis is more efficient than long the first.
import numpy
a = numpy.random.randint(0,10,10) * 1.
b = numpy.random.randint(0,10,10) * 1.
xdist = a[:,None] - b
xdist **= 2
xdist += d**2
xdist **= -1
V = numpy.sum(xdist, axis=1)
which gives the same solution as your code.
I would like to use a function similar to scipy.spatial.distance.cdist which doesn't work for 1D arrays and I don't want to add another dimension to the arrays as they become too large.
cdist works fine, you just have to reshape the arrays to have shape (n, 1) instead of (n,). You can add another dimension to a one-dimensional array A without copying the underlying data by using A[:, None] or A.reshape(-1, 1).
For example,
In [56]: from scipy.spatial.distance import cdist
In [57]: A
Out[57]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [58]: B
Out[58]: array([0, 2, 4, 6, 8])
In [59]: A[:, None]
Out[59]:
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
In [60]: cdist(A[:, None], B[:, None])
Out[60]:
array([[ 0., 2., 4., 6., 8.],
[ 1., 1., 3., 5., 7.],
[ 2., 0., 2., 4., 6.],
[ 3., 1., 1., 3., 5.],
[ 4., 2., 0., 2., 4.],
[ 5., 3., 1., 1., 3.],
[ 6., 4., 2., 0., 2.],
[ 7., 5., 3., 1., 1.],
[ 8., 6., 4., 2., 0.],
[ 9., 7., 5., 3., 1.]])
To compute V as shown in your code, you can use cdist with metric='sqeuclidean', as follows:
In [72]: d = 3.
In [73]: r = np.sqrt(cdist(A[:,None], B[:,None], metric='sqeuclidean') + d**2)
In [74]: V = (1/r).sum(axis=1)