Efficient sum of Gaussians in 3D with NumPy using large arrays

Efficient sum of Gaussians in 3D with NumPy using large arrays - python

I have an M x 3 array of 3D coordinates, coords (M ~1000-10000), and I would like to compute the sum of Gaussians centered at these coordinates over a mesh grid 3D array. The mesh grid 3D array is typically something like 64 x 64 x 64, but sometimes upwards of 256 x 256 x 256, and can go even larger. I’ve followed this question to get started, by converting my meshgrid array into an array of N x 3 coordinates, xyz, where N is 64^3 or 256^3, etc. However, for large array sizes it takes too much memory to vectorize the entire calculation (understandable since it could approach 1e11 elements and consume a terabyte of RAM) so I’ve broken it up into a loop over M coordinates. However, this is too slow.
I’m wondering if there is any way to speed this up at all without overloading memory. By converting the meshgrid to xyz, I feel like I’ve lost any advantage of the grid being equally spaced, and that somehow, maybe with scipy.ndimage, I should be able to take advantage of the even spacing to speed things up.
Here’s my initial start:
import numpy as np
from scipy import spatial
#create meshgrid
side = 100.
n = 64 #could be 256 or larger
x_ = np.linspace(-side/2,side/2,n)
x,y,z = np.meshgrid(x_,x_,x_,indexing='ij')
#convert meshgrid to list of coordinates
xyz = np.column_stack((x.ravel(),y.ravel(),z.ravel()))
#create some coordinates
coords = np.random.random(size=(1000,3))*side - side/2
def sumofgauss(coords,xyz,sigma):
"""Simple isotropic gaussian sum at coordinate locations."""
n = int(round(xyz.shape[0]**(1/3.))) #get n samples for reshaping to 3D later
#this version overloads memory
#dist = spatial.distance.cdist(coords, xyz)
#dist *= dist
#values = 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dist/(2*sigma**2))
#values = np.sum(values,axis=0)
#run cdist in a loop over coords to avoid overloading memory
values = np.zeros((xyz.shape[0]))
for i in range(coords.shape[0]):
dist = spatial.distance.cdist(coords[None,i], xyz)
dist *= dist
values += 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dist[0]/(2*sigma**2))
return values.reshape(n,n,n)
image = sumofgauss(coords,xyz,1.0)
import matplotlib.pyplot as plt
plt.imshow(image[n/2]) #show a slice
plt.show()
M = 1000, N = 64 (~5 seconds):
M = 1000, N = 256 (~10 minutes):

Considering that many of your distance calculations will give zero weight after the exponential, you can probably drop a lot of your distances. Doing big chunks of distance calculations while dropping distances which are greater than a threshhold is usually faster with KDTree:
import numpy as np
from scipy.spatial import cKDTree # so we can get a `coo_matrix` output
def gaussgrid(coords, sigma = 1, n = 64, side = 100, eps = None):
x_ = np.linspace(-side/2,side/2,n)
x,y,z = np.meshgrid(x_,x_,x_,indexing='ij')
xyz = np.column_stack((x.ravel(),y.ravel(),z.ravel()))
if eps is None:
eps = np.finfo('float64').eps
thr = -np.log(eps) * 2 * sigma**2
data_tree = cKDTree(coords)
discr = 1000 # you can tweak this to get best results on your system
values = np.empty(n**3)
for i in range(n**3//discr + 1):
slc = slice(i * discr, i * discr + discr)
grid_tree = cKDTree(xyz[slc])
dists = grid_tree.sparse_distance_matrix(data_tree, thr, output_type = 'coo_matrix')
dists.data = 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dists.data/(2*sigma**2))
values[slc] = dists.sum(1).squeeze()
return values.reshape(n,n,n)
Now, even if you keep eps = None it'll be a bit faster as you're still returning about 10% your distances, but with eps = 1e-6 or so, you should get a big speedup. On my system:
%timeit out = sumofgauss(coords, xyz, 1.0)
1 loop, best of 3: 23.7 s per loop
%timeit out = gaussgrid(coords)
1 loop, best of 3: 2.12 s per loop
%timeit out = gaussgrid(coords, eps = 1e-6)
1 loop, best of 3: 382 ms per loop

Related

optimize this numpy operation

I have inherited some code and there is one particular operation that takes an inordinate amount of time.
The operation is defined as:
cutoff = 0.2
# X has shape (76187, 247, 20)
X_flat = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))
weightfun = lambda x: 1.0 / np.sum(np.dot(X_flat, x) / np.dot(x, x) > 1 - cutoff)
# This is expensive...
N_list = np.array(list(map(weightfun, X_flat)))
This takes hours to compute on my machine. I am wondering if there is a way to optimize this. The code is computing normalized hamming distances between vector sequences.

weightfun performs two dot product operations for every row of X_flat. The worst one is np.dot(X_flat, x), where the dot product is performed against the whole X_flat matrix. But there's a trick to speed things up. The iterative part in the first dot product can be computed only once with:
X_matmut = X_flat # X_flat.T
Also, I noticed that the second dot product is nothing more than the diagonal of the result of the first one.
The rewritten code looks like this:
cutoff = 0.2
# X has shape (76187, 247, 20)
X_flat = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))
X1 = X_flat # X_flat.T
X2 = X1.diagonal()
N_list = 1.0 / (X1/X2 > 1 - cutoff).sum(axis=0)
Edit
For such a large input, when performing the operation above the memory becomes the new bottleneck as the new matrix won't fit into RAM. So there's also the option of breaking the computation into chunks, as the code below shows.
The code gets a little messy, but at least it didn't try to destroy my PC :-P
import numpy as np
import time
# Sample data
X = np.random.random([76187, 247, 20])
start = time.time()
cutoff = 0.2
X_flat = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))
# Divide data into 20 chuncks
X_parts = np.array_split(X_flat, 20)
# Diagonal will be saved incrementally
diagonal = []
for i in range(len(X_parts)):
part = X_parts[i]
X_parts[i] = part # X_flat.T
diagonal.extend(X_parts[i][range(len(X_parts[i])), range(len(diagonal), len(diagonal)+len(X_parts[i]))])
# Performs the second part of the calculation
diagonal = np.array(diagonal)
X_list = np.zeros(len(diagonal))
for x in X_parts:
X_list += (x/diagonal > 1 - cutoff).sum(axis=0)
X_list = 1.0 / X_list
print('Time to solve: %.2f secs' % (time.time() - start))
I would love to be able to perform all the computation on a single loop and discard the used chunks, but it is obligatory to run over the whole matrix once to retrieve the diagonal. Don't believe it's worth to compute everything twice to save memory.
While I use a decent setup (16 GB of RAM in a i7 intel and SSD drive for storage), the whole processing took me around 15 minutes.

Find location where smaller array matches larger array the most

I need to find where a smaller 2d array, array1 matches the closest inside another 2d array, array2.
array1 with have the size of grid_size 46x46 to 96x96.
array2 will be larger (184x184).
I only have access to numpy.
I am currently trying to use the Tversky formula but am not tied to it.
Efficiency is the most important part as this will run many times. My current solution shown below is very slow.
for i in range(array2.shape[0] - grid_size):
for j in range(array2.shape[1] - grid_size):
r[i, j] = np.sum(array2[i:i+grid_size, j:j+grid_size] == array1 ) / (np.sum(array2[i:i+grid_size, j:j+grid_size] != array1 ) + np.sum(Si[i:i+grid_size, j:j+grid_size] == array1 ))
Edit:
The goal is to find the location where a smaller image matches another image.

Here is an FFT/convolution based approach that minimizes Euclidean distance:
import numpy as np
from numpy import fft
N = 184
n = 46
pad = 192
def best_offs(A,a):
A,a = A.astype(float),a.astype(float)
Ap,ap = (np.zeros((pad,pad)) for _ in "Aa")
Ap[:N,:N] = A
ap[:n,:n] = a
sim = fft.irfft2(fft.rfft2(ap).conj()*fft.rfft2(Ap))[:N-n+1,:N-n+1]
Ap[:N,:N] = A*A
ap[:n,:n] = 1
ref = fft.irfft2(fft.rfft2(ap).conj()*fft.rfft2(Ap))[:N-n+1,:N-n+1]
return np.unravel_index((ref-2*sim).argmin(),sim.shape)
# example
# random picture
A = np.random.randint(0,256,(N,N),dtype=np.uint8)
# random offset
offy,offx = np.random.randint(0,N-n+1,2)
# sub pic at random offset
# randomly flip half of the least significant 75% of all bits
a = A[offy:offy+n,offx:offx+n] ^ np.random.randint(0,64,(n,n))
# reconstruct offset
oyrec,oxrec = best_offs(A,a)
assert offy==oyrec and offx==oxrec
# speed?
from timeit import timeit
print(timeit(lambda:best_offs(A,a),number=100)*10,"ms")
# example with zero a
a[...] = 0
# make A smaller in a matching subsquare
A[offy:offy+n,offx:offx+n]>>=1
# reconstruct offset
oyrec,oxrec = best_offs(A,a)
assert offy==oyrec and offx==oxrec
Sample run:
3.458537160186097 ms

Calculate the Euclidean distance for 2 different size arrays [duplicate]

I have two arrays of x-y coordinates, and I would like to find the minimum Euclidean distance between each point in one array with all the points in the other array. The arrays are not necessarily the same size. For example:
xy1=numpy.array(
[[ 243, 3173],
[ 525, 2997]])
xy2=numpy.array(
[[ 682, 2644],
[ 277, 2651],
[ 396, 2640]])
My current method loops through each coordinate xy in xy1 and calculates the distances between that coordinate and the other coordinates.
mindist=numpy.zeros(len(xy1))
minid=numpy.zeros(len(xy1))
for i,xy in enumerate(xy1):
dists=numpy.sqrt(numpy.sum((xy-xy2)**2,axis=1))
mindist[i],minid[i]=dists.min(),dists.argmin()
Is there a way to eliminate the for loop and somehow do element-by-element calculations between the two arrays? I envision generating a distance matrix for which I could find the minimum element in each row or column.
Another way to look at the problem. Say I concatenate xy1 (length m) and xy2 (length p) into xy (length n), and I store the lengths of the original arrays. Theoretically, I should then be able to generate a n x n distance matrix from those coordinates from which I can grab an m x p submatrix. Is there a way to efficiently generate this submatrix?

(Months later)
scipy.spatial.distance.cdist( X, Y )
gives all pairs of distances,
for X and Y 2 dim, 3 dim ...
It also does 22 different norms, detailed
here .
# cdist example: (nx,dim) (ny,dim) -> (nx,ny)
from __future__ import division
import sys
import numpy as np
from scipy.spatial.distance import cdist
#...............................................................................
dim = 10
nx = 1000
ny = 100
metric = "euclidean"
seed = 1
# change these params in sh or ipython: run this.py dim=3 ...
for arg in sys.argv[1:]:
exec( arg )
np.random.seed(seed)
np.set_printoptions( 2, threshold=100, edgeitems=10, suppress=True )
title = "%s dim %d nx %d ny %d metric %s" % (
__file__, dim, nx, ny, metric )
print "\n", title
#...............................................................................
X = np.random.uniform( 0, 1, size=(nx,dim) )
Y = np.random.uniform( 0, 1, size=(ny,dim) )
dist = cdist( X, Y, metric=metric ) # -> (nx, ny) distances
#...............................................................................
print "scipy.spatial.distance.cdist: X %s Y %s -> %s" % (
X.shape, Y.shape, dist.shape )
print "dist average %.3g +- %.2g" % (dist.mean(), dist.std())
print "check: dist[0,3] %.3g == cdist( [X[0]], [Y[3]] ) %.3g" % (
dist[0,3], cdist( [X[0]], [Y[3]] ))
# (trivia: how do pairwise distances between uniform-random points in the unit cube
# depend on the metric ? With the right scaling, not much at all:
# L1 / dim ~ .33 +- .2/sqrt dim
# L2 / sqrt dim ~ .4 +- .2/sqrt dim
# Lmax / 2 ~ .4 +- .2/sqrt dim

To compute the m by p matrix of distances, this should work:
>>> def distances(xy1, xy2):
... d0 = numpy.subtract.outer(xy1[:,0], xy2[:,0])
... d1 = numpy.subtract.outer(xy1[:,1], xy2[:,1])
... return numpy.hypot(d0, d1)
the .outer calls make two such matrices (of scalar differences along the two axes), the .hypot calls turns those into a same-shape matrix (of scalar euclidean distances).

The accepted answer does not fully address the question, which requests to find the minimum distance between the two sets of points, not the distance between every point in the two sets.
Although a straightforward solution to the original question indeed consists of computing the distance between every pair and subsequently finding the minimum one, this is not necessary if one is only interested in the minimum distances. A much faster solution exists for the latter problem.
All the proposed solutions have a running time that scales as m*p = len(xy1)*len(xy2). This is OK for small datasets, but an optimal solution can be written that scales as m*log(p), producing huge savings for large xy2 datasets.
This optimal execution time scaling can be achieved using scipy.spatial.KDTree as follows
import numpy as np
from scipy import spatial
xy1 = np.array(
[[243, 3173],
[525, 2997]])
xy2 = np.array(
[[682, 2644],
[277, 2651],
[396, 2640]])
# This solution is optimal when xy2 is very large
tree = spatial.KDTree(xy2)
mindist, minid = tree.query(xy1)
print(mindist)
# This solution by #denis is OK for small xy2
mindist = np.min(spatial.distance.cdist(xy1, xy2), axis=1)
print(mindist)
where mindist is the minimum distance between each point in xy1 and the set of points in xy2

For what you're trying to do:
dists = numpy.sqrt((xy1[:, 0, numpy.newaxis] - xy2[:, 0])**2 + (xy1[:, 1, numpy.newaxis - xy2[:, 1])**2)
mindist = numpy.min(dists, axis=1)
minid = numpy.argmin(dists, axis=1)
Edit: Instead of calling sqrt, doing squares, etc., you can use numpy.hypot:
dists = numpy.hypot(xy1[:, 0, numpy.newaxis]-xy2[:, 0], xy1[:, 1, numpy.newaxis]-xy2[:, 1])

import numpy as np
P = np.add.outer(np.sum(xy1**2, axis=1), np.sum(xy2**2, axis=1))
N = np.dot(xy1, xy2.T)
dists = np.sqrt(P - 2*N)

I think the following function also works.
import numpy as np
from typing import Optional
def pairwise_dist(X: np.ndarray, Y: Optional[np.ndarray] = None) -> np.ndarray:
Y = X if Y is None else Y
xx = (X ** 2).sum(axis = 1)[:, None]
yy = (Y ** 2).sum(axis = 1)[:, None]
return xx + yy.T - 2 * (X # Y.T)
Explanation
Suppose each row of X and Y are coordinates of the two sets of points.
Let their sizes be m X p and p X n respectively.
The result will produce a numpy array of size m X n with the (i, j)-th entry being the distance between the i-th row and the j-th row of X and Y respectively.

I highly recommend using some inbuilt method for calculating squares, and roots for they are customized for optimized way to calculate and very safe against overflows.
#alex answer below is the most safest in terms of overflow and should also be very fast. Also for single points you can use math.hypot which now supports more than 2 dimensions.
>>> def distances(xy1, xy2):
... d0 = numpy.subtract.outer(xy1[:,0], xy2[:,0])
... d1 = numpy.subtract.outer(xy1[:,1], xy2[:,1])
... return numpy.hypot(d0, d1)
Safety concerns
i, j, k = 1e+200, 1e+200, 1e+200
math.hypot(i, j, k)
# np.hypot for 2d points
# 1.7320508075688773e+200
np.sqrt(np.sum((np.array([i, j, k])) ** 2))
# RuntimeWarning: overflow encountered in square
overflow/underflow/speeds

I think that the most straightforward and efficient solution is to do it like this:
distances = np.linalg.norm(xy1, xy2) # calculate the euclidean distances between the test point and the training features.
min_dist = numpy.min(dists, axis=1) # get the minimum distance
min_id = np.argmi(distances) # get the index of the class with the minimum distance, i.e., the minimum difference.

Although many answers here are great, there is another way which has not been mentioned here, using numpy's vectorization / broadcasting properties to compute the distance between each points of two different arrays of different length (and, if wanted, the closest matches). I publish it here because it can be very handy to master broadcasting, and it also solves this problem elengantly while remaining very efficient.
Assuming you have two arrays like so:
# two arrays of different length, but with the same dimension
a = np.random.randn(6,2)
b = np.random.randn(4,2)
You can't do the operation a-b: numpy complains with operands could not be broadcast together with shapes (6,2) (4,2). The trick to allow broadcasting is to manually add a dimension for numpy to broadcast along to. By leaving the dimension 2 in both reshaped arrays, numpy knows that it must perform the operation over this dimension.
deltas = a.reshape(6, 1, 2) - b.reshape(1, 4, 2)
# contains the distance between each points
distance_matrix = (deltas ** 2).sum(axis=2)
The distance_matrix has a shape (6,4): for each point in a, the distances to all points in b are computed. Then, if you want the "minimum Euclidean distance between each point in one array with all the points in the other array", you would do :
distance_matrix.argmin(axis=1)
This returns the index of the point in b that is closest to each point of a.

Converting a nested loop calculation to Numpy for speedup

Part of my Python program contains the follow piece of code, where a new grid
is calculated based on data found in the old grid.
The grid i a two-dimensional list of floats. The code uses three for-loops:
for t in xrange(0, t, step):
for h in xrange(1, height-1):
for w in xrange(1, width-1):
new_gr[h][w] = gr[h][w] + gr[h][w-1] + gr[h-1][w] + t * gr[h+1][w-1]-2 * (gr[h][w-1] + t * gr[h-1][w])
gr = new_gr
return gr
The code is extremly slow for a large grid and a large time t.
I've tried to use Numpy to speed up this code, by substituting the inner loop
with:
J = np.arange(1, width-1)
new_gr[h][J] = gr[h][J] + gr[h][J-1] ...
But the results produced (the floats in the array) are about 10% smaller than
their list-calculation counterparts.
What loss of accuracy is to be expected when converting lists of floats to Numpy array of floats using np.array(pylist) and then doing a calculation?
How should I go about converting a triple for-loop to pretty and fast Numpy code? (or are there other suggestions for speeding up the code significantly?)

If gr is a list of floats, the first step if you are looking to vectorize with NumPy would be to convert gr to a NumPy array with np.array().
Next up, I am assuming that you have new_gr initialized with zeros of shape (height,width). The calculations being performed in the two innermost loops basically represent 2D convolution. So, you can use signal.convolve2d with an appropriate kernel. To decide on the kernel, we need to look at the scaling factors and make a 3 x 3 kernel out of them and negate them to simulate the calculations we are doing with each iteration. Thus, you would have a vectorized solution with the two innermost loops being removed for better performance, like so -
import numpy as np
from scipy import signal
# Get the scaling factors and negate them to get kernel
kernel = -np.array([[0,1-2*t,0],[-1,1,0,],[t,0,0]])
# Initialize output array and run 2D convolution and set values into it
out = np.zeros((height,width))
out[1:-1,1:-1] = signal.convolve2d(gr, kernel, mode='same')[1:-1,:-2]
Verify output and runtime tests
Define functions :
def org_app(gr,t):
new_gr = np.zeros((height,width))
for h in xrange(1, height-1):
for w in xrange(1, width-1):
new_gr[h][w] = gr[h][w] + gr[h][w-1] + gr[h-1][w] + t * gr[h+1][w-1]-2 * (gr[h][w-1] + t * gr[h-1][w])
return new_gr
def proposed_app(gr,t):
kernel = -np.array([[0,1-2*t,0],[-1,1,0,],[t,0,0]])
out = np.zeros((height,width))
out[1:-1,1:-1] = signal.convolve2d(gr, kernel, mode='same')[1:-1,:-2]
return out
Verify -
In [244]: # Inputs
...: gr = np.random.rand(40,50)
...: height,width = gr.shape
...: t = 1
...:
In [245]: np.allclose(org_app(gr,t),proposed_app(gr,t))
Out[245]: True
Timings -
In [246]: # Inputs
...: gr = np.random.rand(400,500)
...: height,width = gr.shape
...: t = 1
...:
In [247]: %timeit org_app(gr,t)
1 loops, best of 3: 2.13 s per loop
In [248]: %timeit proposed_app(gr,t)
10 loops, best of 3: 19.4 ms per loop

#Divakar, I tried a couple of variations on your org_app. The fully vectorized version is:
def org_app4(gr,t):
new_gr = np.zeros((height,width))
I = np.arange(1,height-1)[:,None]
J = np.arange(1,width-1)
new_gr[I,J] = gr[I,J] + gr[I,J-1] + gr[I-1,J] + t * gr[I+1,J-1]-2 * (gr[I,J-1] + t * gr[I-1,J])
return new_gr
While half the speed of your proposed_app, it is closer in style to the original. And thus may help with understanding how nested loops can be vectorized.
An important step is the conversion of I into a column array, so that together I,J index a block of values.

Improving performance of iterative 2D Numpy array with multivariate random generator

In a UxU periodic domain, I simulate the dynamics of a 2D array, with entries denoting x-y coordinates. At each time step, the "parent" entries are replaced by new coordinates selected from their normally distributed "offsprings", keeping the array size the same. To illustrate:
import numpy as np
import random
np.random.seed(13)
def main(time_step=10):
def dispersal(self, litter_size_):
return np.random.multivariate_normal([self[0], self[1]], [[sigma**2*1, 0], [0, 1*sigma**2]], litter_size_) % U
U = 10
sigma = 2.
parent = np.random.random(size=(4,2))*U
for t in range(time_step):
offspring = []
for parent_id in range(len(parent)):
litter_size = np.random.randint(1,4) # 1-3 offsprings reproduced per parent
offspring.append(dispersal(parent[parent_id], litter_size))
offspring = np.vstack(offspring)
indices = np.arange(len(offspring))
parent = offspring[np.random.choice(indices, 4, replace=False)] # only 4 survives to parenthood
return parent
However, the function can be inefficient to run, indicated by:
from timeit import timeit
timeit(main, number=10000)
that returns 40.13353896141052 secs.
A quick check with cProfile seems to identify Numpy's multivariate_normal function as a bottleneck.
Is there a more efficient way to implement this operation?

Yeah many functions in Numpy are relatively expensive if you use them on single numbers, as multivariate_normal shows in this case. Because the number of offspring is within the narrow range of [1, 3] it's worthwhile to pre-compute random samples. We can take samples around mean=(0,0) and during the iteration add the actual coordinates of the parents.
Also the inner loop can be vectorized. Resulting in:
def main_2(time_step=10, n_parent=4, max_offspring=3):
U = 10
sigma = 2.
cov = [[sigma**2, 0], [0, sigma**2]]
size = n_parent * max_offspring * time_step
samples = np.random.multivariate_normal(np.zeros(2), cov, size)
parents = np.random.rand(n_parent, 2) * U
for _ in range(time_step):
litter_size = np.random.randint(1, max_offspring+1, n_parent)
n_offspring = litter_size.sum()
parents = np.repeat(parents, litter_size, axis=0)
offspring = (parents + samples[:n_offspring]) % U
samples = samples[n_offspring:]
parents = np.random.permutation(offspring)[:n_parent]
return parents
The timings I get are:
In [153]: timeit(main, number=1000)
Out[153]: 9.255848071099535
In [154]: timeit(main_2, number=1000)
Out[154]: 0.870663221881841

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient sum of Gaussians in 3D with NumPy using large arrays - python

Related

optimize this numpy operation

Find location where smaller array matches larger array the most

Calculate the Euclidean distance for 2 different size arrays [duplicate]

Converting a nested loop calculation to Numpy for speedup

Improving performance of iterative 2D Numpy array with multivariate random generator

Categories

Resources