optimize this numpy operation

optimize this numpy operation - python

I have inherited some code and there is one particular operation that takes an inordinate amount of time.
The operation is defined as:
cutoff = 0.2
# X has shape (76187, 247, 20)
X_flat = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))
weightfun = lambda x: 1.0 / np.sum(np.dot(X_flat, x) / np.dot(x, x) > 1 - cutoff)
# This is expensive...
N_list = np.array(list(map(weightfun, X_flat)))
This takes hours to compute on my machine. I am wondering if there is a way to optimize this. The code is computing normalized hamming distances between vector sequences.

weightfun performs two dot product operations for every row of X_flat. The worst one is np.dot(X_flat, x), where the dot product is performed against the whole X_flat matrix. But there's a trick to speed things up. The iterative part in the first dot product can be computed only once with:
X_matmut = X_flat # X_flat.T
Also, I noticed that the second dot product is nothing more than the diagonal of the result of the first one.
The rewritten code looks like this:
cutoff = 0.2
# X has shape (76187, 247, 20)
X_flat = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))
X1 = X_flat # X_flat.T
X2 = X1.diagonal()
N_list = 1.0 / (X1/X2 > 1 - cutoff).sum(axis=0)
Edit
For such a large input, when performing the operation above the memory becomes the new bottleneck as the new matrix won't fit into RAM. So there's also the option of breaking the computation into chunks, as the code below shows.
The code gets a little messy, but at least it didn't try to destroy my PC :-P
import numpy as np
import time
# Sample data
X = np.random.random([76187, 247, 20])
start = time.time()
cutoff = 0.2
X_flat = X.reshape((X.shape[0], X.shape[1] * X.shape[2]))
# Divide data into 20 chuncks
X_parts = np.array_split(X_flat, 20)
# Diagonal will be saved incrementally
diagonal = []
for i in range(len(X_parts)):
part = X_parts[i]
X_parts[i] = part # X_flat.T
diagonal.extend(X_parts[i][range(len(X_parts[i])), range(len(diagonal), len(diagonal)+len(X_parts[i]))])
# Performs the second part of the calculation
diagonal = np.array(diagonal)
X_list = np.zeros(len(diagonal))
for x in X_parts:
X_list += (x/diagonal > 1 - cutoff).sum(axis=0)
X_list = 1.0 / X_list
print('Time to solve: %.2f secs' % (time.time() - start))
I would love to be able to perform all the computation on a single loop and discard the used chunks, but it is obligatory to run over the whole matrix once to retrieve the diagonal. Don't believe it's worth to compute everything twice to save memory.
While I use a decent setup (16 GB of RAM in a i7 intel and SSD drive for storage), the whole processing took me around 15 minutes.

Related

(Theoreticaly) Fast performing Routine for tensor manipulation going slower than not optimized version

The goal of this implementation is to manipulate a 4 dimensional tensor in order to contact it with himself and then truncate the inflated indexes back to their original length.
I have two routines that differ in the following point:
Routine One memory complexity scales with the dimension of the tensor to the power of 5, since the tensorB2 and tensorC, which have shape (dimension,dimension,dimesion,dimension,dimension) have to be computed.
for dimension = 60 this takes about 66s.
Routine Two computes exactly the same thing, but the computations of tensorB2 and tensorC have been split in order to never compute a tensor that is bigger than (dimension,dimension,dimesion,dimension) thus scaling with dimension^4, but takes about 30min.
Again I want to underline that This routines compute exactly the same result, and, in the theory, routine 2 computational complexity should be even better than the one of routine 1.
I want to optimize Routine2
My guess is that I am using numpy`s ndarrays wrong, therefore repeating useless operation like copying Tensors in memory when not need.
Imports and functions
import numpy as np
from ncon import ncon
def truncate_matrix(matrix,index,dim):
shape=list(matrix.shape)
shape[index]=dim
newMatrix= matrix[0:shape[0],0:shape[1]]
return newMatrix
Routine 1 (memory scaling dimension^5)
dimension = 60
myTensor = np.random.rand(dimension,dimension,dimension,dimension)
tensorA = ncon([myTensor,myTensor],[[-1,1,2,-3],[-2,1,2,-4]])
tensorB = ncon([myTensor,myTensor],[[-1,1,-3,2],[-2,1,-4,2]])
tensorQ = ncon([tensorA,tensorB],[[-1,-3,1,2],[-2,-4,1,2]])
del tensorA,tensorB
#Finding Utr
Qshape = tensorQ.shape
tensorQ = tensorQ.reshape((Qshape[0]*Qshape[1],Qshape[2]*Qshape[3]))
Ul, deltaL, _ = np.linalg.svd(tensorQ)
Utr = np.array(truncate_matrix(Ul, 1, dimension))
Utr = Utr.reshape((Qshape[0],Qshape[1],dimension))
del tensorQ,Ul,Qshape
#Recontructing T
tensorB2 = ncon([Utr,myTensor],[[1,-5,-1],[1,-4,-2,-3]])
tensorC = ncon([tensorB2,myTensor],[[-1,-2,1,-4,2],[2,-5,1,-3]])
del tensorB2
newTensor1 = ncon([Utr,tensorC],[[1,2,-2],[-1,-3,-4,1,2]])
del tensorC
Routine 2 (memory scaling dimension^4)
dimension = 60
myTensor = np.random.rand(dimension,dimension,dimension,dimension)
tensorA = ncon([myTensor,myTensor],[[-1,1,2,-3],[-2,1,2,-4]])
tensorB = ncon([myTensor,myTensor],[[-1,1,-3,2],[-2,1,-4,2]])
tensorQ = ncon([tensorA,tensorB],[[-1,-3,1,2],[-2,-4,1,2]])
#Finding Utr
Qshape = tensorQ.shape
tensorQ = tensorQ.reshape((Qshape[0]*Qshape[1],Qshape[2]*Qshape[3]))
Ul, deltaL, _ = np.linalg.svd(tensorQ)
Utr = np.array(truncate_matrix(Ul, 1, dimension))
Utr = Utr.reshape((Qshape[0],Qshape[1],dimension))
courrentShape = myTensor.shape
newTensor2 = np.zeros((dimension,dimension,courrentShape[2],courrentShape[3]))
#Recontructing T
for xf in range(0,newTensor2.shape[0]):
for yu in range(0,dimension):
#Slice courrent tensor
tensorSlice = myTensor[:,:,yu,:] # T_iy1y'1
# Slice truncation matrix
matrixSlice = Utr[:,:,xf] # Uy1y2
tensorB2 = ncon([matrixSlice,tensorSlice],[[1,-3],[1,-2,-1]])
for yb in range(0,newTensor2.shape[3]):
tensorSlice = myTensor[:,:,:,yb]
tensorC = ncon([tensorB2,tensorSlice],[[1,-1,2],[2,-2,1]])
newTensor2[xf,:,yu,yb] = ncon([Utr,tensorC],[[1,2,-1],[1,2]])

Interpolate values and replace with NaNs within a long gap?

I am trying to interpolate data with gaps. Sometimes the gap can be very large, and I do not want the interpolation to "succeed" within the gap; the result should be NaNs inside a large gap. For example, consider this example data set:
orig_x = [26219, 26225, 26232, 28521, 28538]
orig_y = [39, 40, 41, 72, 71]
which has clear gap between x-values 26232 and 28521. Now, I would like to have the orig_y interpolated to x-values like this:
import numpy as np
x_target = np.array(range(min(orig_x) // 10 * 10 + 10, max(orig_x) // 10 * 10 + 10, 10))
# array([26220, 26230, 26240, 26250, 26260, 26270, 26280, 26290,
# ...
# 28460, 28470, 28480, 28490, 28500, 28510, 28520, 28530])
and the output y_target should be np.nan everywhere else than at 26220, 26230 and 28520. Let's say that the condition for this would be that if there is a gap larger than 40 in the data, the interpolation should result to np.nan inside this data gap.
Goal shown as a picture
Instead of this
Get something like this
i.e. the "gap" in the data should result to np.nan instead of garbage data.
Question
What would be the best way (fastest interpolation) to achieve this kind of interpolation? The interpolation can be linear or more sophisticated (e.g. cubic spline). One possibility I have in mind would be to use the scipy.interpolate.interp1d as starting point like this
from scipy.interpolate import interp1d
f = interp1d(orig_x, orig_y, bounds_error=False)
y_target = f(x_target)
and then search for gaps in the data and replace the interpolated data with np.nan inside the gaps. Since I will be using this on fairly large dataset (~10M rows, few columns, handled in parts), performance is a key.

After some trial and error, a think I got a "fast enough" implementation using basic linear interpolation and numba for speedups. Forgive for writing everything in the same loop and same function, but it seems that is the numba way of making your code fast. (numba loves loops, and does not seem to accept nested functions)
Test data used
I added some mode data to x_target to test the algorithm performance.
orig_x = np.array([26219, 26225, 26232, 28521, 28538])
orig_y = np.array([39, 40, 41, 72, 71])
x_target = np.array(
np.arange(min(orig_x) // 10 * 10,
max(orig_x) // 10 * 10 + 10, 0.1))
Test code
from matplotlib import pyplot as plt
y_target = interpolate_with_max_gap(orig_x, orig_y, x_target, max_gap=40)
plt.scatter(x_target, y_target, label='interpolated', s=10)
plt.scatter(orig_x, orig_y, label='orig', s=10)
plt.legend()
plt.show()
Test results
The data is interpolated in regions with gap less than max_gap (40):
closeup:
Speed:
I first tried a pure python + numpy implementation, which took 49.6 ms with the same test data (using timeit). This implementation with numba takes 480µs (100x speedup!). When using target_x_is_sorted=True, the speed is 80.1µs!
The orig_x_sorted=True did not give speedup, probably since the orig_x is so short that sorting it does not make any difference in timing in this example.
Implementation
import numba
import numpy as np
#numba.njit()
def interpolate_with_max_gap(orig_x,
orig_y,
target_x,
max_gap=np.inf,
orig_x_is_sorted=False,
target_x_is_sorted=False):
"""
Interpolate data linearly with maximum gap. If there is
larger gap in data than `max_gap`, the gap will be filled
with np.nan.
The input values should not contain NaNs.
Parameters
---------
orig_x: np.array
The input x-data
orig_y: np.array
The input y-data
target_x: np.array
The output x-data; the data points in x-axis that
you want the interpolation results from.
max_gap: float
The maximum allowable gap in `orig_x` inside which
interpolation is still performed. Gaps larger than
this will be filled with np.nan in the output `target_y`.
orig_x_is_sorted: boolean, default: False
If True, the input data `orig_x` is assumed to be monotonically
increasing. Some performance gain if you supply sorted input data.
target_x_is_sorted: boolean, default: False
If True, the input data `target_x` is assumed to be
monotonically increasing. Some performance gain if you supply
sorted input data.
Returns
------
target_y: np.array
The interpolation results.
"""
if not orig_x_is_sorted:
# Sort to be monotonous wrt. input x-variable.
idx = orig_x.argsort()
orig_x = orig_x[idx]
orig_y = orig_y[idx]
if not target_x_is_sorted:
target_idx = target_x.argsort()
# Needed for sorting back the data.
target_idx_for_reverse = target_idx.argsort()
target_x = target_x[target_idx]
target_y = np.empty(target_x.size)
idx_orig = 0
orig_gone_through = False
for idx_target, x_new in enumerate(target_x):
# Grow idx_orig if needed.
while not orig_gone_through:
if idx_orig + 1 >= len(orig_x):
# Already consumed the orig_x; no more data
# so we would need to extrapolate
orig_gone_through = True
elif x_new > orig_x[idx_orig + 1]:
idx_orig += 1
else:
# x_new <= x2
break
if orig_gone_through:
target_y[idx_target] = np.nan
continue
x1 = orig_x[idx_orig]
y1 = orig_y[idx_orig]
x2 = orig_x[idx_orig + 1]
y2 = orig_y[idx_orig + 1]
if x_new < x1:
# would need to extrapolate to left
target_y[idx_target] = np.nan
continue
delta_x = x2 - x1
if delta_x > max_gap:
target_y[idx_target] = np.nan
continue
delta_y = y2 - y1
if delta_x == 0:
target_y[idx_target] = np.nan
continue
k = delta_y / delta_x
delta_x_new = x_new - x1
delta_y_new = k * delta_x_new
y_new = y1 + delta_y_new
target_y[idx_target] = y_new
if not target_x_is_sorted:
return target_y[target_idx_for_reverse]
return target_y

Is it possible to convert this numpy function to tensorflow?

I have a function that takes a [32, 32, 3] tensor, and outputs a [256,256,3] tensor.
Specifically, the function interprets the smaller array as if it was a .svg file, and 'renders' it to a 256x256 array as a canvas using this algorithm
For an explanation of WHY I would want to do this, see This question
The function behaves exactly as intended, until I try to include it in the training loop of a GAN. The current error I'm seeing is:
NotImplementedError: Cannot convert a symbolic Tensor (mul:0) to a numpy array.
A lot of other answers to similar errors seem to boil down to "You need to re-write the function using tensorflow, not numpy"
Here's the working code using numpy - is it possible to re-write it to exclusively use tensorflow functions?
def convert_to_bitmap(input_tensor, target, j):
#implied conversion to nparray - the tensorflow docs seem to indicate this is okay, but the error is thrown here when training
array = input_tensor
outputArray = target
output = target
for i in range(32):
col = float(array[i,0,j])
if ((float(array[i,0,0]))+(float(array[i,0,1]))+(float(array[i,0,2]))/3)< 0:
continue
#slice only the red channel from the i line, multiply by 255
red_array = array[i,:,0]*255
#slice only the green channel, multiply by 255
green_array = array[i,:,1]*255
#combine and flatten them
combined_array = np.dstack((red_array, green_array)).flatten()
#remove the first two and last two indices of the combined array
index = [0,1,62,63]
clipped_array = np.delete(combined_array,index)
#filter array to remove values less than 0
filtered = clipped_array > 0
filtered_array = clipped_array[filtered]
#check array has an even number of values, delete the last index if it doesn't
if len(filtered_array) % 2 == 0:
pass
else:
filtered_array = np.delete(filtered_array,-1)
#convert into a set of tuples
l = filtered_array.tolist()
t = list(zip(l, l[1:] + l[:1]))
if not t:
continue
output = fill_polygon(t, outputArray, col)
return(output)
The 'fill polygon' function is copied from the 'mahotas' library:
def fill_polygon(polygon, canvas, color):
if not len(polygon):
return
min_y = min(int(y) for y,x in polygon)
max_y = max(int(y) for y,x in polygon)
polygon = [(float(y),float(x)) for y,x in polygon]
if max_y < canvas.shape[0]:
max_y += 1
for y in range(min_y, max_y):
nodes = []
j = -1
for i,p in enumerate(polygon):
pj = polygon[j]
if p[0] < y and pj[0] >= y or pj[0] < y and p[0] >= y:
dy = pj[0] - p[0]
if dy:
nodes.append( (p[1] + (y-p[0])/(pj[0]-p[0])*(pj[1]-p[1])) )
elif p[0] == y:
nodes.append(p[1])
j = i
nodes.sort()
for n,nn in zip(nodes[::2],nodes[1::2]):
nn += 1
canvas[y, int(n):int(nn)] = color
return(canvas)
NOTE: I'm not trying to get someone to convert the whole thing for me! There are some functions that are pretty obvious (tf.stack instead of np.dstack), but others that I don't even know how to start, like the last few lines of the fill_polygon function above.

Yes you can actually do this, you can use a python function in sth called tf.pyfunc. Its a python wrapper but its extremely slow in comparison to plain tensorflow. However, tensorflow and Cuda for example are so damn fast because they use stuff like vectorization, meaning you can rewrite a lot , really many of the loops in terms of mathematical tensor operations which are very fast.
In general:
If you want to use custom code as a custom layer, i would recommend you to rethink the algebra behind those loops and try to express them somehow different. If its just preprocessing before the training is going to start, you can use tensorflow but doing the same with numpy and other libraries is easier.
To your main question: Yes its possible, but better dont use loops. Tensorflow has a build-in loop optimizer but then you have to use tf.while() and thats anyoing (maybe just for me). I just blinked over your code, but it looks like you should be able to vectorize it quite good using the standard tensorflow vocabulary. If you want it fast, i mean really fast with GPU support write all in tensorflow, but nothing like 50/50 with tf.convert_to_tensor(), because than its going to be slow again. because than you switch between GPU and CPU and plain Python interpreter and the tensorflow low level API. Hope i could help you at least a bit

This code 'works', in that it only uses tensorflow functions, and does allow the model to train when used in a training loop:
def convert_image (x):
#split off the first column of the generator output, and store it for later (remove the 'colours' column)
colours_column = tf.slice(img_to_convert, tf.constant([0,0,0], dtype=tf.int32), tf.constant([32,1,3], dtype=tf.int32))
#split off the rest of the data, only keeping R + G, and discarding B
image_data_red = tf.slice(img_to_convert, tf.constant([0,1,0], dtype=tf.int32), tf.constant([32,31,1], dtype=tf.int32))
image_data_green = tf.slice(img_to_convert, tf.constant([0,1,1], dtype=tf.int32), tf.constant([32, 31,1], dtype=tf.int32))
#roll each row by 1 position, and make two more 2D tensors
rolled_red = tf.roll(image_data_red, shift=-1, axis=0)
rolled_green = tf.roll(image_data_green, shift=-1, axis=0)
#remove all values where either the red OR green channels are 0
zeroes = tf.constant(0, dtype=tf.float32)
#this is for the 'count_nonzero' command
boolean_red_data = tf.not_equal(image_data_red, zeroes)
boolean_green_data = tf.not_equal(image_data_green, zeroes)
initial_data_mask = tf.logical_and(boolean_red_data, boolean_green_data)
#count non-zero values per row and flatten it
count = tf.math.count_nonzero(initial_data_mask, 1)
count_flat = tf.reshape(count, [-1])
flat_red = tf.reshape(image_data_red, [-1])
flat_green = tf.reshape(image_data_green, [-1])
boolean_red = tf.math.logical_not(tf.equal(flat_red, tf.zeros_like(flat_red)))
boolean_green = tf.math.logical_not(tf.equal(flat_green, tf.zeros_like(flat_red)))
mask = tf.logical_and(boolean_red, boolean_green)
flat_red_without_zero = tf.boolean_mask(flat_red, mask)
flat_green_without_zero = tf.boolean_mask(flat_green, mask)
# create a ragged tensor
X0_ragged = tf.RaggedTensor.from_row_lengths(values=flat_red_without_zero, row_lengths=count_flat)
Y0_ragged = tf.RaggedTensor.from_row_lengths(values=flat_green_without_zero, row_lengths=count_flat)
#do the same for the rolled version
rolled_data_mask = tf.roll(initial_data_mask, shift=-1, axis=1)
flat_rolled_red = tf.reshape(rolled_red, [-1])
flat_rolled_green = tf.reshape(rolled_green, [-1])
#from SO "shift zeros to the end"
boolean_rolled_red = tf.math.logical_not(tf.equal(flat_rolled_red, tf.zeros_like(flat_rolled_red)))
boolean_rolled_green = tf.math.logical_not(tf.equal(flat_rolled_green, tf.zeros_like(flat_rolled_red)))
rolled_mask = tf.logical_and(boolean_rolled_red, boolean_rolled_green)
flat_rolled_red_without_zero = tf.boolean_mask(flat_rolled_red, rolled_mask)
flat_rolled_green_without_zero = tf.boolean_mask(flat_rolled_green, rolled_mask)
# create a ragged tensor
X1_ragged = tf.RaggedTensor.from_row_lengths(values=flat_rolled_red_without_zero, row_lengths=count_flat)
Y1_ragged = tf.RaggedTensor.from_row_lengths(values=flat_rolled_green_without_zero, row_lengths=count_flat)
#available outputs for future use are:
X0 = X0_ragged.to_tensor(default_value=0.)
Y0 = Y0_ragged.to_tensor(default_value=0.)
X1 = X1_ragged.to_tensor(default_value=0.)
Y1 = Y1_ragged.to_tensor(default_value=0.)
#Example tensor cel (replace with (x))
P = tf.cast(x, dtype=tf.float32)
#split out P.x and P.y, and fill a ragged tensor to the same shape as Rx
Px_value = tf.cast(x, dtype=tf.float32) - tf.cast((tf.math.floor(x/255)*255), dtype=tf.float32)
Py_value = tf.cast(tf.math.floor(x/255), dtype=tf.float32)
Px = tf.squeeze(tf.ones_like(X0)*Px_value)
Py = tf.squeeze(tf.ones_like(Y0)*Py_value)
#for each pair of values (Y0, Y1, make a vector, and check to see if it crosses the y-value (Py) either up or down
a = tf.math.less(Y0, Py)
b = tf.math.greater_equal(Y1, Py)
c = tf.logical_and(a, b)
d = tf.math.greater_equal(Y0, Py)
e = tf.math.less(Y1, Py)
f = tf.logical_and(d, e)
g = tf.logical_or(c, f)
#Makes boolean bitwise mask
#calculate the intersection of the line with the y-value, assuming it intersects
#P.x <= (G.x - R.x) * (P.y - R.y) / (G.y - R.y + R.x) - use tf.divide_no_nan for safe divide
h = tf.math.less(Px,(tf.math.divide_no_nan(((X1-X0)*(Py-Y0)),(Y1-Y0+X0))))
#combine using AND with the mask above
i = tf.logical_and(g,h)
#tf.count_nonzero
#reshape to make a column tensor with the same dimensions as the colours
#divide by 2 using tf.floor_mod (returns remainder of division - any remainder means the value is odd, and hence the point is IN the polygon)
final_count = tf.cast((tf.math.count_nonzero(i, 1)), dtype=tf.int32)
twos = tf.ones_like(final_count, dtype=tf.int32)*tf.constant([2], dtype=tf.int32)
divide = tf.cast(tf.math.floormod(final_count, twos), dtype=tf.int32)
index = tf.cast(tf.range(0,32, delta=1), dtype=tf.int32)
clipped_index = divide*index
sort = tf.sort(clipped_index)
reverse = tf.reverse(sort, [-1])
value = tf.slice(reverse, [0], [1])
pair = tf.constant([0], dtype=tf.int32)
slice_tensor = tf.reshape(tf.stack([value, pair, pair], axis=0),[-1])
output_colour = tf.slice(colours_column, slice_tensor, [1,1,3])
return output_colour
This is where the 'convert image' function is applied using tf.vectorize_map:
def convert_images(image_to_convert):
global img_to_convert
img_to_convert = image_to_convert
process_list = tf.reshape((tf.range(0,65536, delta=1, dtype=tf.int32)), [65536, 1])
output_line = tf.vectorized_map(convert_image, process_list)
output_line_squeezed = tf.squeeze(output_line)
output_reshape = (tf.reshape(output_line_squeezed, [256,256,3])/127.5)-1
output = tf.expand_dims(output_reshape, axis=0)
return output
It is PAINFULLY slow, though - It does not appear to be using the GPU, and looks to be single threaded as well.
I'm adding it as an answer to my own question because is clearly IS possible to do this numpy function entirely in tensorflow - it just probably shouldn't be done like this.

Efficient sum of Gaussians in 3D with NumPy using large arrays

I have an M x 3 array of 3D coordinates, coords (M ~1000-10000), and I would like to compute the sum of Gaussians centered at these coordinates over a mesh grid 3D array. The mesh grid 3D array is typically something like 64 x 64 x 64, but sometimes upwards of 256 x 256 x 256, and can go even larger. I’ve followed this question to get started, by converting my meshgrid array into an array of N x 3 coordinates, xyz, where N is 64^3 or 256^3, etc. However, for large array sizes it takes too much memory to vectorize the entire calculation (understandable since it could approach 1e11 elements and consume a terabyte of RAM) so I’ve broken it up into a loop over M coordinates. However, this is too slow.
I’m wondering if there is any way to speed this up at all without overloading memory. By converting the meshgrid to xyz, I feel like I’ve lost any advantage of the grid being equally spaced, and that somehow, maybe with scipy.ndimage, I should be able to take advantage of the even spacing to speed things up.
Here’s my initial start:
import numpy as np
from scipy import spatial
#create meshgrid
side = 100.
n = 64 #could be 256 or larger
x_ = np.linspace(-side/2,side/2,n)
x,y,z = np.meshgrid(x_,x_,x_,indexing='ij')
#convert meshgrid to list of coordinates
xyz = np.column_stack((x.ravel(),y.ravel(),z.ravel()))
#create some coordinates
coords = np.random.random(size=(1000,3))*side - side/2
def sumofgauss(coords,xyz,sigma):
"""Simple isotropic gaussian sum at coordinate locations."""
n = int(round(xyz.shape[0]**(1/3.))) #get n samples for reshaping to 3D later
#this version overloads memory
#dist = spatial.distance.cdist(coords, xyz)
#dist *= dist
#values = 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dist/(2*sigma**2))
#values = np.sum(values,axis=0)
#run cdist in a loop over coords to avoid overloading memory
values = np.zeros((xyz.shape[0]))
for i in range(coords.shape[0]):
dist = spatial.distance.cdist(coords[None,i], xyz)
dist *= dist
values += 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dist[0]/(2*sigma**2))
return values.reshape(n,n,n)
image = sumofgauss(coords,xyz,1.0)
import matplotlib.pyplot as plt
plt.imshow(image[n/2]) #show a slice
plt.show()
M = 1000, N = 64 (~5 seconds):
M = 1000, N = 256 (~10 minutes):

Considering that many of your distance calculations will give zero weight after the exponential, you can probably drop a lot of your distances. Doing big chunks of distance calculations while dropping distances which are greater than a threshhold is usually faster with KDTree:
import numpy as np
from scipy.spatial import cKDTree # so we can get a `coo_matrix` output
def gaussgrid(coords, sigma = 1, n = 64, side = 100, eps = None):
x_ = np.linspace(-side/2,side/2,n)
x,y,z = np.meshgrid(x_,x_,x_,indexing='ij')
xyz = np.column_stack((x.ravel(),y.ravel(),z.ravel()))
if eps is None:
eps = np.finfo('float64').eps
thr = -np.log(eps) * 2 * sigma**2
data_tree = cKDTree(coords)
discr = 1000 # you can tweak this to get best results on your system
values = np.empty(n**3)
for i in range(n**3//discr + 1):
slc = slice(i * discr, i * discr + discr)
grid_tree = cKDTree(xyz[slc])
dists = grid_tree.sparse_distance_matrix(data_tree, thr, output_type = 'coo_matrix')
dists.data = 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dists.data/(2*sigma**2))
values[slc] = dists.sum(1).squeeze()
return values.reshape(n,n,n)
Now, even if you keep eps = None it'll be a bit faster as you're still returning about 10% your distances, but with eps = 1e-6 or so, you should get a big speedup. On my system:
%timeit out = sumofgauss(coords, xyz, 1.0)
1 loop, best of 3: 23.7 s per loop
%timeit out = gaussgrid(coords)
1 loop, best of 3: 2.12 s per loop
%timeit out = gaussgrid(coords, eps = 1e-6)
1 loop, best of 3: 382 ms per loop

Python improving function speed

I am coding my own script to calculate relation between two signals. Therefore I use the mlab.csd and mlab.psd functions to compute the CSD and PSD of the signals.
My array x is in the shape of (120,68,68,815). My script runs several minutes and this function is the hotspot for this high amount of time.
Anyone any idea what I should do? I am not that familiar with script performance increasing. Thanks!
# to read the list of stcs for all the epochs
with open('/home/daniel/Dropbox/F[...]', 'rb') as f:
label_ts = pickle.load(f)
x = np.asarray(label_ts)
nfft = 512
n_freqs = nfft/2+1
n_epochs = len(x) # in this case there are 120 epochs
channels = 68
sfreq = 1017.25
def compute_mean_psd_csd(x, n_epochs, nfft, sfreq):
'''Computes mean of PSD and CSD for signals.'''
Rxy = np.zeros((n_epochs, channels, channels, n_freqs), dtype=complex)
Rxx = np.zeros((n_epochs, channels, channels, n_freqs))
Ryy = np.zeros((n_epochs, channels, channels, n_freqs))
for i in xrange(0, n_epochs):
print('computing connectivity for epoch %s'%(i+1))
for j in xrange(0, channels):
for k in xrange(0, channels):
Rxy[i,j,k], freqs = mlab.csd(x[j], x[k], NFFT=nfft, Fs=sfreq)
Rxx[i,j,k], _____ = mlab.psd(x[j], NFFT=nfft, Fs=sfreq)
Ryy[i,j,k], _____ = mlab.psd(x[k], NFFT=nfft, Fs=sfreq)
Rxy_mean = np.mean(Rxy, axis=0, dtype=np.float32)
Rxx_mean = np.mean(Rxx, axis=0, dtype=np.float32)
Ryy_mean = np.mean(Ryy, axis=0, dtype=np.float32)
return freqs, Rxy, Rxy_mean, np.real(Rxx_mean), np.real(Ryy_mean)

Something that could help, if the csd and psd methods are computationally intensive. There are chances that you could probably simply cache the results of previous calls and get it instead of calculating multiple times.
As it seems, you will have 120 * 68 * 68 = 591872 cycles.
In the case of the psd calculation, it should be possible to cache the values without problem has the method only depend on one parameter.
Store the value inside a dict for the x[j] or x[k] check if the value exists. If the value doesn't exist, compute it and store it. If the value exists, simply skip the value and reusue the value.
if x[j] not in cache_psd:
cache_psd[x[j]], ____ = mlab.psd(x[j], NFFT=nfft, Fs=sfreq)
Rxx[i,j,k] = cache_psd[x[j]]
if x[k] not in cache_psd:
cache_psd[x[k]], ____ = mlab.psd(x[k], NFFT=nfft, Fs=sfreq)
Ryy[i,j,k] = cache_psd[x[k]]
You can do the same with the csd method. I don't know enough about it to say more. If the order of the parameter doesn't matter, you can store the two parameter in a sorted order to prevent duplicates such as 2, 1 and 1, 2.
The use of the cache will make the code faster only if the memory access time is lower than the computation time and storing time. This fix could be easily added with a module that does memoization.
Here's an article about memoization for further reading:
http://www.python-course.eu/python3_memoization.php

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

optimize this numpy operation - python

Related

(Theoreticaly) Fast performing Routine for tensor manipulation going slower than not optimized version

Interpolate values and replace with NaNs within a long gap?

Is it possible to convert this numpy function to tensorflow?

Efficient sum of Gaussians in 3D with NumPy using large arrays

Python improving function speed

Categories

Resources