Improving moving-window computation in memory consumption and speed - python

Is it possible to obtain better performance (both in memory consumption and speed) in this moving-window computation? I have a 1000x1000 numpy array and I take 16x16 windows through the whole array and finally apply some function to each window (in this case, a discrete cosine transform.)
import numpy as np
from scipy.fftpack import dct
from skimage.util import view_as_windows
X = np.arange(1000*1000, dtype=np.float32).reshape(1000,1000)
window_size = 16
windows = view_as_windows(X, (window_size,window_size))
dcts = np.zeros(windows.reshape(-1,window_size, window_size).shape, dtype=np.float32)
for idx, window in enumerate(windows.reshape(-1,window_size, window_size)):
dcts[idx, :, :] = dct(window)
dcts = dcts.reshape(windows.shape)
This code takes too much memory (in the example above, the memory consumption is not so bad - windows uses 1Gb and dcts also needs 1Gb) and is taking 25 seconds to complete. I'm a bit unsure as to what I'm doing wrong because this should be a straightforward calculation (e.g. filtering an image.) Is there a better way to accomplish this?
UPDATE:
I was initially worried that the arrays produced by Kington's solution and my initial approach were very different, but the difference is restricted to the boundaries, so it is unlikely to cause serious issues for most applications. The only remaining problem is that both solutions are very slow. Currently, the first solution takes 1min 10s and the second solution 59 seconds.
UPDATE 2:
I noticed the biggest culprits by far are dct and np.mean. Even generic_filter performs decently (8.6 seconds) using a "cythonized" version of mean with bottleneck:
import bottleneck as bp
def func(window, shape):
window = window.reshape(shape)
#return np.abs(dct(dct(window, axis=1), axis=0)).mean()
return bp.nanmean(dct(window))
result = scipy.ndimage.generic_filter(X, func, (16, 16),
extra_arguments=([16, 16],))
I'm currently reading how to wrap C code using numpy in order to replace scipy.fftpack.dct. If anyone knows how to do it, I would appreciate the help.

Since scipy.fftpack.dct calculates separate transforms along the last axis of the input array, you can replace your loop with:
windows = view_as_windows(X, (window_size,window_size))
dcts = dct(windows)
result1 = dcts.mean(axis=(2,3))
Now only the dcts array requires a lot of memory and windows remains merely a view into X. And because the DCT's are calculated with a single function call it's also much faster. However, because the windows overlap there are lots of repeated calculations. This can be overcome by only calculating the DCT for each sub-row once, followed by a windowed mean:
ws = window_size
row_dcts = dct(view_as_windows(X, (1, ws)))
cs = row_dcts.squeeze().sum(axis=-1).cumsum(axis=0)
result2 = np.vstack((cs[ws-1], cs[ws:]-cs[:-ws])) / ws**2
Though it seems what is gained in effeciency is lost in code clarity... But basically the approach here is to first calculate the DCT's and then take the window average by summing over the 2D window and then dividing by the number of elements in the window. The DCTs are already calculated over rowwise moving windows, so we take a regular sum over those windows. However we need to take a moving window sum over the columns, to arrive at the proper 2D window sums. To do this efficiently we use a cumsum trick, where:
sum(A[p:q]) # q-p == window_size
Is equivalent to:
cs = cumsum(A)
cs[q-1] - cs[p-1]
This avoids having to sum the exact same numbers over and over. Unfortunately it doesn't work for the first window (when p == 0), so for that we have to take only cs[q-1] and stack it together with the other window sums. Finally we divide by the number of elements to arrive at the 2D window average.
If you like to do a 2D DCT than this second approach becomes less interesting, beause you'll eventually need the full 985 x 985 x 16 x 16 array before you can take the mean.
Both approaches above should be equivalent, but it may be a good idea to perform the arithmetic with 64-bit floats:
np.allclose(result1, result2, atol=1e-6)
# False
np.allclose(result1, result2, atol=1e-5)
# True

skimage.util.view_as_windows is using striding tricks to make an array of overlapping "windows" that doesn't use any additional memory.
However, when you make a new array of the shape shape, it will require ~32 times (16 x 16) the memory that your original X array or the windows array used.
Based on your comment, your end result is doing dcts.reshape(windows.shape).mean(axis=2).mean(axis=2) - taking the mean of the dct of each window.
Therefore, it would be more memory-efficient (though similar performance wise) to take the mean inside the loop and not store the huge intermediate array of windows:
import numpy as np
from scipy.fftpack import dct
from skimage.util import view_as_windows
X = np.arange(1000*1000, dtype=np.float32).reshape(1000,1000)
window_size = 16
windows = view_as_windows(X, (window_size, window_size))
dcts = np.zeros(windows.shape[:2], dtype=np.float32).ravel()
for idx, window in enumerate(windows.reshape(-1, window_size, window_size)):
dcts[idx] = dct(window).mean()
dcts = dcts.reshape(windows.shape[:2])
Another option is scipy.ndimage.generic_filter. It won't increase performance much (the bottleneck is the python function call in the inner loop), but you'll have a lot more boundary condition options, and it will be fairly memory efficient:
import numpy as np
from scipy.fftpack import dct
import scipy.ndimage
X = np.arange(1000*1000, dtype=np.float32).reshape(1000,1000)
def func(window, shape):
window = window.reshape(shape)
return dct(window).mean()
result = scipy.ndimage.generic_filter(X, func, (16, 16),
extra_arguments=([16, 16],))

Related

Finite difference using xarray rolling

My goal is to compute a derivative of a moving window of a multidimensional dataset along a given dimension, where the dataset is stored as Xarray DataArray or DataSet.
In the simplest case, given a 2D array I would like to compute a moving difference across multiple entries in one dimension, e.g.:
data = np.kron(np.linspace(0,1,10), np.linspace(1,4,6) ).reshape(10,6)
T=3
reducedArray = np.zeros_like(data)
for i in range(data.shape[1]):
if i < T:
reducedArray[:,i] = data[:,i] - data[:,0]
else:
reducedArray[:,i] = data[:,i] - data[:,i-T]
where the if i <T condition ensures that input and output contain proper values (i.e., no nans) and are of identical shape.
Xarray's diff aims to perform a finite-difference approximation of a given derivative order using nearest-neighbours, so it is not suitable here, hence the question:
Is it possible to perform this operation using Xarray functions only?
The rolling weighted average example appears to be something similar, but still too distinct due to the usage of NumPy routines. I've been thinking that something along the lines of the following should work:
xr2DDataArray = xr.DataArray(
data,
dims=('x','y'),
coords={'x':np.linspace(0,1,10), 'y':np.linspace(1,4,6)}
)
r = xr2DDataArray.rolling(x=T,min_periods=2)
r.reduce( redFn )
I am struggling with the definition of redFn here ,though.
Caveat The actual dataset to which the operation is to be applied will have a size of ~10GiB, so a solution that does not blow up the memory requirements will be highly appreciated!
Update/Solution
Using Xarray rolling
After sleeping on it and a bit more fiddling the post linked above actually contains a solution. To obtain a finite difference we just have to define the weights to be $\pm 1$ at the ends and $0$ else:
def fdMovingWindow(data, **kwargs):
T = kwargs['T'];
del kwargs['T'];
weights = np.zeros(T)
weights[0] = -1
weights[-1] = 1
axis = kwargs['axis']
if data.shape[axis] == T:
return np.sum(data * weights, **kwargs)
else:
return 0
r.reduce(fdMovingWindow, T=4)
alternatively, using construct and a dot product:
weights = np.zeros(T)
weights[0] = -1
weights[-1] = 1
xrWeights = xr.DataArray(weights, dims=['window'])
xr2DDataArray.rolling(y=T,min_periods=1).construct('window').dot(xrWeights)
This carries a massive caveat: The procedure essentially creates a list arrays representing the moving window. This is fine for a modest 2D / 3D array, but for a 4D array that takes up ~10 GiB in memory this will lead to an OOM death!
Simplicistic - memory efficient
A less memory-intensive way is to copy the array and work in a way similar to NumPy's arrays:
xrDiffArray = xr2DDataArray.copy()
dy = xr2DDataArray.y.values[1] - xr2DDataArray.y.values[0] #equidistant sampling
for src in xr2DDataArray:
if src.y.values < xr2DDataArray.y.values[0] + T*dy:
xrDiffArray.loc[dict(y = src.y.values)] = src.values - xr2DDataArray.values[0]
else:
xrDiffArray.loc[dict(y = src.y.values)] = src.values - xr2DDataArray.sel(y = src.y.values - dy*T).values
This will produce the intended result without dimensional errors, but it requires a copy of the dataset.
I was hoping to utilise Xarray to prevent a copy and instead just chain operations that are then evaluated if and when values are actually requested.
A suggestion as to how to accomplish this will still be welcomed!
I have never used xarray, so maybe I am mistaken, but I think you can get the result you want avoiding using loops and conditionals. This is at least twice faster than your example for numpy arrays:
data = np.kron(np.linspace(0,1,10), np.linspace(1,4,6)).reshape(10,6)
reducedArray = np.empty_like(data)
reducedArray[:, T:] = data[:, T:] - data[:, :-T]
reducedArray[:, :T] = data[:, :T] - data[:, 0, np.newaxis]
I imagine the improvement will be higher when using DataArrays.
It does not use xarray functions but neither depends on numpy functions. I am confident that translating this to xarray will be straightforward, I know that it works if there are no coords, but once you include them, you get an error because of the coords mismatch (coords of data[:, T:] and of data[:, :-T] are different). Sadly, I can't do better now.

Sum values from numpy array if condition on value in another array is met

I'm facing a problem with vectorizing a function so that it applies efficiently on a numpy array.
My program entries :
A pos_part 2D Array of Nb_particles lines, 3 columns (basicaly x,y,z coordinates, only z is relevant for the part that bothers me) Nb_particles can up to several hundreds of thousands.
An prop_part 1D array with Nb_particles values. This part I got covered, creation is made with some nice numpy functions ; I just put here a basic distribution that ressembles real values.
A z_distances 1D Array, a simple np.arange betwwen z=0 and z=z_max.
Then come the calculation that takes time, because where I can't find a way to do things properply with only numpy operation of arrays. What i want to do is :
For all distances z_i in z_distances, sum all values from prop_part if corresponding particle coordinate z_particle < z_i. This would return a 1D array the same length as z_distances.
My ideas so far :
Version 0, for loop, enumerate and np.where do retrieve the index of values that I need to sum. Obviously quite long.
Version 1, using a mask on a new array (combination of z coordinates and particle properties), and sum on the masked array. Seems better than v0
Version 2, another mask and a np.vectorize, but i understand it's not efficient as vectorize is basicaly a for loop. Still seems better than v0
Version 3, I'm trying to use mask on a function that can I directly apply to z_distances, but it's not working so far.
So, here I am. There is maybe something to do with a sort and a cumulative sum, but I don't know how to do this, so any help would be greatly appreciated. Please find below the code to make things clearer
Thanks in advance.
import numpy as np
import time
import matplotlib.pyplot as plt
# Creation of particles' positions
Nb_part = 150_000
pos_part = 10*np.random.rand(Nb_part,3)
pos_part[:,0] = pos_part[:,1] = 0
#usefull property creation
beta = 1/1.5
prop_part = (1/beta)*np.exp(-pos_part[:,2]/beta)
z_distances = np.arange(0,10,0.1)
#my version 0
t0=time.time()
result = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
positions = np.where(pos_part[:,2]<val_dist)[0]
result[index_dist] = sum(prop_part[i] for i in positions)
print("v0 :",time.time()-t0)
#A graph to help understand
plt.figure()
plt.plot(z_distances,result, c="red")
plt.ylabel("Sum of particles' usefull property for particles with z-pos<d")
plt.xlabel("d")
#version 1 ??
t1=time.time()
combi = np.column_stack((pos_part[:,2],prop_part))
result2 = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
mask = (combi[:,0]<val_dist)
result2[index_dist]=sum(combi[:,1][mask])
print("v1 :",time.time()-t1)
plt.plot(z_distances,result2, c="blue")
#version 2
t2=time.time()
def themask(a):
mask = (combi[:,0]<a)
return sum(combi[:,1][mask])
thefunc = np.vectorize(themask)
result3 = thefunc(z_distances)
print("v2 :",time.time()-t2)
plt.plot(z_distances,result3, c="green")
### This does not work so far
# version 3
# =============================
# t3=time.time()
# def thesum(a):
# mask = combi[combi[:,0]<a]
# return sum(mask[:,1])
# result4 = thesum(z_distances)
# print("v3 :",time.time()-t3)
# =============================
You can get a lot more performance by writing your first version completely in numpy. Replace pythons sum with np.sum. Instead of the for i in positions list comprehension, simply pass the positions mask you are creating anyways.
Indeed, the np.where is not necessary and my best version looks like:
#my version 0
t0=time.time()
result = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
positions = pos_part[:, 2] < val_dist
result[index_dist] = np.sum(prop_part[positions])
print("v0 :",time.time()-t0)
# out: v0 : 0.06322097778320312
You can get a bit faster if z_distances is very long by using numba.
Running calc for the first time usually creates some overhead which we can get rid of by running the function for some small set of `z_distances.
The below code achieves roughly a factor of two speedup over pure numpy on my laptop.
import numba as nb
#nb.njit(parallel=True)
def calc(result, z_distances):
n = z_distances.shape[0]
for ii in nb.prange(n):
pos = pos_part[:, 2] < z_distances[ii]
result[ii] = np.sum(prop_part[pos])
return result
result4 = np.zeros_like(result)
# _t = time.time()
# calc(result4, z_distances[:10])
# print(time.time()-_t)
t3 = time.time()
result4 = calc(result4, z_distances)
print("v3 :", time.time()-t3)
plt.plot(z_distances, result4)

How can I speed up closest point comparison using cdist or tensorflow?

I have two sets of points, one is a map consisting of x,y coordinates, and the second is a path of x,y coordinates. I'm trying to find the closest map points to my path points, pretty simple. Except my map is 380000 points and my paths (of which I have several) each consist of ~ 350000 points themselves.
Other than sampling my data to get smaller datasets, I'm trying to find a faster way to accomplish this task.
base algorithm:
import pandas as pd
from scipy.spatial.distance import cdist
...
def closeset_point(point, points):
return points[cdist([point], points).argmin()]
# log['point'].shape; 333000
# map_data['point'].shape; 380000
closest = [closest_point(log_p, list(map_data['point'])) for log_p in log['point']]
as per this example: Find closest point in Pandas DataFrames
After converting this to a tqdm progress bar to see how long it would take (as it was taking a while, obviously), I noticed it would take about 10hrs to complete.
tqdm loop:
for i in trange(len(log), desc='finding closest points'):
closest.append(closest_point(log['point'].loc[i], list(map_data['point'])))
>> finding closest points: 5%| | 16432/333456 [32:11<10:13:52], 8.60it/s
While 10 hours is not impossible, I wonder if there is a way to speed this up? I have a solid gpu/cpu/ram at my disposal so I feel this should be doable. I'm also learning tensorflow (but honestly my math is atrocious so I'm very in the dark with it)
Any ideas on how to speed this up with either multi-threading, gpu computation, tensorflow or some other sort of wizardry?
inb4 python is slow ;)
*edit: image shows what i'm trying to do. green is path, blue is map, orange is what I'm trying to find.
The following is a mini example of what you're trying to do. Considers the variable coords1 as your variable log['point'] and coords2 as your variable log['point']. The end result is the index of the coord2 closest to coord1.
from scipy.spatial import distance
import numpy as np
coords1 = [(35.0456, -85.2672),
(35.1174, -89.9711),
(35.9728, -83.9422),
(36.1667, -86.7833)]
coords2 = [(35.0456, -85.2672),
(35.1174, -89.9711),
(35.9728, -83.9422),
(34.9728, -83.9422),
(36.1667, -86.7833)]
tmp = distance.cdist(coords1, coords2, "sqeuclidean") # sqeuclidean based on Mark Setchell comment to improve speed further
result = np.argmin(tmp,1)
# result: array([0, 1, 2, 4])
This should be way faster, because it's done everything in one iteration.
After 3 years, but if anyone is looking at this issue... You may want to try Numba I get almost a 9x speed reduction from scipy distance.cdist on a 1.5 Million set of points to a 1.5 K set of path points. Also, as #
Mark Setchell said if you want to remove the np.sqrt in a big enough set of points could be considerable saved time.
Results
size: (1459383, 2)
numba: 0.06402060508728027
cdist: 0.5371212959289551
Code
# EUCLEDIAN DISTANCE
#numba.njit('(float64[:,::1], float64[::1], float64[::1])', parallel=True, fastmath=True)
def pz_dist(p_array, x_flat, y_flat):
m = p_array.shape[0]
n = x_flat.shape[0]
d = np.empty(shape=(m, n), dtype=np.float64)
for i in numba.prange(m):
p1 = p_array[i, :]
for j in range(n):
_x = x_flat[j] - p1[0]
_y = y_flat[j] - p1[1]
_d = np.sqrt(_x**2 + _y**2)
d[i, j] = _d
return d

fast downsampling of huge matrix using python (numpy memmap, pytables or other?)

As part of my data processing I produce huge non sparse matrices in the order of 100000*100000 cells, which I want to downsample by a factor of 10 to reduce the amount of data. In this case I want to average over blocks of 10*10 pixels, to reduce the size of my matrix from 100000*100000 to 10000*10000.
What is the fastest way to do so using python? It does not matter for me if I need to save my original data to a new dataformat, because I have to do the downsampling of the same dataset multiple times.
Currently I am using numpy.memmap:
import numpy as np
data_1 = 'data_1.dat'
date_2 = 'data_2.dat'
lines = 100000
pixels = 100000
window = 10
new_lines = lines / window
new_pixels = pixels / window
dat_1 = np.memmap(data_1, dtype='float32', mode='r', shape=(lines, pixels))
dat_2 = np.memmap(data_2, dtype='float32', mode='r', shape=(lines, pixels))
dat_in = dat_1 * dat_2
dat_out = dat_in.reshape([new_lines, window, new_pixels, window]).mean(3).mean(1)
But with with large files this method becomes very slow. Likely this has something to do with the binary data of these files, which are ordered by line. Therefore, I think that a data format which stores my data in blocks instead of lines will be faster, but I am not sure what the performance gain will be and whether there are python packages who support this.
I have also thought about downsampling of the data before creating such a huge matrix (not shown here), but my input data is fractured and irregular, so that would become very complex.
Based on this answer, I think this might be a relatively fast method, depending on how much overhead reshape gives you with memmap.
def downSample(a, window):
i, j = a.shape
ir = np.arange(0, i, window)
jr = np.arange(0, j, window)
n = 1./(window**2)
return n * np.add.reduceat(np.add.reduceat(a, ir), jr, axis=1)
Hard to test speed without your dataset.
This avoids an intermediate copy, as the reshape keeps dimensions contiguous
dat_in.reshape((lines/window, window, pixels/window, window)).mean(axis=(1,3))

Seeking a more efficient python numpy ravel+reshape

I'm curious if there is a better way of doing a numpy ravel+reshape.
I load up a large stack of large images and get an array of shape (num-rasters, h, w) where num-rasters is the number of images and h/w are height/width of an image(which are all the same size).
I wish to convert the array into a shape (h*w, num-rasters)
here is the way I do it now:
res = my_function(some_variable) #(num-rasters, h, w)
res = res.ravel(order='F').reshape((res.shape[1] * res.shape[2], res.shape[0])) #(h*w, num-rasters)
It works fine but my 'res' variable(the stack of images) is several Gigs in size and even with a ton of ram (32Gigs), the operation takes it all.
I'm curious if any pythonistas or numpy pros have any suggestions.
thanks!
############### post question edit/follow-up
first, the reshaping in place ended up being waaaaay faster than a .reshap() call...which would presumably return a copy with all the associated memory stuff. I should have known better with that.
shortly after I posted I discovered "swapaxes" http://docs.scipy.org/doc/numpy/reference/generated/numpy.swapaxes.html so I made a version with that too:
res2 = res.swapaxes(0, 2).reshape((res.shape[1] * res.shape[2], res.shape[0]))
took 9.2 seconds
it was only a wee bit faster than my original(9.3). But with only one discernible memory peak in my process...but still a big and slow peak.
as magic suggested:
res.shape = (res.shape[0], res.shape[1]*res.shape[2])
res_T = res.T
took basically no time (2.4e-5 seconds) with no memory spike.
and throwing a copy:
res.shape = (res.shape[0], res.shape[1]*res.shape[2])
res_T = res.T.copy()
makes the operation take 0.85 seconds with a similar (but brief) memory spike (for the copy).
the take-home for me is that 'swapaxes' does the same thing as a transpose but you can swap any axes you want, whereas transpose has its fixed way of flipping. it's also nice to see how a transpose behaves in 3-d...that is the main point for me...not needing to ravel. also, the transpose is an in-place view.
You can change the array's shape parameter, leading to in-place shape change. It's a bit tricky to tell which dimensions go where, but something along those lines should work:
res.shape = (res.shape[0], res.shape[1]*res.shape[2]) ## converts to num_rasters, h*w
Transposing this would give you a view (so would sort-of be in place), so then you can do
res_T = res.T
and this should lead to no memory copying, to my knowledge.

Categories

Resources