Multi-Dimensional Batch-Image Convolution using Numpy - python

In image processing and classification networks, a common task is the convolution or cross-correlation of input images with some fixed filters. For example, in convolutional neural nets (CNNs), this is an extremely common operation. I have reduced the general version task to this:
Given: a batch of N images [N,H,W,D,...] and a set of K filters [K,H,W,D,...]
Return: a ndarray that represents the m-dimensional cross-correlation (xcorr) of image N_i with filter K_j for every N_i in N and K_j in K
Currently, I am using scipy.spatial.cdist on a custom function that represents the max of the xcorr of two m-dim images, namely scipy.signal.correlate. The code looks something like this:
from scipy.spatial.distance import cdist
from scipy.signal import correlate
def xcorr(u,v):
'''unfortunately, cdist only takes 2D arrays, so need to do this'''
u = np.reshape(u, [96,96,3])
v = np.reshape(v, [96,96,3])
return np.max(correlate(u,v,mode='same',method='fft'))
batch_images = np.random.random([500,96,96,3])
my_filters = np.random.random([1000,96,96,3])
# unfortunately, cdist only takes 2D arrays, so need to do this
batch_vec = np.reshape(batch_images, [-1,np.prod(batch_images.shape[1:])])
filt_vec = np.reshape(my_filters, [-1,np.prod(my_filters.shape[1:])])
answer = cdist(batch_vec, filt_vec, xcorr)
The method works, and its nice that cdist is automatically parallelized across threads, but it is actually quite slow. I am guessing this is due to a number of reasons, including non-optimal use of the cache between threads (e.g. keep one filter fixed in cache while you filter all the images, or vice versa), the reshape operation inside xcorr, etc.
Does the community have any ideas how to speed this up? I realize in my example xcorr takes the maximum over the cross-correlation between both images, but this was just an example that was fit to work with cdist. Ideally, you could perform this batch operation and use some other function (or none) to get the output you wanted. Ideal solutions could handle (R,G,B,D,...) data.
Any/all help appreciated, including but not limited to wrapping C, although Python/numpy solutions are preferred. I saw some posts related to einsum notation, but I am not super familiar with that, so any help would be appreciated. I welcome tensorflow solutions IF they are able to get the same answer (within reasonable precision) as the corresponding slow numpy version.

Related

Total variation implementation in numpy for a piecewise linear function

I would like to use total variation in Python, but I wasn't able to find an existing implementation.
Assuming that I have an array with a finite number of elements, is the implementation with NumPy simply as:
import numpy as np
a = np.array([...], dtype=float)
tv = np.sum(np.abs(np.diff(a)))
My main doubt is how to compute the supremum of tv across all partitions, and if just the sum of the absolute difference might suffice for a finite array of floats.
Edit: My input array represents a piecewise linear function, therefore the supremum over the full set of partitions is indeed the sum of absolute differences between contiguous points.
Yes, that is correct.
I imagine you're confused by the mathy definition on the Wikipedia page for total variation. Have a look at the more practical definition on the Wikipedia page for total variation denoising instead.
For an actual code (even Python) implementation, see e.g. Tensorflow's total_variation(), though this is for one or more (2D, color) images, so the TV is computed for both rows and columns, and then added together.

How to np.convolve over two 2d arrays

I want to carry out np.convolve for two 2d arrays in a vectorized manner. Here is the thing:
The function np.convolve takes two 1d arrays, a and v, and computes the convolution. The result reads:
output[n] = \sum_m a[m] v[n - m] .
What I want to do is, for 2d arrays a and v, to repeat "convolution along axis=0" over axis=1. That is,
output[n][k] = \sum_m a[m][k] v[n - m][k] .
Though this is attained by a for statement with respect to the axis=1, I'm keen on finding a way to "vectorize" this to improve performance.
I looked up for a while but no clue. I'm glad if someone could find a way out. Thank you so much.
Depending on the size of your arrays it may be quicker to do a multiplication of the fft arrays, which is mathematically equivalent to the convolution operation. Based on the answer here, it is quicker, however it appears limited when you get to exceptionally long sequences, as proven here. An example of what this could look like:
from scipy.signal import fftconvolve
output = fftconvolve(a, v, mode='same') #mode='full', too
If they help you don't forget to give them an upvote, too :).

About custom operations in Tensorflow and PyTorch

I have to implement an energy function, termed Rigidity Energy, as in Eq 7 of this paper here.
The energy function takes as input two 3D object meshes, and returns the energy between them. The first mesh is the source mesh, and the second mesh is the deformed version of the source mesh. In rough psuedo-code, the computation would go like this:
Iterate over all the vertices in the source mesh.
For every vertex, compute its covariance matrix with its neighboring vertices.
Perform SVD on the computed covariance matrix and find the rotation matrix of the vertex.
Use the computed rotation matrix, the point coordinates in the original mesh and the corresponding coordinates in the deformed mesh, to compute the energy deviation of the vertex.
Thus this energy function requires me to iterate over each point in the mesh, and the mesh could have more than 2k such points. In Tensorflow, there are two ways to do this. I can have 2 tensors of shape (N,3), one representing the points of source and the other of the deformed mesh.
Do it purely using Tensorflow tensors. That is, iterate over elements of the above tensors using tf.gather and perform the computation on each point using only existing TF operations. This method, would be extremely slow. I've tried to define loss functions that iterate over 1000s of points before, and the graph construction itself takes too much time to be practical.
Add a new TF OP as explained in the TF documentation here . This involves writing the function in CPP (and Cuda, for GPU support), and registering the new OP with TF.
The first method is easy to write, but impractically slow. The second method is a pain to write.
I've used TF for 3 years, and have never used PyTorch before, but at this point I'm considering switching to it, if it offers a better alternative for such cases.
Does PyTorch have a way of implementing such loss functions both easily and performs as fast as it would on GPU. i.e, A pythonic way of writing my own loss functions that runs on GPU, without any C or Cuda code on my part?
As far as I understand, you are essentially asking if this operation can be vectorized. The answer is no, at least not fully, because svd implementation in PyTorch is not vectorized.
If you showed the tensorflow implementation, it would help in understanding your starting point. I don't know what you mean by finding the rotation matrix of the vertex, but I would guess this can be vectorized. This would mean that svd is the only non-vectorized operation and you could perhaps get away with writing just a single custom OP, that is the vectorized svd - which is likely quite easy, because it would amount to calling some library routines in a loop in C++.
Two possible sources of problems I see are
if the neighborhoods of N(i) in equation 7 can be of significantly different sizes (which would mean that the covariance matrices are of different sizes and vectorization would require some dirty tricks)
the general problem of dealing with meshes and neighborhoods could be difficult. This is an innate property of irregular meshes, but PyTorch has support for sparse matrices and a dedicated package torch_geometry, which at least helps.

Apodization Mask for Fast Fourier Transforms in Python

I need to do a Fourier transform of a map in Python. Fast Fourier Transforms expect periodic boundary conditions, but the input map is not periodic. So I need to apply an input filter/weight slowly tapering the map toward zero at the edges. Are there libraries for doing this in python?
My favorite function to apodize a map is the generalized Gaussian (also called 'Super-Gaussian' which is a Gaussian whose exponent is raised to a power P. By setting P to, say, 4 or 6 you get a flat-top pulse which falls off smoothly, which is good for FFT applications where sharp edges always create ripples in conjugate space.
The generalized Gaussian is available on Scipy. Here is a minimal code (Python 3) to apodize a 2D array with a generalized Gaussian. As noted in previous comments, there are dozens of functions which would work just as well.
import numpy as np
from scipy.signal import general_gaussian
# A 128x128 array
array = np.random.rand(128,128)
# Define a general Gaussian in 2D as outer product of the function with itself
window = np.outer(general_gaussian(128,6,50),general_gaussian(128,6,50))
# Multiply
ap_array = window*array
Such tapering is often referred to as a "window".
Scipy has many window functions.
You can use numpy.expand_dims to create the 2D window you want.
Regarding Stefan's comment, apparently the numpy team thinks that including more than arrays was a mistake. I would stick to using scipy for signal processing. Watch out, as they moved quite a bit of functions around in their 1.0 release so older documentation is, well, quite old.
As a final note: a "filter" is typically reserved for multiplications you apply in the Frequency domain, not spatial domain.

3D volume processing using dask

I’m exploring 3D interactive volume convolution with some simple stencils using dask right now.
Let me explain what I mean:
Assume that you have a 3D data which you would like to process through Sobel Transform (for example to get L1 or L2 gradient).
Then you divide your input 3D image into subvolumes (with some overlapping boundaries – for 3x3x3 stencil Sobel it will demand +2 samples overlap/padding)
Now let’s assume that you create a delayed computation of the Sobel 3D transform on entire 3D volume – but not executing it yet.
And now the most important part:
I want to write a function which will extract some particular 2D section from the virtually transformed data.
And then finally let dask everything to compute:
But what I need dask to do is not to compute the entire transform for me and then provide a section.
I need it to execute only those tasks which are needed to compute that particular 2D transformed image slice.
Do you think – it’s possible?
In order to explain it with image – please consider this to be a 3D domain decomposition (this is from DWT – but good for illustration from here):
illistration of domain decomposition
And assume that there is a function which computes 3D transform of the entire volume using dask.
But what I would like to get – for example – is 2D image of the transformed 3D data which consists from LLL1,LLH1,HLH1,HLL1 planes (essentially a single slice).
The important part is not to compute the whole subcubes – but let dask somehow automatically track the dependencies in the compute graph and evaluate only those.
Please don’t worry about compute v.s. copy time.
Assume that it has perfect ratio.
Let me know if more clarification is needed!
Thanks for your help!
I'm hearing a few questions. I'll answer each individually
Can Dask track which tasks are required for a subset of outputs and only compute those?
Yes. Lazy Dask operations produce a dependency graph. In the case of dask.arrays this graph is per-chunk. If your output only depends on a subset of the graph then Dask will remove tasks that are not necessary. The in-depth docs for this are here and the cull optimization in particular.
As an example consider this 100,000 by 100,000 array
>>> x = da.random.random((100000, 100000), chunks=(1000, 1000))
And lets say that I add a couple of 1d slices from it
>>> y = x[5000, :] + x[:, 5000].T
The resulting optimized graph is only large enough to compute the output
>>> graph = y._optimize(y.dask, y._keys()) # you don't need to do this
>>> len(graph) # it happens automatically
301
And we can compute the result quite quickly:
In [8]: %time y.compute()
CPU times: user 3.18 s, sys: 120 ms, total: 3.3 s
Wall time: 936 ms
Out[8]:
array([ 1.59069994, 0.84731881, 1.86923216, ..., 0.45040813,
0.86290539, 0.91143427])
Now, this wasn't perfect. It did have to produce all of the 1000x1000 chunks that our two slices touched. But you can control the granularity there.
Short answer: Dask will automatically inspect the graph and only run those tasks that are necessary to evaluate the output. You don't need to do anything special to do this.
Is it a good idea to do overlapping array computations with dask.array?
Maybe. The relevant doc page is here on Overlapping Blocks with Ghost Cells. Dask.array has convenience functions to make this easy to write down. However it will create in-memory copies. Many people in your position find memcopy too slow. Dask generally doesn't support in-place computation so we can't be as efficient as proper MPI code. I'll leave the performance question here to you though.
Not to detract from the nicely laid out answer from #MRocklin, but more to add to it.
I also regularly find myself needing to do things like edge detection and other image processing techniques on large scale array data. As Dask is a very nice library for constructing and exploring such computational workflows on large array data, have put together some utility libraries for some common image processing techniques in a GitHub organization called dask-image. They have largely been designed to mimic SciPy's ndimage API.
As to using a Sobel operator with Dask, one can use this sobel function from dask-ndfilters (permissively licensed) to perform this operation on a Dask Array. It handles proper haloing on the blocks underneath the hood returning a new Dask Array.
As SciPy's sobel function (and dask-ndfilters' sobel as well) operate on one dimension, one will need to map over each axis and stack to get the full Sobel operator result. That said, this is quite straightforward to do. Below is a brief snippet showing how to do this on a random Dask Array. Also included is taking a slice along the XZ-plane. Though one could just as easily take any other slice or perform additional operations on the data.
Hope this helps. :)
import dask.array as da
import dask_ndfilters as da_ndfilt
d = da.random.random((100, 120, 140), chunks=(25, 30, 35))
ds = da.stack([da_ndfilt.sobel(d, axis=i) for i in range(d.ndim)])
dsp = ds[:, :, 0, :]
asp = dsp.compute()

Categories

Resources