Efficient calculation of Sobel gradient magnitude

Efficient calculation of Sobel gradient magnitude - python

I'm currently using this code to calculate the magnitude of the Sobel gradient:
sobel_x = cv.CreateImage(cv.GetSize(im), cv.IPL_DEPTH_16S, 1)
sobel_y = cv.CreateImage(cv.GetSize(im), cv.IPL_DEPTH_16S, 1)
cv.Sobel(im, sobel_x, 1, 0, 3)
cv.Sobel(im, sobel_y, 0, 1, 3)
width, height = cv.GetSize(im)
for i in range(width*height):
x, _, _, _ = cv.Get1D(sobel_x, i)
y, _, _, _ = cv.Get1D(sobel_y, i)
px = int(math.sqrt(x*x + y*y))
cv.Set1D(sobel, i, px)
It's simple enough, but it's not very efficient, because I'm accessing each pixel one by one. I was hoping of a better way to do this in OpenCV:
sobel_x2 = cv.CreateImage(cv.GetSize(im), cv.IPL_DEPTH_32S, 1)
sobel_y2 = cv.CreateImage(cv.GetSize(im), cv.IPL_DEPTH_32S, 1)
sobel_2 = cv.CreateImage(cv.GetSize(im), cv.IPL_DEPTH_32S, 1)
cv.Mul(sobel_x, sobel_x, sobel_x2)
cv.Mul(sobel_y, sobel_y, sobel_y2)
cv.Add(sobel_x2, sobel_y2, sobel_2)
Here I'm just squaring the images and adding them. It uses more memory but should be faster because now some operations will be done in parallel. What I'm stuck on is there's no element-wise square root function (cv.Sqrt seems to only work with scalars).
Any ideas?

As you've already noted, cv.Sqrt() only accepts a scalar in the Python bindings. Since there is an equivalent function, cv::sqrt(), that performs an element-wise square-root, it should also be in the mostly auto-generated Python bindings. Perhaps this is a bug in the version of OpenCV that you are using.
Regardless, you should be able to use cv.Pow() to get the same result:
cv.Pow(src, dst, 0.5)
This is likely not as fast as cv.Sqrt() would be, but should still dramatically outperform an element-wise computation.

Related

How to parallelize a for loop with a shared array return?

I have a numpy array with an image, and a binary segmentation mask containing separate binary "blobs". Something like the binary mask in:
I wish to extract image statistics from pixels in correspondence of each of the binary blobs, separately. These values are stored inside a new numpy array, named cnr_map.
My current implementation uses a for loop. However, when the number of binary blobs increases, it is really slow, and I'm wondering if it is possible to parallelize it.
from scipy.ndimage import label
labeled_array, num_features = label(mask)
cnr_map = np.copy(mask)
for k in range(num_features):
foreground_mask = labeled_array == k
background_mask = 1.0 - foreground_mask
a = np.mean(image[foreground_mask == 1])
b = np.mean(image[background_mask == 1])
c = np.std(image[background_mask == 1])
cnr = np.abs(a - b) / (c + 1e-12)
cnr_map[foreground_mask] = cnr
How can I parallelize the work so that the for loop runs faster?
I have seen this question, but my case is a bit different as I want to return a numpy array with the cumulative modifications of the multiple processes (i.e. cnr_map), and I don't understand how to do it.

Best method to add noise on tf.dataset images "on the fly"

After training a model (image classification) I would like to see how it performs differently when I evaluate a proper image and various noised versions of it.
The type of noise I'm thinking is a random change in pixels value, I tried with this approach:
# --Inside the generator function that I provide to model.predict_generator--
# dataset is a numpy array with denoised images path
dt = tf.data.Dataset.from_generator(lambda: image_generator(dataset), output_types=(tf.float32))
def image_generator_(image_paths):
for path in image_paths:
# im is keras.preprocessing image
img = im.load_img(path,
color_mode='rgb',
target_size=(224,224))
img_to_numpy = np.array(img)
for _ in range (0, 5):
tmp_numpy_image = img_to_numpy.copy()
for i in range(tmp_numpy_image.shape[0]):
for j in range(tmp_numpy_image.shape[1]):
# add noise
tmp_numpy_image.shape[i][j] = ...
yield tmp_numpy_image
This process works fine but it is very slow. I also use dataset.batch and dataset.prefetch on dt and I didn't found a combination for their values that reduces the algorithm time
Is there a smarter way to do it? I tried by yielding not noised images and to add the noise later inside dataset.map. The problem is that inside map I have to manipulate tensors and I didn't found a way to change each pixel value
SOLUTION
I used #Marat approach and it worked like a charm, the whole process went from 20-30 hours to minutes. My noise was a simple +-1 but I didn't want to go in overflow (255+1 = 0 in uint8) and therefore I only had to use numpy masks
...
tmp_numpy_image = img_to_numpy.copy()
noise = np.random.randint(-1, 1, img_to_numpy.shape)
# tmp_numpy_imag will become of type int32
tmp_numpy_image = tmp_numpy_image + noise
np.putmask(tmp_numpy_image, tmp_numpy_image < 0, 0)
np.putmask(tmp_numpy_image, tmp_numpy_image > 255, 255)
tmp_numpy_image = tmp_numpy_image.astype('uint8')
yield tmp_numpy_image

The biggest overhead here is pixel operations (double for loop). Vectorizing it should result in substantial speedup:
noise_magnitude = 10
...
img_max_value = img_to_numpy.max() * np.ones(img_to_numpy.shape)
for _ in range (0, 5):
# depending on range of values, you might want to adjust noise magnitude
noise = np.random.randint(0, noise_magnitude, img_to_numpy.shape)
# after adding noise, clip values exceeding max values
yield np.maximum(img_to_numpy + noise, img_max_value)

Faster implementation to quantize an image with an existing palette?

I am using Python 3.6 to perform basic image manipulation through Pillow. Currently, I am attempting to take 32-bit PNG images (RGBA) of arbitrary color compositions and sizes and quantize them to a known palette of 16 colors. Optimally, this quantization method should be able to leave fully transparent (A = 0) pixels alone, while forcing all semi-transparent pixels to be fully opaque (A = 255). I have already devised working code that performs this, but I wonder if it may be inefficient:
import math
from PIL import Image
# a list of 16 RGBA tuples
palette = [
(0, 0, 0, 255),
# ...
]
with Image.open('some_image.png').convert('RGBA') as img:
for py in range(img.height):
for px in range(img.width):
pix = img.getpixel((px, py))
if pix[3] == 0: # Ignore fully transparent pixels
continue
# Perform exhaustive search for closest Euclidean distance
dist = 450
best_fit = (0, 0, 0, 0)
for c in palette:
if pix[:3] == c: # If pixel matches exactly, break
best_fit = c
break
tmp = sqrt(pow(pix[0]-c[0], 2) + pow(pix[1]-c[1], 2) + pow(pix[2]-c[2], 2))
if tmp < dist:
dist = tmp
best_fit = c
img.putpixel((px, py), best_fit + (255,))
img.save('quantized.png')
I think of two main inefficiencies of this code:
Image.putpixel() is a slow operation
Calculating the distance function multiple times per pixel is computationally wasteful
Is there a faster method to do this?
I've noted that Pillow has a native function Image.quantize() that seems to do exactly what I want. But as it is coded, it forces dithering in the result, which I do not want. This has been brought up in another StackOverflow question. The answer to that question was simply to extract the internal Pillow code and tweak the control variable for dithering, which I tested, but I find that Pillow corrupts the palette I give it and consistently yields an image where the quantized colors are considerably darker than they should be.
Image.point() is a tantalizing method, but it only works on each color channel individually, where color quantization requires working with all channels as a set. It'd be nice to be able to force all of the channels into a single channel of 32-bit integer values, which seems to be what the ill-documented mode "I" would do, but if I run img.convert('I'), I get a completely greyscale result, destroying all color.
An alternative method seems to be using NumPy and altering the image directly. I've attempted to create a lookup table of RGB values, but the three-dimensional indexing of NumPy's syntax is driving me insane. Ideally I'd like some kind of code that works like this:
img_arr = numpy.array(img)
# Find all unique colors
unique_colors = numpy.unique(arr, axis=0)
# Generate lookup table
colormap = numpy.empty(unique_colors.shape)
for i, c in enumerate(unique_colors):
dist = 450
best_fit = None
for pc in palette:
tmp = sqrt(pow(c[0] - pc[0], 2) + pow(c[1] - pc[1], 2) + pow(c[2] - pc[2], 2))
if tmp < dist:
dist = tmp
best_fit = pc
colormap[i] = best_fit
# Hypothetical pseudocode I can't seem to write out
for iy in range(arr.size):
for ix in range(arr[0].size):
if arr[iy, ix, 3] == 0: # Skip transparent
continue
index = # Find index of matching color in unique_colors, somehow
arr[iy, ix] = colormap[index]
I note with this hypothetical example that numpy.unique() is another slow operation, since it sorts the output. Since I cannot seem to finish the code the way I want, I haven't been able to test if this method is faster anyway.
I've also considered attempting to flatten the RGBA axis by converting the values to a 32-bit integer and desiring to create a one-dimensional lookup table with the simpler index:
def shift(a):
return a[0] << 24 | a[1] << 16 | a[2] << 8 | a[3]
img_arr = numpy.apply_along_axis(shift, 1, img_arr)
But this operation seemed noticeably slow on its own.
I would prefer answers that involve only Pillow and/or NumPy, please. Unless using another library demonstrates a dramatic computational speed increase over any PIL- or NumPy-native solution, I don't want to import extraneous libraries to do something these two libraries should be reasonably capable of on their own.

for loops should be avoided for speed.
I think you should make a tensor like:
d2[x,y,color_index,rgb] = distance_squared
where rgb = 0..2 (0 = r, 1 = g, 2 = b).
Then compute the distance:
d[x,y,color_index] =
sqrt(sum(rgb,d2))
Then select the color_index with the minimal distance:
c[x,y] = min_index(color_index, d)
Finally replace alpha as needed:
alpha = ceil(orig_image.alpha)
img = c,alpha

A long term puzzle, how to optimize multi-level loops in python?

I have written a function in python to calculate Delta function in Gauss broadening, which involves 4-level loops. However, the efficiency is very low, about 10 times slower than using Fortran in a similar way.
def Delta_Gaussf(Nw, N_bd, N_kp, hw, eigv):
Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=float)
for w1 in range(Nw):
for k1 in range(N_kp):
for i1 in range(N_bd):
for j1 in range(N_bd):
if ( j1 >= i1 ):
Delta_Gauss[w1][k1][i1][j1] = np.exp(pow((eigv[k1][j1]-eigv[k1][i1]-hw[w1])/width,2))
return Delta_Gauss
I have removed some constants to make it looks simpler.
Could any one help me to optimize this script to increase efficiency?

Simply compile it
To get the best performance I recommend Numba (easy usage, good performance). Alternatively Cython may be a good idea, but with a bit more changes to your code.
You actually got everything right and implemented a easy to understand (for a human and most important for a compiler) solution.
There are basically two ways to gain performance
Vectorize the code as #scnerd showed. This is usually a bit slower and more complex than simply compile a quite simple code, that only uses some for loops. Don't vectorize your code and than use a compiler. From a simple looping aproach this is usually some work to do and leads to a slower and more complex result. The advantage of this process is that you only need numpy, which is a standard dependency in nearly every Python project that deals with some numerical calculations.
Compile the code. If you have already a solution with a few loops and no other, or only a few non numpy functions involved this is often the simplest and fastest solution.
A solution using Numba
You do not have to change much, I changed the pow function to np.power and some slight changes to the way arrays accessed in numpy (this isn't really necessary).
import numba as nb
import numpy as np
#performance-debug info
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')
#nb.njit(fastmath=True)
def Delta_Gaussf_nb(Nw, N_bd, N_kp, hw, width,eigv):
Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=float)
for w1 in range(Nw):
for k1 in range(N_kp):
for i1 in range(N_bd):
for j1 in range(N_bd):
if ( j1 >= i1 ):
Delta_Gauss[w1,k1,i1,j1] = np.exp(np.power((eigv[k1,j1]-eigv[k1,i1]-hw[w1])/width,2))
return Delta_Gauss
Due to the 'if' the SIMD-vectorization fails. In the next step we can remove it (maybe a call outside the njited function to np.triu(Delta_Gauss) will be necessary). I also parallelized the function.
#nb.njit(fastmath=True,parallel=True)
def Delta_Gaussf_1(Nw, N_bd, N_kp, hw, width,eigv):
Delta_Gauss = np.zeros((Nw,N_kp,N_bd,N_bd),dtype=np.float64)
for w1 in nb.prange(Nw):
for k1 in range(N_kp):
for i1 in range(N_bd):
for j1 in range(N_bd):
Delta_Gauss[w1,k1,i1,j1] = np.exp(np.power((eigv[k1,j1]-eigv[k1,i1]-hw[w1])/width,2))
return Delta_Gauss
Performance
Nw = 20
N_bd = 20
N_kp = 20
width=20
hw = np.linspace(0., 1.0, Nw)
eigv = np.zeros((N_kp, N_bd),dtype=np.float)
Your version: 0.5s
first_compiled version: 1.37ms
parallel version: 0.55ms
These easy optimizations lead to about 1000x speedup.

BLUF: Using Numpy's full functionality, plus another neat module, you can get the Python code down over 100x faster than this raw for-loop code. Using #max9111's answer, however, you can get even faster with much cleaner code and less work.
The resulting code looks nothing like the original, so I'll do the optimization one step at a time so that the process and final code make sense. Essentially, we're going to use a lot of broadcasting in order to get Numpy to perform the looping under the hood (which is always faster than looping in Python). The result computes the full square of results, which means we're necessarily duplicating some work since the result is symmetrical, but it's easier, and honestly probably faster, to do this work in high-performance ways than to have an if at the deepest level of looping in order to avoid the computation. This might be avoidable in Fortran, but probably not in Python. If you want the result to be identical to your provided source, we'll need to take the upper triangle of the result of my code below (which I do in the sample code below... feel free to remove the triu call in actual production, it's not necessary).
First, we'll notice a few things. The main equation has a denominator that performs np.sqrt, but the content of that computation doesn't change at any iteration of the loop, so we'll compute it once and re-use the result. This turns out to be minor, but we'll do it anyway. Next, the main function of the inner two loops is to perform eigv[k1][j1] - eigv[k1][i1], which is quite easy to vectorize. If eigv is a matrix, then eigv[k1] - eigv[k1].T produces a matrix where result[i1, j1] = eigv[k1][j1] - eigv[k1][i1]. This allows us to entirely remove the innermost two loops:
def mine_Delta_Gaussf(Nw, N_bd, N_kp, hw, width, eigv):
Delta_Gauss = np.zeros((Nw, N_kp, N_bd, N_bd), dtype=float)
denom = np.sqrt(2.0 * np.pi) * width
eigv = np.matrix(eigv)
for w1 in range(Nw):
for k1 in range(N_kp):
this_eigv = (eigv[k1] - eigv[k1].T - hw[w1])
v = np.power(this_eigv / width, 2)
Delta_Gauss[w1, k1, :, :] = np.exp(-0.5 * v) / denom
# Take the upper triangle to make the result exactly equal to the original code
return np.triu(Delta_Gauss)
Well, now that we're on the broadcasting bandwagon, it really seems like the remaining two loops should be possible to remove in the same way. As it happens, it is easy! The only thing we need k1 for is to get the row out of eigv that we're trying to pairwise-subtract... so why not do this to all rows at the same time? We're currently basically subtracting matrices of shapes (1, B) - (B, 1) for each of N rows in eigv (where B is is N_bd). We can abuse broadcasting to do this for all rows of eigv simultaneously by subtracting matrices of shapes (N, 1, B) - (N, B, 1) (where N is N_kp):
def mine_Delta_Gaussf(Nw, N_bd, N_kp, hw, width, eigv):
Delta_Gauss = np.zeros((Nw, N_kp, N_bd, N_bd), dtype=float)
denom = np.sqrt(2.0 * np.pi) * width
for w1 in range(Nw):
this_eigv = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2) - hw[w1]
v = np.power(this_eigv / width, 2)
Delta_Gauss[w1, :, :, :] = np.exp(-0.5 * v) / denom
return np.triu(Delta_Gauss)
The next step should be clear now. We're only using w1 to index hw, so let's do some more broadcasting to make numpy do the looping instead. We're currently subtracting a scalar value from a matrix of shape (N, B, B), so to get the resulting matrix for each of the W values in hw, we need to perform subtraction on matrices of the shapes (1, N, B, B) - (W, 1, 1, 1) and numpy will broadcast everything to produce a matrix of the shape (W, N, B, B):
def Delta_Gaussf(hw, width, eigv):
eigv_sub = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2)
w_sub = np.expand_dims(eigv_sub, 0) - np.reshape(hw, (0, 1, 1, 1))
v = np.power(w_sub / width, 2)
denom = np.sqrt(2.0 * np.pi) * width
Delta_Gauss = np.exp(-0.5 * v) / denom
return np.triu(Delta_Gauss)
On my example data, this code is ~100x faster (~900ms to ~10ms). Your mileage might vary.
But wait! There's more! Since our code is all numeric/numpy/python, we can use another handy module called numba to compile this function into an equivalent one with higher performance. Under the hood, it's basically reading what functions we're calling and converting the function into C-types and C-calls to remove the Python function call overhead. It's doing more than that, but that gives the jist of where we're going to gain benefit. To gain this benefit is trivial in this case:
import numba
#numba.jit
def Delta_Gaussf(hw, width, eigv):
eigv_sub = np.expand_dims(eigv, 1) - np.expand_dims(eigv, 2)
w_sub = np.expand_dims(eigv_sub, 0) - np.reshape(hw, (0, 1, 1, 1))
v = np.power(w_sub / width, 2)
denom = np.sqrt(2.0 * np.pi) * width
Delta_Gauss = np.exp(-0.5 * v) / denom
return np.triu(Delta_Gauss)
The resulting function is down to about ~7ms on my sample data, down from ~10ms, just by adding that decorator. Pretty nice for no effort.
EDIT: #max9111 gave a better answer that points out that numba works much better with the loop syntax than with numpy broadcasting code. With almost no work besides the removal of the inner if statement, he shows that numba.jit can be made to get the almost original code even faster. The result is much cleaner, in that you still have just the single innermost equation that shows what each value is, and you don't have to follow the magical broadcasting used above. I highly recommend using his answer.
Conclusion
For my given sample data (Nw = 20, N_bd = 20, N_kp = 20), my final runtimes are the following (I've included timings on the same computer for #max9111's solution, first without using parallel execution and then with it on my 2-core VM):
Original code: ~900 ms
Fortran estimate: ~90 ms (based on OP saying it was ~10x faster)
Final numpy code: ~10 ms
Final code with numba.jit: ~7 ms
max9111's solution (serial): ~4ms
max9111 (parallel 2-core): ~3ms
Overall vectorized speedup: ~130x
max9111's numba speedup: ~300x (potentially more with more cores)
I don't know how fast exactly your Fortran code is, but it looks like proper usage of numpy allows you to easily beat it by an order of magnitude, and #max9111's numba solution gives you potentially another order of magnitude.

Better image normalization with numpy

I already achieved the goal described in the title but I was wondering if there was a more efficient (or generally better) way to do it. First of all let me introduce the problem.
I have a set of images of different sizes but with a width/height ratio less than (or equal) 2 (could be anything but let's say 2 for now), I want to normalize each one, meaning I want all of them to have the same size. Specifically I am going to do so like this:
Extract the max height above all images
Zoom the image so that each image reaches the max height keeping its ratio
Add a padding to the right with just white pixels until the image has a width/height ratio of 2
Keep in mind the images are represented as numpy matrices of grey scale values [0,255].
This is how I'm doing it now in Python:
max_height = numpy.max([len(obs) for obs in data if len(obs[0])/len(obs) <= 2])
for obs in data:
if len(obs[0])/len(obs) <= 2:
new_img = ndimage.zoom(obs, round(max_height/len(obs), 2), order=3)
missing_cols = max_height * 2 - len(new_img[0])
norm_img = []
for row in new_img:
norm_img.append(np.pad(row, (0, missing_cols), mode='constant', constant_values=255))
norm_img = np.resize(norm_img, (max_height, max_height*2))
There's a note about this code:
I'm rounding the zoom ratio because it makes the final height equal to max_height, I'm sure this is not the best approach but it's working (any suggestion is appreciated here). What I'd like to do is to expand the image keeping the ratio until it reaches a height equal to max_height. This is the only solution I found so far and it worked right away, the interpolation works pretty good.
So my final questions are:
Is there a better approach to achieve what explained above (image normalization) ? Do you think I could have done this differently ? Is there a common good practice I'm not following ?
Thanks in advance for your time.

Instead of ndimage.zoom you could use
scipy.misc.imresize. This
function allows you to specify the target size as a tuple, instead of by zoom
factor. Thus you won't have to call np.resize later to get the size exactly as
desired.
Note that scipy.misc.imresize calls
PIL.Image.resize
under the hood, so PIL (or Pillow) is a dependency.
Instead of using np.pad in a for-loop, you could allocate space for the desired array, norm_arr, first:
norm_arr = np.full((max_height, max_width), fill_value=255)
and then copy the resized image, new_arr into norm_arr:
nh, nw = new_arr.shape
norm_arr[:nh, :nw] = new_arr
For example,
from __future__ import division
import numpy as np
from scipy import misc
data = [np.linspace(255, 0, i*10).reshape(i,10)
for i in range(5, 100, 11)]
max_height = np.max([len(obs) for obs in data if len(obs[0])/len(obs) <= 2])
max_width = 2*max_height
result = []
for obs in data:
norm_arr = obs
h, w = obs.shape
if float(w)/h <= 2:
scale_factor = max_height/float(h)
target_size = (max_height, int(round(w*scale_factor)))
new_arr = misc.imresize(obs, target_size, interp='bicubic')
norm_arr = np.full((max_height, max_width), fill_value=255)
# check the shapes
# print(obs.shape, new_arr.shape, norm_arr.shape)
nh, nw = new_arr.shape
norm_arr[:nh, :nw] = new_arr
result.append(norm_arr)
# visually check the result
# misc.toimage(norm_arr).show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient calculation of Sobel gradient magnitude - python

Related

How to parallelize a for loop with a shared array return?

Best method to add noise on tf.dataset images "on the fly"

Faster implementation to quantize an image with an existing palette?

A long term puzzle, how to optimize multi-level loops in python?

Better image normalization with numpy

Categories

Resources