AS the title suggests I have an image, whose pixel coordinates I want to change using a mathematical function. So far, I have the following code which works but is very time consuming because of the nested loop. Do you have any suggestions to make it faster? To be quantitative, it takes about 2-2.5 minutes to complete the process on a 12MPixel image.
imgcor = np.zeros(img.shape, dtype=img.dtype)
for f in range(rowc):
for k in range(colc):
offX = k + (f*b*c*(math.sin(math.radians(a))))
offY = f + (f*b*d*(math.cos(math.radians(a))))
imgcor[f, k] = img[int(offY)%rowc, int(offX)%colc]
P.S. I am using opencv 2.4.13 and python 2.7
There may be a way to get numpy to do some vectorized work for you, but one easy speedup is to not re-calculate some of the values every time you loop (I'm assuming a,b,c, and d are not changing in the loop). I'm curious what the speedup would be, can you report back?
imgcor = np.zeros(img.shape, dtype=img.dtype)
offX_precalc = b*c*(math.sin(math.radians(a)))
offY_precalc = b*d*(math.cos(math.radians(a)))
for f in range(rowc):
for k in range(colc):
offX = k + (f*offX_precalc)
offY = f + (f*offY_precalc)
imgcor[f, k] = img[int(offY)%rowc, int(offX)%colc]
ok since the above was too slow, I added a bit of vectorization and I'm curious if it's faster:
imgcor = np.zeros(img.shape, dtype=img.dtype)
off_base = b*(math.sin(math.radians(a)))
offX_precalc = off_base*c
offY_precalc = off_base*d+1
for f in range(rowc):
offY = int(f*offY_precalc)%rowc
offXs = [int(k + (f*offX_precalc))%colc for k in range(colc)]
imgcor[f,:] = img[offY, offXs]
Related
I have written some code in Python and wanted to improve it using Numba's function decorators. Using a just-in-time compiler works fine (`#jit`). However, when I tried to parralize my code, speeding it up even more, the programm strangely runs slower than the non-parralized version.
Here is my code:
#numba.njit(parallel=True)
def get_output(input, depth, input_size, kernel_size, input_depth, kernels, bias):
output = np.zeros((depth, input_size[0] - kernel_size + 1, input_size[0] - kernel_size + 1))
for k in numba.prange(depth):
for i in numba.prange(input_depth):
output[k] += valid_correlate(input[i], kernels[k][i])
output[k] += bias[k]
return output
#numba.njit(fastmath=True)
def apply_filter(mat, filter, point):
result = 0
end_point = (min(mat.shape[0], point[0] + filter.shape[0]),
min(mat.shape[1], point[1] + filter.shape[1]))
point = (max(0, point[0]), max(0, point[1]))
area = mat[point[0]:end_point[0], point[1]:end_point[1]]
if filter.shape != area.shape:
s_filter = filter[0:area.shape[0], 0:area.shape[1]]
else:
s_filter = filter
i_result = np.multiply(area, s_filter)
result = np.sum(i_result)
return result
#numba.njit(nogil=True)
def valid_correlate(mat, filter):
f_mat = np.zeros((mat.shape[0] - filter.shape[0] + 1, mat.shape[1] - filter.shape[0] + 1))
for x in range(f_mat.shape[0]):
for y in range(f_mat.shape[1]):
f_mat[x, y] = apply_filter(mat, filter, (x, y))
return f_mat
With parralel=True it takes about 0.07 seconds, whilst its about 0.056 seconds without it.
I can't seem to figure out the problem here and would be glad for any help!
Regards
-Eirik
As pointed out in the comments, creating temporary Numpy array is expensive because allocations do not scale in parallel but also because memory-bound core also do not scale. Here is a modified (untested) code:
#numba.njit(fastmath=True)
def apply_filter(mat, filter, point):
result = 0
end_point = (min(mat.shape[0], point[0] + filter.shape[0]),
min(mat.shape[1], point[1] + filter.shape[1]))
point = (max(0, point[0]), max(0, point[1]))
area = mat[point[0]:end_point[0], point[1]:end_point[1]]
if filter.shape != area.shape:
s_filter = filter[0:area.shape[0], 0:area.shape[1]]
else:
s_filter = filter
res = 0.0
# Replace the slow `np.multiply` call creating a temporary array
for i in range(area.shape[0]):
for j in range(area.shape[1]):
res += area[i, j] + s_filter[i, j]
return res
I assumed the shape of area and s_filter are the same. If this is not the case, then please provide a complete reproducible example.
Also note that the first call is slow because of compilation time. One ways to fix that is to provide the function signature. See the Numba documentation for more information about this.
By the way, the operation looks like a custom convolution. If so, then note that FFTs are known to speed this up a lot if the filter is relatively big. For small filter the code can be faster if the compiler know the size of the array at compile time. If the filter is relatively big and separable, then the whole computation can be split in much faster steps.
I'm trying to write a package about image processing with some numpy operations. I've observe that the operations inside the nested loop are costly and want to speed it up.
Input is an 512 by 1024 image and be preprocessing into a
edge set, which is a list of (Ni,2) ndarrays for each array i.
And next, the nested for loop code will pass edge set and do some math stuffs.
###proprocessing: img ===> countour set
img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
high_thresh, _ = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY +
cv2.THRESH_OTSU)
lowThresh = 0.5*high_thresh
b = cv2.Canny(img, lowThresh, high_thresh)
edgeset, _ =
cv2.findContours(b,cv2.RETR_TREE,cv2.CHAIN_APPROX_NONE)
imgH = img.shape[0] ## 512
imgW = img.shape[1] ## 1024
num_edges = len(edgeset) ## ~900
min_length_segment_vp = imgH/6 ## ~100
### nested for loop
for i in range(num_edges):
if(edgeset[i].shape[0] > min_length_segment_vp):
#points: (N, 1, 2) ==> uv: (N, 2)
uv = edgeset[i].reshape(edgeset[i].shape[0],
edgeset[i].shape[2])
uv = np.unique(uv, axis=0)
theta = -(uv[:, 1]-imgH/2)*np.pi/imgH
phi = (uv[:, 0]-imgW/2)*2*np.pi/imgW
xyz = np.zeros((uv.shape[0], 3))
xyz[:, 0] = np.sin(phi) * np.cos(theta)
xyz[:, 1] = np.cos(theta) * np.cos(phi)
xyz[:, 2] = np.sin(theta)
##xyz: (N, 3)
N=xyz.shape[0]
for _ in range(10):
if(xyz.shape[0] > N * 0.1):
bestInliers = np.array([])
bestOutliers = np.array([])
####
#### watch this out!
####
for _ in range(1000):
id0 = random.randint(0, xyz.shape[0]-1)
id1 = random.randint(0, xyz.shape[0]-1)
if(id0 == id1):
continue
n = np.cross(xyz[id0, :], xyz[id1, :])
n = n / np.linalg.norm(n)
cosTetha = n # xyz.T
inliers = np.abs(cosTetha) < threshold
outliers = np.where(np.invert(inliers))[0]
inliers = np.where(inliers)[0]
if inliers.shape[0] > bestInliers.shape[0]:
bestInliers = inliers
bestOutliers = outliers
What I have tried:
I changed np.cross and np.norm into my custom cross and norm
only work for shape (3,) ndarray. This gives me a from ~0.9s into
~0.3s in my i5-4460 cpu.
I profile my code and find that now the code inside the most inner loop still cost 2/3 of time.
What I think I can try next:
Compile code into cython and add some cdef notation.
Translate whole file into C++.
Use some faster library for calculation like numexpr.
Vectorization of the loop process (but I don't know how).
Can I do more faster? Please give me some suggestions! Thanks!
The question is quite broad so I'll only give a few non-obvious tips based on my own experience.
If you use Cython, you might want to change the for loops into while loops. I've managed to get quite big (x5) speed-ups just from this, although it may not help for all possible cases;
Sometimes code that would be considered inefficient in regular Python, such as a nested while (or for) loop to apply a function to an array one element at a time, can be optimized by Cython to be faster than the equivalent vectorized Numpy approach;
Find out which Numpy functions cost the most time, and write your own in a way that Cython can most easily optimise them (see above point).
I'm currently working on an project aim at finding blur region by using walsh hadamard transform. The basic idea is pixel-wise extract local patch and apply walsh hadamard transform to this local patch. In order to do Walsh hadamard transform, I prior generate the hadamard matrix H and do H×T(local_patch)×H_transpose computation. This operation cost 5ms per pixel which is time consuming. I'm wondering is there have some technique to speed up the matrix multiplication process in numpy python or using some other fast walsh hadamard trainsform technique to replace the H×T×H'. Any help would be appreciated.
for i in range(h):
for j in range(w):
local_patch_gray = gray_pad[i:i+patch_size, j:j+patch_size]
local_patch_gray = local_patch_gray[1:, 1:] # extract 2^n×2^n part
local_patch_blur = blur_pad[i:i + patch_size, j:j + patch_size]
local_patch_blur = local_patch_blur[1:, 1:]
patch_WHT = np.dot(np.dot(H, local_patch_gray), H)
blur_WHT = np.dot(np.dot(H, local_patch_blur), H)
num = np.power(np.sum(np.power(np.abs(blur_WHT), p)), 1/p)
denomi = np.power(np.sum(np.power(np.abs(patch_WHT), p)), 1/p)
if denomi == 0:
blur_map[i, j] = 0
continue
blur_map[i, j] = num / denomi
It sounds like this is a job for Numba, check out their 5-minute starting guide.
In short, Numba compiles the first call of a function into a fast-callable format, so that every subsequent call of the same function is at light speed. Numba also has options which can make function calls at ludicrous speed. The options that will pertain to your example are likely fastmath and parallel.
As a starting point, here's what your new numba function might look like:
#njit(fastmath=True, parallel=True)
def lightning_fast_numba_function:
local_patch_gray = gray_pad[i:i+patch_size, j:j+patch_size]
local_patch_gray = local_patch_gray[1:, 1:] # extract 2^n×2^n part
local_patch_blur = blur_pad[i:i + patch_size, j:j + patch_size]
local_patch_blur = local_patch_blur[1:, 1:]
patch_WHT = np.dot(np.dot(H, local_patch_gray), H)
blur_WHT = np.dot(np.dot(H, local_patch_blur), H)
num = np.power(np.sum(np.power(np.abs(blur_WHT), p)), 1/p)
denomi = np.power(np.sum(np.power(np.abs(patch_WHT), p)), 1/p)
if denomi == 0:
blur_map[i, j] = 0
continue
blur_map[i, j] = num / denomi
for i in range(h):
for j in range(w):
lighting_fast_numba_function()
Other options you may consider are using np.nditer instead of range. But, dont hesitate to cross-check options using Numpy's iteration docs.
Lastly, I noticed a Wikipedia article for your alg has a fast section, with Python code. Might find it useful.
I want to run through a large tif stack +1500 frames and extract the coordinates of the local maxima for each frame. The code below does the job, however extremely slow for large files. When running on smaller bits (e.g. 20 frames) each frame is done almost instantly - when running on the whole dataset, each frame takes seconds.
Any solutions to run a faster code? I figure it is due to the loading of the large tiff file - however it should only be necessary one time initially?
I have the following code:
from pims import ImageSequence
from skimage.feature import peak_local_max
def cmask(index,array):
radius = 3
a,b = index
nx,ny = array.shape
y,x = np.ogrid[-a:nx-a,-b:ny-b]
mask = x*x + y*y <= radius*radius
return(sum(array[mask])) # number of pixels
images = ImageSequence('tryhard_red_small.tif')
frame_list = []
x = []
y = []
int_liposome = []
BG_liposome = []
for i in range(len(images[0])):
tmp_frame = images[0][i]
xy = pd.DataFrame(peak_local_max(tmp_frame, min_distance=8,threshold_abs=3000))
x.extend(xy[0].tolist())
y.extend(xy[1].tolist())
for j in range(len(xy)):
index = x[j],y[j]
int_liposome.append(cmask(index,tmp_frame))
frame_list.extend([i]*len(xy))
print "Frame: ", i, "of ",len(images[0])
features = pd.DataFrame(
{'lip_int':int_liposome,
'y' : y,
'x' : x,
'frame' : frame_list})
Have you tried profiling the code, say with %prun or %lprun in ipython? That'll tell you exactly where your slowdowns are occurring.
I can't make my own version of this without the tif stack, but I suspect the problem is the fact that you're using lists to store everything. Every time you do an append or an extension, python is having to allocate more memory. You could try getting the total count of maxima first, then allocating your output arrays, then rerunning to fill the arrays. Something like below
# run through once to get the count of local maxima
npeaks = (len(peak_local_max(f, min_distance=8, threshold_abs=3000))
for f in images[0])
total_peaks = sum(npeaks)
# allocate storage arrays and rerun
x = np.zeros(total_peaks, np.float)
y = np.zeros_like(x)
int_liposome = np.zeros_like(x)
BG_liposome = np.zeros_like(x)
frame_list = np.zeros(total_peaks, np.int)
index_0 = 0
for frame_ind, tmp_frame in enumerate(images[0]):
peaks = pd.DataFrame(peak_local_max(tmp_frame, min_distance=8,threshold_abs=3000))
index_1 = index_0 + len(peaks)
# copy the data from the DataFrame's underlying numpy array
x[index_0:index_1] = peaks[0].values
y[index_0:index_1] = peaks[1].values
for i, peak in enumerate(peaks, index_0):
int_liposome[i] = cmask(peak, tmp_frame)
frame_list[index_0:index_1] = frame_ind
# update the starting index
index_0 = index_1
print "Frame: ", frame_ind, "of ",len(images[0])
Does anyone know a fast algorithm to detect main colors in an image?
I'm currently using k-means to find the colors together with Python's PIL but it's very slow. One 200x200 image takes 10 seconds to process. I've several hundred thousand images.
One fast method would be to simply divide up the color space into bins and then construct a histogram. It's fast because you only need a small number of decisions per pixel, and you only need one pass over the image (and one pass over the histogram to find the maxima).
Update: here's a rough diagram to help explain what I mean.
On the x-axis is the color divided into discrete bins. The y-axis shows the value of each bin, which is the number of pixels matching the color range of that bin. There are two main colors in this image, shown by the two peaks.
With a bit of tinkering, this code (which I suspect you might have already seen!) can be sped up to just under a second
If you increase the kmeans(min_diff=...) value to about 10, it produces very similar results, but runs in 900ms (compared to about 5000-6000ms with min_diff=1)
Further decreasing the size of the thumbnails to 100x100 doesn't seem to impact the results much either, and takes the runtime to about 250ms
Here's a slightly tweaked version of the code, which just parameterises the min_diff value, and includes some terrible code to generate an HTML file with the results/timing
from collections import namedtuple
from math import sqrt
import random
try:
import Image
except ImportError:
from PIL import Image
Point = namedtuple('Point', ('coords', 'n', 'ct'))
Cluster = namedtuple('Cluster', ('points', 'center', 'n'))
def get_points(img):
points = []
w, h = img.size
for count, color in img.getcolors(w * h):
points.append(Point(color, 3, count))
return points
rtoh = lambda rgb: '#%s' % ''.join(('%02x' % p for p in rgb))
def colorz(filename, n=3, mindiff=1):
img = Image.open(filename)
img.thumbnail((200, 200))
w, h = img.size
points = get_points(img)
clusters = kmeans(points, n, mindiff)
rgbs = [map(int, c.center.coords) for c in clusters]
return map(rtoh, rgbs)
def euclidean(p1, p2):
return sqrt(sum([
(p1.coords[i] - p2.coords[i]) ** 2 for i in range(p1.n)
]))
def calculate_center(points, n):
vals = [0.0 for i in range(n)]
plen = 0
for p in points:
plen += p.ct
for i in range(n):
vals[i] += (p.coords[i] * p.ct)
return Point([(v / plen) for v in vals], n, 1)
def kmeans(points, k, min_diff):
clusters = [Cluster([p], p, p.n) for p in random.sample(points, k)]
while 1:
plists = [[] for i in range(k)]
for p in points:
smallest_distance = float('Inf')
for i in range(k):
distance = euclidean(p, clusters[i].center)
if distance < smallest_distance:
smallest_distance = distance
idx = i
plists[idx].append(p)
diff = 0
for i in range(k):
old = clusters[i]
center = calculate_center(plists[i], old.n)
new = Cluster(plists[i], center, old.n)
clusters[i] = new
diff = max(diff, euclidean(old.center, new.center))
if diff < min_diff:
break
return clusters
if __name__ == '__main__':
import sys
import time
for x in range(1, 11):
sys.stderr.write("mindiff %s\n" % (x))
start = time.time()
fname = "akira_940x700.png"
col = colorz(fname, 3, x)
print "<h1>%s</h1>" % x
print "<img src='%s'>" % (fname)
print "<br>"
for a in col:
print "<div style='background-color: %s; width:20px; height:20px'> </div>" % (a)
print "<br>Took %.02fms<br> % ((time.time()-start)*1000)
K-means is a good choice for this task because you know number of main colors beforehand. You need to optimize K-means. I think you can reduce your image size, just scale it down to 100x100 pixels or so. Find the size on witch your algorithm works with acceptable speed. Another option is to use dimensionality reduction before k-means clustering.
And try to find fast k-means implementation. Writing such things in python is a misuse of python. It's not supposed to be used like this.