I'm trying to write a package about image processing with some numpy operations. I've observe that the operations inside the nested loop are costly and want to speed it up.
Input is an 512 by 1024 image and be preprocessing into a
edge set, which is a list of (Ni,2) ndarrays for each array i.
And next, the nested for loop code will pass edge set and do some math stuffs.
###proprocessing: img ===> countour set
img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
high_thresh, _ = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY +
cv2.THRESH_OTSU)
lowThresh = 0.5*high_thresh
b = cv2.Canny(img, lowThresh, high_thresh)
edgeset, _ =
cv2.findContours(b,cv2.RETR_TREE,cv2.CHAIN_APPROX_NONE)
imgH = img.shape[0] ## 512
imgW = img.shape[1] ## 1024
num_edges = len(edgeset) ## ~900
min_length_segment_vp = imgH/6 ## ~100
### nested for loop
for i in range(num_edges):
if(edgeset[i].shape[0] > min_length_segment_vp):
#points: (N, 1, 2) ==> uv: (N, 2)
uv = edgeset[i].reshape(edgeset[i].shape[0],
edgeset[i].shape[2])
uv = np.unique(uv, axis=0)
theta = -(uv[:, 1]-imgH/2)*np.pi/imgH
phi = (uv[:, 0]-imgW/2)*2*np.pi/imgW
xyz = np.zeros((uv.shape[0], 3))
xyz[:, 0] = np.sin(phi) * np.cos(theta)
xyz[:, 1] = np.cos(theta) * np.cos(phi)
xyz[:, 2] = np.sin(theta)
##xyz: (N, 3)
N=xyz.shape[0]
for _ in range(10):
if(xyz.shape[0] > N * 0.1):
bestInliers = np.array([])
bestOutliers = np.array([])
####
#### watch this out!
####
for _ in range(1000):
id0 = random.randint(0, xyz.shape[0]-1)
id1 = random.randint(0, xyz.shape[0]-1)
if(id0 == id1):
continue
n = np.cross(xyz[id0, :], xyz[id1, :])
n = n / np.linalg.norm(n)
cosTetha = n # xyz.T
inliers = np.abs(cosTetha) < threshold
outliers = np.where(np.invert(inliers))[0]
inliers = np.where(inliers)[0]
if inliers.shape[0] > bestInliers.shape[0]:
bestInliers = inliers
bestOutliers = outliers
What I have tried:
I changed np.cross and np.norm into my custom cross and norm
only work for shape (3,) ndarray. This gives me a from ~0.9s into
~0.3s in my i5-4460 cpu.
I profile my code and find that now the code inside the most inner loop still cost 2/3 of time.
What I think I can try next:
Compile code into cython and add some cdef notation.
Translate whole file into C++.
Use some faster library for calculation like numexpr.
Vectorization of the loop process (but I don't know how).
Can I do more faster? Please give me some suggestions! Thanks!
The question is quite broad so I'll only give a few non-obvious tips based on my own experience.
If you use Cython, you might want to change the for loops into while loops. I've managed to get quite big (x5) speed-ups just from this, although it may not help for all possible cases;
Sometimes code that would be considered inefficient in regular Python, such as a nested while (or for) loop to apply a function to an array one element at a time, can be optimized by Cython to be faster than the equivalent vectorized Numpy approach;
Find out which Numpy functions cost the most time, and write your own in a way that Cython can most easily optimise them (see above point).
Related
I've been looking for hours to find a question similar, but nothing has satisfied me.
My problem is: I've a PIL image (representing a canal) already converted into a Numpy array (using the "L" mode of PIL), and I'd like to retrieve the white pixels whose neighbor are black (their indexes in fact), without using for loops (the image is really huge).
I thought of np.where but I don't know how I should use it to solve my problem, and I also don't know if it would be faster than using for loops (because my aim would be reaching this goal with the fastest solution).
I hope I'm clear enough, and I thank you in advance for your response!
EDIT: for example, with this image (a simple canal, it is already a black and white image, so the image.convert('L') isn't really useful here, but the code should be generic if possible), I'd do something like that:
import numpy as np
from PIL import Image
image = Image.open(canal)
image = image.convert("L")
array = np.asarray(image)
l = []
for i in range(1, len(array) - 1):
for j in range(1, len(array[0]) - 1):
if array[i][j] == 255 and (array[i+1][j] == 0 or array[i-1][j] == 0 or array[i][j+1] == 0 or array[i][j-1] == 0):
l.append((i, j))
and I'd hope to obtain l as fast as possible :)
I've colored the pixels I need in red in the next image: here.
EDIT2: thank you all for the help, it worked!
You could use the numba just-in-time compiler to speed up your loop.
from numba import njit
#njit
def find_highlow_pixels(img):
pixels = []
for j in range(1, img.shape[0]-1):
for i in range(1, img.shape[1]-1):
if (
img[j, i] == 255 and (
img[j-1, i]==0 or img[j+1,i]==0 or
img[j, i-1]==0 or img[j, i+1]==0
)
):
pixels.append((j, i))
return pixels
Another possibility that came to my mind would be using the minimum filter. However, I would expect it to be slower than the first proposed solution, but could be useful to build more on top of it.
import numpy as np
from scipy.ndimage import minimum_filter
# create a footprint that only takes the neighbours into account
neighbours = (np.arange(9) % 2 == 1).reshape(3,3)
# create a mask of relevant pixels, img should be your image as array
mask = np.logical_and(
img == 255,
minimum_filter(img, footprint=neighbours) == 0
)
# get indexes
indexes = np.where(mask)
# as list
list(zip(*indexes))
If memory space is not considered, I prefer manipulation of masks like the following.
# Step 1: Generate two masks of white and black.
mask_white = img == 255
mask_black = img == 0
# Step 2: Apply 8-neighborhood dilation on black mask
# if you want to use numpy only, you need to implement dilation by yourself.
# define function of 8-neighborhood dilation
def dilate_8nb(m):
index_row, index_col = np.where(m)
ext_index_row = np.repeat(index_row,9)
ext_index_col = np.repeat(index_col,9)
ext_index_row.reshape(-1,9)[:, :3] += 1
ext_index_row.reshape(-1,9)[:, -3:] -= 1
ext_index_col.reshape(-1,9)[:, ::3] += 1
ext_index_col.reshape(-1,9)[:, 2::3] -= 1
ext_index_row = np.clip(ext_index_row, 0, m.shape[0]-1)
ext_index_col = np.clip(ext_index_col, 0, m.shape[1]-1)
ret = m.copy()
ret[ext_index_row, ext_index_col] = True
return ret
ext_mask_black = dilate_8nb(mask_black)
# or just using dilation in scipy
# from scipy import ndimage
# ext_mask_black = ndimage.binary_dilation(mask_black, structure=ndimage.generate_binary_structure(2, 2))
# Step 3: take the intersection of mask_white and ext_mask_black
mask_target = mask_white & ext_mask_black
# Step 4: take the index using np.where
l = np.where(mask_target)
# modify this type to make it consistency with your result
l = list(zip(*l))
I'm currently working on an project aim at finding blur region by using walsh hadamard transform. The basic idea is pixel-wise extract local patch and apply walsh hadamard transform to this local patch. In order to do Walsh hadamard transform, I prior generate the hadamard matrix H and do H×T(local_patch)×H_transpose computation. This operation cost 5ms per pixel which is time consuming. I'm wondering is there have some technique to speed up the matrix multiplication process in numpy python or using some other fast walsh hadamard trainsform technique to replace the H×T×H'. Any help would be appreciated.
for i in range(h):
for j in range(w):
local_patch_gray = gray_pad[i:i+patch_size, j:j+patch_size]
local_patch_gray = local_patch_gray[1:, 1:] # extract 2^n×2^n part
local_patch_blur = blur_pad[i:i + patch_size, j:j + patch_size]
local_patch_blur = local_patch_blur[1:, 1:]
patch_WHT = np.dot(np.dot(H, local_patch_gray), H)
blur_WHT = np.dot(np.dot(H, local_patch_blur), H)
num = np.power(np.sum(np.power(np.abs(blur_WHT), p)), 1/p)
denomi = np.power(np.sum(np.power(np.abs(patch_WHT), p)), 1/p)
if denomi == 0:
blur_map[i, j] = 0
continue
blur_map[i, j] = num / denomi
It sounds like this is a job for Numba, check out their 5-minute starting guide.
In short, Numba compiles the first call of a function into a fast-callable format, so that every subsequent call of the same function is at light speed. Numba also has options which can make function calls at ludicrous speed. The options that will pertain to your example are likely fastmath and parallel.
As a starting point, here's what your new numba function might look like:
#njit(fastmath=True, parallel=True)
def lightning_fast_numba_function:
local_patch_gray = gray_pad[i:i+patch_size, j:j+patch_size]
local_patch_gray = local_patch_gray[1:, 1:] # extract 2^n×2^n part
local_patch_blur = blur_pad[i:i + patch_size, j:j + patch_size]
local_patch_blur = local_patch_blur[1:, 1:]
patch_WHT = np.dot(np.dot(H, local_patch_gray), H)
blur_WHT = np.dot(np.dot(H, local_patch_blur), H)
num = np.power(np.sum(np.power(np.abs(blur_WHT), p)), 1/p)
denomi = np.power(np.sum(np.power(np.abs(patch_WHT), p)), 1/p)
if denomi == 0:
blur_map[i, j] = 0
continue
blur_map[i, j] = num / denomi
for i in range(h):
for j in range(w):
lighting_fast_numba_function()
Other options you may consider are using np.nditer instead of range. But, dont hesitate to cross-check options using Numpy's iteration docs.
Lastly, I noticed a Wikipedia article for your alg has a fast section, with Python code. Might find it useful.
I have a function that takes a [32, 32, 3] tensor, and outputs a [256,256,3] tensor.
Specifically, the function interprets the smaller array as if it was a .svg file, and 'renders' it to a 256x256 array as a canvas using this algorithm
For an explanation of WHY I would want to do this, see This question
The function behaves exactly as intended, until I try to include it in the training loop of a GAN. The current error I'm seeing is:
NotImplementedError: Cannot convert a symbolic Tensor (mul:0) to a numpy array.
A lot of other answers to similar errors seem to boil down to "You need to re-write the function using tensorflow, not numpy"
Here's the working code using numpy - is it possible to re-write it to exclusively use tensorflow functions?
def convert_to_bitmap(input_tensor, target, j):
#implied conversion to nparray - the tensorflow docs seem to indicate this is okay, but the error is thrown here when training
array = input_tensor
outputArray = target
output = target
for i in range(32):
col = float(array[i,0,j])
if ((float(array[i,0,0]))+(float(array[i,0,1]))+(float(array[i,0,2]))/3)< 0:
continue
#slice only the red channel from the i line, multiply by 255
red_array = array[i,:,0]*255
#slice only the green channel, multiply by 255
green_array = array[i,:,1]*255
#combine and flatten them
combined_array = np.dstack((red_array, green_array)).flatten()
#remove the first two and last two indices of the combined array
index = [0,1,62,63]
clipped_array = np.delete(combined_array,index)
#filter array to remove values less than 0
filtered = clipped_array > 0
filtered_array = clipped_array[filtered]
#check array has an even number of values, delete the last index if it doesn't
if len(filtered_array) % 2 == 0:
pass
else:
filtered_array = np.delete(filtered_array,-1)
#convert into a set of tuples
l = filtered_array.tolist()
t = list(zip(l, l[1:] + l[:1]))
if not t:
continue
output = fill_polygon(t, outputArray, col)
return(output)
The 'fill polygon' function is copied from the 'mahotas' library:
def fill_polygon(polygon, canvas, color):
if not len(polygon):
return
min_y = min(int(y) for y,x in polygon)
max_y = max(int(y) for y,x in polygon)
polygon = [(float(y),float(x)) for y,x in polygon]
if max_y < canvas.shape[0]:
max_y += 1
for y in range(min_y, max_y):
nodes = []
j = -1
for i,p in enumerate(polygon):
pj = polygon[j]
if p[0] < y and pj[0] >= y or pj[0] < y and p[0] >= y:
dy = pj[0] - p[0]
if dy:
nodes.append( (p[1] + (y-p[0])/(pj[0]-p[0])*(pj[1]-p[1])) )
elif p[0] == y:
nodes.append(p[1])
j = i
nodes.sort()
for n,nn in zip(nodes[::2],nodes[1::2]):
nn += 1
canvas[y, int(n):int(nn)] = color
return(canvas)
NOTE: I'm not trying to get someone to convert the whole thing for me! There are some functions that are pretty obvious (tf.stack instead of np.dstack), but others that I don't even know how to start, like the last few lines of the fill_polygon function above.
Yes you can actually do this, you can use a python function in sth called tf.pyfunc. Its a python wrapper but its extremely slow in comparison to plain tensorflow. However, tensorflow and Cuda for example are so damn fast because they use stuff like vectorization, meaning you can rewrite a lot , really many of the loops in terms of mathematical tensor operations which are very fast.
In general:
If you want to use custom code as a custom layer, i would recommend you to rethink the algebra behind those loops and try to express them somehow different. If its just preprocessing before the training is going to start, you can use tensorflow but doing the same with numpy and other libraries is easier.
To your main question: Yes its possible, but better dont use loops. Tensorflow has a build-in loop optimizer but then you have to use tf.while() and thats anyoing (maybe just for me). I just blinked over your code, but it looks like you should be able to vectorize it quite good using the standard tensorflow vocabulary. If you want it fast, i mean really fast with GPU support write all in tensorflow, but nothing like 50/50 with tf.convert_to_tensor(), because than its going to be slow again. because than you switch between GPU and CPU and plain Python interpreter and the tensorflow low level API. Hope i could help you at least a bit
This code 'works', in that it only uses tensorflow functions, and does allow the model to train when used in a training loop:
def convert_image (x):
#split off the first column of the generator output, and store it for later (remove the 'colours' column)
colours_column = tf.slice(img_to_convert, tf.constant([0,0,0], dtype=tf.int32), tf.constant([32,1,3], dtype=tf.int32))
#split off the rest of the data, only keeping R + G, and discarding B
image_data_red = tf.slice(img_to_convert, tf.constant([0,1,0], dtype=tf.int32), tf.constant([32,31,1], dtype=tf.int32))
image_data_green = tf.slice(img_to_convert, tf.constant([0,1,1], dtype=tf.int32), tf.constant([32, 31,1], dtype=tf.int32))
#roll each row by 1 position, and make two more 2D tensors
rolled_red = tf.roll(image_data_red, shift=-1, axis=0)
rolled_green = tf.roll(image_data_green, shift=-1, axis=0)
#remove all values where either the red OR green channels are 0
zeroes = tf.constant(0, dtype=tf.float32)
#this is for the 'count_nonzero' command
boolean_red_data = tf.not_equal(image_data_red, zeroes)
boolean_green_data = tf.not_equal(image_data_green, zeroes)
initial_data_mask = tf.logical_and(boolean_red_data, boolean_green_data)
#count non-zero values per row and flatten it
count = tf.math.count_nonzero(initial_data_mask, 1)
count_flat = tf.reshape(count, [-1])
flat_red = tf.reshape(image_data_red, [-1])
flat_green = tf.reshape(image_data_green, [-1])
boolean_red = tf.math.logical_not(tf.equal(flat_red, tf.zeros_like(flat_red)))
boolean_green = tf.math.logical_not(tf.equal(flat_green, tf.zeros_like(flat_red)))
mask = tf.logical_and(boolean_red, boolean_green)
flat_red_without_zero = tf.boolean_mask(flat_red, mask)
flat_green_without_zero = tf.boolean_mask(flat_green, mask)
# create a ragged tensor
X0_ragged = tf.RaggedTensor.from_row_lengths(values=flat_red_without_zero, row_lengths=count_flat)
Y0_ragged = tf.RaggedTensor.from_row_lengths(values=flat_green_without_zero, row_lengths=count_flat)
#do the same for the rolled version
rolled_data_mask = tf.roll(initial_data_mask, shift=-1, axis=1)
flat_rolled_red = tf.reshape(rolled_red, [-1])
flat_rolled_green = tf.reshape(rolled_green, [-1])
#from SO "shift zeros to the end"
boolean_rolled_red = tf.math.logical_not(tf.equal(flat_rolled_red, tf.zeros_like(flat_rolled_red)))
boolean_rolled_green = tf.math.logical_not(tf.equal(flat_rolled_green, tf.zeros_like(flat_rolled_red)))
rolled_mask = tf.logical_and(boolean_rolled_red, boolean_rolled_green)
flat_rolled_red_without_zero = tf.boolean_mask(flat_rolled_red, rolled_mask)
flat_rolled_green_without_zero = tf.boolean_mask(flat_rolled_green, rolled_mask)
# create a ragged tensor
X1_ragged = tf.RaggedTensor.from_row_lengths(values=flat_rolled_red_without_zero, row_lengths=count_flat)
Y1_ragged = tf.RaggedTensor.from_row_lengths(values=flat_rolled_green_without_zero, row_lengths=count_flat)
#available outputs for future use are:
X0 = X0_ragged.to_tensor(default_value=0.)
Y0 = Y0_ragged.to_tensor(default_value=0.)
X1 = X1_ragged.to_tensor(default_value=0.)
Y1 = Y1_ragged.to_tensor(default_value=0.)
#Example tensor cel (replace with (x))
P = tf.cast(x, dtype=tf.float32)
#split out P.x and P.y, and fill a ragged tensor to the same shape as Rx
Px_value = tf.cast(x, dtype=tf.float32) - tf.cast((tf.math.floor(x/255)*255), dtype=tf.float32)
Py_value = tf.cast(tf.math.floor(x/255), dtype=tf.float32)
Px = tf.squeeze(tf.ones_like(X0)*Px_value)
Py = tf.squeeze(tf.ones_like(Y0)*Py_value)
#for each pair of values (Y0, Y1, make a vector, and check to see if it crosses the y-value (Py) either up or down
a = tf.math.less(Y0, Py)
b = tf.math.greater_equal(Y1, Py)
c = tf.logical_and(a, b)
d = tf.math.greater_equal(Y0, Py)
e = tf.math.less(Y1, Py)
f = tf.logical_and(d, e)
g = tf.logical_or(c, f)
#Makes boolean bitwise mask
#calculate the intersection of the line with the y-value, assuming it intersects
#P.x <= (G.x - R.x) * (P.y - R.y) / (G.y - R.y + R.x) - use tf.divide_no_nan for safe divide
h = tf.math.less(Px,(tf.math.divide_no_nan(((X1-X0)*(Py-Y0)),(Y1-Y0+X0))))
#combine using AND with the mask above
i = tf.logical_and(g,h)
#tf.count_nonzero
#reshape to make a column tensor with the same dimensions as the colours
#divide by 2 using tf.floor_mod (returns remainder of division - any remainder means the value is odd, and hence the point is IN the polygon)
final_count = tf.cast((tf.math.count_nonzero(i, 1)), dtype=tf.int32)
twos = tf.ones_like(final_count, dtype=tf.int32)*tf.constant([2], dtype=tf.int32)
divide = tf.cast(tf.math.floormod(final_count, twos), dtype=tf.int32)
index = tf.cast(tf.range(0,32, delta=1), dtype=tf.int32)
clipped_index = divide*index
sort = tf.sort(clipped_index)
reverse = tf.reverse(sort, [-1])
value = tf.slice(reverse, [0], [1])
pair = tf.constant([0], dtype=tf.int32)
slice_tensor = tf.reshape(tf.stack([value, pair, pair], axis=0),[-1])
output_colour = tf.slice(colours_column, slice_tensor, [1,1,3])
return output_colour
This is where the 'convert image' function is applied using tf.vectorize_map:
def convert_images(image_to_convert):
global img_to_convert
img_to_convert = image_to_convert
process_list = tf.reshape((tf.range(0,65536, delta=1, dtype=tf.int32)), [65536, 1])
output_line = tf.vectorized_map(convert_image, process_list)
output_line_squeezed = tf.squeeze(output_line)
output_reshape = (tf.reshape(output_line_squeezed, [256,256,3])/127.5)-1
output = tf.expand_dims(output_reshape, axis=0)
return output
It is PAINFULLY slow, though - It does not appear to be using the GPU, and looks to be single threaded as well.
I'm adding it as an answer to my own question because is clearly IS possible to do this numpy function entirely in tensorflow - it just probably shouldn't be done like this.
I have an image that I want to perform some calculations on. The image pixels will be represented as f(x, y) where x is the column number and y is the row number of each pixel. I want to perform a calculation using the following formula:
Here is the code that does the calculation:
import matplotlib.pyplot as plt
import numpy as np
import os.path
from PIL import Image
global image_width, image_height
# A. Blur Measurement
def measure_blur(f):
D_sub_h = [[0 for y in range(image_height)] for x in range(image_width)]
for x in range(image_width):
for y in range(image_height):
if(y == 0):
f_x_yp1 = f[x][y+1]
f_x_ym1 = 0
elif(y == (image_height -1)):
f_x_yp1 = 0
f_x_ym1 = f[x][y -1]
else:
f_x_yp1 = f[x][y+1]
f_x_ym1 = f[x][y -1]
D_sub_h[x][y] = abs(f_x_yp1 - f_x_ym1)
return D_sub_h
if __name__ == '__main__':
image_counter = 1
while True:
if not os.path.isfile(str (image_counter) + '.jpg'):
break
image_path = str(image_counter) + '.jpg'
image = Image.open(image_path )
image_height, image_width = image.size
print("Image Width : " + str(image_width))
print("Image Height : " + str(image_height))
f = np.array(image)
D_sub_h = measure_blur(f)
image_counter = image_counter + 1
The problem with this code is when the image size becomes large, such as (5000, 5000), it takes a very long time to complete. Is there any way or function I can use to make the execution time faster by not doing one by one or manual computation?
Since you specifically convert the input f to a numpy array, I am assuming you want to use numpy. In that case, the allocation of D_sub_h needs to change from a list to an array:
D_sub_h = np.empty_like(f)
If we assume that everything outside your array is zeros, then the first row and last row can be computed as the second and negative second-to-last rows, respectively:
D_sub_h[0, :] = f[1, :]
D_sub_h[-1, :] = -f[-2, :]
The remainder of the data is just the difference between the next and previous index at each location, which is idiomatically computed by shifting views: f[2:, :] - f[:-2, :]. This formulation creates a temporary array. You can avoid doing that by using np.subtract explicitly:
np.subtract(f[2:, :], f[:-2, :], out=D_sub_h[1:-1, :])
The entire thing takes four lines in this formulation, and is fully vectorized, which means that loops run quickly under the hood, without most of Python's overhead:
def measure_blur(f):
D_sub_h = np.empty_like(f)
D_sub_h[0, :] = f[1, :]
D_sub_h[-1, :] = -f[-2, :]
np.subtract(f[2:, :], f[:-2, :], out=D_sub_h[1:-1, :])
return D_sub_h
Notice that I return the value instead of printing it. When you write functions, get in the habit of returning a value. Printing can be done later, and effectively discards the computation if it replaces a proper return.
The way shown above is fairly efficient with regards to time and space. If you want to write a one liner that uses a lot of temporary arrays, you can also do:
D_sub_h = np.concatenate((f[1, None], f[2:, :] - f[:-2, :], -f[-2, None]), axis=0)
I try to process many images which represented as NumPy array, but it takes too long. that's what im trying to do
# image is a list with images
max = np.amax(image[k])# k is current image index in loop
# here i try to normalize SHORT color to BYTE color and make it fill all range from 0 to 255
# in images max color value is like 30000 min is usually 0
i = 0
while i < len(image[k]):
j = 0
while j < len(image[k][i]):
image[k][i][j] = float(image[k][i][j]) / (max) * 255
j += 1
i += 1
if i only read images (170 in total (images is 512x512)) without it takes about 7 secs, if i do this normalization it takes 20 mins. And it's all over in code. Here i try to make my image colored
maskLoot1=np.zeros([len(mask1), 3*len(mask1[0])])
for i in range(len(mask1)):
for j in range(len(mask1[0])):
maskLoot1[i][j*3]=mask1[i][j]
maskLoot1[i][j*3+1]=mask1[i][j]
maskLoot1[i][j*3+2]=mask1[i][j]
Next i try to replace selected region pixels with colored ones, for example 120 (grey) -> (255 40 0) in rgb model.
for i in range(len(mask1)):
for j in range(len(mask1[0])):
#mask is NumPy array with selected pixel painted in white (255)
if (mask[i][j] > 250):
maskLoot1[i][j * 3] = lootScheme[mask1[i][j]][1] #red chanel
maskLoot1[i][j * 3+1] = lootScheme[mask1[i][j]][2] #green chanel
maskLoot1[i][j * 3+2] = lootScheme[mask1[i][j]][3] #bluechanel
And it also takes much time, not 20 min but long enouch to make my script lag. consider it's just 2 of many my operations on arrays, and if for second case we can use some bultin function for others is very unlikely. So is there a way to speed up my sode?
For your mask-making code try this replacement to loops:
maskLoot1 = np.dstack(3*[mask1]).reshape((mask1.shape[0],3*mask1.shape[1]))
There are many other ways/variations of achieving the above, e.g.,
maskLoot1 = np.tile(mask1[:,:,None], 3).reshape((mask1.shape[0],3*mask1.shape[1]))
As for the first part of your question the best answer is in the first comment to your question by #furas
First thing, consider moving to Python 3.*. Numpy is dropping support for Python Numpy is dropping support for Python 2.7 from 2020.
For your code questions. You are missing the point of using Numpy below. Numpy is compiled from lower level libraries and it runs very fast, you should not loop over indices in Python, you should throw matrices to Numpy.
Question 1
Normalization is very fast using a listcomp and an np.array
import numpy as np
import time
# create dummy image structure (k, i, j, c) or (k, i, j)
# k is image index, i is row, j is columns, c is channel RGB
images = np.random.uniform(0, 30000, size=(170, 512, 512))
t_start = time.time()
norm_images = np.array([(255*images[k, :, :]/images[k, :, :].max()).astype(int) for k in range(170)])
t_end = time.time()
print("Processing time = {} seconds".format(t_end-t_start))
print("Input shape = {}".format(images.shape))
print("Output shape = {}".format(norm_images.shape))
print("Maximum input value = {}".format(images.max()))
print("Maximum output value = {}".format(norm_images.max()))
That creates the following output
Processing time = 0.2568979263305664 seconds
Input shape = (170, 512, 512)
Output shape = (170, 512, 512)
Maximum input value = 29999.999956185838
Maximum output value = 255
It takes 0.25 seconds!
Question 2
Not sure what you meant here but if you want to clone the values of a monochromatic image to RGB values you can do it like this
# coloring (by copying value and keeping your structure)
color_img = np.array([np.tile(images[k], 3) for k in range(170)])
print("Output shape = {}".format(color_img.shape))
Which produces
Output shape = (170, 512, 1536)
If you instead would like to keep a (c, i, j, k) structure
color_img = np.array([[images[k]]*3 for k in range(170)]) # that creates (170, 3, 512, 512)
color_img = np.swapaxes(np.swapaxes(color_img, 1,2), 2, 3) # that creates (170, 512, 512, 3)
All this takes 0.26 seconds!
Question 3
Coloring certain regions, I would use a function again and a listcomp. Since this is an example I have used a default colouring of (255, 40, 0) but you can use anything, including a LUT.
# create mask of zeros and ones
mask = np.floor(np.random.uniform(0,256, size=(512,512)))
default_scheme = (255, 40, 0)
def substitute(cimg, mask, scheme):
ind = mask > 250
cimg[ind, :] = scheme
return cimg
new_cimg = np.array([substitute(color_img[k], mask, default_scheme) for k in range(170)])
In general for-loops are significantly faster than while-loops. Also using a function for
maskLoot1[i][j*3]=mask1[i][j]
maskLoot1[i][j*3+1]=mask1[i][j]
maskLoot1[i][j*3+2]=mask1[i][j]
and calling the function in the loop should speed up the process significantly.