I often need to stack 2d numpy arrays (tiff images). For that, I first append them in a list and use np.dstack. This seems to be the fastest way to get 3D array stacking images. But, is there a faster/memory-efficient way?
from time import time
import numpy as np
# Create 100 images of the same dimention 256x512 (8-bit).
# In reality, each image comes from a different file
img = np.random.randint(0,255,(256, 512, 100))
t0 = time()
temp = []
for n in range(100):
temp.append(img[:,:,n])
stacked = np.dstack(temp)
#stacked = np.array(temp) # much slower 3.5 s for 100
print time()-t0 # 0.58 s for 100 frames
print stacked.shape
# dstack in each loop is slower
t0 = time()
temp = img[:,:,0]
for n in range(1, 100):
temp = np.dstack((temp, img[:,:,n]))
print time()-t0 # 3.13 s for 100 frames
print temp.shape
# counter-intuitive but preallocation is slightly slower
stacked = np.empty((256, 512, 100))
t0 = time()
for n in range(100):
stacked[:,:,n] = img[:,:,n]
print time()-t0 # 0.651 s for 100 frames
print stacked.shape
# (Edit) As in the accepted answer, re-arranging axis to mainly use
# the first axis to access data improved the speed significantly.
img = np.random.randint(0,255,(100, 256, 512))
stacked = np.empty((100, 256, 512))
t0 = time()
for n in range(100):
stacked[n,:,:] = img[n,:,:]
print time()-t0 # 0.08 s for 100 frames
print stacked.shape
After some joint effort with otterb, we concluded that preallocating of the array is the way to go. Apparently the performance killing bottleneck was the array layout with the image number (n) being the fastest changing index. If we make n the first index of the array (which will default to the "C" ordering: first index changest slowest, last index changes fastest) we get the best performance:
from time import time
import numpy as np
# Create 100 images of the same dimention 256x512 (8-bit).
# In reality, each image comes from a different file
img = np.random.randint(0,255,(100, 256, 512))
# counter-intuitive but preallocation is slightly slower
stacked = np.empty((100, 256, 512))
t0 = time()
for n in range(100):
stacked[n] = img[n]
print time()-t0
print stacked.shape
Related
In the following example, I initialize a 500x2000x2000 three-dimensional NumPy array named a. At each iteration, a random two-dimensional array r is inserted into array a. This example represents a larger piece of code where the r array would be created from various calculations during each iteration of the for-loop. Consequently, each slice in the z dimension of the a array is calculated at each iteration.
# ex1_basic.py
import numpy as np
import time
def main():
tic = time.perf_counter()
z = 500 # depth
x = 2000 # rows
y = 2000 # columns
a = np.zeros((z, x, y))
for i in range(z):
r = np.random.rand(x, y)
a[i] = r
toc = time.perf_counter()
print('elapsed =', round(toc - tic, 2), 'sec')
if __name__ == '__main__':
main()
This example's memory is profiled using the memory-profiler package. The steps for running the memory-profiler for this example are:
# Run the memory profiler
$ mprof run ex1_basic.py
# Plot the memory profiler results
$ mprof plot
The memory usage is plotted below. The memory usage increases over time as the values are added to the array.
I profiled another example where the data type for the array is defined as np.float32. See below for the example code and memory usage plot. This decreased the overall memory use but the memory still increases with each iteration.
import numpy as np
import time
def main():
rng = np.random.default_rng()
tic = time.perf_counter()
z = 500 # depth
x = 2000 # rows
y = 2000 # columns
a = np.zeros((z, x, y), dtype=np.float32)
for i in range(z):
r = rng.standard_normal((x, y), dtype=np.float32)
a[i] = r
toc = time.perf_counter()
print('elapsed =', round(toc - tic, 2), 'sec')
if __name__ == '__main__':
main()
Since I initialized array a using np.zeros, I would expect the memory usage to remain constant based on the block of memory initialized for that array. But it appears that memory usage increases as values are inserted into array a.
So I have two questions related to these examples:
Why does the memory usage increase with time?
How do I create and store array a on disk and only have slice a[i] and the r array in memory at each iteration? Basically, how would I run these examples if the a array did not fit in memory (RAM)?
Update
I ran an example using numpy.memmap but there is no improvement in memory usage. It seems like memmap is still keeping the entire array in memory.
import numpy as np
import time
def main():
rng = np.random.default_rng()
tic = time.perf_counter()
z = 500
x = 2000
y = 2000
a = np.memmap('file.dat', dtype=np.float32, mode='w+', shape=(z, x, y))
for i in range(z):
r = rng.standard_normal((x, y), dtype=np.float32)
a[i] = r
toc = time.perf_counter()
print('elapsed =', round(toc - tic, 2), 'sec')
if __name__ == '__main__':
main()
Using the h5py package, I can create an hdf5 file that contains a dataset that represents the a array. The dset variable is similar to the a variable discussed in the question. This allows the array to reside on disk, not in memory. The generated hdf5 file is 8 GB on disk which is the size of the array containing np.float32 values. The elapsed time for this approach is similar to the examples discussed in the question; therefore, writing to the hdf5 file seems to have a negligible performance impact.
import numpy as np
import h5py
import time
def main():
rng = np.random.default_rng()
tic = time.perf_counter()
z = 500 # depth
x = 2000 # rows
y = 2000 # columns
f = h5py.File('file.hdf5', 'w')
dset = f.create_dataset('data', shape=(z, x, y), dtype=np.float32)
for i in range(z):
r = rng.standard_normal((x, y), dtype=np.float32)
dset[i, :, :] = r
toc = time.perf_counter()
print('elapsed time =', round(toc - tic, 2), 'sec')
s = np.float32().nbytes * (z * x * y) / 1e9 # where 1 GB = 1000 MB
print('calculated storage =', s, 'GB')
if __name__ == '__main__':
main()
Output from running this example on a MacBook Pro with 2.6 GHz 6-Core Intel Core i7 and 32 GB of RAM:
elapsed time = 22.97 sec
calculated storage = 8.0 GB
Running the memory profiler for this example gives the plot shown below. The peak memory usage is about 100 MiB which is drastically lower than the examples demonstrated in the question.
I have a large array of values packed in a 4D numpy array (thousands of values in x,y,z for thousands of times). For each of these values I need 'color vector' (RGBA) from a matplotlib.cm.ScalarMappable object.
I've discovered that looping through such an array becomes rather slow, and I'm wondering if there's a way of significantly speeding it up by taking a different approach. For example, can an entire numpy array (greater than 2D) be passed to a ScalarMappable in order for this operation to take place in a more numpythonic or vectorized way?
My example code for a 3D case:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import timeit
def get_colors_1(data,x,y,z):
colors = np.zeros( (x,y,z,4), dtype=np.float16)
for i in range(x):
for j in range(y):
for k in range(z):
colors[i,j,k,:] = m.to_rgba(data[i,j,k])
return colors
def get_colors_2(data,x,y,z):
colors = np.array([[[m.to_rgba(data[i,j,k]) for k in range(z)] for j in range(y)] for i in range(x)], dtype=np.float16)
return colors
def get_colors_3(data,x,y,z):
colors = np.zeros((x,y,z,4), dtype=np.float16)
for i in range(x):
colors[i,:,:,:] = m.to_rgba(data[i,:,:])
return colors
x, y, z = 30, 20, 10
data = np.random.rand(x,y,z)
cmap = matplotlib.cm.get_cmap('jet')
norm = matplotlib.colors.PowerNorm(vmin=0.0, vmax=1.0, gamma=2.5)
m = matplotlib.cm.ScalarMappable(norm=norm, cmap=cmap)
start_time = timeit.default_timer()
colors = get_colors_1(data,x,y,z)
elapsed = timeit.default_timer() - start_time
print('time elapsed: '+str(elapsed))
start_time = timeit.default_timer()
colors = get_colors_2(data,x,y,z)
elapsed = timeit.default_timer() - start_time
print('time elapsed: '+str(elapsed))
start_time = timeit.default_timer()
colors = get_colors_3(data,x,y,z)
elapsed = timeit.default_timer() - start_time
print('time elapsed: '+str(elapsed))
The third method (passing 2D arrays at time) shows a big performance bump, but I'm wondering if this can be pushed significantly further.
time elapsed: 0.5877857000014046
time elapsed: 0.5911024999986694
time elapsed: 0.004590500000631437
Looking at help ScalarMappable.to_rgba,
def to_rgba(self, x, alpha=None, bytes=False, norm=True):
In the normal case, x is a 1D or 2D sequence of scalars, and
the corresponding ndarray of rgba values will be returned,
based on the norm and colormap set for this ScalarMappable.
There is one special case, for handling images that are already
rgb or rgba, such as might have been read from an image file.
If x is an ndarray with 3 dimensions ...
Is
sm = ScalarMappable( norm=None, cmap=cmap )
flat = data.reshape( -1 ) # a view
rgba = sm.to_rgba( flat, bytes=True, norm=False ).reshape( data.shape + (4,) )
much faster ?
(Source: ScalarMappable to_rgba .)
I have an M x 3 array of 3D coordinates, coords (M ~1000-10000), and I would like to compute the sum of Gaussians centered at these coordinates over a mesh grid 3D array. The mesh grid 3D array is typically something like 64 x 64 x 64, but sometimes upwards of 256 x 256 x 256, and can go even larger. I’ve followed this question to get started, by converting my meshgrid array into an array of N x 3 coordinates, xyz, where N is 64^3 or 256^3, etc. However, for large array sizes it takes too much memory to vectorize the entire calculation (understandable since it could approach 1e11 elements and consume a terabyte of RAM) so I’ve broken it up into a loop over M coordinates. However, this is too slow.
I’m wondering if there is any way to speed this up at all without overloading memory. By converting the meshgrid to xyz, I feel like I’ve lost any advantage of the grid being equally spaced, and that somehow, maybe with scipy.ndimage, I should be able to take advantage of the even spacing to speed things up.
Here’s my initial start:
import numpy as np
from scipy import spatial
#create meshgrid
side = 100.
n = 64 #could be 256 or larger
x_ = np.linspace(-side/2,side/2,n)
x,y,z = np.meshgrid(x_,x_,x_,indexing='ij')
#convert meshgrid to list of coordinates
xyz = np.column_stack((x.ravel(),y.ravel(),z.ravel()))
#create some coordinates
coords = np.random.random(size=(1000,3))*side - side/2
def sumofgauss(coords,xyz,sigma):
"""Simple isotropic gaussian sum at coordinate locations."""
n = int(round(xyz.shape[0]**(1/3.))) #get n samples for reshaping to 3D later
#this version overloads memory
#dist = spatial.distance.cdist(coords, xyz)
#dist *= dist
#values = 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dist/(2*sigma**2))
#values = np.sum(values,axis=0)
#run cdist in a loop over coords to avoid overloading memory
values = np.zeros((xyz.shape[0]))
for i in range(coords.shape[0]):
dist = spatial.distance.cdist(coords[None,i], xyz)
dist *= dist
values += 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dist[0]/(2*sigma**2))
return values.reshape(n,n,n)
image = sumofgauss(coords,xyz,1.0)
import matplotlib.pyplot as plt
plt.imshow(image[n/2]) #show a slice
plt.show()
M = 1000, N = 64 (~5 seconds):
M = 1000, N = 256 (~10 minutes):
Considering that many of your distance calculations will give zero weight after the exponential, you can probably drop a lot of your distances. Doing big chunks of distance calculations while dropping distances which are greater than a threshhold is usually faster with KDTree:
import numpy as np
from scipy.spatial import cKDTree # so we can get a `coo_matrix` output
def gaussgrid(coords, sigma = 1, n = 64, side = 100, eps = None):
x_ = np.linspace(-side/2,side/2,n)
x,y,z = np.meshgrid(x_,x_,x_,indexing='ij')
xyz = np.column_stack((x.ravel(),y.ravel(),z.ravel()))
if eps is None:
eps = np.finfo('float64').eps
thr = -np.log(eps) * 2 * sigma**2
data_tree = cKDTree(coords)
discr = 1000 # you can tweak this to get best results on your system
values = np.empty(n**3)
for i in range(n**3//discr + 1):
slc = slice(i * discr, i * discr + discr)
grid_tree = cKDTree(xyz[slc])
dists = grid_tree.sparse_distance_matrix(data_tree, thr, output_type = 'coo_matrix')
dists.data = 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dists.data/(2*sigma**2))
values[slc] = dists.sum(1).squeeze()
return values.reshape(n,n,n)
Now, even if you keep eps = None it'll be a bit faster as you're still returning about 10% your distances, but with eps = 1e-6 or so, you should get a big speedup. On my system:
%timeit out = sumofgauss(coords, xyz, 1.0)
1 loop, best of 3: 23.7 s per loop
%timeit out = gaussgrid(coords)
1 loop, best of 3: 2.12 s per loop
%timeit out = gaussgrid(coords, eps = 1e-6)
1 loop, best of 3: 382 ms per loop
I try to process many images which represented as NumPy array, but it takes too long. that's what im trying to do
# image is a list with images
max = np.amax(image[k])# k is current image index in loop
# here i try to normalize SHORT color to BYTE color and make it fill all range from 0 to 255
# in images max color value is like 30000 min is usually 0
i = 0
while i < len(image[k]):
j = 0
while j < len(image[k][i]):
image[k][i][j] = float(image[k][i][j]) / (max) * 255
j += 1
i += 1
if i only read images (170 in total (images is 512x512)) without it takes about 7 secs, if i do this normalization it takes 20 mins. And it's all over in code. Here i try to make my image colored
maskLoot1=np.zeros([len(mask1), 3*len(mask1[0])])
for i in range(len(mask1)):
for j in range(len(mask1[0])):
maskLoot1[i][j*3]=mask1[i][j]
maskLoot1[i][j*3+1]=mask1[i][j]
maskLoot1[i][j*3+2]=mask1[i][j]
Next i try to replace selected region pixels with colored ones, for example 120 (grey) -> (255 40 0) in rgb model.
for i in range(len(mask1)):
for j in range(len(mask1[0])):
#mask is NumPy array with selected pixel painted in white (255)
if (mask[i][j] > 250):
maskLoot1[i][j * 3] = lootScheme[mask1[i][j]][1] #red chanel
maskLoot1[i][j * 3+1] = lootScheme[mask1[i][j]][2] #green chanel
maskLoot1[i][j * 3+2] = lootScheme[mask1[i][j]][3] #bluechanel
And it also takes much time, not 20 min but long enouch to make my script lag. consider it's just 2 of many my operations on arrays, and if for second case we can use some bultin function for others is very unlikely. So is there a way to speed up my sode?
For your mask-making code try this replacement to loops:
maskLoot1 = np.dstack(3*[mask1]).reshape((mask1.shape[0],3*mask1.shape[1]))
There are many other ways/variations of achieving the above, e.g.,
maskLoot1 = np.tile(mask1[:,:,None], 3).reshape((mask1.shape[0],3*mask1.shape[1]))
As for the first part of your question the best answer is in the first comment to your question by #furas
First thing, consider moving to Python 3.*. Numpy is dropping support for Python Numpy is dropping support for Python 2.7 from 2020.
For your code questions. You are missing the point of using Numpy below. Numpy is compiled from lower level libraries and it runs very fast, you should not loop over indices in Python, you should throw matrices to Numpy.
Question 1
Normalization is very fast using a listcomp and an np.array
import numpy as np
import time
# create dummy image structure (k, i, j, c) or (k, i, j)
# k is image index, i is row, j is columns, c is channel RGB
images = np.random.uniform(0, 30000, size=(170, 512, 512))
t_start = time.time()
norm_images = np.array([(255*images[k, :, :]/images[k, :, :].max()).astype(int) for k in range(170)])
t_end = time.time()
print("Processing time = {} seconds".format(t_end-t_start))
print("Input shape = {}".format(images.shape))
print("Output shape = {}".format(norm_images.shape))
print("Maximum input value = {}".format(images.max()))
print("Maximum output value = {}".format(norm_images.max()))
That creates the following output
Processing time = 0.2568979263305664 seconds
Input shape = (170, 512, 512)
Output shape = (170, 512, 512)
Maximum input value = 29999.999956185838
Maximum output value = 255
It takes 0.25 seconds!
Question 2
Not sure what you meant here but if you want to clone the values of a monochromatic image to RGB values you can do it like this
# coloring (by copying value and keeping your structure)
color_img = np.array([np.tile(images[k], 3) for k in range(170)])
print("Output shape = {}".format(color_img.shape))
Which produces
Output shape = (170, 512, 1536)
If you instead would like to keep a (c, i, j, k) structure
color_img = np.array([[images[k]]*3 for k in range(170)]) # that creates (170, 3, 512, 512)
color_img = np.swapaxes(np.swapaxes(color_img, 1,2), 2, 3) # that creates (170, 512, 512, 3)
All this takes 0.26 seconds!
Question 3
Coloring certain regions, I would use a function again and a listcomp. Since this is an example I have used a default colouring of (255, 40, 0) but you can use anything, including a LUT.
# create mask of zeros and ones
mask = np.floor(np.random.uniform(0,256, size=(512,512)))
default_scheme = (255, 40, 0)
def substitute(cimg, mask, scheme):
ind = mask > 250
cimg[ind, :] = scheme
return cimg
new_cimg = np.array([substitute(color_img[k], mask, default_scheme) for k in range(170)])
In general for-loops are significantly faster than while-loops. Also using a function for
maskLoot1[i][j*3]=mask1[i][j]
maskLoot1[i][j*3+1]=mask1[i][j]
maskLoot1[i][j*3+2]=mask1[i][j]
and calling the function in the loop should speed up the process significantly.
I am trying to perform image correlation to find which frame out of a set of 20 frames (the set is stored in a 3D array, x) matches best with a given frame (stored as a 2D array, y). This step has to be performed 1000 times.
I tried to vectorize the code to make it run faster. But somehow the vectorization is making the code take twice as long. I am probably doing something wrong in the vectorization process which is making it slower.
Here is the code
import numpy as np
import time
def corr2(a,b):
#Getting shapes and prealocating the auxillairy variables
k = np.shape(a)
#Calculating mean values
AM=np.mean(a)
BM=np.mean(b)
#calculate vectors
c_vect = (a-AM)*(b-BM)
d_vect = (a-AM)**2
e_vect = (b-BM)**2
#Formula itself
r_out = np.sum(c_vect)/float(np.sqrt(np.sum(d_vect)*np.sum(e_vect)))
return r_out
def ZZ_1X_v1_MCC(R,RefImage):
from img_proccessing import corr2
Cor = np.zeros(R.shape[2])
for t in range(R.shape[2]):
Cor[t]=corr2(RefImage,R[:,:,t]) #Correlation
#report
max_correlationvalue_intermediate = np.amax(Cor)
max_correlatedframe_intermediate = np.argmax(Cor)
max_correlatedframeandvalue = [max_correlatedframe_intermediate,max_correlationvalue_intermediate];
return max_correlatedframeandvalue
def ZZ_1X_v1_MCC_vectorized(R,RefImage):
R_shape = np.asarray(np.shape(R))
R_flattened = R.swapaxes(0,2).reshape(R_shape[2],R_shape[0]*R_shape[1])
AA = np.transpose(R_flattened)
RefImageflattened = RefImage.transpose().ravel()
#Calculating mean subtracted values
AAM = AA - np.mean(AA,axis=0)
BM = RefImageflattened - np.mean(RefImageflattened)
#calculate vectors
DD_vect = AAM**2
E_vect = BM**2
EE_vect = np.transpose(np.tile(np.transpose(E_vect),(R_shape[2],1)))
CC_vect = AAM*np.transpose(np.tile(BM,(R_shape[2],1)))
#Formula itself
Cor = np.sum(CC_vect,axis=0)/np.sqrt((np.sum(DD_vect,axis=0)*np.sum(EE_vect,axis=0)).astype(float))
#report
max_correlationvalue_intermediate = np.amax(Cor)
max_correlatedframe_intermediate = np.argmax(Cor)
max_correlatedframeandvalue = [max_correlatedframe_intermediate,max_correlationvalue_intermediate];
return max_correlatedframeandvalue
x = np.arange(400000).reshape((20,200,100)).swapaxes(0,2) #3D array with 20 frames
y = np.transpose(np.arange(20000).reshape((200,100))) #2D array with 1 frame
# using for loop
tic = time.time()
for i in range(500):
[a,b] = ZZ_1X_v1_MCC(x,y)
print(time.time() - tic)
# using vectorization
tic = time.time()
for i in range(500):
[a,b] = ZZ_1X_v1_MCC_vectorized(x,y)
print(time.time() - tic)