How to efficiently operate a large numpy array - python

I have a segment of codes which is based on a large numpy array and then to operate another array. Because this is a very large array, could you please let me know whether there is an efficient way to achieve my goal? (I think the efficient way should be achieved by directly operating on the array but not through the for-loop).
Thanks in advance, please below find my codes:
N = 1000000000
rand = np.random.rand(N)
beta = np.zeros(N)
for i in range(0, N):
if rand[i] < 0.5:
beta[i] = 2.0*rand[i]
else:
beta[i] = 1.0/(2.0*(1.0-rand[i]))

You are here basically losing the efficiency of numpy, by performing the processing in Python. The idea of numpy is process the items in bulk, since it has efficiency algorithms in C++ behind the curtains that do the actual processing. You can see the Python end of numpy more as an "interface".
Now to answer your question, we can basically first construct an array of random numbers between 0 and 2, by multiplying it with 2 already:
rand = 2.0 * np.random.rand(N)
next we can use np.where(..) [numpy-doc] that acts like a conditional selector: we here pass it three "arrays": the first one is an array of booleans that encodes the truthiness of the "condition", the second is the array of values to fill in in case the related condition is true, and the third value is the array of values to plug in in case the condition is false, so we can then write it like:
N = 1000000000
rand = 2 * np.random.rand(N)
beta = np.where(rand < 1.0, rand, 1.0 / (2.0 - rand))

N = 1000000000 caused a MemoryError for me. Reducing to 100 for a minimal example.
You can use np.where routine.
In both case, fundamentally you are iterating over your array and applying a function. However, np.where uses a way faster loop (it's compiled code basically), while your "python" loop is interpreted and thus really slow for a big N.
Here's an example of implementation.
N = 100
rand = np.random.rand(N)
beta = np.where(rand < 0.5, 2.0 * rand, 1.0/(2.0*(1.0-rand))

As other answers have pointed out, iterating over the elements of a numpy array in a Python loop should (and can) almost always be avoided. In most cases going from a Python loop to an array operation gives a speedup of ~100x.
However, if performance is absolutely critical, you can often squeeze out another factor of between 2x and 10x (in my experience) by using Cython.
Here's an example:
%%cython
cimport numpy as np
import numpy as np
cimport cython
from cython cimport floating
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cpdef np.ndarray[floating, ndim=1] beta(np.ndarray[floating, ndim=1] arr):
cdef:
Py_ssize_t i
Py_ssize_t N = arr.shape[0]
np.ndarray[floating, ndim=1] result = np.zeros(N)
for i in range(N):
if arr[i] < 0.5:
result[i] = 2.0*arr[i]
else:
result[i] = 1.0/(2.0*(1.0-arr[i]))
return result
You would then call it as beta(rand).
As you can see, this allows you to use your original loop structure, but now using efficient typed native code. I get a speedup of ~2.5x compared to np.where.
It should be noted that in many cases this is not worth the extra effort compared to the one-liner in numpy -- but it may well be where performance is critical.

Related

Invert particular bits in a byte array [duplicate]

What is the most efficient way to map a function over a numpy array? I am currently doing:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
# Obtain array of square of each element in x
squarer = lambda t: t ** 2
squares = np.array([squarer(xi) for xi in x])
However, this is probably very inefficient, since I am using a list comprehension to construct the new array as a Python list before converting it back to a numpy array. Can we do better?
I've tested all suggested methods plus np.array(list(map(f, x))) with perfplot (a small project of mine).
Message #1: If you can use numpy's native functions, do that.
If the function you're trying to vectorize already is vectorized (like the x**2 example in the original post), using that is much faster than anything else (note the log scale):
If you actually need vectorization, it doesn't really matter much which variant you use.
Code to reproduce the plots:
import numpy as np
import perfplot
import math
def f(x):
# return math.sqrt(x)
return np.sqrt(x)
vf = np.vectorize(f)
def array_for(x):
return np.array([f(xi) for xi in x])
def array_map(x):
return np.array(list(map(f, x)))
def fromiter(x):
return np.fromiter((f(xi) for xi in x), x.dtype)
def vectorize(x):
return np.vectorize(f)(x)
def vectorize_without_init(x):
return vf(x)
b = perfplot.bench(
setup=np.random.rand,
n_range=[2 ** k for k in range(20)],
kernels=[
f,
array_for,
array_map,
fromiter,
vectorize,
vectorize_without_init,
],
xlabel="len(x)",
)
b.save("out1.svg")
b.show()
How about using numpy.vectorize.
import numpy as np
x = np.array([1, 2, 3, 4, 5])
squarer = lambda t: t ** 2
vfunc = np.vectorize(squarer)
vfunc(x)
# Output : array([ 1, 4, 9, 16, 25])
TL;DR
As noted by #user2357112, a "direct" method of applying the function is always the fastest and simplest way to map a function over Numpy arrays:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
f = lambda x: x ** 2
squares = f(x)
Generally avoid np.vectorize, as it does not perform well, and has (or had) a number of issues. If you are handling other data types, you may want to investigate the other methods shown below.
Comparison of methods
Here are some simple tests to compare three methods to map a function, this example using with Python 3.6 and NumPy 1.15.4. First, the set-up functions for testing:
import timeit
import numpy as np
f = lambda x: x ** 2
vf = np.vectorize(f)
def test_array(x, n):
t = timeit.timeit(
'np.array([f(xi) for xi in x])',
'from __main__ import np, x, f', number=n)
print('array: {0:.3f}'.format(t))
def test_fromiter(x, n):
t = timeit.timeit(
'np.fromiter((f(xi) for xi in x), x.dtype, count=len(x))',
'from __main__ import np, x, f', number=n)
print('fromiter: {0:.3f}'.format(t))
def test_direct(x, n):
t = timeit.timeit(
'f(x)',
'from __main__ import x, f', number=n)
print('direct: {0:.3f}'.format(t))
def test_vectorized(x, n):
t = timeit.timeit(
'vf(x)',
'from __main__ import x, vf', number=n)
print('vectorized: {0:.3f}'.format(t))
Testing with five elements (sorted from fastest to slowest):
x = np.array([1, 2, 3, 4, 5])
n = 100000
test_direct(x, n) # 0.265
test_fromiter(x, n) # 0.479
test_array(x, n) # 0.865
test_vectorized(x, n) # 2.906
With 100s of elements:
x = np.arange(100)
n = 10000
test_direct(x, n) # 0.030
test_array(x, n) # 0.501
test_vectorized(x, n) # 0.670
test_fromiter(x, n) # 0.883
And with 1000s of array elements or more:
x = np.arange(1000)
n = 1000
test_direct(x, n) # 0.007
test_fromiter(x, n) # 0.479
test_array(x, n) # 0.516
test_vectorized(x, n) # 0.945
Different versions of Python/NumPy and compiler optimization will have different results, so do a similar test for your environment.
There are numexpr, numba and cython around, the goal of this answer is to take these possibilities into consideration.
But first let's state the obvious: no matter how you map a Python-function onto a numpy-array, it stays a Python function, that means for every evaluation:
numpy-array element must be converted to a Python-object (e.g. a Float).
all calculations are done with Python-objects, which means to have the overhead of interpreter, dynamic dispatch and immutable objects.
So which machinery is used to actually loop through the array doesn't play a big role because of the overhead mentioned above - it stays much slower than using numpy's built-in functionality.
Let's take a look at the following example:
# numpy-functionality
def f(x):
return x+2*x*x+4*x*x*x
# python-function as ufunc
import numpy as np
vf=np.vectorize(f)
vf.__name__="vf"
np.vectorize is picked as a representative of the pure-python function class of approaches. Using perfplot (see code in the appendix of this answer) we get the following running times:
We can see, that the numpy-approach is 10x-100x faster than the pure python version. The decrease of performance for bigger array-sizes is probably because data no longer fits the cache.
It is worth also mentioning, that vectorize also uses a lot of memory, so often memory-usage is the bottle-neck (see related SO-question). Also note, that numpy's documentation on np.vectorize states that it is "provided primarily for convenience, not for performance".
Other tools should be used, when performance is desired, beside writing a C-extension from the scratch, there are following possibilities:
One often hears, that the numpy-performance is as good as it gets, because it is pure C under the hood. Yet there is a lot room for improvement!
The vectorized numpy-version uses a lot of additional memory and memory-accesses. Numexp-library tries to tile the numpy-arrays and thus get a better cache utilization:
# less cache misses than numpy-functionality
import numexpr as ne
def ne_f(x):
return ne.evaluate("x+2*x*x+4*x*x*x")
Leads to the following comparison:
I cannot explain everything in the plot above: we can see bigger overhead for numexpr-library at the beginning, but because it utilize the cache better it is about 10 time faster for bigger arrays!
Another approach is to jit-compile the function and thus getting a real pure-C UFunc. This is numba's approach:
# runtime generated C-function as ufunc
import numba as nb
#nb.vectorize(target="cpu")
def nb_vf(x):
return x+2*x*x+4*x*x*x
It is 10 times faster than the original numpy-approach:
However, the task is embarrassingly parallelizable, thus we also could use prange in order to calculate the loop in parallel:
#nb.njit(parallel=True)
def nb_par_jitf(x):
y=np.empty(x.shape)
for i in nb.prange(len(x)):
y[i]=x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
return y
As expected, the parallel function is slower for smaller inputs, but faster (almost factor 2) for larger sizes:
While numba specializes on optimizing operations with numpy-arrays, Cython is a more general tool. It is more complicated to extract the same performance as with numba - often it is down to llvm (numba) vs local compiler (gcc/MSVC):
%%cython -c=/openmp -a
import numpy as np
import cython
#single core:
#cython.boundscheck(False)
#cython.wraparound(False)
def cy_f(double[::1] x):
y_out=np.empty(len(x))
cdef Py_ssize_t i
cdef double[::1] y=y_out
for i in range(len(x)):
y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
return y_out
#parallel:
from cython.parallel import prange
#cython.boundscheck(False)
#cython.wraparound(False)
def cy_par_f(double[::1] x):
y_out=np.empty(len(x))
cdef double[::1] y=y_out
cdef Py_ssize_t i
cdef Py_ssize_t n = len(x)
for i in prange(n, nogil=True):
y[i] = x[i]+2*x[i]*x[i]+4*x[i]*x[i]*x[i]
return y_out
Cython results in somewhat slower functions:
Conclusion
Obviously, testing only for one function doesn't prove anything. Also one should keep in mind, that for the choosen function-example, the bandwidth of the memory was the bottle neck for sizes larger than 10^5 elements - thus we had the same performance for numba, numexpr and cython in this region.
In the end, the ultimative answer depends on the type of function, hardware, Python-distribution and other factors. For example Anaconda-distribution uses Intel's VML for numpy's functions and thus outperforms numba (unless it uses SVML, see this SO-post) easily for transcendental functions like exp, sin, cos and similar - see e.g. the following SO-post.
Yet from this investigation and from my experience so far, I would state, that numba seems to be the easiest tool with best performance as long as no transcendental functions are involved.
Plotting running times with perfplot-package:
import perfplot
perfplot.show(
setup=lambda n: np.random.rand(n),
n_range=[2**k for k in range(0,24)],
kernels=[
f,
vf,
ne_f,
nb_vf, nb_par_jitf,
cy_f, cy_par_f,
],
logx=True,
logy=True,
xlabel='len(x)'
)
squares = squarer(x)
Arithmetic operations on arrays are automatically applied elementwise, with efficient C-level loops that avoid all the interpreter overhead that would apply to a Python-level loop or comprehension.
Most of the functions you'd want to apply to a NumPy array elementwise will just work, though some may need changes. For example, if doesn't work elementwise. You'd want to convert those to use constructs like numpy.where:
def using_if(x):
if x < 5:
return x
else:
return x**2
becomes
def using_where(x):
return numpy.where(x < 5, x, x**2)
It seems that no one has mentioned a built-in factory method of producing ufunc in numpy package: np.frompyfunc, which I have tested against np.vectorize, and have outperformed it by about 20~30%. Of course it will not perform as well prescribed C code or even numba(which I have not tested), but it can a better alternative than np.vectorize
f = lambda x, y: x * y
f_arr = np.frompyfunc(f, 2, 1)
vf = np.vectorize(f)
arr = np.linspace(0, 1, 10000)
%timeit f_arr(arr, arr) # 307ms
%timeit vf(arr, arr) # 450ms
I have also tested larger samples, and the improvement is proportional. See the documentation also here
Edit: the original answer was misleading, np.sqrt was applied directly to the array, just with a small overhead.
In multidimensional cases where you want to apply a builtin function that operates on a 1d array, numpy.apply_along_axis is a good choice, also for more complex function compositions from numpy and scipy.
Previous misleading statement:
Adding the method:
def along_axis(x):
return np.apply_along_axis(f, 0, x)
to the perfplot code gives performance results close to np.sqrt.
I believe in newer version( I use 1.13) of numpy you can simply call the function by passing the numpy array to the fuction that you wrote for scalar type, it will automatically apply the function call to each element over the numpy array and return you another numpy array
>>> import numpy as np
>>> squarer = lambda t: t ** 2
>>> x = np.array([1, 2, 3, 4, 5])
>>> squarer(x)
array([ 1, 4, 9, 16, 25])
As mentioned in this post, just use generator expressions like so:
numpy.fromiter((<some_func>(x) for x in <something>),<dtype>,<size of something>)
All above answers compares well, but if you need to use custom function for mapping, and you have numpy.ndarray, and you need to retain the shape of array.
I have compare just two, but it will retain the shape of ndarray. I have used the array with 1 million entries for comparison. Here I use square function, which is also inbuilt in numpy and has great performance boost, since there as was need of something, you can use function of your choice.
import numpy, time
def timeit():
y = numpy.arange(1000000)
now = time.time()
numpy.array([x * x for x in y.reshape(-1)]).reshape(y.shape)
print(time.time() - now)
now = time.time()
numpy.fromiter((x * x for x in y.reshape(-1)), y.dtype).reshape(y.shape)
print(time.time() - now)
now = time.time()
numpy.square(y)
print(time.time() - now)
Output
>>> timeit()
1.162431240081787 # list comprehension and then building numpy array
1.0775556564331055 # from numpy.fromiter
0.002948284149169922 # using inbuilt function
here you can clearly see numpy.fromiter works great considering to simple approach, and if inbuilt function is available please use that.
Use numpy.fromfunction(function, shape, **kwargs)
See "https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfunction.html"

Printing and reading Numpy arrays efficiently

I would like to print a Numpy array and then read it back. This is what I have done so far:
#printer
import numpy as np
N = 100
x = np.arange(N)
for xi in x:
print(xi)
#reader
import numpy as np
N = 100
x = np.empty(N)
for i in range(N):
x[i] = float(input())
This gets the job done but I think that it may not be the most
efficient way due to the multiple uses of input(). An alternative way I considered is printing only once, reading only once and modifying what I read. This approach has some similarities with this question. In contrast to that question, I have some extra info that could possibly be used to improve performance:
N is known in advance(to both programs)
Arrays are only 1D or 2D(of sizes N and NxN respectively)
Data are float
Data are fully trusted
Thanks in advance.
Edit: I have to add that the value of N will not be that large, even N=1000 will be huge for my problem.

I Need a fast way to loop through pixels of an Image/Stack in Python

I have created a 3D median filter which does work and is the following:
def Median_Filter_3D(image,kernel):
window = np.zeros(shape=(kernel,kernel,kernel), dtype = np.uint8)
n = (kernel-1)/2 #Deals with Image border
imgout = np.empty_like(image)
w,h,l = image.shape()
%%Start Loop over each pixel
for y in np.arange(0,(w-n*2),1):
for x in np.arange(0,(h-n*2),1):
for z in np.arange(0,(l-n*2),1):
window[:,:,:] = image[x:x+kernel,y:y+kernel,z:z+kernel]
med = np.median(window)
imgout[x+n,y+n,z+n] = med
return(imgout)
So at every pixel, It creates a window of size kernelxkernelxkernel, finds the median value of the pixels in the window, and replaces the value of that pixel with the new medium value.
My problem is, its very slow, I have thousands of big images to process. There must be a faster way to iterate through all these pixels and still be able to get the same result.
Thanks in advance!!
First, looping a 3D matrix in python is a very very very bad idea. In order to loop a large 3D matrix you are better of going down to Cython or C/C++/Fortran and creating a python extension. However, for this particular case, scipy already contains an implementation of the median filter for n-dimensional arrays:
>>> from scipy.ndimage import median_filter
>>> median_filter(my_large_3d_array, radious)
In short, there is no a faster way of iterating through voxels in python (maybe numpy iterators would help a bit, but won't increase the performance considerably). If you need to perform more complicated 3D stuff in python, you should consider programming in Cython the loopy interface or, alternatively, using a chunking library such as Dask, which implements parallel operations for chunks of arrays.
The problem with Python if that for loops are extremely slow, specially if they are nested and with large arrays. Thus, there is no a standard pythonic method for obtaining efficient iterations over arrays. Usually, the way of getting speed-ups is through vectorized operations and numpy-ticks, but those are very problem-specific and there is no generic trick, you will learn a lot of numpy tricks here in SO.
As a generic approach, if you really need to iterate over arrays, you can write your code in Cython. Cython is a C-like extension for Python. You write code in Python syntax, but specifying variable types (like in C, with int or float. That code is then compiled automatically to C and can be called from python. A quick example:
Example Python loopy function:
import numpy as np
def iter_A(A):
B = np.empty(A.shape, dtype=np.float64)
for i in range(A.shape[0]):
for j in range(A.shape[1]):
B[i, j] = A[i, j] * 2
return B
I know that the above code is kinda redundant and could be written as B = A * 2, but its purpose is just to illustrate that python loops are extremely slow.
Cython version of the function:
import numpy as np
cimport numpy as np
def iter_A_cy(double[:, ::1] A):
cdef Py_ssize_t H = A.shape[0], W = A.shape[1]
cdef double[:, ::1] B = np.empty((H, W), dtype=np.float64)
cdef Py_ssize_t i, j
for i in range(H):
for j in range(W):
B[i, j] = A[i, j] * 2
return np.asarray(B)
Test speeds of both implementations:
>>> import numpy as np
>>> A = np.random.randn(1000, 1000)
>>> %timeit iter_A(A)
1 loop, best of 3: 399 ms per loop
>>> %timeit iter_A_cy(A)
100 loops, best of 3: 2.11 ms per loop
NOTE: you cannot run the Cython function as it is. You need to put it in a separate file and compile it first (or use %%cython magic in IPython Notebook).
It shows that the raw python version took 400ms to iterate the whole array, while it was only 2ms for the Cython version (x200 speedup).

Speeding up distance matrix computation with Numpy and Cython

Consider a numpy array A of dimensionality NxM. The goal is to compute Euclidean distance matrix D, where each element D[i,j] is Eucledean distance between rows i and j. What is the fastest way of doing it? This is not exactly the problem I need to solve, but it's a good example of what I'm trying to do (in general, other distance metrics could be used).
This is the fastest I could come up with so far:
n = A.shape[0]
D = np.empty((n,n))
for i in range(n):
D[i] = np.sqrt(np.square(A-A[i]).sum(1))
But is it the fastest way? I'm mainly concerned about the for loop. Can we beat this with, say, Cython?
To avoid looping, I tried to use broadcasting, and do something like this:
D = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
But it turned out to be a bad idea, because there's some overhead in construction an intermediate 3D array of dimensionality NxNxM, so the performance is worse.
I tried Cython. But I am a newbie in Cython, so I don't know how good is my attempt:
def dist(np.ndarray[np.int32_t, ndim=2] A):
cdef int n = A.shape[0]
cdef np.ndarray[np.float64_t, ndim=2] dm = np.empty((n,n), dtype=np.float64)
cdef int i = 0
for i in range(n):
dm[i] = np.sqrt(np.square(A-A[i]).sum(1)).astype(np.float64)
return dm
The above code was a bit slower than Python's for loop. I don't know much about Cython, but I assume I could achieve at least the same performance as the for loop + numpy. And I am wondering whether it is possible to achieve some noticeable performance improvement when done the right way? Or whether there's some other way to speed this up (not involving parallel computations)?
The key thing with Cython is to avoid using Python objects and function calls as much as possible, including vectorized operations on numpy arrays. This usually means writing out all of the loops by hand and operating on single array elements at a time.
There's a very useful tutorial here that covers the process of converting numpy code to Cython and optimizing it.
Here's a quick stab at a more optimized Cython version of your distance function:
import numpy as np
cimport numpy as np
cimport cython
# don't use np.sqrt - the sqrt function from the C standard library is much
# faster
from libc.math cimport sqrt
# disable checks that ensure that array indices don't go out of bounds. this is
# faster, but you'll get a segfault if you mess up your indexing.
#cython.boundscheck(False)
# this disables 'wraparound' indexing from the end of the array using negative
# indices.
#cython.wraparound(False)
def dist(double [:, :] A):
# declare C types for as many of our variables as possible. note that we
# don't necessarily need to assign a value to them at declaration time.
cdef:
# Py_ssize_t is just a special platform-specific type for indices
Py_ssize_t nrow = A.shape[0]
Py_ssize_t ncol = A.shape[1]
Py_ssize_t ii, jj, kk
# this line is particularly expensive, since creating a numpy array
# involves unavoidable Python API overhead
np.ndarray[np.float64_t, ndim=2] D = np.zeros((nrow, nrow), np.double)
double tmpss, diff
# another advantage of using Cython rather than broadcasting is that we can
# exploit the symmetry of D by only looping over its upper triangle
for ii in range(nrow):
for jj in range(ii + 1, nrow):
# we use tmpss to accumulate the SSD over each pair of rows
tmpss = 0
for kk in range(ncol):
diff = A[ii, kk] - A[jj, kk]
tmpss += diff * diff
tmpss = sqrt(tmpss)
D[ii, jj] = tmpss
D[jj, ii] = tmpss # because D is symmetric
return D
I saved this in a file called fastdist.pyx. We can use pyximport to simplify the build process:
import pyximport
pyximport.install()
import fastdist
import numpy as np
A = np.random.randn(100, 200)
D1 = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
D2 = fastdist.dist(A)
print np.allclose(D1, D2)
# True
So it works, at least. Let's do some benchmarking using the %timeit magic:
%timeit np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
# 100 loops, best of 3: 10.6 ms per loop
%timeit fastdist.dist(A)
# 100 loops, best of 3: 1.21 ms per loop
A ~9x speed-up is nice, but not really a game-changer. As you said, though, the big problem with the broadcasting approach is the memory requirements of constructing the intermediate array.
A2 = np.random.randn(1000, 2000)
%timeit fastdist.dist(A2)
# 1 loops, best of 3: 1.36 s per loop
I wouldn't recommend trying that using broadcasting...
Another thing we could do is parallelize this over the outermost loop, using the prange function:
from cython.parallel cimport prange
...
for ii in prange(nrow, nogil=True, schedule='guided'):
...
In order to compile the parallel version you'll need to tell the compiler to enable OpenMP. I haven't figured out how to do this using pyximport, but if you're using gcc you could compile it manually like this:
$ cython fastdist.pyx
$ gcc -shared -pthread -fPIC -fwrapv -fopenmp -O3 \
-Wall -fno-strict-aliasing -I/usr/include/python2.7 -o fastdist.so fastdist.c
With parallelism, using 8 threads:
%timeit D2 = fastdist.dist_parallel(A2)
1 loops, best of 3: 509 ms per loop

Why is this numpy array operation so slow?

I am a python beginner and I am trying to average two NumPy 2D arrays with shape of (1024,1024). Doing it like this is quite fast:
newImage = (image1 + image2) / 2
But now the images have a "mask" that invalidate certain elements if set to zero. That means if one of the elements is zero, the resulting element should also be zero. My trivial solution is:
newImage = numpy.zeros( (1024,1024) , dtype=numpy.int16 )
for y in xrange(newImage.shape[0]):
for x in xrange(newImage.shape[1]):
val1 = image1[y][x]
val2 = image2[y][x]
if val1!=0 and val2!=0:
newImage[y][x] = (val1 + val2) / 2
But this is really slow. I did not time it, but it seems to be slower by a factor of 100.
I also tried using a lambda operator and "map", but this does not return a NumPy array.
Try this:
newImage = numpy.where(np.logical_and(image1, image2), (image1 + image2) / 2, 0)
Where none of image1 and image2 equals zero, take their mean, otherwise zero.
Looping with native Python code is generally much slower than using
built-in tools that use fast C loops. I'm not familiar with NumPy; can
you use map() to do a transformation from your two input arrays to
the output? If so, that should be faster.
Explicit for loops are very inefficient in Python in general, not only for numpy operations. Fortunately, there are faster ways to solve our problem. If memory is not an issue, this solution is quite good:
import numpy as np
new_image = np.zeros((1024, 1024), dtype=np.int16)
valid = (image1!=0) & (image2!=0)
new_image[valid] = (image1+image2)[valid]
Another solution using masked arrays, which do not create copies of the arrays (they represent views of the original image1/2:
m1 = np.ma.masked_equal(image1, 0)
m2 = np.ma.masked_equal(image2, 0)
new_image = (m1+m2).filled(0)
Update: The first solution seems to be 3 times faster than the second for arrays with about 1000 non-zero entries.
numpy array access operation seems slow at best. I can't see any reason for it. You can clearly see it by constructing a simple example:
import numpy
# numpy version
def at(s,n):
t1=time.time()
a=numpy.zeros(s,dtype=numpy.int32)
for i in range(n):
a[i%s]=n
t2=time.time()
return t2-t1
# native version
def an(s,n):
t1=time.time()
a=[(i) for i in range(s)]
for i in range(n):
a[i%s]=n
t2=time.time()
return t2-t1
# test
[at(100000,1000000),an(100000,1000000)]
Result: [0.21972250938415527, 0.15950298309326172]

Categories

Resources