Numpy calculation too slow compared to Fortran

Numpy calculation too slow compared to Fortran - python

I used to mainly use the Fortran, but recently I started using python and Numpy for deep-learning application.
But, calculating the double for-loop in python was significantly slower than the Fortran. I already knew the Fortran is originally fast in calculation, but I would like to know if there is anything wrong with my python code. This is the code I used as follows:
for it in range(nt):
if it%20 == 1:
print(it,'//',nt)
itimenum4 = "%.4i" %(it)
ppsix2[:,:]=0.; ppsiz2[:,:]=0.
apsix2[:,:]=0.; apsiz2[:,:]=0.
ax[:,:]=0.; az[:,:]=0.
p3[:,:]=0.
for iz in range(2,nnz+1):
for ix in range(2,nnx+1):
pdx2=(p2[ix+1,iz]-p2[ix,iz])*a1+(p2[ix+2,iz]-p2[ix-1,iz])*a2
pdz2=(p2[ix,iz+1]-p2[ix,iz])*a1+(p2[ix,iz+2]-p2[ix,iz-1])*a2
dpml0=math.log(1./R)*3.*vp[ix,iz]/(2.*dx*pml)
if ix <= pml:
dpml=dpml0*(float(pml-ix+1)/float(pml))**2
damp=math.exp(-dpml*dt)
ppsix2[ix,iz]=damp*ppsix1[ix,iz]+(damp-1.)*pdx2
if ix >nx+pml:
dpml=dpml0*(float(ix-(nx+pml))/float(pml))**2
damp=math.exp(-dpml*dt)
ppsix2[ix,iz]=damp*ppsix1[ix,iz]+(damp-1.)*pdx2
if iz > nz:
dpml=dpml0*(float(iz-(nz))/float(pml))**2
damp=math.exp(-dpml*dt)
ppsiz2[ix,iz]=damp*ppsiz1[ix,iz]+(damp-1.)*pdz2
ax[ix,iz]=pdx2+ppsix2[ix,iz]
az[ix,iz]=pdz2+ppsiz2[ix,iz]
az[:,1]=az[:,2]
az[:,0]=az[:,3]
for iz in range(2,nnz+1):
for ix in range(2,nnx+1):
adx=a1*(ax[ix,iz]-ax[ix-1,iz])+a2*(ax[ix+1,iz]-ax[ix-2,iz])
adz=a1*(az[ix,iz]-az[ix,iz-1])+a2*(az[ix,iz+1]-az[ix,iz-2])
dpml0=math.log(1./R)*3.*vp[ix,iz]/(2.*dx*pml)
if ix <= pml:
dpml=dpml0*(float(pml-ix+1)/float(pml))**2
damp=math.exp(-dpml*dt)
apsix2[ix,iz]=damp*apsix1[ix,iz]+(damp-1.)*adx
if ix > nx+pml:
dpml=dpml0*(float(ix-(nx+pml))/float(pml))**2
damp=math.exp(-dpml*dt)
apsix2[ix,iz]=damp*apsix1[ix,iz]+(damp-1.)*adx
if iz > nz:
dpml=dpml0*(float(iz-(nz))/float(pml))**2
damp=math.exp(-dpml*dt)
apsiz2[ix,iz]=damp*apsiz1[ix,iz]+(damp-1.)*adz
px2=adx+apsix2[ix,iz]
pz2=adz+apsiz2[ix,iz]
p3[ix,iz]=2.*p2[ix,iz]-p1[ix,iz]+(dt*vp[ix,iz])**2*((px2+pz2)+srcf[ix,iz]*source[it])
if iz == recz:
mod_p[ix,it]=p3[ix,iz]
p3[:,0:2]=0.
p1[:,:]=p2[:,:]
p2[:,:]=p3[:,:]
ppsix1[:,:]=ppsix2[:,:]
ppsiz1[:,:]=ppsiz2[:,:]
apsix1[:,:]=apsix2[:,:]
apsiz1[:,:]=apsiz2[:,:]
This is the conventional first-order acoustic wave equation in geophysics. Anyway, I used the Numpy array. Is there any factors that makes calculation slow?
Thanks.

Unfortunately since python is an interpreted language (compared to Fortran which is compiled) you will not be able to get similar speed when using python for loops.
There are a few ways to get around this. One way is try to vectorize your for loops using numpy arrays and numpy math functions. By doing this you will pass off the computationally heavy parts of the code to the numpy library which is precompiled.
Another way to speed up your python code is to use a library to compile your code. One such library is numba which provides just-in-time compilation options.

Almost none of the work looks like it's vectorized -- e.g., it's doing things like extracting the single element ppsize1[ix,iz] for use in a multiplication.
Numpy doesn't compile down into something fast like Fortran will -- it's fast by having fast routines for operating on large chunks of data at once. If you don't explicitly encode the vectorizeable portions of your algorithm then numpy will be slower than even vanilla python.

Related

Does calling a numpy function in a vectorized operation affect performance?

I am new to python and currently studying the numPy package. I come from the C/C++ world, so maybe my question is stupid. When using vectorized operations in numPy, I assume that they parallelize the execution like openMP does.
I came across a piece of code in an udacity tutorial, which calculated a standardized 1D-array in the following way:
standardized = (array - array.mean()) / array.std()
where array is a numPy array. So in my eyes numPy would parallelize the following 'single' instructions to get a better performance:
standardized[0] = (array[0] - array.mean()) / array.std()
standardized[1] = (array[1] - array.mean()) / array.std()
...
...
standardized[n] = (array[n] - array.mean()) / array.std()
where n is the size of the array. So in every iteration, I would call mean() and std() which gets always calculated and therefore needs a lot of time. In a 'C way' I would do something like this, to increase performance:
mean = array.mean()
std = array.std()
standardized = (array - mean) / std
I measured times for both calculations and nearly got always the same time. In fact, it depends on which method I use first, which is the fastest. Additionally, I only used array filled with zeros, maybe this has an impact, too.
So my question is, how does python (or numPy) 'parallalize' the vectorized execution and how does it deal with function calls, which should always return the same value in one iteration.
I hope my questions are clear and understandable. I could not find any sources which deals with this use-case.

standardized = (array - array.mean()) / array.std()
is a Python expression which gets evaluated as:
temp1 = array.mean()
temp2 = array.std()
temp3 = (array - temp1)
temp4 = temp3 / temp2
array.mean is a numpy 'builtin' method, which means it's written in compiled code. Same for std. And for subtraction and division of two arrays.
numpy provides building blocks, python provides the glue to join them together. Generally the best strategy is to maximize the use of those numpy methods. And avoid loops at the Python level. Sometimes a few loops on a complex operation is better, and sometimes using basic Python is better (creating an array from lists takes time).
There are tools for building custom compiled blocks - cython, numba etc.

I'm not aware of any OpenMP-style parallelization in numpy. Speed-gains come from using C/Fortran/specialised libraries such as LAPack/BLAS etc. You can roll your own parallelization using multiprocessing if you can afford the marshaling cost.
There seems to be a way to enable OpenMP if you build yourself: https://docs.scipy.org/doc/scipy/reference/building/linux.html

For loop vs Numpy vectorization computation time

I was randomly comparing the computation times of an explicit for-loop with vectorized implementation in numpy. I ran exactly 1 million iterations and found some astounding differences. For-loop took about 646ms while the np.exp() function computed the same result in less than 20ms.
import time
import math
import numpy as np
iter = 1000000
x = np.zeros((iter,1))
v = np.random.randn(iter,1)
before = time.time()
for i in range(iter):
x[i] = math.exp(v[i])
after = time.time()
print(x)
print("Non vectorized= " + str((after-before)*1000) + "ms")
before = time.time()
x = np.exp(v)
after = time.time()
print(x)
print("Vectorized= " + str((after-before)*1000) + "ms")
The result I got:
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Non vectorized= 646.1577415466309ms
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Vectorized= 19.547224044799805ms
My questions are:
What exactly is happening in the second case? The first one is using
an explicit for-loop and thus the computation time is justified.
What is happening "behind the scenes" in the second case?
How can one implement such computations (second case) without using numpy (in plain Python)?

What is happening is that NumPy is calling high quality numerical libraries (BLAS for instance) which are very good at vector arithmetic.
I imagine you could specifically call the exact libraries used by NumPy, however, NumPy would likely know best which to use.

NumPy is a Python wrapper over libraries and code written in C. This is a large part of the efficiency of NumPy. C code compiles directly to instructions which are executed by your processor or GPU. On the other hand, Python code must be interpreted as it executes. Despite the ever increasing speed we can get from interpreted languages with advances like Just In Time Compilers, for some tasks they will never be able to approach the speed of compiled languages.

It comes down to the fact that Python does not have direct access to the hardware level.
Python can't use the SIMD (Single instruction, multiple data) assembly instructions that most modern CPU's and GPU's have. These SIMD instruction allow a single operation to execute on a vector of data all at once (within a single clock cycle) at the hardware level.
NumPy on the other hand has functions built in C, and C is a language capable of running SIMD instructions. Therefore NumPy can take advantage of the vectorization hardware in your processor.

Why is numpy faster than my c/c++ code for summing an array of float?

I was testing the efficiency of my simple shared C library and comparing it with the numpy implmentation.
Library creation: The following function is defined in sum_function.c:
float sum_vector(float* data, int num_row){
float value = 0.0;
for (int i = 0; i < num_row; i++){
value += data[i];
}
return value;
}
Library compilation: the shared library sum.so is created by
clang -c sum_function.c
clang -shared -o sum.so sum_function.o
Measurement: a simple numpy array is created and the sum of its elements is calculated using the above function.
from ctypes import *
import numpy as np
N = int(1e7)
data = np.arange(N, dtype=np.float32)
libc = cdll.LoadLibrary("sum.so")
libc.sum_vector.restype = c_float
libc.sum_vector(data.ctypes.data_as(POINTER(c_float)),
c_int(N))
The above function takes 30 ms. However, if I use numpy.sum, the execution time is only 4 ms.
So my question is: what makes numpy a lot faster than my C implementation? I cannot think about any improvement in terms of algorithm for calculating the sum of a vector.

There are many reasons that could be involved depending even on the compiler you are using. Your numpy backend is in many cases C/C++. In other words, you have to appreciate that languages like C++ allow for a lot more efficiency and contact to hardware but also demand a lot of knowledge. C++ less that C, as as long as you use the STL like in #PaulMcKenzie's comment. Those are routines that are optimized for runtime performance.
The next thing is memory allocation. Now, your vector seems large enough that the allocator inside <std::vector> will align the memory on the heap. Memory on the stack can end up unaligned keeping std::accumulate even to be slow. Here's an idea how such allocator could be written to avoid that: https://github.com/kvahed/codeare/blob/master/src/matrix/Allocator.hpp. This is part of an MRI image reconstruction library I wrote as a PhD student.
A word on SIMD: Same library other aspect. https://github.com/kvahed/codeare/blob/master/src/matrix/SIMDTraits.hpp How to do state of the art arithmetic is anything but trivial.
Both above concepts culminate into https://github.com/kvahed/codeare/blob/master/src/matrix/Matrix.hpp, where you easily outperform any standardized code on a specific machine.
And last but not least: The compiler and the compiler flags. Your runtime code should once debugged probably be compiled -O2 -g or even -O3. If you have good test coverage you might even be able to get away with -Ofast which ditches ieee math precision. Apart of numerical integration I have never witnessed issues.

You need to enable optimizations
In addition to that you have to check if the compiler is able to use autovectorization. If you want distribute a compiled binary, you may want to add multiple codepaths (AVX2,SS2) to get a runable and performant version on all platforms.
A small overview of different implementations and their performance. If you can't beat the numpy sum implementation (binary version installed via pip) on an recent processor you have done something wrong, but also keep the varying implementation and compiler (fastmath) dependent precision in mind. I was too lazy to install clang but used Numba, which has also a LLVM backend (same as clang has).
import numba as nb
import numpy as np
import time
#prints information about SIMD vectorization
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')
#nb.njit(fastmath=True) #eq. O3, march-native,fastmath
def sum_nb(ar):
s1=0. #double
for i in range(ar.shape[0]):
s1+=ar[i+0]
return s1
N = int(1e7)
ar = np.random.rand(N).astype(np.float32)
#Numba solution float32 with float64 accumulator
#don't measure compilation time
sum_1=sum_nb(ar)
t1=time.time()
for i in range(1000):
sum_1=sum_nb(ar)
print(time.time()-t1)
#Numba solution float64 with float64 accumulator
#don't measure compilation time
arr_64=ar.astype(np.float64)
sum_2=sum_nb(arr_64)
t1=time.time()
for i in range(1000):
sum_2=sum_nb(arr_64)
print(time.time()-t1)
#Numpy solution (float32)
t1=time.time()
for i in range(1000):
sum_3=np.sum(ar)
print(time.time()-t1)
#Numpy solution (float32, with float64 accumulator)
t1=time.time()
for i in range(1000):
sum_4=np.sum(ar,dtype=np.float64)
print(time.time()-t1)
#Numpy solution (float64)
t1=time.time()
for i in range(1000):
sum_5=np.sum(arr_64)
print(time.time()-t1)
print(sum_1)
print(sum_2)
print(sum_3)
print(sum_4)
print(sum_5)
Performance
#Numba solution float32 with float64 accumulator: 2.29ms
#Numba solution float64 with float64 accumulator: 4.76ms
#Numpy solution (float32): 5.72ms
#Numpy solution (float32) with float64 accumulator:: 7.97ms
#Numpy solution (float64):: 10.61ms

My python codes in general are very slow, is this normal?

I recently began self-learning python, and have been using this language for an online course in algorithms. For some reason, many of my codes I created for this course are very slow (relatively to C/C++ Matlab codes I have created in the past), and I'm starting to worry that I am not using python properly.
Here is a simple python and matlab code to compare their speed.
MATLAB
for i = 1:100000000
a = 1 + 1
end
Python
for i in list(range(0, 100000000)):
a=1 + 1
The matlab code takes about 0.3 second, and the python code takes about 7 seconds. Is this normal? My python codes for much complex problems are very slow. For example, as a HW assignment, I'm running depth first search on a graph with about 900000 nodes, and this is taking forever. Thank you.

Performance is not an explicit design goal of Python:
Don’t fret too much about performance--plan to optimize later when
needed.
That's one of the reasons why Python integrated with a lot of high performance calculating backend engines, such as numpy, OpenBLAS and even CUDA, just to name a few.
The best way to go foreward if you want to increase performance is to let high-performance libraries do the heavy lifting for you. Optimizing loops within Python (by using xrange instead of range in Python 2.7) won't get you very dramatic results.
Here is a bit of code that compares different approaches:
Your original list(range())
The suggestes use of xrange()
Leaving the i out
Using numpy to do the addition using numpy array's (vector addition)
Using CUDA to do vector addition on the GPU
Code:
import timeit
import matplotlib.pyplot as mplplt
iter = 100
testcode = [
"for i in list(range(1000000)): a = 1+1",
"for i in xrange(1000000): a = 1+1",
"for _ in xrange(1000000): a = 1+1",
"import numpy; one = numpy.ones(1000000); a = one+one",
"import pycuda.gpuarray as gpuarray; import pycuda.driver as cuda; import pycuda.autoinit; import numpy;" \
"one_gpu = gpuarray.GPUArray((1000000),numpy.int16); one_gpu.fill(1); a = (one_gpu+one_gpu).get()"
]
labels = ["list(range())", "i in xrange()", "_ in xrange()", "numpy", "numpy and CUDA"]
timings = [timeit.timeit(t, number=iter) for t in testcode]
print labels, timings
label_idx = range(len(labels))
mplplt.bar(label_idx, timings)
mplplt.xticks(label_idx, labels)
mplplt.ylabel('Execution time (sec)')
mplplt.title('Timing of integer addition in python 2.7\n(smaller value is better performance)')
mplplt.show()
Results (graph) ran on Python 2.7.13 on OSX:
The reason that Numpy performs faster than the CUDA solution is that the overhead of using CUDA does not beat the efficiency of Python+Numpy. For larger, floating point calculations, CUDA does even better than Numpy.
Note that the Numpy solution performs more that 80 times faster than your original solution. If your timings are correct, this would even be faster than Matlab...
A final note on DFS (Depth-afirst-Search): here is an interesting article on DFS in Python.

Try using xrange instead of range.
The difference between them is that **xrange** generates the values as you use them instead of range, which tries to generate a static list at runtime.

Unfortunately, python's amazing flexibility and ease comes at the cost of being slow. And also, for such large values of iteration, I suggest using itertools module as it has faster caching.
The xrange is a good solution however if you want to iterate over dictionaries and such, it's better to use itertools as in that, you can iterate over any type of sequence object.

Why is numpy much slower than matlab on a digitize example?

I am comparing performance of numpy vs matlab, in several cases I observed that numpy is significantly slower (indexing, simple operations on arrays such as absolute value, multiplication, sum, etc.). Let's look at the following example, which is somehow striking, involving the function digitize (which I plan to use for synchronizing timestamps):
import numpy as np
import time
scale=np.arange(1,1e+6+1)
y=np.arange(1,1e+6+1,10)
t1=time.time()
ind=np.digitize(scale,y)
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
The result is:
Time passed is 55.91 seconds
Let's now try the same example Matlab using the equivalent function histc
scale=[1:1e+6];
y=[1:10:1e+6];
tic
[N,bin]=histc(scale,y);
t=toc;
display(['Time passed is ',num2str(t), ' seconds'])
The result is:
Time passed is 0.10237 seconds
That's 560 times faster!
As I'm learning to extend Python with C++, I implemented my own version of digitize (using boost libraries for the extension):
import analysis # my C++ module implementing digitize
t1=time.time()
ind2=analysis.digitize(scale,y)
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
np.all(ind==ind2) #ok
The result is:
Time passed is 0.02 seconds
There is a bit of cheating as my version of digitize assumes inputs are all monotonic, this might explain why it is even faster than Matlab. However, sorting an array of size 1e+6 takes 0.16 seconds (with numpy.sort), making therefore the performance of my function worse (by a factor of approx 1.6) compared to the Matlab function histc.
So the questions are:
Why is numpy.digitize so slow? Is this function not supposed to be written in compiled and optimized code?
Why is my own version of digitize much faster than numpy.digitize, but still slower than Matlab (I am quite confident I use the fastest algorithm possible, given that I assume inputs are already sorted)?
I am using Fedora 16 and I recently installed ATLAS and LAPACK libraries (but there has been so change in performance). Should I perhaps rebuild numpy? I am not sure if my installation of numpy uses the appropriate libraries to gain maximum speed, perhaps Matlab is using better libraries.
Update
Based on the answers so far, I would like to stress that the Matlab function histc is not equivalent to numpy.histogram if someone (like me in this case) does not care about the histogram. I need the second output of hisc, which is a mapping from input values to the index of the provided input bins. Such an output is provided by the numpy functions digitize and searchsorted. As one of the answers says, searchsorted is much faster than digitize. However, searchsorted is still slower than Matlab by a factor 2:
t1=time.time()
ind3=np.searchsorted(y,scale,"right")
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
np.all(ind==ind3) #ok
The result is
Time passed is 0.21 seconds
So the questions are now:
What is the sense of having numpy.digitize if there is an equivalent function numpy.searchsorted which is 280 times faster?
Why is the Matlab function histc (which also provides the output of numpy.searchsorted) 2 times faster than numpy.searchsorted?

First, let's look at why numpy.digitize is slow. If your bins are found to be monotonic, then one of these functions is called depending on whether the bins are nondecreasing or nonincreasing (the code for this is found in numpy/lib/src/_compiled_base.c in the numpy git repo):
static npy_intp
incr_slot_(double x, double *bins, npy_intp lbins)
{
npy_intp i;
for ( i = 0; i < lbins; i ++ ) {
if ( x < bins [i] ) {
return i;
}
}
return lbins;
}
static npy_intp
decr_slot_(double x, double * bins, npy_intp lbins)
{
npy_intp i;
for ( i = lbins - 1; i >= 0; i -- ) {
if (x < bins [i]) {
return i + 1;
}
}
return 0;
}
As you can see, it is doing a linear search. Linear search is much, much slower than binary search so there is your answer as to why it is slow. I will open a ticket for this on the numpy tracker.
Second, I think that Matlab is actually slower than your C++ code because Matlab also assumes that the bins are monotonically nondecreasing.

I can't answer why numpy.digitize() is so slow -- I could confirm your timings on my machine.
The function numpy.searchsorted() does basically the same thing as numpy.digitize(), but efficiently.
ind = np.searchsorted(y, scale, "right")
takes about 0.15 seconds on my machine and gives exactly the same result as your code.
Note that your Matlab code does something different from both of those functions -- it is the equivalent of numpy.histogram().

Before the question can get answered, several subquestions need to be addressed:
In order to get more reliable results, you should run several
iterations of the tests and average their results. This would
somehow eliminate startup effects, which do not have anything to do
with the algorithm. Also, try to use larger data for the same
purpose.
Use the same algortihms across the frameworks. This has already
been addressed in other answers here.
Make sure, the algorithms are really similar enough. How do they
utilize system ressources? How is iterated over memory ? If (just an
example) a Matlab algorithm uses repmat and the numpy would not, the
comparison is not fair.
How does the corresponding framework parallelize? This possibly
is connected to your individual machine / processor configuration.
Matlab does parallelize some (but by far not all) builtin functions.
I dont know about numpy/CPython.
Use a memory profiler in order to find out, how both implementations
behave from that performance point of view.
Afterwards (this is only a guess) we probably will find out, numpy does often behave slower than Matlab. Many questions here at SO come to the same conclusion. One explanation could be, that Matlab has an easier job to optimize array access, because it does not need to take into account a whole collection of general purpose objects (like CPython). The requirements on mathematical arrays are much lower than those on general arrays. numpy on the other hand does utilize CPython, which must serve the full python library - not only numpy. However, according to this comparison test (among many others) Matlab is still pretty slow ...

I don't think you are comparing the same functions in numpy and matlab. The equivalent to histc is np.histogram as far as I can tell from looking at the documentation. I don't have matlab to do a comparison, but when I do the following on my machine:
In [7]: import numpy as np
In [8]: scale=np.arange(1,1e+6+1)
In [9]: y=np.arange(1,1e+6+1,10)
In [10]: %timeit np.histogram(scale,y)
10 loops, best of 3: 135 ms per loop
I get a number that is approximately equivalent to what you get for histc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.