Caching jit-compiled functions in numba

Caching jit-compiled functions in numba - python

I want to compile a range oft functions using numba and as I only need to run them on my machine with the same signatures, I want to cache them.
But when attempting to do so, numba tells me that the function cannot be cached because it uses large global arrays. This is the specific warning it displayed.
NumbaWarning: Cannot cache compiled function "sigmoid" as it uses dynamic globals (such as ctypes pointers and large global arrays)
I am aware that global arrays are usually frozen but large ones aren't, but as my function looks like this:
#njit(parallel=True, cache=True)
def sigmoid(x):
return 1./(1. + np.exp(-x))
I cannot see any global arrays, especially large ones.
Where is the problem?

I observed this behavior too (running on: Windows 10, Dell Latitude 7480, Git for Windows), even for very simple tests. It seems parallel=True doesn't allow caching. This is independent from the actual presence of of prange calls. Below a simple example.
def where_numba(arr: np.ndarray) -> np.ndarray:
l0, l1 = np.shape(arr)[0], np.shape(arr)[1]
for i0 in prange(l0):
for i1 in prange(l1):
if arr[i0, i1] > 0.5:
arr[i0, i1] = arr[i0, i1] * 10
return(arr)
where_numba_jit = jit(signature_or_function='float64[:,:](float64[:,:])',
nopython=True, parallel=True, cache=True, fastmath=True, nogil=True)(where_numba)
arr = np.random.random((10000, 10000))
seln = where_numba_jit(arr)
I get the same warning.
I think you may consider your specific codes and see which option (cache or parallel) is better to keep, clearly cache for relative short calculation times and parallel when the compilation time may be negligible compared to the actual calculation time. Please, comment if you have updates.
There is also an open Numba issue on this:
https://github.com/numba/numba/issues/2439

Related

Why does Numba skew the timings of a JIT-compiled function?

I'm trying to benchmark a Python function that does list operations using Numba against the CPython interpreter. To compare end-to-end time I used the Linux time utility.
time python3.10 list.py
As I understand the first invocation will be expensive due to JIT compilation, but it does not explain why the maximum recorded time is longer than the total time taken to run the entire script.
# list.py
import numpy as np
from time import time, perf_counter
from numba import njit
#njit
def listOperations():
list = []
for i in range(1000):
list.append(i)
list.sort(reverse=True)
list.remove(420)
list.reverse()
if __name__ == "__main__":
repetitions = 1000
timings = np.zeros(repetitions)
for rep in range(repetitions):
start = time() # Similar results with perf_counter too.
listOperations()
timings[rep] = time() - start
# Convert to milliseconds
timings *= 10e3
print("Mean {}ms, Median {}ms, Std. Dev {}ms, Min {}ms, Max {}ms".format(
float('%.4f' % np.mean(timings)),
float('%.4f' % np.median(timings)),
float('%.4f' % np.std(timings)),
float('%.4f' % np.min(timings)),
float('%.4f' % np.max(timings)))
)
For Numba it shows maximum of ~66.3s while the time utility reports ~8s. The complete results are below.
'''
Numba --->
Mean 66.8154ms, Median 0.391ms, Std. Dev 2097.7752ms, Min 0.3219ms, Max 66371.1143ms
real 0m7.982s
user 0m8.248s
sys 0m0.100s
CPython3.10 --->
Mean 1.6395ms, Median 1.6284ms, Std. Dev 0.0708ms, Min 1.5759ms, Max 2.3198ms
real. 0m1.115s
user 0m1.468s
sys 0m0.080s
'''

The main issue is that the compilation time is included in the timings. Indeed, Numba compiles the functions lazily. To prevent this, you must specify the prototype or to execute the first function call outside (which is generally a good practice in benchmarks).
You can use #njit('()') instead of #njit. With this fix, the Numba code is about twice faster on my machine.
Note that your function does not return anything nor read anything in parameter so the JIT can optimize the function to a no-op. To avoid biases, you certainly need to add a parameter, to use it and to return the list. This is apparently not the case on my machine but different versions of Numba may do that.
Note also that Numba list are generally not where Numba shine. Lists are generally slow (both with and without Numba). It is better to use array when the size is known.
By the way, list is a built-in function. Overwriting it can cause sneaky bugs in modules using it (frequent) so this is not a good idea. I advise you to use another name.
Furthermore, note that the standard deviation was pretty big in the results, the median time was good and the maximum time was very big indicating that the timings were not stable and that this instability was due to one slow call. Such results generally indicates the benchmark is flawed or the function itself has an unstable behaviour (typically due to a bug or an initialization done once).

Does calling a numpy function in a vectorized operation affect performance?

I am new to python and currently studying the numPy package. I come from the C/C++ world, so maybe my question is stupid. When using vectorized operations in numPy, I assume that they parallelize the execution like openMP does.
I came across a piece of code in an udacity tutorial, which calculated a standardized 1D-array in the following way:
standardized = (array - array.mean()) / array.std()
where array is a numPy array. So in my eyes numPy would parallelize the following 'single' instructions to get a better performance:
standardized[0] = (array[0] - array.mean()) / array.std()
standardized[1] = (array[1] - array.mean()) / array.std()
...
...
standardized[n] = (array[n] - array.mean()) / array.std()
where n is the size of the array. So in every iteration, I would call mean() and std() which gets always calculated and therefore needs a lot of time. In a 'C way' I would do something like this, to increase performance:
mean = array.mean()
std = array.std()
standardized = (array - mean) / std
I measured times for both calculations and nearly got always the same time. In fact, it depends on which method I use first, which is the fastest. Additionally, I only used array filled with zeros, maybe this has an impact, too.
So my question is, how does python (or numPy) 'parallalize' the vectorized execution and how does it deal with function calls, which should always return the same value in one iteration.
I hope my questions are clear and understandable. I could not find any sources which deals with this use-case.

standardized = (array - array.mean()) / array.std()
is a Python expression which gets evaluated as:
temp1 = array.mean()
temp2 = array.std()
temp3 = (array - temp1)
temp4 = temp3 / temp2
array.mean is a numpy 'builtin' method, which means it's written in compiled code. Same for std. And for subtraction and division of two arrays.
numpy provides building blocks, python provides the glue to join them together. Generally the best strategy is to maximize the use of those numpy methods. And avoid loops at the Python level. Sometimes a few loops on a complex operation is better, and sometimes using basic Python is better (creating an array from lists takes time).
There are tools for building custom compiled blocks - cython, numba etc.

I'm not aware of any OpenMP-style parallelization in numpy. Speed-gains come from using C/Fortran/specialised libraries such as LAPack/BLAS etc. You can roll your own parallelization using multiprocessing if you can afford the marshaling cost.
There seems to be a way to enable OpenMP if you build yourself: https://docs.scipy.org/doc/scipy/reference/building/linux.html

For loop vs Numpy vectorization computation time

I was randomly comparing the computation times of an explicit for-loop with vectorized implementation in numpy. I ran exactly 1 million iterations and found some astounding differences. For-loop took about 646ms while the np.exp() function computed the same result in less than 20ms.
import time
import math
import numpy as np
iter = 1000000
x = np.zeros((iter,1))
v = np.random.randn(iter,1)
before = time.time()
for i in range(iter):
x[i] = math.exp(v[i])
after = time.time()
print(x)
print("Non vectorized= " + str((after-before)*1000) + "ms")
before = time.time()
x = np.exp(v)
after = time.time()
print(x)
print("Vectorized= " + str((after-before)*1000) + "ms")
The result I got:
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Non vectorized= 646.1577415466309ms
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Vectorized= 19.547224044799805ms
My questions are:
What exactly is happening in the second case? The first one is using
an explicit for-loop and thus the computation time is justified.
What is happening "behind the scenes" in the second case?
How can one implement such computations (second case) without using numpy (in plain Python)?

What is happening is that NumPy is calling high quality numerical libraries (BLAS for instance) which are very good at vector arithmetic.
I imagine you could specifically call the exact libraries used by NumPy, however, NumPy would likely know best which to use.

NumPy is a Python wrapper over libraries and code written in C. This is a large part of the efficiency of NumPy. C code compiles directly to instructions which are executed by your processor or GPU. On the other hand, Python code must be interpreted as it executes. Despite the ever increasing speed we can get from interpreted languages with advances like Just In Time Compilers, for some tasks they will never be able to approach the speed of compiled languages.

It comes down to the fact that Python does not have direct access to the hardware level.
Python can't use the SIMD (Single instruction, multiple data) assembly instructions that most modern CPU's and GPU's have. These SIMD instruction allow a single operation to execute on a vector of data all at once (within a single clock cycle) at the hardware level.
NumPy on the other hand has functions built in C, and C is a language capable of running SIMD instructions. Therefore NumPy can take advantage of the vectorization hardware in your processor.

Why is numpy much slower than matlab on a digitize example?

I am comparing performance of numpy vs matlab, in several cases I observed that numpy is significantly slower (indexing, simple operations on arrays such as absolute value, multiplication, sum, etc.). Let's look at the following example, which is somehow striking, involving the function digitize (which I plan to use for synchronizing timestamps):
import numpy as np
import time
scale=np.arange(1,1e+6+1)
y=np.arange(1,1e+6+1,10)
t1=time.time()
ind=np.digitize(scale,y)
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
The result is:
Time passed is 55.91 seconds
Let's now try the same example Matlab using the equivalent function histc
scale=[1:1e+6];
y=[1:10:1e+6];
tic
[N,bin]=histc(scale,y);
t=toc;
display(['Time passed is ',num2str(t), ' seconds'])
The result is:
Time passed is 0.10237 seconds
That's 560 times faster!
As I'm learning to extend Python with C++, I implemented my own version of digitize (using boost libraries for the extension):
import analysis # my C++ module implementing digitize
t1=time.time()
ind2=analysis.digitize(scale,y)
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
np.all(ind==ind2) #ok
The result is:
Time passed is 0.02 seconds
There is a bit of cheating as my version of digitize assumes inputs are all monotonic, this might explain why it is even faster than Matlab. However, sorting an array of size 1e+6 takes 0.16 seconds (with numpy.sort), making therefore the performance of my function worse (by a factor of approx 1.6) compared to the Matlab function histc.
So the questions are:
Why is numpy.digitize so slow? Is this function not supposed to be written in compiled and optimized code?
Why is my own version of digitize much faster than numpy.digitize, but still slower than Matlab (I am quite confident I use the fastest algorithm possible, given that I assume inputs are already sorted)?
I am using Fedora 16 and I recently installed ATLAS and LAPACK libraries (but there has been so change in performance). Should I perhaps rebuild numpy? I am not sure if my installation of numpy uses the appropriate libraries to gain maximum speed, perhaps Matlab is using better libraries.
Update
Based on the answers so far, I would like to stress that the Matlab function histc is not equivalent to numpy.histogram if someone (like me in this case) does not care about the histogram. I need the second output of hisc, which is a mapping from input values to the index of the provided input bins. Such an output is provided by the numpy functions digitize and searchsorted. As one of the answers says, searchsorted is much faster than digitize. However, searchsorted is still slower than Matlab by a factor 2:
t1=time.time()
ind3=np.searchsorted(y,scale,"right")
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
np.all(ind==ind3) #ok
The result is
Time passed is 0.21 seconds
So the questions are now:
What is the sense of having numpy.digitize if there is an equivalent function numpy.searchsorted which is 280 times faster?
Why is the Matlab function histc (which also provides the output of numpy.searchsorted) 2 times faster than numpy.searchsorted?

First, let's look at why numpy.digitize is slow. If your bins are found to be monotonic, then one of these functions is called depending on whether the bins are nondecreasing or nonincreasing (the code for this is found in numpy/lib/src/_compiled_base.c in the numpy git repo):
static npy_intp
incr_slot_(double x, double *bins, npy_intp lbins)
{
npy_intp i;
for ( i = 0; i < lbins; i ++ ) {
if ( x < bins [i] ) {
return i;
}
}
return lbins;
}
static npy_intp
decr_slot_(double x, double * bins, npy_intp lbins)
{
npy_intp i;
for ( i = lbins - 1; i >= 0; i -- ) {
if (x < bins [i]) {
return i + 1;
}
}
return 0;
}
As you can see, it is doing a linear search. Linear search is much, much slower than binary search so there is your answer as to why it is slow. I will open a ticket for this on the numpy tracker.
Second, I think that Matlab is actually slower than your C++ code because Matlab also assumes that the bins are monotonically nondecreasing.

I can't answer why numpy.digitize() is so slow -- I could confirm your timings on my machine.
The function numpy.searchsorted() does basically the same thing as numpy.digitize(), but efficiently.
ind = np.searchsorted(y, scale, "right")
takes about 0.15 seconds on my machine and gives exactly the same result as your code.
Note that your Matlab code does something different from both of those functions -- it is the equivalent of numpy.histogram().

Before the question can get answered, several subquestions need to be addressed:
In order to get more reliable results, you should run several
iterations of the tests and average their results. This would
somehow eliminate startup effects, which do not have anything to do
with the algorithm. Also, try to use larger data for the same
purpose.
Use the same algortihms across the frameworks. This has already
been addressed in other answers here.
Make sure, the algorithms are really similar enough. How do they
utilize system ressources? How is iterated over memory ? If (just an
example) a Matlab algorithm uses repmat and the numpy would not, the
comparison is not fair.
How does the corresponding framework parallelize? This possibly
is connected to your individual machine / processor configuration.
Matlab does parallelize some (but by far not all) builtin functions.
I dont know about numpy/CPython.
Use a memory profiler in order to find out, how both implementations
behave from that performance point of view.
Afterwards (this is only a guess) we probably will find out, numpy does often behave slower than Matlab. Many questions here at SO come to the same conclusion. One explanation could be, that Matlab has an easier job to optimize array access, because it does not need to take into account a whole collection of general purpose objects (like CPython). The requirements on mathematical arrays are much lower than those on general arrays. numpy on the other hand does utilize CPython, which must serve the full python library - not only numpy. However, according to this comparison test (among many others) Matlab is still pretty slow ...

I don't think you are comparing the same functions in numpy and matlab. The equivalent to histc is np.histogram as far as I can tell from looking at the documentation. I don't have matlab to do a comparison, but when I do the following on my machine:
In [7]: import numpy as np
In [8]: scale=np.arange(1,1e+6+1)
In [9]: y=np.arange(1,1e+6+1,10)
In [10]: %timeit np.histogram(scale,y)
10 loops, best of 3: 135 ms per loop
I get a number that is approximately equivalent to what you get for histc.

numpy float: 10x slower than builtin in arithmetic operations?

I am getting really weird timings for the following code:
import numpy as np
s = 0
for i in range(10000000):
s += np.float64(1) # replace with np.float32 and built-in float
built-in float: 4.9 s
float64: 10.5 s
float32: 45.0 s
Why is float64 twice slower than float? And why is float32 5 times slower than float64?
Is there any way to avoid the penalty of using np.float64, and have numpy functions return built-in float instead of float64?
I found that using numpy.float64 is much slower than Python's float, and numpy.float32 is even slower (even though I'm on a 32-bit machine).
numpy.float32 on my 32-bit machine. Therefore, every time I use various numpy functions such as numpy.random.uniform, I convert the result to float32 (so that further operations would be performed at 32-bit precision).
Is there any way to set a single variable somewhere in the program or in the command line, and make all numpy functions return float32 instead of float64?
EDIT #1:
numpy.float64 is 10 times slower than float in arithmetic calculations. It's so bad that even converting to float and back before the calculations makes the program run 3 times faster. Why? Is there anything I can do to fix it?
I want to emphasize that my timings are not due to any of the following:
the function calls
the conversion between numpy and python float
the creation of objects
I updated my code to make it clearer where the problem lies. With the new code, it would seem I see a ten-fold performance hit from using numpy data types:
from datetime import datetime
import numpy as np
START_TIME = datetime.now()
# one of the following lines is uncommented before execution
#s = np.float64(1)
#s = np.float32(1)
#s = 1.0
for i in range(10000000):
s = (s + 8) * s % 2399232
print(s)
print('Runtime:', datetime.now() - START_TIME)
The timings are:
float64: 34.56s
float32: 35.11s
float: 3.53s
Just for the hell of it, I also tried:
from datetime import datetime
import numpy as np
START_TIME = datetime.now()
s = np.float64(1)
for i in range(10000000):
s = float(s)
s = (s + 8) * s % 2399232
s = np.float64(s)
print(s)
print('Runtime:', datetime.now() - START_TIME)
The execution time is 13.28 s; it's actually 3 times faster to convert the float64 to float and back than to use it as is. Still, the conversion takes its toll, so overall it's more than 3 times slower compared to the pure-python float.
My machine is:
Intel Core 2 Duo T9300 (2.5GHz)
WinXP Professional (32-bit)
ActiveState Python 3.1.3.5
Numpy 1.5.1
EDIT #2:
Thank you for the answers, they help me understand how to deal with this problem.
But I still would like to know the precise reason (based on the source code perhaps) why the code below runs 10 times slow with float64 than with float.
EDIT #3:
I rerun the code under the Windows 7 x64 (Intel Core i7 930 # 3.8GHz).
Again, the code is:
from datetime import datetime
import numpy as np
START_TIME = datetime.now()
# one of the following lines is uncommented before execution
#s = np.float64(1)
#s = np.float32(1)
#s = 1.0
for i in range(10000000):
s = (s + 8) * s % 2399232
print(s)
print('Runtime:', datetime.now() - START_TIME)
The timings are:
float64: 16.1s
float32: 16.1s
float: 3.2s
Now both np floats (either 64 or 32) are 5 times slower than the built-in float. Still, a significant difference. I'm trying to figure out where it comes from.
END OF EDITS

CPython floats are allocated in chunks
The key problem with comparing numpy scalar allocations to the float type is that CPython always allocates the memory for float and int objects in blocks of size N.
Internally, CPython maintains a linked list of blocks each large enough to hold N float objects. When you call float(1) CPython checks if there is space available in the current block; if not it allocates a new block. Once it has space in the current block it simply initializes that space and returns a pointer to it.
On my machine each block can hold 41 float objects, so there is some overhead for the first float(1) call but the next 40 run much faster as the memory is allocated and ready.
Slow numpy.float32 vs. numpy.float64
It appears that numpy has 2 paths it can take when creating a scalar type: fast and slow. This depends on whether the scalar type has a Python base class to which it can defer for argument conversion.
For some reason numpy.float32 is hard-coded to take the slower path (defined by the _WORK0 macro), while numpy.float64 gets a chance to take the faster path (defined by the _WORK1 macro). Note that scalartypes.c.src is a template which generates scalartypes.c at build time.
You can visualize this in Cachegrind. I've included screen captures showing how many more calls are made to construct a float32 vs. float64:
float64 takes the fast path
float32 takes the slow path
Updated - Which type takes the slow/fast path may depend on whether the OS is 32-bit vs 64-bit. On my test system, Ubuntu Lucid 64-bit, the float64 type is 10 times faster than float32.

Operating with Python objects in a heavy loop like that, whether they are float, np.float32, is always slow. NumPy is fast for operations on vectors and matrices, because all of the operations are performed on big chunks of data by parts of the library written in C, and not by the Python interpreter. Code run in the interpreter and/or using Python objects is always slow, and using non-native types makes it even slower. That's to be expected.
If your app is slow and you need to optimize it, you should try either converting your code to a vector solution that uses NumPy directly, and is fast, or you could use tools such as Cython to create a fast implementation of the loop in C.

Perhaps, that is why you should use Numpy directly instead of using loops.
s1 = np.ones(10000000, dtype=np.float)
s2 = np.ones(10000000, dtype=np.float32)
s3 = np.ones(10000000, dtype=np.float64)
np.sum(s1) <-- 17.3 ms
np.sum(s2) <-- 15.8 ms
np.sum(s3) <-- 17.3 ms

The answer is quite simple: the memory allocation might be part of it, but the biggest problem is that arithmetic operations for numpy scalars is done using "ufuncs" which are meant to be fast for several hundred values not just 1. There is some overhead in choosing the correct function to call and setting up the loops. Overhead which is un-necessary for scalars.
It was easier to just have the scalars be converted to 0-d arrays and then passed to the corresponding numpy ufunc then write separate calculation methods for each of the many different scalar types that NumPy supports.
The intent was that optimized versions of the scalar math would be added to the type-objects in C. This could still happen, but it never has happened because no-one has been motivated enough to do it. Possibly because the work-around is to convert numpy scalars to Python scalars which do have optimized arithmetic.

Summary
If an arithmetic expression contains both numpy and built-in numbers, Python arithmetics works slower. Avoiding this conversion removes almost all of the performance degradation I reported.
Details
Note that in my original code:
s = np.float64(1)
for i in range(10000000):
s = (s + 8) * s % 2399232
the types float and numpy.float64 are mixed up in one expression. Perhaps Python had to convert them all to one type?
s = np.float64(1)
for i in range(10000000):
s = (s + np.float64(8)) * s % np.float64(2399232)
If the runtime is unchanged (rather than increased), it would suggest that's what Python indeed was doing under the hood, explaining the performance drag.
Actually, the runtime fell by 1.5 times! How is it possible? Isn't the worst thing that Python could possibly have to do was these two conversions?
I don't really know. Perhaps Python had to dynamically check what needs to be converted into what, which takes time, and being told what precise conversions to perform makes it faster. Perhaps, some entirely different mechanism is used for arithmetics (which doesn't involve conversions at all), and it happens to be super-slow on mismatched types. Reading numpy source code might help, but it's beyond my skill.
Anyway, now we can obviously speed things up more by moving the conversions out of the loop:
q = np.float64(8)
r = np.float64(2399232)
for i in range(10000000):
s = (s + q) * s % r
As expected, the runtime is reduced substantially: by another 2.3 times.
To be fair, we now need to change the float version slightly, by moving the literal constants out of the loop. This results in a tiny (10%) slowdown.
Accounting for all these changes, the np.float64 version of the code is now only 30% slower than the equivalent float version; the ridiculous 5-fold performance hit is largely gone.
Why do we still see the 30% delay? numpy.float64 numbers take the same amount of space as float, so that won't be the reason. Perhaps the resolution of the arithmetic operators takes longer for user-defined types. Certainly not a major concern.

If you're after fast scalar arithmetic, you should be looking at libraries like gmpy rather than numpy (as others have noted, the latter is optimised more for vector operations rather than scalar ones).

I can confirm the results also. I tried to see what it would look like using all numpy types, and the difference persists. So then, my tests were:
def testStandard(length=100000):
s = 1.0
addend = 8.0
modulo = 2399232.0
startTime = datetime.now()
for i in xrange(length):
s = (s + addend) * s % modulo
return datetime.now() - startTime
def testNumpy(length=100000):
s = np.float64(1.0)
addend = np.float64(8.0)
modulo = np.float64(2399232.0)
startTime = datetime.now()
for i in xrange(length):
s = (s + addend) * s % modulo
return datetime.now() - startTime
So at this point, the numpy types are all interacting with each other, but the 10x difference persists (2 sec vs 0.2 sec).
If I had to guess, I would say that there are two possible reasons for why the default float types are much faster. The first possibility is that python performs significant optimizations under the hood for dealing with certain numeric operations or looping in general (e.g. loop unrolling). The second possibility is that the numpy types involves an extra layer of abstraction (i.e. having to read from an address). To look into the effects of each, I did a few extra checks.
One difference could be the result of python having to take extra steps to resolve the float64 types. Unlike compiled languages that generate efficient tables, python 2.6 (and maybe 3) has a significant cost for resolving things that you'd generally think of as free. Even a simple X.a resolution has to resolve the dot operator EVERY time it is called. (Which is why if you have a loop that calls instance.function() you're better off having a variable "function = instance.function" declared outside the loop).
From my understanding, when you use python standard operators, these are fairly similar to using the ones from "import operator." If you substitute add, mul, and mod in for your +, *, and %, you see a static performance hit of about 0.5 sec versus the standard operators (to both cases). This means that by wrapping the operators, the standard python float operations get 3x slower. If you do one further, using operator.add and those variants adds on 0.7 sec approximately (over 1m trials, starting with 2 sec and 0.2 sec respectively). That's verging on the 5x slowness. So basically, if each of these issues happens twice, you're basically at the 10x slower point.
So let's assume we're the python interpreter for a moment. Case 1, we do an operation on native types, let's say a+b. Under the hood, we can check the types of a and b and dispatch our addition to python's optimized code. Case 2, we have an operation of two other types (also a+b). Under the hood, we check if they're native types (they're not). We move on to the 'else' case. The else case sends us to something like a.add(b). a.add can then do a dispatch to numpy's optimized code. So at this point we have had additional overhead of an extra branch, one '.' get slots property, and a function call. And we've only got into the addition operation. We then have to use the result to create a new float64 (or alter an existing float64). Meanwhile, the python native code probably cheats by treating its types specially to avoid this sort of overhead.
Based on the above examination of the costliness of python function calls and scoping overhead, it would be pretty easy for numpy to incur a 9x penalty just getting to and from its c math functions. I can entirely imagine this process taking many times longer than a simple math operation call. For each operation, the numpy library will have to wade through layers of python to get to its C implementation.
So in my opinion, the reason for this is probably captured in this effect:
length = 10000000
class A():
X = 10
startTime = datetime.now()
for i in xrange(length):
x = A.X
print "Long Way", datetime.now() - startTime
startTime = datetime.now()
y = A.X
for i in xrange(length):
x = y
print "Short Way", datetime.now() - startTime
This simple case shows a difference of 0.2 sec vs 0.14 sec (short way faster, obviously). I think what you're seeing is mainly just a bunch of those issues adding up.
To avoid this, I can think of a a couple possible solutions that mainly echo what has been said. The first solution is to try to keep your evaluations inside NumPy as much as possible, as Selinap said. A large amount of the losses are probably due to the interfacing. I would look into ways to dispatch your job into numpy or some other numeric library optimized in C (gmpy has been mentioned). The goal should be to push as much into C at the same time as possible, then get the result(s) back. You want to put in big jobs, not lots of small jobs.
The second solution, of course, would be to do more of your intermediate and small operations in python if you can. Clearly, using the native objects are going to be faster. They're going to be the first options on all the branch statements and will always have the shortest path to C code. Unless you have a specific need for fixed precision calculation or other issues with the default operators, I don't see why one wouldn't use the straight python functions for many things.

Really strange...I confirm the results in Ubuntu 11.04 32bit, python 2.7.1, numpy 1.5.1 (official packages):
import numpy as np
def testfloat():
s = 0
for i in range(10000000):
s+= float(1)
def testfloat32():
s = 0
for i in range(10000000):
s+= np.float32(1)
def testfloat64():
s = 0
for i in range(10000000):
s+= np.float64(1)
%time testfloat()
CPU times: user 4.66 s, sys: 0.06 s, total: 4.73 s
Wall time: 4.74 s
%time testfloat64()
CPU times: user 11.43 s, sys: 0.07 s, total: 11.50 s
Wall time: 11.57 s
%time testfloat32()
CPU times: user 47.99 s, sys: 0.09 s, total: 48.08 s
Wall time: 48.23 s
I don't see why float32 should be 5 times slower that float64.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.