Speed up loading 24-bit binary data into 16-bit numpy array

Speed up loading 24-bit binary data into 16-bit numpy array - python

I use the following code to load 24-bit binary data into a 16-bit numpy array :
temp = numpy.zeros((len(data) / 3, 4), dtype='b')
temp[:, 1:] = numpy.frombuffer(data, dtype='b').reshape(-1, 3)
temp2 = temp.view('<i4').flatten() >> 16 # >> 16 because I need to divide by 2**16 to load my data into 16-bit array, needed for my (audio) application
output = temp2.astype('int16')
I imagine that it's possible to improve the speed efficiency, but how?

It seems like you are being very roundabout here. Won't this do the same thing?
output = np.frombuffer(data,'b').reshape(-1,3)[:,1:].flatten().view('i2')
This would save some time from not zero-filling a temporary array, skipping the bitshift and avoiding some unneceessary data moves. I haven't actually benchmarked it yet, though, and I expect the savings to be modest.
Edit: I have now performed the benchmark. For len(data) of 12 million, I get 80 ms for your version and 39 ms for mine, so pretty much exactly a factor 2 speedup. Not a very big improvement, as expected, but then your starting point was already pretty fast.
Edit2: I should mention that I have assumed little endian here. However, the original question's code is also implicitly assuming little endian, so this is not a new assumption on my part.
(For big endian (data and architecture), you would replace 1: by :-1. If the data had a different endianness than the CPU, then you would also need to reverse the order of the bytes (::-1).)
Edit3: For even more speed, I think you will have to go outside python. This fortran function, which also uses openMP, gets me a factor 2+ speedup compared to my version (so 4+ times faster than yours).
subroutine f(a,b)
implicit none
integer*1, intent(in) :: a(:)
integer*1, intent(out) :: b(size(a)*2/3)
integer :: i
!$omp parallel do
do i = 1, size(a)/3
b(2*(i-1)+1) = a(3*(i-1)+2)
b(2*(i-1)+2) = a(3*(i-1)+3)
end do
!$omp end parallel do
end subroutine
Compile with FOPT="-fopenmp" f2py -c -m basj{,.f90} -lgomp. You can then import and use it in python:
import basj
def convert(data): return def mine2(data): return basj.f(np.frombuffer(data,'b')).view('i2')
You can control the number of cores to use via the environment variavble OMP_NUM_THREADS, but it defaults to using all available cores.

Inspired by #amaurea's answer, here is a cython version (I already used cython in my original code, so I'll continue with cython instead of mixing cython + fortran) :
import cython
import numpy as np
cimport numpy as np
def binary24_to_int16(char *data):
cdef int i
res = np.zeros(len(data)/3, np.int16)
b = <char *>((<np.ndarray>res).data)
for i in range(len(data)/3):
b[2*i] = data[3*i+1]
b[2*i+1] = data[3*i+2]
return res
There is a factor 4 speed gain :)

Related

For loop vs Numpy vectorization computation time

I was randomly comparing the computation times of an explicit for-loop with vectorized implementation in numpy. I ran exactly 1 million iterations and found some astounding differences. For-loop took about 646ms while the np.exp() function computed the same result in less than 20ms.
import time
import math
import numpy as np
iter = 1000000
x = np.zeros((iter,1))
v = np.random.randn(iter,1)
before = time.time()
for i in range(iter):
x[i] = math.exp(v[i])
after = time.time()
print(x)
print("Non vectorized= " + str((after-before)*1000) + "ms")
before = time.time()
x = np.exp(v)
after = time.time()
print(x)
print("Vectorized= " + str((after-before)*1000) + "ms")
The result I got:
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Non vectorized= 646.1577415466309ms
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Vectorized= 19.547224044799805ms
My questions are:
What exactly is happening in the second case? The first one is using
an explicit for-loop and thus the computation time is justified.
What is happening "behind the scenes" in the second case?
How can one implement such computations (second case) without using numpy (in plain Python)?

What is happening is that NumPy is calling high quality numerical libraries (BLAS for instance) which are very good at vector arithmetic.
I imagine you could specifically call the exact libraries used by NumPy, however, NumPy would likely know best which to use.

NumPy is a Python wrapper over libraries and code written in C. This is a large part of the efficiency of NumPy. C code compiles directly to instructions which are executed by your processor or GPU. On the other hand, Python code must be interpreted as it executes. Despite the ever increasing speed we can get from interpreted languages with advances like Just In Time Compilers, for some tasks they will never be able to approach the speed of compiled languages.

It comes down to the fact that Python does not have direct access to the hardware level.
Python can't use the SIMD (Single instruction, multiple data) assembly instructions that most modern CPU's and GPU's have. These SIMD instruction allow a single operation to execute on a vector of data all at once (within a single clock cycle) at the hardware level.
NumPy on the other hand has functions built in C, and C is a language capable of running SIMD instructions. Therefore NumPy can take advantage of the vectorization hardware in your processor.

Why is numpy faster than my c/c++ code for summing an array of float?

I was testing the efficiency of my simple shared C library and comparing it with the numpy implmentation.
Library creation: The following function is defined in sum_function.c:
float sum_vector(float* data, int num_row){
float value = 0.0;
for (int i = 0; i < num_row; i++){
value += data[i];
}
return value;
}
Library compilation: the shared library sum.so is created by
clang -c sum_function.c
clang -shared -o sum.so sum_function.o
Measurement: a simple numpy array is created and the sum of its elements is calculated using the above function.
from ctypes import *
import numpy as np
N = int(1e7)
data = np.arange(N, dtype=np.float32)
libc = cdll.LoadLibrary("sum.so")
libc.sum_vector.restype = c_float
libc.sum_vector(data.ctypes.data_as(POINTER(c_float)),
c_int(N))
The above function takes 30 ms. However, if I use numpy.sum, the execution time is only 4 ms.
So my question is: what makes numpy a lot faster than my C implementation? I cannot think about any improvement in terms of algorithm for calculating the sum of a vector.

There are many reasons that could be involved depending even on the compiler you are using. Your numpy backend is in many cases C/C++. In other words, you have to appreciate that languages like C++ allow for a lot more efficiency and contact to hardware but also demand a lot of knowledge. C++ less that C, as as long as you use the STL like in #PaulMcKenzie's comment. Those are routines that are optimized for runtime performance.
The next thing is memory allocation. Now, your vector seems large enough that the allocator inside <std::vector> will align the memory on the heap. Memory on the stack can end up unaligned keeping std::accumulate even to be slow. Here's an idea how such allocator could be written to avoid that: https://github.com/kvahed/codeare/blob/master/src/matrix/Allocator.hpp. This is part of an MRI image reconstruction library I wrote as a PhD student.
A word on SIMD: Same library other aspect. https://github.com/kvahed/codeare/blob/master/src/matrix/SIMDTraits.hpp How to do state of the art arithmetic is anything but trivial.
Both above concepts culminate into https://github.com/kvahed/codeare/blob/master/src/matrix/Matrix.hpp, where you easily outperform any standardized code on a specific machine.
And last but not least: The compiler and the compiler flags. Your runtime code should once debugged probably be compiled -O2 -g or even -O3. If you have good test coverage you might even be able to get away with -Ofast which ditches ieee math precision. Apart of numerical integration I have never witnessed issues.

You need to enable optimizations
In addition to that you have to check if the compiler is able to use autovectorization. If you want distribute a compiled binary, you may want to add multiple codepaths (AVX2,SS2) to get a runable and performant version on all platforms.
A small overview of different implementations and their performance. If you can't beat the numpy sum implementation (binary version installed via pip) on an recent processor you have done something wrong, but also keep the varying implementation and compiler (fastmath) dependent precision in mind. I was too lazy to install clang but used Numba, which has also a LLVM backend (same as clang has).
import numba as nb
import numpy as np
import time
#prints information about SIMD vectorization
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')
#nb.njit(fastmath=True) #eq. O3, march-native,fastmath
def sum_nb(ar):
s1=0. #double
for i in range(ar.shape[0]):
s1+=ar[i+0]
return s1
N = int(1e7)
ar = np.random.rand(N).astype(np.float32)
#Numba solution float32 with float64 accumulator
#don't measure compilation time
sum_1=sum_nb(ar)
t1=time.time()
for i in range(1000):
sum_1=sum_nb(ar)
print(time.time()-t1)
#Numba solution float64 with float64 accumulator
#don't measure compilation time
arr_64=ar.astype(np.float64)
sum_2=sum_nb(arr_64)
t1=time.time()
for i in range(1000):
sum_2=sum_nb(arr_64)
print(time.time()-t1)
#Numpy solution (float32)
t1=time.time()
for i in range(1000):
sum_3=np.sum(ar)
print(time.time()-t1)
#Numpy solution (float32, with float64 accumulator)
t1=time.time()
for i in range(1000):
sum_4=np.sum(ar,dtype=np.float64)
print(time.time()-t1)
#Numpy solution (float64)
t1=time.time()
for i in range(1000):
sum_5=np.sum(arr_64)
print(time.time()-t1)
print(sum_1)
print(sum_2)
print(sum_3)
print(sum_4)
print(sum_5)
Performance
#Numba solution float32 with float64 accumulator: 2.29ms
#Numba solution float64 with float64 accumulator: 4.76ms
#Numpy solution (float32): 5.72ms
#Numpy solution (float32) with float64 accumulator:: 7.97ms
#Numpy solution (float64):: 10.61ms

Cython code runs 125x slower when compiled against python 2 vs python 3

I have a big block of Cython code that is parsing Touchstone files that I want to work with Python 2 and Python 3. I'm using very C-style parsing techniques for what I thought would be maximum efficiency, including manually malloc-ing and free-ing char* instead of using bytes so that I can avoid the GIL. When compiled using
python 3.5.2 0 anaconda
cython 0.24.1 py35_0 anaconda
I see speeds that I'm happy with, a moderate boost on small files (~20% faster) and a huge boost on large files (~2.5x faster). When compiled against
python 2.7.12 0 anaconda
cython 0.24.1 py27_0 anaconda
It runs about 125x slower (~17ms in Python 3 vs ~2.2s in Python 2). It's the exact same code compiled in different environments using a pretty simple setuputils script. I'm not currently using NumPy from Cython for any of the parsing or data storage.
import cython
cimport cython
from cython cimport array
import array
from libc.stdlib cimport strtod, malloc, free
from libc.string cimport memcpy
ctypedef long long int64_t # Really VS2008? Couldn't include this by default?
# Bunch of definitions and utility functions omitted
#cython.boundscheck(False)
cpdef Touchstone parse_touchstone(bytes file_contents, int num_ports):
cdef:
char c
char* buffer = <char*> file_contents
int64_t length_of_buffer = len(file_contents)
int64_t i = 0
# These are some cpdef enums
FreqUnits freq_units
Domain domain
Format fmt
double z0
bint option_line_found = 0
array.array data = array.array('d')
array.array row = array.array('d', [0 for _ in range(row_size)])
while i < length_of_buffer:
c = buffer[i] # cdef char c
if is_whitespace(c):
i += 1
continue
if is_comment_char(c):
# Returns the last index of the comment
i = parse_comment(buffer, length_of_buffer)
continue
if not option_line_found and is_option_leader_char(c):
# Returns the last index of the option line
# assigns values of all references passed in
i = parse_option_line(
buffer, length_of_buffer, i,
&domain, &fmt, &z0, &freq_units)
if i < 0:
# Lots of boring code along the lines of
# if i == some_int:
# raise Exception("message")
# I did this so that only my top-level parse has to interact
# with the interpreter, all the lower level functions have nogil
option_line_found = 1
if option_line_found:
if is_digit(c):
# Parse a float
row[row_idx] = strtod(buffer + i, &end_of_value)
# Jump the cursor to the end of that float
i = end_of_value - p - 1
row_idx += 1
if row_idx == row_size:
# append this row onto the main data array
data.extend(row)
row_idx = 0
i += 1
return Touchstone(num_ports, domain, fmt, z0, freq_units, data)
I've ruled out a few things, such as type casts. I also tested where the code simply loops over the entire file doing nothing. Either Cython optimized that away or it's just really fast because it causes parse_touchstone to not even show up in a cProfile/pstats report. I determined that it's not just the comment, whitespace, and option line parsing (not shown is the significantly more complicated keyword-value parsing) after I threw in a print statement in the last if row_idx == row_size block to print out a status and discovered that it's taking about 0.5-1 second (guesstimate) to parse a row with 512 floating point numbers on it. That really should not take so long, especially when using strtod to do the parsing. I also checked parsing just 2 rows' worth of values then jumping out of the while loop and it told me that parsing the comments, whitespace, and option line took up about 800ms (1/3 of the overall time), and that was for 6 lines of text totaling less than 150 bytes.
Am I just missing something here? Is there a small trick that would cause Cython code to run 3 orders of magnitude slower in Python 2 than Python 3?
(Note: I haven't shown the full code here because I'm not sure if I'm allowed to for legal reasons and because it's about 450 lines total)

The problem is with strtod, which is not optimized in VS2008. Apparently it internally calculates the length of the input string each time its called, and if you call it with a long string this will slow down your code considerably. To circumvent this you have to write a wrapper around strtod to use only small buffers at a time (see the above link for one example of how to do this) or write your own strtod function.

Why is numpy much slower than matlab on a digitize example?

I am comparing performance of numpy vs matlab, in several cases I observed that numpy is significantly slower (indexing, simple operations on arrays such as absolute value, multiplication, sum, etc.). Let's look at the following example, which is somehow striking, involving the function digitize (which I plan to use for synchronizing timestamps):
import numpy as np
import time
scale=np.arange(1,1e+6+1)
y=np.arange(1,1e+6+1,10)
t1=time.time()
ind=np.digitize(scale,y)
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
The result is:
Time passed is 55.91 seconds
Let's now try the same example Matlab using the equivalent function histc
scale=[1:1e+6];
y=[1:10:1e+6];
tic
[N,bin]=histc(scale,y);
t=toc;
display(['Time passed is ',num2str(t), ' seconds'])
The result is:
Time passed is 0.10237 seconds
That's 560 times faster!
As I'm learning to extend Python with C++, I implemented my own version of digitize (using boost libraries for the extension):
import analysis # my C++ module implementing digitize
t1=time.time()
ind2=analysis.digitize(scale,y)
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
np.all(ind==ind2) #ok
The result is:
Time passed is 0.02 seconds
There is a bit of cheating as my version of digitize assumes inputs are all monotonic, this might explain why it is even faster than Matlab. However, sorting an array of size 1e+6 takes 0.16 seconds (with numpy.sort), making therefore the performance of my function worse (by a factor of approx 1.6) compared to the Matlab function histc.
So the questions are:
Why is numpy.digitize so slow? Is this function not supposed to be written in compiled and optimized code?
Why is my own version of digitize much faster than numpy.digitize, but still slower than Matlab (I am quite confident I use the fastest algorithm possible, given that I assume inputs are already sorted)?
I am using Fedora 16 and I recently installed ATLAS and LAPACK libraries (but there has been so change in performance). Should I perhaps rebuild numpy? I am not sure if my installation of numpy uses the appropriate libraries to gain maximum speed, perhaps Matlab is using better libraries.
Update
Based on the answers so far, I would like to stress that the Matlab function histc is not equivalent to numpy.histogram if someone (like me in this case) does not care about the histogram. I need the second output of hisc, which is a mapping from input values to the index of the provided input bins. Such an output is provided by the numpy functions digitize and searchsorted. As one of the answers says, searchsorted is much faster than digitize. However, searchsorted is still slower than Matlab by a factor 2:
t1=time.time()
ind3=np.searchsorted(y,scale,"right")
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
np.all(ind==ind3) #ok
The result is
Time passed is 0.21 seconds
So the questions are now:
What is the sense of having numpy.digitize if there is an equivalent function numpy.searchsorted which is 280 times faster?
Why is the Matlab function histc (which also provides the output of numpy.searchsorted) 2 times faster than numpy.searchsorted?

First, let's look at why numpy.digitize is slow. If your bins are found to be monotonic, then one of these functions is called depending on whether the bins are nondecreasing or nonincreasing (the code for this is found in numpy/lib/src/_compiled_base.c in the numpy git repo):
static npy_intp
incr_slot_(double x, double *bins, npy_intp lbins)
{
npy_intp i;
for ( i = 0; i < lbins; i ++ ) {
if ( x < bins [i] ) {
return i;
}
}
return lbins;
}
static npy_intp
decr_slot_(double x, double * bins, npy_intp lbins)
{
npy_intp i;
for ( i = lbins - 1; i >= 0; i -- ) {
if (x < bins [i]) {
return i + 1;
}
}
return 0;
}
As you can see, it is doing a linear search. Linear search is much, much slower than binary search so there is your answer as to why it is slow. I will open a ticket for this on the numpy tracker.
Second, I think that Matlab is actually slower than your C++ code because Matlab also assumes that the bins are monotonically nondecreasing.

I can't answer why numpy.digitize() is so slow -- I could confirm your timings on my machine.
The function numpy.searchsorted() does basically the same thing as numpy.digitize(), but efficiently.
ind = np.searchsorted(y, scale, "right")
takes about 0.15 seconds on my machine and gives exactly the same result as your code.
Note that your Matlab code does something different from both of those functions -- it is the equivalent of numpy.histogram().

Before the question can get answered, several subquestions need to be addressed:
In order to get more reliable results, you should run several
iterations of the tests and average their results. This would
somehow eliminate startup effects, which do not have anything to do
with the algorithm. Also, try to use larger data for the same
purpose.
Use the same algortihms across the frameworks. This has already
been addressed in other answers here.
Make sure, the algorithms are really similar enough. How do they
utilize system ressources? How is iterated over memory ? If (just an
example) a Matlab algorithm uses repmat and the numpy would not, the
comparison is not fair.
How does the corresponding framework parallelize? This possibly
is connected to your individual machine / processor configuration.
Matlab does parallelize some (but by far not all) builtin functions.
I dont know about numpy/CPython.
Use a memory profiler in order to find out, how both implementations
behave from that performance point of view.
Afterwards (this is only a guess) we probably will find out, numpy does often behave slower than Matlab. Many questions here at SO come to the same conclusion. One explanation could be, that Matlab has an easier job to optimize array access, because it does not need to take into account a whole collection of general purpose objects (like CPython). The requirements on mathematical arrays are much lower than those on general arrays. numpy on the other hand does utilize CPython, which must serve the full python library - not only numpy. However, according to this comparison test (among many others) Matlab is still pretty slow ...

I don't think you are comparing the same functions in numpy and matlab. The equivalent to histc is np.histogram as far as I can tell from looking at the documentation. I don't have matlab to do a comparison, but when I do the following on my machine:
In [7]: import numpy as np
In [8]: scale=np.arange(1,1e+6+1)
In [9]: y=np.arange(1,1e+6+1,10)
In [10]: %timeit np.histogram(scale,y)
10 loops, best of 3: 135 ms per loop
I get a number that is approximately equivalent to what you get for histc.

numpy float: 10x slower than builtin in arithmetic operations?

I am getting really weird timings for the following code:
import numpy as np
s = 0
for i in range(10000000):
s += np.float64(1) # replace with np.float32 and built-in float
built-in float: 4.9 s
float64: 10.5 s
float32: 45.0 s
Why is float64 twice slower than float? And why is float32 5 times slower than float64?
Is there any way to avoid the penalty of using np.float64, and have numpy functions return built-in float instead of float64?
I found that using numpy.float64 is much slower than Python's float, and numpy.float32 is even slower (even though I'm on a 32-bit machine).
numpy.float32 on my 32-bit machine. Therefore, every time I use various numpy functions such as numpy.random.uniform, I convert the result to float32 (so that further operations would be performed at 32-bit precision).
Is there any way to set a single variable somewhere in the program or in the command line, and make all numpy functions return float32 instead of float64?
EDIT #1:
numpy.float64 is 10 times slower than float in arithmetic calculations. It's so bad that even converting to float and back before the calculations makes the program run 3 times faster. Why? Is there anything I can do to fix it?
I want to emphasize that my timings are not due to any of the following:
the function calls
the conversion between numpy and python float
the creation of objects
I updated my code to make it clearer where the problem lies. With the new code, it would seem I see a ten-fold performance hit from using numpy data types:
from datetime import datetime
import numpy as np
START_TIME = datetime.now()
# one of the following lines is uncommented before execution
#s = np.float64(1)
#s = np.float32(1)
#s = 1.0
for i in range(10000000):
s = (s + 8) * s % 2399232
print(s)
print('Runtime:', datetime.now() - START_TIME)
The timings are:
float64: 34.56s
float32: 35.11s
float: 3.53s
Just for the hell of it, I also tried:
from datetime import datetime
import numpy as np
START_TIME = datetime.now()
s = np.float64(1)
for i in range(10000000):
s = float(s)
s = (s + 8) * s % 2399232
s = np.float64(s)
print(s)
print('Runtime:', datetime.now() - START_TIME)
The execution time is 13.28 s; it's actually 3 times faster to convert the float64 to float and back than to use it as is. Still, the conversion takes its toll, so overall it's more than 3 times slower compared to the pure-python float.
My machine is:
Intel Core 2 Duo T9300 (2.5GHz)
WinXP Professional (32-bit)
ActiveState Python 3.1.3.5
Numpy 1.5.1
EDIT #2:
Thank you for the answers, they help me understand how to deal with this problem.
But I still would like to know the precise reason (based on the source code perhaps) why the code below runs 10 times slow with float64 than with float.
EDIT #3:
I rerun the code under the Windows 7 x64 (Intel Core i7 930 # 3.8GHz).
Again, the code is:
from datetime import datetime
import numpy as np
START_TIME = datetime.now()
# one of the following lines is uncommented before execution
#s = np.float64(1)
#s = np.float32(1)
#s = 1.0
for i in range(10000000):
s = (s + 8) * s % 2399232
print(s)
print('Runtime:', datetime.now() - START_TIME)
The timings are:
float64: 16.1s
float32: 16.1s
float: 3.2s
Now both np floats (either 64 or 32) are 5 times slower than the built-in float. Still, a significant difference. I'm trying to figure out where it comes from.
END OF EDITS

CPython floats are allocated in chunks
The key problem with comparing numpy scalar allocations to the float type is that CPython always allocates the memory for float and int objects in blocks of size N.
Internally, CPython maintains a linked list of blocks each large enough to hold N float objects. When you call float(1) CPython checks if there is space available in the current block; if not it allocates a new block. Once it has space in the current block it simply initializes that space and returns a pointer to it.
On my machine each block can hold 41 float objects, so there is some overhead for the first float(1) call but the next 40 run much faster as the memory is allocated and ready.
Slow numpy.float32 vs. numpy.float64
It appears that numpy has 2 paths it can take when creating a scalar type: fast and slow. This depends on whether the scalar type has a Python base class to which it can defer for argument conversion.
For some reason numpy.float32 is hard-coded to take the slower path (defined by the _WORK0 macro), while numpy.float64 gets a chance to take the faster path (defined by the _WORK1 macro). Note that scalartypes.c.src is a template which generates scalartypes.c at build time.
You can visualize this in Cachegrind. I've included screen captures showing how many more calls are made to construct a float32 vs. float64:
float64 takes the fast path
float32 takes the slow path
Updated - Which type takes the slow/fast path may depend on whether the OS is 32-bit vs 64-bit. On my test system, Ubuntu Lucid 64-bit, the float64 type is 10 times faster than float32.

Operating with Python objects in a heavy loop like that, whether they are float, np.float32, is always slow. NumPy is fast for operations on vectors and matrices, because all of the operations are performed on big chunks of data by parts of the library written in C, and not by the Python interpreter. Code run in the interpreter and/or using Python objects is always slow, and using non-native types makes it even slower. That's to be expected.
If your app is slow and you need to optimize it, you should try either converting your code to a vector solution that uses NumPy directly, and is fast, or you could use tools such as Cython to create a fast implementation of the loop in C.

Perhaps, that is why you should use Numpy directly instead of using loops.
s1 = np.ones(10000000, dtype=np.float)
s2 = np.ones(10000000, dtype=np.float32)
s3 = np.ones(10000000, dtype=np.float64)
np.sum(s1) <-- 17.3 ms
np.sum(s2) <-- 15.8 ms
np.sum(s3) <-- 17.3 ms

The answer is quite simple: the memory allocation might be part of it, but the biggest problem is that arithmetic operations for numpy scalars is done using "ufuncs" which are meant to be fast for several hundred values not just 1. There is some overhead in choosing the correct function to call and setting up the loops. Overhead which is un-necessary for scalars.
It was easier to just have the scalars be converted to 0-d arrays and then passed to the corresponding numpy ufunc then write separate calculation methods for each of the many different scalar types that NumPy supports.
The intent was that optimized versions of the scalar math would be added to the type-objects in C. This could still happen, but it never has happened because no-one has been motivated enough to do it. Possibly because the work-around is to convert numpy scalars to Python scalars which do have optimized arithmetic.

Summary
If an arithmetic expression contains both numpy and built-in numbers, Python arithmetics works slower. Avoiding this conversion removes almost all of the performance degradation I reported.
Details
Note that in my original code:
s = np.float64(1)
for i in range(10000000):
s = (s + 8) * s % 2399232
the types float and numpy.float64 are mixed up in one expression. Perhaps Python had to convert them all to one type?
s = np.float64(1)
for i in range(10000000):
s = (s + np.float64(8)) * s % np.float64(2399232)
If the runtime is unchanged (rather than increased), it would suggest that's what Python indeed was doing under the hood, explaining the performance drag.
Actually, the runtime fell by 1.5 times! How is it possible? Isn't the worst thing that Python could possibly have to do was these two conversions?
I don't really know. Perhaps Python had to dynamically check what needs to be converted into what, which takes time, and being told what precise conversions to perform makes it faster. Perhaps, some entirely different mechanism is used for arithmetics (which doesn't involve conversions at all), and it happens to be super-slow on mismatched types. Reading numpy source code might help, but it's beyond my skill.
Anyway, now we can obviously speed things up more by moving the conversions out of the loop:
q = np.float64(8)
r = np.float64(2399232)
for i in range(10000000):
s = (s + q) * s % r
As expected, the runtime is reduced substantially: by another 2.3 times.
To be fair, we now need to change the float version slightly, by moving the literal constants out of the loop. This results in a tiny (10%) slowdown.
Accounting for all these changes, the np.float64 version of the code is now only 30% slower than the equivalent float version; the ridiculous 5-fold performance hit is largely gone.
Why do we still see the 30% delay? numpy.float64 numbers take the same amount of space as float, so that won't be the reason. Perhaps the resolution of the arithmetic operators takes longer for user-defined types. Certainly not a major concern.

If you're after fast scalar arithmetic, you should be looking at libraries like gmpy rather than numpy (as others have noted, the latter is optimised more for vector operations rather than scalar ones).

I can confirm the results also. I tried to see what it would look like using all numpy types, and the difference persists. So then, my tests were:
def testStandard(length=100000):
s = 1.0
addend = 8.0
modulo = 2399232.0
startTime = datetime.now()
for i in xrange(length):
s = (s + addend) * s % modulo
return datetime.now() - startTime
def testNumpy(length=100000):
s = np.float64(1.0)
addend = np.float64(8.0)
modulo = np.float64(2399232.0)
startTime = datetime.now()
for i in xrange(length):
s = (s + addend) * s % modulo
return datetime.now() - startTime
So at this point, the numpy types are all interacting with each other, but the 10x difference persists (2 sec vs 0.2 sec).
If I had to guess, I would say that there are two possible reasons for why the default float types are much faster. The first possibility is that python performs significant optimizations under the hood for dealing with certain numeric operations or looping in general (e.g. loop unrolling). The second possibility is that the numpy types involves an extra layer of abstraction (i.e. having to read from an address). To look into the effects of each, I did a few extra checks.
One difference could be the result of python having to take extra steps to resolve the float64 types. Unlike compiled languages that generate efficient tables, python 2.6 (and maybe 3) has a significant cost for resolving things that you'd generally think of as free. Even a simple X.a resolution has to resolve the dot operator EVERY time it is called. (Which is why if you have a loop that calls instance.function() you're better off having a variable "function = instance.function" declared outside the loop).
From my understanding, when you use python standard operators, these are fairly similar to using the ones from "import operator." If you substitute add, mul, and mod in for your +, *, and %, you see a static performance hit of about 0.5 sec versus the standard operators (to both cases). This means that by wrapping the operators, the standard python float operations get 3x slower. If you do one further, using operator.add and those variants adds on 0.7 sec approximately (over 1m trials, starting with 2 sec and 0.2 sec respectively). That's verging on the 5x slowness. So basically, if each of these issues happens twice, you're basically at the 10x slower point.
So let's assume we're the python interpreter for a moment. Case 1, we do an operation on native types, let's say a+b. Under the hood, we can check the types of a and b and dispatch our addition to python's optimized code. Case 2, we have an operation of two other types (also a+b). Under the hood, we check if they're native types (they're not). We move on to the 'else' case. The else case sends us to something like a.add(b). a.add can then do a dispatch to numpy's optimized code. So at this point we have had additional overhead of an extra branch, one '.' get slots property, and a function call. And we've only got into the addition operation. We then have to use the result to create a new float64 (or alter an existing float64). Meanwhile, the python native code probably cheats by treating its types specially to avoid this sort of overhead.
Based on the above examination of the costliness of python function calls and scoping overhead, it would be pretty easy for numpy to incur a 9x penalty just getting to and from its c math functions. I can entirely imagine this process taking many times longer than a simple math operation call. For each operation, the numpy library will have to wade through layers of python to get to its C implementation.
So in my opinion, the reason for this is probably captured in this effect:
length = 10000000
class A():
X = 10
startTime = datetime.now()
for i in xrange(length):
x = A.X
print "Long Way", datetime.now() - startTime
startTime = datetime.now()
y = A.X
for i in xrange(length):
x = y
print "Short Way", datetime.now() - startTime
This simple case shows a difference of 0.2 sec vs 0.14 sec (short way faster, obviously). I think what you're seeing is mainly just a bunch of those issues adding up.
To avoid this, I can think of a a couple possible solutions that mainly echo what has been said. The first solution is to try to keep your evaluations inside NumPy as much as possible, as Selinap said. A large amount of the losses are probably due to the interfacing. I would look into ways to dispatch your job into numpy or some other numeric library optimized in C (gmpy has been mentioned). The goal should be to push as much into C at the same time as possible, then get the result(s) back. You want to put in big jobs, not lots of small jobs.
The second solution, of course, would be to do more of your intermediate and small operations in python if you can. Clearly, using the native objects are going to be faster. They're going to be the first options on all the branch statements and will always have the shortest path to C code. Unless you have a specific need for fixed precision calculation or other issues with the default operators, I don't see why one wouldn't use the straight python functions for many things.

Really strange...I confirm the results in Ubuntu 11.04 32bit, python 2.7.1, numpy 1.5.1 (official packages):
import numpy as np
def testfloat():
s = 0
for i in range(10000000):
s+= float(1)
def testfloat32():
s = 0
for i in range(10000000):
s+= np.float32(1)
def testfloat64():
s = 0
for i in range(10000000):
s+= np.float64(1)
%time testfloat()
CPU times: user 4.66 s, sys: 0.06 s, total: 4.73 s
Wall time: 4.74 s
%time testfloat64()
CPU times: user 11.43 s, sys: 0.07 s, total: 11.50 s
Wall time: 11.57 s
%time testfloat32()
CPU times: user 47.99 s, sys: 0.09 s, total: 48.08 s
Wall time: 48.23 s
I don't see why float32 should be 5 times slower that float64.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.