How to use GPU to speed up matrix operations [closed]

How to use GPU to speed up matrix operations [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have the following code:
import numpy as np
A = np.random.random((128,4,46,23)) + np.random.random((128,4,46,23)) * 1j
signal = np.random.random((355,256,4)) + np.random.random((355,256,4)) * 1j
# start timing here
Signal = np.fft.fft(signal, axis=1)[:,:128,:]
B = Signal[..., None, None] * A[None,...]
B = B.sum(1)
b = np.fft.ifft(B, axis=1).real
b_squared = b**2
res = b_squared.sum(1)
I need to run this code for two different values of A each time. The problem is that the code is too slow to be used in an real time application. I tried using np.einsum and although it did speed up things a little it wasn't enough for my application.
So, now I'm trying to speed things up using a GPU but I'm not sure how. I looked into OpenCL and multiplying two matrices seems fine, but I'm not sure how to do it with complex numbers and matrices with more than two dimensions(I guess using a for loop to send two axis at a time for the GPU). I also don't know how to do something like array.sum(axis). I have worked with a GPU before using OpenGL.
I would have no problem using C++ to optimize the code if needed or using something besides OpenCL, as long as it works with more than one GPU manufacturer(so no CUDA).
Edit:
Running cProfile:
import cProfile
def f():
Signal = np.fft.fft(signal, axis=1)[:,:128,:]
B = Signal[..., None, None] * A[None,...]
B = B.sum(1)
b = np.fft.ifft(B, axis=1).real
b_squared = b**2
res = b_squared.sum(1)
cProfile.run("f()", sort="cumtime")
Output:
56 function calls (52 primitive calls) in 1.555 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.555 1.555 {built-in method builtins.exec}
1 0.005 0.005 1.555 1.555 <string>:1(<module>)
1 1.240 1.240 1.550 1.550 <ipython-input-10-d4613cd45f64>:3(f)
2 0.000 0.000 0.263 0.131 {method 'sum' of 'numpy.ndarray' objects}
2 0.000 0.000 0.263 0.131 _methods.py:45(_sum)
2 0.263 0.131 0.263 0.131 {method 'reduce' of 'numpy.ufunc' objects}
6/2 0.000 0.000 0.047 0.024 {built-in method numpy.core._multiarray_umath.implement_array_function}
2 0.000 0.000 0.047 0.024 _pocketfft.py:49(_raw_fft)
2 0.047 0.023 0.047 0.023 {built-in method numpy.fft._pocketfft_internal.execute}
1 0.000 0.000 0.041 0.041 <__array_function__ internals>:2(ifft)
1 0.000 0.000 0.041 0.041 _pocketfft.py:189(ifft)
1 0.000 0.000 0.006 0.006 <__array_function__ internals>:2(fft)
1 0.000 0.000 0.006 0.006 _pocketfft.py:95(fft)
4 0.000 0.000 0.000 0.000 <__array_function__ internals>:2(swapaxes)
4 0.000 0.000 0.000 0.000 fromnumeric.py:550(swapaxes)
4 0.000 0.000 0.000 0.000 fromnumeric.py:52(_wrapfunc)
4 0.000 0.000 0.000 0.000 {method 'swapaxes' of 'numpy.ndarray' objects}
2 0.000 0.000 0.000 0.000 _asarray.py:14(asarray)
4 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
2 0.000 0.000 0.000 0.000 {built-in method numpy.array}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.000 0.000 0.000 0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
4 0.000 0.000 0.000 0.000 fromnumeric.py:546(_swapaxes_dispatcher)
2 0.000 0.000 0.000 0.000 _pocketfft.py:91(_fft_dispatcher)

Most of the number crunching libraries that interface well with the Python ecosystem have tight dependencies on Nvidia's ecosystem, but this is changing slowly. Here are some things you could try:
Profile your code. The built-in profiler (cProfile) is probably a good place to start, I'm also a fan of snakeviz for looking at performance traces. This will actually tell you if NumPy's FFT implementation is what's blocking you. Is memory being allocated efficiently? Is there some way where you could hand more data off to np.fft.ifft? How much time is Python taking to read the signal from its source and convert it into a Numpy array?
Numba is a JIT which takes Python code and further optimizes it, or compiles it to either CUDA or ROCm (AMD). I'm not sure how far off you are from your performance goals, but perhaps this could help.
Here is a list of C++ libraries to try.
Honestly, I'm kind of surprised that the NumPy build distributed on
the PyPI isn't fast enough for real-time use. I'll post an update to my comment where I benchmark this code snippet.
UPDATE: Here's an implementation which uses a multiprocessing.Pool, and the np.einsum trick kindly provided by #hpaulj.
import time
import numpy as np
import multiprocessing
NUM_RUNS = 50
A = np.random.random((128, 4, 46, 23)) + np.random.random((128, 4, 46, 23)) * 1j
signal = np.random.random((355, 256, 4)) + np.random.random((355, 256, 4)) * 1j
def worker(signal_chunk: np.ndarray, a_chunk: np.ndarray) -> np.ndarray:
return (
np.fft.ifft(np.einsum("ijk,jklm->iklm", fft_chunk, a_chunk), axis=1).real ** 2
)
# old code
serial_times = []
for _ in range(NUM_RUNS):
start = time.monotonic()
Signal = np.fft.fft(signal, axis=1)[:, :128, :]
B = Signal[..., None, None] * A[None, ...]
B = B.sum(1)
b = np.fft.ifft(B, axis=1).real
b_squared = b ** 2
res = b_squared.sum(1)
serial_times.append(time.monotonic() - start)
parallel_times = []
# My CPU is hyperthreaded, so I'm only spawning workers for the amount of physical cores
with multiprocessing.Pool(multiprocessing.cpu_count() // 2) as p:
for _ in range(NUM_RUNS):
start = time.monotonic()
# Get the FFT of the whole sample before splitting
transformed = np.fft.fft(signal, axis=1)[:, :128, :]
a_chunks = np.split(A, A.shape[0] // multiprocessing.cpu_count(), axis=0)
signal_chunks = np.split(
transformed, transformed.shape[1] // multiprocessing.cpu_count(), axis=1
)
res = np.sum(np.hstack(p.starmap(worker, zip(signal_chunks, a_chunks))), axis=1)
parallel_times.append(time.monotonic() - start)
print(
f"ORIGINAL AVG TIME: {np.mean(serial_times):.3f}\t POOLED TIME: {np.mean(parallel_times):.3f}"
)
And here are the results I'm getting on a Ryzen 3700X (8 cores, 16 threads):
ORIGINAL AVG TIME: 0.897 POOLED TIME: 0.315
I'd've loved to have offered you an FFT library written in OpenCL, but I'm not sure whether you'd have to write the Python bridge yourself (more code) or whether you'd trust the first implementation you'd come across on GitHub. If you're willing to give into CUDA's vendor lock-in, Nvidia provides an "almost drop in replacement" for NumPy called CuPy, and it has FFT and IFFT kernels., Hope this helps!

With your arrays:
In [42]: Signal.shape
Out[42]: (355, 128, 4)
In [43]: A.shape
Out[43]: (128, 4, 46, 23)
The B calc takes minutes on my modest machine:
In [44]: B = (Signal[..., None, None] * A[None,...]).sum(1)
In [45]: B.shape
Out[45]: (355, 4, 46, 23)
einsum is much faster:
In [46]: B2=np.einsum('ijk,jklm->iklm',Signal,A)
In [47]: np.allclose(B,B2)
Out[47]: True
In [48]: timeit B2=np.einsum('ijk,jklm->iklm',Signal,A)
1.05 s ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
reworking the dimensions to move the 128 to the classic dot positions:
In [49]: B21=np.einsum('ikj,kjn->ikn',Signal.transpose(0,2,1),A.reshape(128, 4, 46*23).transpose(1,0,2))
In [50]: B21.shape
Out[50]: (355, 4, 1058)
In [51]: timeit B21=np.einsum('ikj,kjn->ikn',Signal.transpose(0,2,1),A.reshape(128, 4, 46*23).transpose(1,0,2)
...: )
1.04 s ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With a bit more tweaking I can use matmul/# and cut the time in half:
In [52]: B3=Signal.transpose(0,2,1)[:,:,None,:]#(A.reshape(1,128, 4, 46*23).transpose(0,2,1,3))
In [53]: timeit B3=Signal.transpose(0,2,1)[:,:,None,:]#(A.reshape(1,128, 4, 46*23).transpose(0,2,1,3))
497 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [54]: B3.shape
Out[54]: (355, 4, 1, 1058)
In [56]: np.allclose(B, B3[:,:,0,:].reshape(B.shape))
Out[56]: True
Casting the arrays to the matmul format took a fair bit of experimentation. matmul makes optimal use of BLAS-like libraries. You may improve speed by installing better libraries.

Related

Numpy mean of flattened large array slower than mean of mean of all axes

Running Numpy version 1.19.2, I get better performance cumulating the mean of every individual axis of an array than by calculating the mean over an already flattened array.
shape = (10000,32,32,3)
mat = np.random.random(shape)
# Call this Method A.
%%timeit
mat_means = mat.mean(axis=0).mean(axis=0).mean(axis=0)
14.6 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
mat_reshaped = mat.reshape(-1,3)
# Call this Method B
%%timeit
mat_means = mat_reshaped.mean(axis=0)
135 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This is odd, since doing the mean multiple times has the same bad access pattern (perhaps even worse) than the one on the reshaped array. We also do more operations this way. As a sanity check, I converted the array to FORTRAN order:
mat_reshaped_fortran = mat.reshape(-1,3, order='F')
%%timeit
mat_means = mat_reshaped_fortran.mean(axis=0)
12.2 ms ± 85.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This yields the performance improvement I expected.
For Method A, prun gives:
36 function calls in 0.019 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
3 0.018 0.006 0.018 0.006 {method 'reduce' of 'numpy.ufunc' objects}
1 0.000 0.000 0.019 0.019 {built-in method builtins.exec}
3 0.000 0.000 0.019 0.006 _methods.py:143(_mean)
3 0.000 0.000 0.000 0.000 _methods.py:59(_count_reduce_items)
1 0.000 0.000 0.019 0.019 <string>:1(<module>)
3 0.000 0.000 0.019 0.006 {method 'mean' of 'numpy.ndarray' objects}
3 0.000 0.000 0.000 0.000 _asarray.py:86(asanyarray)
3 0.000 0.000 0.000 0.000 {built-in method numpy.array}
3 0.000 0.000 0.000 0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
6 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
6 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
While for Method B:
14 function calls in 0.166 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.166 0.166 0.166 0.166 {method 'reduce' of 'numpy.ufunc' objects}
1 0.000 0.000 0.166 0.166 {built-in method builtins.exec}
1 0.000 0.000 0.166 0.166 _methods.py:143(_mean)
1 0.000 0.000 0.000 0.000 _methods.py:59(_count_reduce_items)
1 0.000 0.000 0.166 0.166 <string>:1(<module>)
1 0.000 0.000 0.166 0.166 {method 'mean' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 _asarray.py:86(asanyarray)
1 0.000 0.000 0.000 0.000 {built-in method numpy.array}
1 0.000 0.000 0.000 0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
2 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
2 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Note: np.setbufsize(1e7) doesn't seem to have any effect.
What is the reason for this performance difference?

Let's call your original matrix mat. mat.shape = (10000,32,32,3). Visually, this is like having a "stack" of 10,000 * 32x32x3 * rectangular prisms (I think of them as LEGOs) of floats.
Now lets think about what you did in terms of floating point operations (flops):
In Method A, you do mat.mean(axis=0).mean(axis=0).mean(axis=0). Let's break this down:
You take the mean of each position (i,j,k) across all 10,000 LEGOs. This gives you back a single LEGO of size 32x32x3 which now contains the first set of means. This means you have performed 10,000 additions and 1 division per mean, of which there are 32323 = 3072. In total, you've done 30,723,072 flops.
You then take the mean again, this time of each position (j,k), where i is now the number of the layer (vertical position) you are currently on. This gives you a piece of paper with 32x3 means written on it. You have performed 32 additions and 1 divisions per mean, of which there are 32*3 = 96. In total, you've done 3,168 flops.
Finally, you take the mean of each column k, where j is now the row you are currently on. This gives you a stub with 3 means written on it. You have performed 32 additions and 1 division per mean, of which there are 3. In total, you've done 99 flops.
The grand total of all this is 30,723,072 + 3,168 + 99 = 30,726,339 flops.
In Method B, you do mat_reshaped = mat.reshape(-1,3); mat_means = mat_reshaped.mean(axis=0). Let's break this down:
You reshaped everything, so mat is a long roll of paper of size 10,240,000x3. You take the mean of each column k, where j is now the row you are currently on. This gives you a stub with 3 means written on it. You have performed 10,240,000 additions and 1 division per mean, of which there are 3. In total, you've done 30,720,003 flops.
So now you're saying to yourself "What! All of that work, only to show that the slower method actually does ~less~ work?! " Here's the problem: Although Method B has less work to do, it does not have a lot less work to do, meaning just from a flop standpoint, we would expect things to be similar in terms of runtime.
You also have to consider the size of your reshaped array in Method B: a matrix with 10,240,000 rows is HUGE!!! It's really hard/inefficient for the computer to access all of that, and more memory accesses means longer runtimes. The fact is that in its original 10,000x32x32x3 shape, the matrix was already partitioned into convenient slices that the computer could access more efficiently: this is actually a common technique when handling giant matrices Jaime's response to a similar question or even this article: both talk about how breaking up a big matrix into smaller slices helps your program be more memory efficient, therefore making it run faster.

Why is Pandas.eval() with numexpr so slow?

Test code:
import numpy as np
import pandas as pd
COUNT = 1000000
df = pd.DataFrame({
'y': np.random.normal(0, 1, COUNT),
'z': np.random.gamma(50, 1, COUNT),
})
%timeit df.y[(10 < df.z) & (df.z < 50)].mean()
%timeit df.y.values[(10 < df.z.values) & (df.z.values < 50)].mean()
%timeit df.eval('y[(10 < z) & (z < 50)].mean()', engine='numexpr')
The output on my machine (a fairly fast x86-64 Linux desktop with Python 3.6) is:
17.8 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.44 ms ± 502 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
46.4 ms ± 2.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I understand why the second line is a bit faster (it ignores the Pandas index). But why is the eval() approach using numexpr so slow? Shouldn't it be faster than at least the first approach? The documentation sure makes it seem like it would be: https://pandas.pydata.org/pandas-docs/stable/enhancingperf.html

From the investigation presented below, it looks like the unspectacular reason for the worse performance is "overhead".
Only a small part of the expression y[(10 < z) & (z < 50)].mean() is done via numexpr-module. numexpr doesn't support indexing, thus we can only hope for (10 < z) & (z < 50) to be speed-up - anything else will be mapped to pandas-operations.
However, (10 < z) & (z < 50) is not the bottle-neck here, as can be easily seen:
%timeit df.y[(10 < df.z) & (df.z < 50)].mean() # 16.7 ms
mask=(10 < df.z) & (df.z < 50)
%timeit df.y[mask].mean() # 13.7 ms
%timeit df.y[mask] # 13.2 ms
df.y[mask] -takes the lion's share of the running time.
We can compare the profiler output for df.y[mask] and df.eval('y[mask]') to see what makes the difference.
When I use the following script:
import numpy as np
import pandas as pd
COUNT = 1000000
df = pd.DataFrame({
'y': np.random.normal(0, 1, COUNT),
'z': np.random.gamma(50, 1, COUNT),
})
mask = (10 < df.z) & (df.z < 50)
df['m']=mask
for _ in range(500):
df.y[df.m]
# OR
#df.eval('y[m]', engine='numexpr')
and run it with python -m cProfile -s cumulative run.py (or %prun -s cumulative <...> in IPython), I can see the following profiles.
For direct call of the pandas functionality:
ncalls tottime percall cumtime percall filename:lineno(function)
419/1 0.013 0.000 7.228 7.228 {built-in method builtins.exec}
1 0.006 0.006 7.228 7.228 run.py:1(<module>)
500 0.005 0.000 6.589 0.013 series.py:764(__getitem__)
500 0.003 0.000 6.475 0.013 series.py:812(_get_with)
500 0.003 0.000 6.468 0.013 series.py:875(_get_values)
500 0.009 0.000 6.445 0.013 internals.py:4702(get_slice)
500 0.006 0.000 3.246 0.006 range.py:491(__getitem__)
505 3.146 0.006 3.236 0.006 base.py:2067(__getitem__)
500 3.170 0.006 3.170 0.006 internals.py:310(_slice)
635/2 0.003 0.000 0.414 0.207 <frozen importlib._bootstrap>:958(_find_and_load)
We can see that almost 100% of the time is spent in series.__getitem__ without any overhead.
For the call via df.eval(...), the situation is quite different:
ncalls tottime percall cumtime percall filename:lineno(function)
453/1 0.013 0.000 12.702 12.702 {built-in method builtins.exec}
1 0.015 0.015 12.702 12.702 run.py:1(<module>)
500 0.013 0.000 12.090 0.024 frame.py:2861(eval)
1000/500 0.025 0.000 10.319 0.021 eval.py:153(eval)
1000/500 0.007 0.000 9.247 0.018 expr.py:731(__init__)
1000/500 0.004 0.000 9.236 0.018 expr.py:754(parse)
4500/500 0.019 0.000 9.233 0.018 expr.py:307(visit)
1000/500 0.003 0.000 9.105 0.018 expr.py:323(visit_Module)
1000/500 0.002 0.000 9.102 0.018 expr.py:329(visit_Expr)
500 0.011 0.000 9.096 0.018 expr.py:461(visit_Subscript)
500 0.007 0.000 6.874 0.014 series.py:764(__getitem__)
500 0.003 0.000 6.748 0.013 series.py:812(_get_with)
500 0.004 0.000 6.742 0.013 series.py:875(_get_values)
500 0.009 0.000 6.717 0.013 internals.py:4702(get_slice)
500 0.006 0.000 3.404 0.007 range.py:491(__getitem__)
506 3.289 0.007 3.391 0.007 base.py:2067(__getitem__)
500 3.282 0.007 3.282 0.007 internals.py:310(_slice)
500 0.003 0.000 1.730 0.003 generic.py:432(_get_index_resolvers)
1000 0.014 0.000 1.725 0.002 generic.py:402(_get_axis_resolvers)
2000 0.018 0.000 1.685 0.001 base.py:1179(to_series)
1000 0.003 0.000 1.537 0.002 scope.py:21(_ensure_scope)
1000 0.014 0.000 1.534 0.002 scope.py:102(__init__)
500 0.005 0.000 1.476 0.003 scope.py:242(update)
500 0.002 0.000 1.451 0.003 inspect.py:1489(stack)
500 0.021 0.000 1.449 0.003 inspect.py:1461(getouterframes)
11000 0.062 0.000 1.415 0.000 inspect.py:1422(getframeinfo)
2000 0.008 0.000 1.276 0.001 base.py:1253(_to_embed)
2035 1.261 0.001 1.261 0.001 {method 'copy' of 'numpy.ndarray' objects}
1000 0.015 0.000 1.226 0.001 engines.py:61(evaluate)
11000 0.081 0.000 1.081 0.000 inspect.py:757(findsource)
once again about 7 seconds are spent in series.__getitem__, but there are also about 6 seconds overhead - for example about 2 seconds in frame.py:2861(eval) and about 2 seconds in expr.py:461(visit_Subscript).
I did only a superficial investigation (see more details further below), but this overhead doesn't seems to be just constant but at least linear in the number of element in the series. For example there is method 'copy' of 'numpy.ndarray' objects which means that data is copied (it is quite unclear, why this would be necessary per se).
My take-away from it: using pd.eval has advantages as long as the evaluated expression can be evaluated with numexpr alone. As soon as this is not the case, there might be no longer gains but losses due to quite large overhead.
Using line_profiler (here I use %lprun-magic (after loading it with %load_ext line_profliler) for the function run() which is more or less a copy from the script above) we can easily find where the time is lost in Frame.eval:
%lprun -f pd.core.frame.DataFrame.eval
-f pd.core.frame.DataFrame._get_index_resolvers
-f pd.core.frame.DataFrame._get_axis_resolvers
-f pd.core.indexes.base.Index.to_series
-f pd.core.indexes.base.Index._to_embed
run()
Here we can see were the additional 10% are spent:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2861 def eval(self, expr,
....
2951 10 206.0 20.6 0.0 from pandas.core.computation.eval import eval as _eval
2952
2953 10 176.0 17.6 0.0 inplace = validate_bool_kwarg(inplace, 'inplace')
2954 10 30.0 3.0 0.0 resolvers = kwargs.pop('resolvers', None)
2955 10 37.0 3.7 0.0 kwargs['level'] = kwargs.pop('level', 0) + 1
2956 10 17.0 1.7 0.0 if resolvers is None:
2957 10 235850.0 23585.0 9.0 index_resolvers = self._get_index_resolvers()
2958 10 2231.0 223.1 0.1 resolvers = dict(self.iteritems()), index_resolvers
2959 10 29.0 2.9 0.0 if 'target' not in kwargs:
2960 10 19.0 1.9 0.0 kwargs['target'] = self
2961 10 46.0 4.6 0.0 kwargs['resolvers'] = kwargs.get('resolvers', ()) + tuple(resolvers)
2962 10 2392725.0 239272.5 90.9 return _eval(expr, inplace=inplace, **kwargs)
and _get_index_resolvers() can be drilled down to Index._to_embed:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1253 def _to_embed(self, keep_tz=False, dtype=None):
1254 """
1255 *this is an internal non-public method*
1256
1257 return an array repr of this object, potentially casting to object
1258
1259 """
1260 40 73.0 1.8 0.0 if dtype is not None:
1261 return self.astype(dtype)._to_embed(keep_tz=keep_tz)
1262
1263 40 201490.0 5037.2 100.0 return self.values.copy()
Where the O(n)-copying happens.

strange result from timeit

I tried to repeat the functionality of IPython %time, but for some strange reason, results of testing of some function are horrific.
IPython:
In [11]: from random import shuffle
....: import numpy as np
....: def numpy_seq_el_rank(seq, el):
....: return sum(seq < el)
....:
....: seq = np.array(xrange(10000))
....: shuffle(seq)
....:
In [12]: %timeit numpy_seq_el_rank(seq, 10000//2)
10000 loops, best of 3: 46.1 µs per loop
Python:
from timeit import timeit, repeat
def my_timeit(code, setup, rep, loops):
result = repeat(code, setup=setup, repeat=rep, number=loops)
return '%d loops, best of %d: %0.9f sec per loop'%(loops, rep, min(result))
np_setup = '''
from random import shuffle
import numpy as np
def numpy_seq_el_rank(seq, el):
return sum(seq < el)
seq = np.array(xrange(10000))
shuffle(seq)
'''
np_code = 'numpy_seq_el_rank(seq, 10000//2)'
print 'Numpy seq_el_rank:\n\t%s'%my_timeit(code=np_code, setup=np_setup, rep=3, loops=100)
And its output:
Numpy seq_el_rank:
100 loops, best of 3: 1.655324947 sec per loop
As you can see, in python i made 100 loops instead 10000 (and get 35000 times slower result) as in ipython, because it takes really long time. Can anybody explain why result in python is so slow?
UPD:
Here is cProfile.run('my_timeit(code=np_code, setup=np_setup, rep=3, loops=10000)') output:
30650 function calls in 4.987 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 4.987 4.987 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 <timeit-src>:2(<module>)
3 0.001 0.000 4.985 1.662 <timeit-src>:2(inner)
300 0.006 0.000 4.961 0.017 <timeit-src>:7(numpy_seq_el_rank)
1 0.000 0.000 4.987 4.987 Lab10.py:47(my_timeit)
3 0.019 0.006 0.021 0.007 random.py:277(shuffle)
1 0.000 0.000 0.002 0.002 timeit.py:121(__init__)
3 0.000 0.000 4.985 1.662 timeit.py:185(timeit)
1 0.000 0.000 4.985 4.985 timeit.py:208(repeat)
1 0.000 0.000 4.987 4.987 timeit.py:239(repeat)
2 0.000 0.000 0.000 0.000 timeit.py:90(reindent)
3 0.002 0.001 0.002 0.001 {compile}
3 0.000 0.000 0.000 0.000 {gc.disable}
3 0.000 0.000 0.000 0.000 {gc.enable}
3 0.000 0.000 0.000 0.000 {gc.isenabled}
1 0.000 0.000 0.000 0.000 {globals}
3 0.000 0.000 0.000 0.000 {isinstance}
3 0.000 0.000 0.000 0.000 {len}
3 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
29997 0.001 0.000 0.001 0.000 {method 'random' of '_random.Random' objects}
2 0.000 0.000 0.000 0.000 {method 'replace' of 'str' objects}
1 0.000 0.000 0.000 0.000 {min}
3 0.003 0.001 0.003 0.001 {numpy.core.multiarray.array}
1 0.000 0.000 0.000 0.000 {range}
300 4.955 0.017 4.955 0.017 {sum}
6 0.000 0.000 0.000 0.000 {time.clock}

Well, one issue is that you're misreading the results. ipython is telling you how long it took each of the 10,000 iterations for the set of 10,000 iterations with the lowest total time. The timeit.repeat module is reporting how long the whole round of 100 iterations took (again, for the shortest of three). So the real discrepancy is 46.1 µs per loop (ipython) vs. 16.5 ms per loop (python), still a factor of ~350x difference, but not 35,000x.
You didn't show profiling results for ipython. Is it possible that in your ipython session, you did either from numpy import sum or from numpy import *? If so, you'd have been timing the numpy.sum (which is optimized for numpy arrays and would run several orders of magnitude faster), while your python code (which isolated the globals in a way that ipython does not) ran the normal sum (that has to convert all the values to Python ints and sum them).
If you check your profiling output, virtually all of your work is being done in sum; if that part of your code was sped up by several orders of magnitude, the total time would reduce similarly. That would explain the "real" discrepancy; in the test case linked above, it was a 40x difference, and that was for a smaller array (the smaller the array, the less numpy can "show off") with more complex values (vs. summing 0s and 1s here I believe).
The remainder (if any) is probably an issue of how the code is being evaled slightly differently, or possibly weirdness with the random shuffle (for consistent tests, you'd want to seed random with a consistent seed to make the "randomness" repeatable) but I doubt that's a difference of more than a few percent.

There could be any number of reasons this code is running slower in one implementation of python than another. One may be optimized differently than another, one may pre-compile certain parts while the other is fully interpreted. The only way to figure out why is to profile your code.
https://docs.python.org/2/library/profile.html
import cProfile
cProfile.run('repeat(code, setup=setup, repeat=rep, number=loops)')
Will give a result similar to
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 <stdin>:1(testing)
1 0.000 0.000 0.000 0.000 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.000 0.000 {method 'upper' of 'str' objects}
Which shows you when function calls were made, how many times they were made and how long they took.

Most efficient way to create an array of cos and sin in Numpy

I need to store an array of size n with values of cos(x) and sin(x), lets say
array[[cos(0.9), sin(0.9)],
[cos(0.35),sin(0.35)],
...]
The arguments of each pair of cos and sin is given by random choice. My code as far as I have been improving it is like this:
def randvector():
""" Generates random direction for n junctions in the unitary circle """
x = np.empty([n,2])
theta = 2 * np.pi * np.random.random_sample((n))
x[:,0] = np.cos(theta)
x[:,1] = np.sin(theta)
return x
Is there a shorter way or more effective way to achieve this?

Your code is effective enough. And justhalf's answer is not bad I think.
For effective and short, How about this code?
def randvector(n):
theta = 2 * np.pi * np.random.random_sample((n))
return np.vstack((np.cos(theta), np.sin(theta))).T
UPDATE
Append cProfile result.
justhalf's
5 function calls in 4.707 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 4.707 4.707 <string>:1(<module>)
1 2.452 2.452 4.706 4.706 test.py:6(randvector1)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.010 0.010 0.010 0.010 {method 'random_sample' of 'mtrand.RandomState' objects}
1 2.244 2.244 2.244 2.244 {numpy.core.multiarray.array}
OP's
5 function calls in 0.088 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.088 0.088 <string>:1(<module>)
1 0.079 0.079 0.088 0.088 test.py:9(randvector2)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.009 0.009 0.009 0.009 {method 'random_sample' of 'mtrand.RandomState' objects}
1 0.000 0.000 0.000 0.000 {numpy.core.multiarray.empty}
mine
21 function calls in 0.087 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.087 0.087 <string>:1(<module>)
2 0.000 0.000 0.000 0.000 numeric.py:322(asanyarray)
1 0.000 0.000 0.002 0.002 shape_base.py:177(vstack)
2 0.000 0.000 0.000 0.000 shape_base.py:58(atleast_2d)
1 0.076 0.076 0.087 0.087 test.py:17(randvector3)
6 0.000 0.000 0.000 0.000 {len}
1 0.000 0.000 0.000 0.000 {map}
2 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.009 0.009 0.009 0.009 {method 'random_sample' of 'mtrand.RandomState' objects}
2 0.000 0.000 0.000 0.000 {numpy.core.multiarray.array}
1 0.002 0.002 0.002 0.002 {numpy.core.multiarray.concatenate}

Your code already looks fine to me, but here are a few more thoughts.
Here's a one-liner.
It is marginally slower than your version.
def randvector2(n):
return np.exp((2.0j * np.pi) * np.random.rand(n, 1)).view(dtype=np.float64)
I get these timings for n=10000
Yours:
1000 loops, best of 3: 716 µs per loop
my shortened version:
1000 loops, best of 3: 834 µs per loop
Now if speed is a concern, your approach is really very good.
Another answer shows how to use hstack.
That works well.
Here is another version that is just a little different from yours and is marginally faster.
def randvector3(n):
x = np.empty([n,2])
theta = (2 * np.pi) * np.random.rand(n)
np.cos(theta, out=x[:,0])
np.sin(theta, out=x[:,1])
return x
This gives me the timing:
1000 loops, best of 3: 698 µs per loop
If you have access to numexpr, the following is faster (at least on my machine).
import numexpr as ne
def randvector3(n):
sample = np.random.rand(n, 1)
c = 2.0j * np.pi
return ne.evaluate('exp(c * sample)').view(dtype=np.float64)
This gives me the timing:
1000 loops, best of 3: 366 µs per loop
Honestly though, if I were writing this for anything that wasn't extremely performance intensive, I'd do pretty much the same thing you did.
It makes your intent pretty clear to the reader.
The version with hstack works well too.
Another quick note:
When I run timings for n=10, my one-line version is fastest.
When I do n=10000000, the fast pure-numpy version is fastest.

You can use list comprehension to make the code a little bit shorter:
def randvector(n):
return np.array([(np.cos(theta), np.sin(theta)) for theta in 2*np.pi*np.random.random_sample(n)])
But, as IanH mentioned in comments, this is slower. In fact, through my experiment, this is 5x slower, because this doesn't take advantage of NumPy vectorization.
So to answer your question:
Is there a shorter way?
Yes, which is what I give in this answer, although it's only shorter by a few characters (but it saves many lines!)
Is there a more effective (I believe you meant "efficient") way?
I believe the answer to this question, without overly complicating the code, is no, since numpy already optimizes the vectorization (assigning of the cos and sin values to the array)
Timing
Comparing various methods:
OP's randvector: 0.002131 s
My randvector: 0.013218 s
mskimm's randvector: 0.003175 s
So it seems that mskimm's randvector looks good in terms of code length end efficiency =D

numpy routines don't appear to be that fast

I'm using python to do some Bayesian statistics. I've coded it up in python and in Fortran 95. The Fortran code is waaay faster... like a factor of 100. I expected the Fortran to be faster, but I was really hoping that by using numpy I could get the python code to come close, maybe within a factor of 2. I've profiled the python code and it looks like the majority of the time is spent doing the following things:
scipy.stats.rvs: taking a random draw from a distribution. I do this ~19000 times and it takes a total time of 3.552 sec
numpy.slogdet: computing the log of the determinant of a matrix. I do this ~10,000 and it takes a total of 2.48 s
numpy.solve: solve a linear system: I call this routine ~10,000 times for a total time of 2.557 s
In total my code runs in ~ 11 sec whereas my fortran code takes .092 sec. Is this a joke? I'm really not trying to be unrealistic in my expectations of python, and I certainly don't expect to get my python code to be as fast as Fortran.. but to be slower by a factor of > 100.. Python's gotta be able to do better than that. Just in case you are curious, here is the full output of my profiler:( I don't know why it broke the text into several blocks)
1290611 function calls in 11.296 CPU seconds
Ordered by: internal time, function name
ncalls tottime percall cumtime percall filename:lineno(function)
18973 0.864 0.000 3.552 0.000 /usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:484(rvs)
9976 0.819 0.000 2.480 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:1559(slogdet)
9976 0.627 0.000 6.659 0.001 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:77(evaluate_posterior)
9384 0.591 0.000 0.753 0.000 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:39(construct_R_matrix)
77852 0.533 0.000 0.533 0.000 :0(array)
37946 0.520 0.000 1.489 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:32(_wrapit)
77851 0.423 0.000 0.956 0.000 /usr/lib64/python2.6/site-packages/numpy/core/numeric.py:216(asarray)
37946 0.360 0.000 0.360 0.000 :0(all)
9976 0.335 0.000 2.557 0.000 /usr/lib64/python2.6/sitepackages/scipy/linalg/basic.py:23(solve)
107799 0.322 0.000 0.322 0.000 :0(len)
109740 0.301 0.000 0.301 0.000 :0(issubclass)
28357 0.294 0.000 0.294 0.000 :0(prod)
9976 0.287 0.000 0.957 0.000 /usr/lib64/python2.6/site-packages/scipy/linalg/lapack.py:45(find_best_lapack_type)
1 0.282 0.282 11.294 11.294 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:199(get_rho_lambda_draws)
9976 0.269 0.000 1.386 0.000 /usr/lib64/python2.6/site-packages/scipy/linalg/lapack.py:60(get_lapack_funcs)
19952 0.263 0.000 0.476 0.000 /usr/lib64/python2.6/site-packages/scipy/linalg/lapack.py:23(cast_to_lapack_prefix)
19952 0.235 0.000 0.669 0.000 /usr/lib64/python2.6/site-packages/numpy/lib/function_base.py:483(asarray_chkfinite)
66833 0.212 0.000 0.212 0.000 :0(log)
18973 0.207 0.000 1.054 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1427(product)
29931 0.205 0.000 0.205 0.000 :0(reduce)
28949 0.187 0.000 0.856 0.000 :0(map)
9976 0.175 0.000 0.175 0.000 :0(dot)
47922 0.163 0.000 0.163 0.000 :0(getattr)
9976 0.157 0.000 0.206 0.000 /usr/lib64/python2.6/site-packages/numpy/lib/twodim_base.py:169(eye)
19952 0.154 0.000 0.271 0.000 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:32(loggbeta)
18973 0.151 0.000 0.793 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1548(all)
19953 0.146 0.000 0.146 0.000 :0(any)
9976 0.142 0.000 0.316 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:99(_commonType)
9976 0.133 0.000 0.133 0.000 :0(dgetrf)
18973 0.125 0.000 0.175 0.000 /usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:462(_fix_loc_scale)
39904 0.117 0.000 0.117 0.000 :0(append)
18973 0.105 0.000 0.292 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1461(alltrue)
19952 0.102 0.000 0.102 0.000 :0(zeros)
19952 0.093 0.000 0.154 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:71(isComplexType)
19952 0.090 0.000 0.090 0.000 :0(split)
9976 0.089 0.000 2.569 0.000 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:62(get_log_determinant_of_matrix)
19952 0.087 0.000 0.134 0.000 /bluehome/legoses/bce/bayes_GP_integrated_out/python/ce_funcs.py:35(logggamma)
9976 0.083 0.000 0.154 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:139(_fastCopyAndTranspose)
9976 0.076 0.000 0.125 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:157(_assertSquareness)
9976 0.074 0.000 0.097 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:151(_assertRank2)
9976 0.072 0.000 0.119 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:127(_to_native_byte_order)
18973 0.072 0.000 0.072 0.000 /usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:832(_argcheck)
9976 0.072 0.000 0.228 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:901(diagonal)
9976 0.070 0.000 0.070 0.000 :0(arange)
9976 0.061 0.000 0.061 0.000 :0(diagonal)
9976 0.055 0.000 0.055 0.000 :0(sum)
9976 0.053 0.000 0.075 0.000 /usr/lib64/python2.6/site-packages/numpy/linalg/linalg.py:84(_realType)
11996 0.050 0.000 0.091 0.000 /usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:1412(_rvs)
9384 0.047 0.000 0.162 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1898(prod)
9976 0.045 0.000 0.045 0.000 :0(sort)
11996 0.041 0.000 0.041 0.000 :0(standard_normal)
9976 0.037 0.000 0.037 0.000 :0(_fastCopyAndTranspose)
9976 0.037 0.000 0.037 0.000 :0(hasattr)
9976 0.037 0.000 0.037 0.000 :0(range)
6977 0.034 0.000 0.055 0.000 /usr/lib64/python2.6/site-packages/scipy/stats/distributions.py:3731(_rvs)
9977 0.027 0.000 0.027 0.000 :0(max)
9976 0.023 0.000 0.023 0.000 /usr/lib64/python2.6/site-packages/numpy/core/numeric.py:498(isfortran)
9977 0.022 0.000 0.022 0.000 :0(min)
9976 0.022 0.000 0.022 0.000 :0(get)
6977 0.021 0.000 0.021 0.000 :0(uniform)
1 0.001 0.001 11.295 11.295 <string>:1(<module>)
1 0.001 0.001 11.296 11.296 profile:0(get_rho_lambda_draws(correlations,energies,rho_priors,lambda_e_prior,lambda_z_prior,candidate_sig2_rhos,candidate_sig2_lambda_e,candidate_sig2_lambda_z,3000))
2 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:445(__call__)
1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:385(__init__)
1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:175(_array2string)
2 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:475(_digits)
2 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:309(_extendLine)
1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:317(_formatArray)
1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1477(any)
1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:243(array2string)
1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/numeric.py:1390(array_str)
1 0.000 0.000 0.000 0.000 :0(compress)
1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/arrayprint.py:394(fillFormat)
6 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/numeric.py:2166(geterr)
12 0.000 0.000 0.000 0.000 :0(geterrobj)
0 0.000 0.000 profile:0(profiler)
1 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/fromnumeric.py:1043(ravel)
1 0.000 0.000 0.000 0.000 :0(ravel)
8 0.000 0.000 0.000 0.000 :0(rstrip)
6 0.000 0.000 0.000 0.000 /usr/lib64/python2.6/site-packages/numpy/core/numeric.py:2070(seterr)
6 0.000 0.000 0.000 0.000 :0(seterrobj)
1 0.000 0.000 0.000 0.000 :0(setprofile)
EDIT:
Here is copy of the relevant routines
def get_rho_lambda_draws(correlations, energies, rho_priors, lam_e_prior, lam_z_prior,
candidate_sig2_rhos, candidate_sig2_lambda_e,
candidate_sig2_lambda_z, ndraws):
nBasis = len(correlations[0])
nStruct = len(correlations)
rho _draws = [ [0.5 for x in xrange(nBasis)] for y in xrange(ndraws)]
lambda_e_draws = [ 5 for x in xrange(ndraws)]
lambda_z_draws = [ 5 for x in xrange(ndraws)]
    
accept_rhos = array([0. for x in xrange(nBasis)])
accept_lambda_e = 0.
accept_lambda_z = 0.
for i in xrange(1,ndraws):
if i % 100 == 0:
print i, "REP<---------------------------------------------------------------------------------"
#do metropolis to get rho
rho_draws[i] = [x for x in rho_draws[i-1]]
lambda_e_draws[i] = lambda_e_draws[i-1]
lambda_z_draws[i] = lambda_z_draws[i-1]
rho_vec = [x for x in rho_draws[i-1]]
R_matrix_before =construct_R_matrix(correlations,correlations,rho_vec)
post_before = evaluate_posterior(R_matrix_before,rho_vec,energies,lambda_e_draws[i-1],lambda_z_draws[i-1],lam_e_prior,lam_z_prior,rho_priors)
index = 0
for j in xrange(nBasis):
cand = norm.rvs(rho_draws[i-1][j],scale=candidate_sig2_rhos[j])
if 0.0 < cand < 1.0:
rho_vec[j] = cand
R_matrix_after = construct_R_matrix(correlations,correlations,rho_vec)
post_after = evaluate_posterior(R_matrix_after,rho_vec,energies,lambda_e_draws[i-1],lambda_z_draws[i-1],lam_e_prior,lam_z_prior,rho_priors)
metrop_value = post_after - post_before
unif = log(uniform.rvs(0,1))
if metrop_value > unif:
rho_draws[i][j] = cand
post_before = post_after
accept_rhos[j] += 1
else:
rho_vec[j] = rho_draws[i-1][j]
R_matrix = construct_R_matrix(correlations,correlations,rho_vec)
cand = norm.rvs(lambda_e_draws[i-1],scale=candidate_sig2_lambda_e)
if cand > 0.0:
post_after = evaluate_posterior(R_matrix,rho_vec,energies,cand,lambda_z_draws[i-1],lam_e_prior,lam_z_prior,rho_priors)
metrop_value = post_after - post_before
unif = log(uniform.rvs(0,1))
if metrop_value > unif:
lambda_e_draws[i] = cand
post_before = post_after
accept_lambda_e = accept_lambda_e + 1
cand = norm.rvs(lambda_z_draws[i-1],scale=candidate_sig2_lambda_z)
if cand > 0.0:
post_after = evaluate_posterior(R_matrix,rho_vec,energies,lambda_e_draws[i],cand,lam_e_prior,lam_z_prior,rho_priors)
metrop_value = post_after - post_before
unif = log(uniform.rvs(0,1))
if metrop_value > unif:
lambda_z_draws[i] = cand
post_before = post_after
accept_lambda_z = accept_lambda_z + 1
print accept_rhos/ndraws
print accept_lambda_e/ndraws
print accept_lambda_z/ndraws
return [rho_draws,lambda_e_draws,lambda_z_draws]
def evaluate_posterior(R_matrix,rho_vec,energies,lambda_e,lambda_z,lam_e_prior,lam_z_prior,rho_prior_params):
# from scipy.linalg import solve
#from numpy import allclose
working_matrix = eye(len(R_matrix))/lambda_e + R_matrix/lambda_z
logdet = get_log_determinant_of_matrix(working_matrix)
x = solve(working_matrix,energies,sym_pos=True)
# if not allclose(dot(working_matrix,x),energies):
# exit('solve routine didnt work')
rho_priors = sum([loggbeta(rho_vec[j],rho_prior_params[j][0],rho_prior_params[j][1]) for j in xrange(len(rho_vec))])
loggposterior = -.5 * logdet - .5*dot(energies,x) + logggamma(lambda_e,lam_e_prior[0],lam_e_prior[1]) + logggamma(lambda_z,lam_z_prior[0],lam_z_prior[1]) + rho_priors #(a_e-1)*log(lambda_e) - b_e*lambda_e + (a_z-1)*log(lambda_z) - b_z*lambda_z + rho_priors
return loggposterior
def construct_R_matrix(listone,listtwo,rhos):
return prod(rhos[:]**(4*(listone[:,newaxis]-listtwo)**2),axis=2)
(Once again... I don't know why It breaks my input up into several blocks when I post.. I hope you can decifer it)

It is hard to tell exactly what's going on with your code, But my suspicion is that you just have some data which is not (or could not be) very vectorized.
Because obviously the call of .rvs() 19000 times is going to be way slower than the .rvs(size=19000). See:
In [5]: %timeit x=[scipy.stats.norm().rvs() for i in range(19000)]
1 loops, best of 3: 1.23 s per loop
In [6]: %timeit x=scipy.stats.norm().rvs(size=19000)
1000 loops, best of 3: 1.67 ms per loop
So if you indeed have a not very vectorized code or algorithm it is well expected to be slower than fortran.

Check out the performance page created by the SciPy/NumPy folks. There are a number of remarkably easy extras that foster very fast code. Among them are (a) using the weave module, especially the inline and blitz options. (b) Using Cython to write some of your functions in C but be able to call and use them in Python.
I do a lot of large-scale scientific computing work in Python for statistics, finance, and (in grad school) computer vision. The reason why Python is excellent for these kinds of issues is not that my naive, first hack code would yield the fastest solution, but because in Python I can easily interface with tons of other tasks. I can easily issue Linux commands for other programs, easily read and parse most data files, easily interface with SQL and other databse software; I have all of the R statistics library available, use of OpenCV commands (in much much nicer syntax that the C++ version), and much more.
When the importance of my task was to manipulate a new dataset and get my hands dirty, feeling out the nuances of that data, then Python's ease of programming, along with matplotlib, made it much better. Later on, when I need to scale things up, I can always use PyCUDA, Cython, or just rewrite things in C++ if high-end performance is required. Since most machines have multiprocessors now, the multiprocessing module, as well as mpi4py, allow me to quickly and cheaply turn annoying for-loop style tasks into much shorter tasks, without needing to migrate to C++.
In short, the real utility of Python doesn't come from the language all by itself, but from becoming really proficient with the add-ons and extras that let you cheaply make your little set of common problems execute quickly on the data sets that matter in the day-to-day.
Real-time embedded communications software is going to be using C++ for a long time to come... same for high-frequency trading strategies. But then again, professional solutions to these types of things is not really what Python is meant for. And in some cases, folks prefer unusual solutions for that stuff anyway.

Get rid of the two for loops and two list comprehensions by replacing them with Numpy functions and constructs that use numpy.ndarrays. Also do not print in between the computation - that is also slow. You can probably get 10-50 fold speed increase just by following the above advice.
Also see http://www.scipy.org/PerformancePython/

You usually should't use Numpy or Scipy to compute scalar values. Use 'ordinary' Python. Extending the example provided by #sega_sai:
In [11]: %timeit x = [normalvariate(0, 1) for i in range(190)]
1000 loops, best of 3: 274 µs per loop
In [12]: %timeit x = [scipy.stats.norm().rvs() for i in range(190)]
10 loops, best of 3: 180 ms per loop
In [13]: %timeit x = scipy.stats.norm().rvs(size=190)
1000 loops, best of 3: 987 µs per loop
It is faster if you make an instance of scipy.stats.norm().rvs
In [14]: rvs = scipy.stats.norm().rvs
In [15]: %timeit x = [rvs() for i in range(190)]
100 loops, best of 3: 3.8 ms per loop
In [16]: %timeit x = rvs(size=190)
10000 loops, best of 3: 44 µs per loop
Also note that PyMC has complained about Scipy's probability distributions:
"Based on informal comparison using version 2.0, the distributions in PyMC tend to be approximately an order of magnitude faster than their counterparts in SciPy (using version 0.7)"
.
http://www.map.ox.ac.uk/media/PDF/Patil_et_al_2010.pdf
import pymc
s = pymc.Normal('s', 0, 1)
%timeit x = [s.rand() for i in range(190)]
100 loops, best of 3: 3.76 ms per loop
Also note that Scipy without individual instancing at each iteration is faster:
generate = scipy.stats.norm().rvs
%timeit x = [generate() for i in range(190)]
100 loops, best of 3: 7.98 ms per loop

Try doing this:
import psyco
psyco.full()
Or using pypy, these can sometimes yield significant speed improvements, although pypy doesn't have full numpy support yet.

recently I posted something about the performance of c/c++/fortran and that of python on
Stackoverflow:
comparing python with c/fortran
what I concluded from that post was that is better to combine python with a low level
programming language than using python itself for numeric computations. I am actually using
F2PY.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.