Optimizing a simple CPU bound function with python multiprocessing

Optimizing a simple CPU bound function with python multiprocessing - python

I am trying to understand how the multiprocessing.Pool works, and I have developed a minimal example that illustrates my question. Briefly, I am using pool.map to parallelize a CPU-bound function operating on an array by following the example Dead simple example of using Multiprocessing Queue, Pool and Locking. When I follow that pattern, I get only a modest speedup with 4 cores, but if I instead manually chunk the array into num_threads and then use pool.map over the chunks, I find speedup factors that vastly exceed 4x, which makes no sense to me. Details to follow.
First, the function definitions.
def take_up_time():
n = 1e3
while n > 0:
n -= 1
def count_even_numbers(x):
take_up_time()
return np.where(np.mod(x, 2) == 0, 1, 0)
Now define the functions we'll benchmark.
First the function that runs in serial:
def serial(arr):
return np.sum(map(count_even_numbers,arr))
Now the function that uses Pool.map in the "standard" way:
def parallelization_strategy1(arr):
num_threads = multiprocessing_count()
pool = multiprocessing.Pool(num_threads)
result = pool.map(count_even_numbers,arr)
pool.close()
return np.sum(result)
Finally, the second strategy in which I manually chunk the array and then run Pool.map over the chunks (Splitting solution due to python numpy split array into unequal subarrays)
def split_padded(a,n):
""" Simple helper function for strategy 2
"""
padding = (-len(a))%n
if padding == 0:
return np.split(a, n)
else:
sub_arrays = np.split(np.concatenate((a,np.zeros(padding))),n)
sub_arrays[-1] = sub_arrays[-1][:-padding]
return sub_arrays
def parallelization_strategy2(arr):
num_threads = multiprocessing_count()
sub_arrays = split_padded(arr, num_threads)
pool = multiprocessing.Pool(num_threads)
result = pool.map(count_even_numbers,sub_arrays)
pool.close()
return np.sum(np.array(result))
Here is my array input:
npts = 1e3
arr = np.arange(npts)
Now I use the IPython %timeit function to run my timings, and for 1e3 points I get the following:
serial: 10 loops, best of 3: 98.7 ms per loop
parallelization_strategy1: 10 loops, best of 3: 77.7 ms per loop
parallelization_strategy2: 10 loops, best of 3: 22 ms per loop
Since I have 4 cores, Strategy 1 is a disappointingly modest speedup, and strategy 2 is suspiciously larger than the maximum 4x speedup.
When I increase npts to 1e4, the results are even more perplexing:
serial: 1 loops, best of 3: 967 ms per loop
parallelization_strategy1: 1 loops, best of 3: 596 ms per loop
parallelization_strategy2: 10 loops, best of 3: 22.9 ms per loop
So the two sources of confusion are:
Strategy 2 is way faster than the naive theoretical limit
For some reason, %timeit with npts=1e4 only triggers 1 loop for serial and strategy 1, but 10 loops for strategy 2.

Turns out your example fits perfectly in the Pythran model. Compiling the following source code count_even.py:
#pythran export count_even(int [:])
import numpy as np
def count_even_numbers(x):
return np.where(np.mod(x, 2) == 0, 1, 0)
def count_even(arr):
s = 0
#omp parallel for reduction(+:s)
for elem in arr:
s += count_even_numbers(elem)
return s
with the command line (-fopenmp activates the handling of the OpenMP annotations):
pythran count_even.py -fopenmp
And running timeit over this already yields massive speedups thanks to the conversion to native code:
Without Pythran
$ python -m timeit -s 'import numpy as np; arr = np.arange(1e7, dtype=int); from count_even import count_even' 'count_even(arr)'
verryyy long, more than several minutes :-/
With Pythran, one core
$ OMP_NUM_THREADS=1 python -m timeit -s 'import numpy as np; arr = np.arange(1e7, dtype=int); from count_even import count_even' 'count_even(arr)'
100 loops, best of 3: 10.3 msec per loop
With Pythran, two cores:
$ OMP_NUM_THREADS=2 python -m timeit -s 'import numpy as np; arr = np.arange(1e7, dtype=int); from count_even import count_even' 'count_even(arr)'
100 loops, best of 3: 5.5 msec per loop
twice as fast, parallelization is working :-)
Note that OpenMP enables multi-threading, not multi-processing.

Your strategies aren't doing the same!
In the first strategy, the Pool.map iterates over an array, so count_even_numbers is called for every array item (since the shape of the array is one-dimensional).
The second strategy maps over a list of arrays, so count_even_numbers is called for every array in the list.

Related

numba guvectorize target='parallel' slower than target='cpu'

I've been attempting to optimize a piece of python code that involves large multi-dimensional array calculations. I am getting counterintuitive results with numba. I am running on an MBP, mid 2015, 2.5 GHz i7 quadcore, OS 10.10.5, python 2.7.11. Consider the following:
import numpy as np
from numba import jit, vectorize, guvectorize
import numexpr as ne
import timeit
def add_two_2ds_naive(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
#jit
def add_two_2ds_jit(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
#guvectorize(['float64[:,:],float64[:,:],float64[:,:]'],
'(n,m),(n,m)->(n,m)',target='cpu')
def add_two_2ds_cpu(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
#guvectorize(['(float64[:,:],float64[:,:],float64[:,:])'],
'(n,m),(n,m)->(n,m)',target='parallel')
def add_two_2ds_parallel(A,B,res):
for i in range(A.shape[0]):
for j in range(B.shape[1]):
res[i,j] = A[i,j]+B[i,j]
def add_two_2ds_numexpr(A,B,res):
res = ne.evaluate('A+B')
if __name__=="__main__":
np.random.seed(69)
A = np.random.rand(10000,100)
B = np.random.rand(10000,100)
res = np.zeros((10000,100))
I can now run timeit on the various functions:
%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.16 ms per loop
%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.19 ms per loop
%timeit add_two_2ds_parallel(A,B,res)
100 loops, best of 3: 6.9 ms per loop
%timeit add_two_2ds_numexpr(A,B,res)
1000 loops, best of 3: 1.62 ms per loop
It seems that 'parallel' is not taking even using the majority of a single core, as it's usage in top shows that python is hitting ~40% cpu for 'parallel', ~100% for 'cpu', and numexpr hits ~300%.

There are two issues with your #guvectorize implementations. The first is that you are are doing all the looping inside your #guvectorize kernel, so there is actually nothing for the Numba parallel target to parallelize. Both #vectorize and #guvectorize parallelize on the broadcast dimensions in a ufunc/gufunc. Since the signature of your gufunc is 2D, and your inputs are 2D, there is only a single call to the inner function, which explains the only 100% CPU usage you saw.
The best way to write the function you have above is to use a regular ufunc:
#vectorize('(float64, float64)', target='parallel')
def add_ufunc(a, b):
return a + b
Then on my system, I see these speeds:
%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.87 ms per loop
%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.81 ms per loop
%timeit add_two_2ds_parallel(A,B,res)
The slowest run took 11.82 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 2.43 ms per loop
%timeit add_two_2ds_numexpr(A,B,res)
100 loops, best of 3: 2.79 ms per loop
%timeit add_ufunc(A, B, res)
The slowest run took 9.24 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 2.03 ms per loop
(This is a very similar OS X system to yours, but with OS X 10.11.)
Although Numba's parallel ufunc now beats numexpr (and I see add_ufunc using about 280% CPU), it doesn't beat the simple single-threaded CPU case. I suspect that the bottleneck is due to memory (or cache) bandwidth, but I haven't done the measurements to check that.
Generally speaking, you will see much more benefit from the parallel ufunc target if you are doing more math operations per memory element (like, say, a cosine).

Cython prange slower for 4 threads then with range

I am currently trying to follow a simple example for parallelizing a loop with cython's prange.
I have installed OpenBlas 0.2.14 with openmp allowed and compiled numpy 1.10.1 and scipy 0.16 from source against openblas. To test the performance of the libraries I am following this example: http://nealhughes.net/parallelcomp2/.
The functions to be timed are copied form the site:
import numpy as np
from math import exp
from libc.math cimport exp as c_exp
from cython.parallel import prange,parallel
def array_f(X):
Y = np.zeros(X.shape)
index = X > 0.5
Y[index] = np.exp(X[index])
return Y
def c_array_f(double[:] X):
cdef int N = X.shape[0]
cdef double[:] Y = np.zeros(N)
cdef int i
for i in range(N):
if X[i] > 0.5:
Y[i] = c_exp(X[i])
else:
Y[i] = 0
return Y
def c_array_f_multi(double[:] X):
cdef int N = X.shape[0]
cdef double[:] Y = np.zeros(N)
cdef int i
with nogil, parallel():
for i in prange(N):
if X[i] > 0.5:
Y[i] = c_exp(X[i])
else:
Y[i] = 0
return Y
The author of the code reports following speed ups for 4 cores:
from thread_demo import *
import numpy as np
X = -1 + 2*np.random.rand(10000000)
%timeit array_f(X)
1 loops, best of 3: 222 ms per loop
%timeit c_array_f(X)
10 loops, best of 3: 87.5 ms per loop
%timeit c_array_f_multi(X)
10 loops, best of 3: 22.4 ms per loop
When I run these example on my machines ( macbook pro with osx 10.10 ), I get the following timings for export OMP_NUM_THREADS=1
In [1]: from bla import *
In [2]: import numpy as np
In [3]: X = -1 + 2*np.random.rand(10000000)
In [4]: %timeit c_array_f(X)
10 loops, best of 3: 89.7 ms per loop
In [5]: %timeit c_array_f_multi(X)
1 loops, best of 3: 343 ms per loop
and for OMP_NUM_THREADS=4
In [1]: from bla import *
In [2]: import numpy as np
In [3]: X = -1 + 2*np.random.rand(10000000)
In [4]: %timeit c_array_f(X)
10 loops, best of 3: 89.5 ms per loop
In [5]: %timeit c_array_f_multi(X)
10 loops, best of 3: 119 ms per loop
I see this same behavior on an openSuse machine, hence my question. How can the author get a 4x speed up while the same code runs slower for 4 threads on 2 of my systems.
The setup script for generating the *.c & .so is also identical to the one used in the blog.
from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension
from Cython.Distutils import build_ext
import numpy as np
ext_modules=[
Extension("bla",
["bla.pyx"],
libraries=["m"],
extra_compile_args = ["-O3", "-ffast-math","-march=native", "-fopenmp" ],
extra_link_args=['-fopenmp'],
include_dirs = [np.get_include()]
)
]
setup(
name = "bla",
cmdclass = {"build_ext": build_ext},
ext_modules = ext_modules
)
Would be great if someone could explain to me why this happens.

1) An important feature of prange (like any other parallel for loop) is that it activates out-of-order execution, which means that the loop can execute in any arbitrary order. Out-of-order execution really pays off when you have no data dependency between iterations.
I do not know the internals of Cython but I reckon that if boundschecking is not turned off, the loop cannot be executed arbitrarily, since the next iteration will depend on whether or not the array is going out of bounds in the current iteration, hence the problem becomes almost serial as threads will have to wait for the result. This is one of the issues with your code. In fact Cython does give me the following warning:
warning: bla.pyx:42:16: Use boundscheck(False) for faster access
So add the following
from cython import boundscheck, wraparound
#boundscheck(False)
#wraparound(False)
def c_array_f(double[:] X):
# Rest of your code
#boundscheck(False)
#wraparound(False)
def c_array_f_multi(double[:] X):
# Rest of your code
Let's now time them with your data X = -1 + 2*np.random.rand(10000000).
With Bounds Checking:
In [2]:%timeit array_f(X)
10 loops, best of 3: 189 ms per loop
In [4]:%timeit c_array_f(X)
10 loops, best of 3: 93.6 ms per loop
In [5]:%timeit c_array_f_multi(X)
10 loops, best of 3: 103 ms per loop
Without Bounds Checking:
In [9]:%timeit c_array_f(X)
10 loops, best of 3: 84.2 ms per loop
In [10]:%timeit c_array_f_multi(X)
10 loops, best of 3: 42.3 ms per loop
These results are with num_threads=4 (I have 4 logical cores) and the speed-up is around 2x. Before getting further we can still shave off a few more ms by declaring our arrays to be contiguous i.e. declaring X and Y with double[::1].
Contiguous Arrays:
In [14]:%timeit c_array_f(X)
10 loops, best of 3: 81.8 ms per loop
In [15]:%timeit c_array_f_multi(X)
10 loops, best of 3: 39.3 ms per loop
2) Even more important is job scheduling and this is what your benchmark suffers from. By default chunk sizes are determined at compile time i.e. schedule=static however it is very likely that the environment variables (for instance OMP_SCHEDULE) and work-load of the two machines (yours and the one from the blog post) are different, and they schedule the jobs at runtime, dynmically, guidedly and so on. Let's experiment it with replacing your prange to
for i in prange(N, schedule='static'):
# static scheduling...
for i in prange(N, schedule='dynamic'):
# dynamic scheduling...
Let's time them now (only the multi-threaded code):
Scheduling Effect:
In [23]:%timeit c_array_f_multi(X) # static
10 loops, best of 3: 39.5 ms per loop
In [28]:%timeit c_array_f_multi(X) # dynamic
1 loops, best of 3: 319 ms per loop
You might be able to replicate this depending on the work-load on your own machine. As a side note, since you are just trying to measure the performance of a parallel vs serial code in a micro-benchmark test and not an actual code, I suggest you get rid of the if-else condition i.e. only keep Y[i] = c_exp(X[i]) within the for loop. This is because if-else statements also adversely affect branch-prediction and out-of-order execution in parallel code. On my machine I get almost 2.7x speed-up over serial code with this change.

Python: Vectorizing evaluations of arrays of lambda functions

How would you vectorize the evaluation of arrays of lambda functions?
Here's an example to understand what I'm talking about. (And even though I'm using numpy arrays, I'm not limiting myself to only using numpy.)
Let's say I have the following numpy arrays.
array1 = np.array(["hello", 9])
array2 = np.array([lambda s: s == "hello", lambda num: num < 10])
(You could store these kinds of objects in numpy without throwing an error, believe it or not.) What I want is something akin to the following.
array2 * array1
# Return np.array([True, True]). PS: An explanation of how to `AND` all of
# booleans together quickly would be nice too.
Of course, this seems impractical for arrays of size 2, but for arrays of arbitrary sizes, I'll assume this would yield a performance boost because of all of the low level optimizations.
So, anyone know how to write this weird kind of python code?

The simple answer, of course, is that you can't easily do this with numpy (or with standard Python, for that matter). Numpy doesn't actually vectorize most operations itself, to my knowledge: it uses libraries like BLAS/ATLAS/etc that do for certain situations. Even if it did, it would do so in C for specific situations: it certainly can't vectorize Python function execution.
If you want to involve multiprocessing in this, it is possible, but it depends on your situation. Are your individual function applications time-consuming, making them feasible to send out one-by-one, or do you need a very large number of fast function executions, in which case you'd probably want to send batches of them to each process?
In general, because of what could be argued as poor fundamental design (eg, the Global Interpreter Lock), it's very difficult with standard Python to have lightweight parallelization as you're hoping for here. There are significantly heavier methods, like the multiprocessing module or Ipython.parallel, but these require some work to use.

Alright guys, I have an answer: numpy's vectorize.
Please read the edited section though. You'll discover that python actually optimizes code for you, which actually defeats the purpose of using numpy arrays in this case. (But using numpy arrays does not decrease the performance.)
The last test really shows is that python lists are as efficient as they could be, and so this vectorization procedure is unnecessary. This is why I didn't mark this question as the "best answer".
Setup code:
def factory(i): return lambda num: num==i
array1 = list()
for i in range(10000): array1.append(factory(i))
array1 = np.array(array1)
array2 = np.array(xrange(10000))
The "unvectorized" version:
def evaluate(array1, array2):
return [func(val) for func, val in zip(array1, array2)]
%timeit evaluate(array1, array2)
# 100 loops, best of 3: 10 ms per loop
The vectorized version
def evaluate2(func, b): return func(b)
vec_evaluate = np.vectorize(evaluate2)
vec_evaluate(array1, array2)
# 100 loops, best of 3: 2.65 ms per loop
EDIT
Okay, I just wanted to paste more benchmarks that I received using the above tests, except with different test cases.
I made a third edit, showing what happens if you simply use python lists. The long story short, you actually won't regret much. This test case is on the very bottom.
Test cases only involving integers
In summary, if n is small, then the unvectorized version is better. Otherwise, vectorized is the way to go.
With n = 30
%timeit evaluate(array1, array2)
# 10000 loops, best of 3: 35.7 µs per loop
%timeit vec_evaluate(array1, array2)
# 10000 loops, best of 3: 27.6 µs per loop
With n = 7
%timeit evaluate(array1, array2)
100000 loops, best of 3: 9.93 µs per loop
%timeit vec_evaluate(array1, array2)
10000 loops, best of 3: 21.6 µs per loop
Test cases involving strings
Vectorization wins.
Setup code:
def factory(i): return lambda num: str(num)==str(i)
array1 = list()
for i in range(7):
array1.append(factory(i))
array1 = np.array(array1)
array2 = np.array(xrange(7))
With n = 10000
%timeit evaluate(array1, array2)
10 loops, best of 3: 36.7 ms per loop
%timeit vec_evaluate(array1, array2)
100 loops, best of 3: 6.57 ms per loop
With n = 7
%timeit evaluate(array1, array2)
10000 loops, best of 3: 28.3 µs per loop
%timeit vec_evaluate(array1, array2)
10000 loops, best of 3: 27.5 µs per loop
Random tests
Just to see how branch prediction played a role. From what I'm seeing, it didn't really change much. Vectorization still usually wins.
Setup code.
def factory(i):
if random() < 0.5:
return lambda num: str(num) == str(i)
return lambda num: num == i
When n = 10000
%timeit evaluate(array1, array2)
10 loops, best of 3: 25.7 ms per loop
%timeit vec_evaluate(array1, array2)
100 loops, best of 3: 4.67 ms per loop
When n = 7
%timeit evaluate(array1, array2)
10000 loops, best of 3: 23.1 µs per loop
%timeit vec_evaluate(array1, array2)
10000 loops, best of 3: 23.1 µs per loop
Using python lists instead of numpy arrays
I ran this test to see what happened when I chose not to use the "optimized" numpy arrays, and I received some very surprising results.
The setup code is almost the same, except I'm choosing not to use numpy arrays. I'm also doing this test for only the "random" case.
def factory(i):
if random() < 0.5:
return lambda num: str(num) == str(i)
return lambda num: num == i
array1 = list()
for i in range(10000): array1.append(factory(i))
array2 = range(10000)
And the "unvectorized" version:
%timeit evaluate(array1, array2)
100 loops, best of 3: 4.93 ms per loop
You could see this is actually pretty surprising, because this is almost the same benchmark I was receiving with my random test case involving the vectorized evaluate.
%timeit vec_evaluate(array1, array2)
10 loops, best of 3: 19.8 ms per loop
Likewise, if you change these into numpy arrays before using vec_evaluate, you get the same 4.5 ms benchmark.

Why is vectorized version slower?

I have a problem where I have to do the following calculation.
I wanted to avoid the loop version, so I vectorized it.
Why is the loop version actually fast than the vectorized version?
Does anybody have an explanation for this.
thx
import numpy as np
from numpy.core.umath_tests import inner1d
num_vertices = 40000
num_pca_dims = 1000
num_vert_coords = 3
a = np.arange(num_vert_coords * num_vertices * num_pca_dims).reshape((num_pca_dims, num_vertices*num_vert_coords)).T
#n-by-3
norms = np.arange(num_vertices * num_vert_coords).reshape(num_vertices,-1)
#Loop version
def slowversion(a,norms):
res_list = []
for c_idx in range(a.shape[1]):
curr_col = a[:,c_idx].reshape(-1,3)
res = inner1d(curr_col, norms)
res_list.append(res)
res_list_conc = np.column_stack(res_list)
return res_list_conc
#Fast version
def fastversion(a,norms):
a_3 = a.reshape(num_vertices, 3, num_pca_dims)
fast_res = np.sum(a_3 * norms[:,:,None], axis=1)
return fast_res
res_list_conc = slowversion(a,norms)
fast_res = fastversion(a,norms)
assert np.all(res_list_conc == fast_res)

Your "slow code" is likely doing better because inner1d is a single optimized C++ function that can* make use of your BLAS implementation. Lets look at comparable timings for this operation:
np.allclose(inner1d(a[:,0].reshape(-1,3), norms),
np.sum(a[:,0].reshape(-1,3)*norms,axis=1))
True
%timeit inner1d(a[:,0].reshape(-1,3), norms)
10000 loops, best of 3: 200 µs per loop
%timeit np.sum(a[:,0].reshape(-1,3)*norms,axis=1)
1000 loops, best of 3: 625 µs per loop
%timeit np.einsum('ij,ij->i',a[:,0].reshape(-1,3), norms)
1000 loops, best of 3: 325 µs per loop
Using inner is quite a bit faster then the pure numpy operations. Note that einsum is almost twice as fast as pure numpy expressions and for good reason. As your loop is not that large and most of the FLOPS are in the inner computations the saving for the inner operation outweigh the cost of looping.
%timeit slowversion(a,norms)
1 loops, best of 3: 991 ms per loop
%timeit fastversion(a,norms)
1 loops, best of 3: 1.28 s per loop
#Thanks to DSM for writing this out
%timeit np.einsum('ijk,ij->ik',a.reshape(num_vertices, num_vert_coords, num_pca_dims), norms)
1 loops, best of 3: 488 ms per loop
Putting this back together we can see the overall advantage of the "slow version" wins out; however, using an einsum implementation, which is fairly optimized for this sort of thing, gives us a further speed increase.
*I don't see it right off in the code, but it is clearly threaded.

How can I speed up transition matrix creation in Numpy?

The following is the most basic way I know of to count transitions in a markov chain and use it to populate a transition matrix:
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
I've tried speeding it up in 3 different ways:
1) Using a sparse matrix one-liner based on this Matlab code:
transition_matrix = full(sparse(markov_chain(1:end-1), markov_chain(2:end), 1))
Which in Numpy/SciPy, looks like this:
def get_sparse_counts_matrix(markov_chain, number_of_states):
return coo_matrix(([1]*(len(markov_chain) - 1), (markov_chain[0:-1], markov_chain[1:])), shape=(number_of_states, number_of_states))
And I've tried a couple more Python tweaks, like using zip():
for old_state, new_state in zip(markov_chain[0:-1], markov_chain[1:]):
transition_counts_matrix[old_state, new_state] += 1
And Queues:
old_and_new_states_holder = Queue(maxsize=2)
old_and_new_states_holder.put(markov_chain[0])
for new_state in markov_chain[1:]:
old_and_new_states_holder.put(new_state)
old_state = old_and_new_states_holder.get()
transition_counts_matrix[old_state, new_state] += 1
But none of these 3 methods sped things up. In fact, everything but the zip() solution was at least 10X slower than my original solution.
Are there any other solutions worth looking into?
Modified solution for building a transition matrix from lots of chains
The best answer to the above question specifically was DSM's. However, for anyone who wants to populate a transition matrix based on a list of millions of markov chains, the quickest way is this:
def fast_increment_transition_counts_from_chain(markov_chain, transition_counts_matrix):
flat_coords = numpy.ravel_multi_index((markov_chain[:-1], markov_chain[1:]), transition_counts_matrix.shape)
transition_counts_matrix.flat += numpy.bincount(flat_coords, minlength=transition_counts_matrix.size)
def get_fake_transitions(markov_chains):
fake_transitions = []
for i in xrange(1,len(markov_chains)):
old_chain = markov_chains[i - 1]
new_chain = markov_chains[i]
end_of_old = old_chain[-1]
beginning_of_new = new_chain[0]
fake_transitions.append((end_of_old, beginning_of_new))
return fake_transitions
def decrement_fake_transitions(fake_transitions, counts_matrix):
for old_state, new_state in fake_transitions:
counts_matrix[old_state, new_state] -= 1
def fast_get_transition_counts_matrix(markov_chains, number_of_states):
"""50% faster than original, but must store 2 additional slice copies of all markov chains in memory at once.
You might need to break up the chains into manageable chunks that don't exceed your memory.
"""
transition_counts_matrix = numpy.zeros([number_of_states, number_of_states])
fake_transitions = get_fake_transitions(markov_chains)
markov_chains = list(itertools.chain(*markov_chains))
fast_increment_transition_counts_from_chain(markov_chains, transition_counts_matrix)
decrement_fake_transitions(fake_transitions, transition_counts_matrix)
return transition_counts_matrix

Just for kicks, and because I've been wanting to try it out, I applied Numba to your problem. In code, that involves just adding a decorator (although I've made a direct call so I could test the jit variants that numba provides here):
import numpy as np
import numba
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
autojit_func = numba.autojit()(increment_counts_in_matrix_from_chain)
jit_func = numba.jit(argtypes=[numba.int64[:,::1],numba.double[:,::1]])(increment_counts_in_matrix_from_chain)
t = np.random.randint(0,50, 500)
m1 = np.zeros((50,50))
m2 = np.zeros((50,50))
m3 = np.zeros((50,50))
And then timings:
In [10]: %timeit increment_counts_in_matrix_from_chain(t,m1)
100 loops, best of 3: 2.38 ms per loop
In [11]: %timeit autojit_func(t,m2)
10000 loops, best of 3: 67.5 us per loop
In [12]: %timeit jit_func(t,m3)
100000 loops, best of 3: 4.93 us per loop
The autojit method does some guessing based on runtime inputs, and the jit function has types dictated. You have to be a little careful since numba at these early stages doesn't communicate that there was an error with jit if you pass in the wrong type for an input. It will just spit out an incorrect answer.
That said though, getting a 35x and 485x speed-up without any code change and just adding a call to numba (can be also called as a decorator) is pretty impressive in my book. You could probably get similar results using cython, but it would require a bit more boilerplate and writing a setup.py file.
I also like this solution because the code remains readable and you can write it the way you originally thought about implementing the algorithm.

How about something like this, taking advantage of np.bincount? Not super-robust, but functional. [Thanks to #Warren Weckesser for the setup.]
import numpy as np
from collections import Counter
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
def using_counter(chain, counts_matrix):
counts = Counter(zip(chain[:-1], chain[1:]))
from_, to = zip(*counts.keys())
counts_matrix[from_, to] = counts.values()
def using_bincount(chain, counts_matrix):
flat_coords = np.ravel_multi_index((chain[:-1], chain[1:]), counts_matrix.shape)
counts_matrix.flat = np.bincount(flat_coords, minlength=counts_matrix.size)
def using_bincount_reshape(chain, counts_matrix):
flat_coords = np.ravel_multi_index((chain[:-1], chain[1:]), counts_matrix.shape)
return np.bincount(flat_coords, minlength=counts_matrix.size).reshape(counts_matrix.shape)
which gives:
In [373]: t = np.random.randint(0,50, 500)
In [374]: m1 = np.zeros((50,50))
In [375]: m2 = m1.copy()
In [376]: m3 = m1.copy()
In [377]: timeit increment_counts_in_matrix_from_chain(t, m1)
100 loops, best of 3: 2.79 ms per loop
In [378]: timeit using_counter(t, m2)
1000 loops, best of 3: 924 us per loop
In [379]: timeit using_bincount(t, m3)
10000 loops, best of 3: 57.1 us per loop
[edit]
Avoiding flat (at the cost of not working in-place) can save some time for small matrices:
In [80]: timeit using_bincount_reshape(t, m3)
10000 loops, best of 3: 22.3 us per loop

Here's a faster method. The idea is to count the number of occurrences of each transition, and use the counts in a vectorized update of the matrix. (I'm assuming that the same transition can occur multiple times in markov_chain.) The Counter class from the collections library is used to count the number of occurrences of each transition.
from collections import Counter
def update_matrix(chain, counts_matrix):
counts = Counter(zip(chain[:-1], chain[1:]))
from_, to = zip(*counts.keys())
counts_matrix[from_, to] += counts.values()
Timing example, in ipython:
In [64]: t = np.random.randint(0,50, 500)
In [65]: m1 = zeros((50,50))
In [66]: m2 = zeros((50,50))
In [67]: %timeit increment_counts_in_matrix_from_chain(t, m1)
1000 loops, best of 3: 895 us per loop
In [68]: %timeit update_matrix(t, m2)
1000 loops, best of 3: 504 us per loop
It's faster, but not orders of magnitude faster. For a real speed up, you might consider implementing this in Cython.

Ok, few ideas to tamper with, with some slight improvement (at cost of human undestanding)
Let's start with a random vector of integers between 0 and 9 of length 3000:
L = 3000
N = 10
states = array(randint(N),size=L)
transitions = np.zeros((N,N))
Your method, on my machine, has a timeit performance of 11.4 ms.
The first thing for a little improvement is to avoid to read the data twice, storing it in a temporary variable:
old = states[0]
for i in range(1,len(states)):
new = states[i]
transitions[new,old]+=1
old=new
This gives you a ~10% improvement and drops the time to 10.9 ms.
A more involuted approach uses the strides:
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
state_2 = rolling(states, 2)
for i in range(len(state_2)):
l,m = state_2[i,0],state_2[i,1]
transitions[m,l]+=1
The strides allow you to read the consecutive numbers of the array tricking the array to think that the rows start in a different way (ok, it's not well described, but if you take some time to read about strides you will get it)
This approach loses performance, going to 12.2 ms, but it is the hallway to trick the system even more. flattening both the transition matrix and the strided array to one dimensional arrays, you can speed up the performance a little more:
transitions = np.zeros(N*N)
state_2 = rolling(states, 2)
state_flat = np.sum(state_2 * array([1,10]),axis=1)
for i in state_flat:
transitions[i]+=1
transitions.reshape((N,N))
This goes down to 7.75 ms. It's not an order of magnitude, but it's a 30% better anyway :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing a simple CPU bound function with python multiprocessing - python

Related

numba guvectorize target='parallel' slower than target='cpu'

Cython prange slower for 4 threads then with range

Python: Vectorizing evaluations of arrays of lambda functions

Why is vectorized version slower?

How can I speed up transition matrix creation in Numpy?

Categories

Resources