Parallelize np.searchsorted

Parallelize np.searchsorted - python

Is there a way to parallelize the implementation of np.searchsorted()?
I have a situation where the base array a and value array v are of the same order of size. From what I understand of the search sorted algorithm, it does the operation for each element in v in turn. I'd like to parallelize it so it performs this sorting on multiple elements of v at the same time. (If I have 32 cores, it should be able to sort 32 elements at once, right?) Is there a way to implement this?
I tried to use Numba #jit(nopython=True, nogil=True, parallel=True) but it shows no improvement in speed and no increase in CPU usage.
For reference, a and v are lists of integers with length on the order of 10^7 elements.

A trivially parallelized np.searchsorted works for me. Between 3.4x and 3.9x speed up on a 2-core x 2 threads colab instance with a, b length 10**7 (e.g. 9.94 s/2.62 s) using numba 0.55.1, omp threading layer.
import numba as nb
#nb.njit(parallel=True)
def searchsorted_parallel(a, b):
res = np.empty(len(b), np.intp)
for i in nb.prange(len(b)):
res[i] = np.searchsorted(a, b[i])
return res
Running a micro-benchmark
import numpy as np
a = np.random.randint(200000, size=10**7)
b = np.random.randint(200000, size=10**7)
a.sort()
r = [0,0]
%timeit -r1 -n1 r[0] = np.searchsorted(a,b)
#1 loop, best of 1: 9.31 s per loop
%timeit -r1 -n1 r[1] = searchsorted_parallel(a,b)
#1 loop, best of 1: 2.36 s per loop
np.testing.assert_array_equal(r[0], r[1])

Related

Numpy optimization with Numba

I have two sets of points on a sphere, labelled 'obj' and 'ps' in the code example below. I would like to identify all 'obj' points that are closer than a certain angular distance from a 'ps' point.
My take on this is to represent each point by a 3D unit vector, and to compare their dot products to cos(maximum separation). This can be done easily with numpy broadcasting, but in my application I have n_obj ~ 500,000 and n_ps ~ 50,000, so the memory requirements of broadcasting are too large. Below I have pasted my current take using numba. Can this be optimized further?
from numba import jit
import numpy as np
from sklearn.preprocessing import normalize
def gen_points(n):
"""
generate random 3D unit vectors (not uniform, but irrelevant here)
"""
vec = 2*np.random.rand(n,3)-1.
vec_norm = normalize(vec)
return vec_norm
##jit(nopython=True)
#jit
def angdist_threshold_numba(vec_obj,vec_ps,cos_maxsep):
"""
finds obj that are closer than maxsep to a ps
"""
nps = len(vec_ps)
nobj = len(vec_obj)
#closeobj_all = []
closeobj_all = np.empty(0)
dotprod = np.empty(nobj)
a = np.arange(nobj)
for ps in range(nps):
np.sum(vec_obj*vec_ps[ps],axis=1,out=dotprod)
#closeobj_all.extend(a[dotprod > cos_maxsep])
closeobj_all = np.append(closeobj_all, a[dotprod > cos_maxsep])
return closeobj_all
vec_obj = gen_points(50000) #in reality ~500,000
vec_ps = gen_points(5000) #in reality ~50,000
cos_maxsep = np.cos(0.003)
closeobj_all = np.unique(angdist_threshold_numba(vec_obj,vec_ps,cos_maxsep))
This is the performance using the test case given in the code:
%timeit np.unique(angdist_threshold_numba(vec_obj,vec_ps,cos_maxsep))
1 loops, best of 3: 4.53 s per loop
I have tried to speed it up using
#jit(nopython=True)
but this fails with
NotImplementedError: Failed at nopython (nopython frontend)
(<class 'numba.ir.Expr'>, build_list(items=[]))
Edit: After a numba update to 0.26 the creation of the empty list fails even in the python mode. This can be fixed by replacing it with np.empty(0), and the .extend() with np.append(), see above. This almost doesn't change the performance.
According to https://github.com/numba/numba/issues/858 np.empty() is now supported in nopython mode, but I still can't run this with #jit(nopython = True):
TypingError: Internal error at <numba.typeinfer.CallConstraint object at 0x7ff3114a9310>

Unlike list.append you should never call numpy.append in a loop! This is because even for appending a single element the whole array needs to be copied. Because you're only interested in the unique obj you could use a Boolean array to flag the matches found so far.
As for Numba, it works best if you write out all the loops. So for example:
#jit(nopython=True)
def numba2(vec_obj, vec_ps, cos_maxsep):
nps = vec_ps.shape[0]
nobj = vec_obj.shape[0]
dim = vec_obj.shape[1]
found = np.zeros(nobj, np.bool_)
for i in range(nobj):
for j in range(nps):
cos = 0.0
for k in range(dim):
cos += vec_obj[i,k] * vec_ps[j,k]
if cos > cos_maxsep:
found[i] = True
break
return found.nonzero()
The added benefit is that we can break out of the loop over the ps array as soon as we find a match to the current obj.
You can gain some more speed by specializing the function for 3 dimensional spaces. Also, for some reason, passing all arrays and relevant dimensions into a helper function results in another speedup:
def numba3(vec_obj, vec_ps, cos_maxsep):
nps = len(vec_ps)
nobj = len(vec_obj)
out = np.zeros(nobj, bool)
numba3_helper(vec_obj, vec_ps, cos_maxsep, out, nps, nobj)
return np.flatnonzero(out)
#jit(nopython=True)
def numba3_helper(vec_obj, vec_ps, cos_maxsep, out, nps, nobj):
for i in range(nobj):
for j in range(nps):
cos = (vec_obj[i,0]*vec_ps[j,0] +
vec_obj[i,1]*vec_ps[j,1] +
vec_obj[i,2]*vec_ps[j,2])
if cos > cos_maxsep:
out[i] = True
break
return out
Timings I get for 20,000 obj and 2,000 ps:
%timeit angdist_threshold_numba(vec_obj,vec_ps,cos_maxsep)
1 loop, best of 3: 2.99 s per loop
%timeit numba2(vec_obj, vec_ps, cos_maxsep)
1 loop, best of 3: 444 ms per loop
%timeit numba3(vec_obj, vec_ps, cos_maxsep)
10 loops, best of 3: 134 ms per loop

Optimizing a simple CPU bound function with python multiprocessing

I am trying to understand how the multiprocessing.Pool works, and I have developed a minimal example that illustrates my question. Briefly, I am using pool.map to parallelize a CPU-bound function operating on an array by following the example Dead simple example of using Multiprocessing Queue, Pool and Locking. When I follow that pattern, I get only a modest speedup with 4 cores, but if I instead manually chunk the array into num_threads and then use pool.map over the chunks, I find speedup factors that vastly exceed 4x, which makes no sense to me. Details to follow.
First, the function definitions.
def take_up_time():
n = 1e3
while n > 0:
n -= 1
def count_even_numbers(x):
take_up_time()
return np.where(np.mod(x, 2) == 0, 1, 0)
Now define the functions we'll benchmark.
First the function that runs in serial:
def serial(arr):
return np.sum(map(count_even_numbers,arr))
Now the function that uses Pool.map in the "standard" way:
def parallelization_strategy1(arr):
num_threads = multiprocessing_count()
pool = multiprocessing.Pool(num_threads)
result = pool.map(count_even_numbers,arr)
pool.close()
return np.sum(result)
Finally, the second strategy in which I manually chunk the array and then run Pool.map over the chunks (Splitting solution due to python numpy split array into unequal subarrays)
def split_padded(a,n):
""" Simple helper function for strategy 2
"""
padding = (-len(a))%n
if padding == 0:
return np.split(a, n)
else:
sub_arrays = np.split(np.concatenate((a,np.zeros(padding))),n)
sub_arrays[-1] = sub_arrays[-1][:-padding]
return sub_arrays
def parallelization_strategy2(arr):
num_threads = multiprocessing_count()
sub_arrays = split_padded(arr, num_threads)
pool = multiprocessing.Pool(num_threads)
result = pool.map(count_even_numbers,sub_arrays)
pool.close()
return np.sum(np.array(result))
Here is my array input:
npts = 1e3
arr = np.arange(npts)
Now I use the IPython %timeit function to run my timings, and for 1e3 points I get the following:
serial: 10 loops, best of 3: 98.7 ms per loop
parallelization_strategy1: 10 loops, best of 3: 77.7 ms per loop
parallelization_strategy2: 10 loops, best of 3: 22 ms per loop
Since I have 4 cores, Strategy 1 is a disappointingly modest speedup, and strategy 2 is suspiciously larger than the maximum 4x speedup.
When I increase npts to 1e4, the results are even more perplexing:
serial: 1 loops, best of 3: 967 ms per loop
parallelization_strategy1: 1 loops, best of 3: 596 ms per loop
parallelization_strategy2: 10 loops, best of 3: 22.9 ms per loop
So the two sources of confusion are:
Strategy 2 is way faster than the naive theoretical limit
For some reason, %timeit with npts=1e4 only triggers 1 loop for serial and strategy 1, but 10 loops for strategy 2.

Turns out your example fits perfectly in the Pythran model. Compiling the following source code count_even.py:
#pythran export count_even(int [:])
import numpy as np
def count_even_numbers(x):
return np.where(np.mod(x, 2) == 0, 1, 0)
def count_even(arr):
s = 0
#omp parallel for reduction(+:s)
for elem in arr:
s += count_even_numbers(elem)
return s
with the command line (-fopenmp activates the handling of the OpenMP annotations):
pythran count_even.py -fopenmp
And running timeit over this already yields massive speedups thanks to the conversion to native code:
Without Pythran
$ python -m timeit -s 'import numpy as np; arr = np.arange(1e7, dtype=int); from count_even import count_even' 'count_even(arr)'
verryyy long, more than several minutes :-/
With Pythran, one core
$ OMP_NUM_THREADS=1 python -m timeit -s 'import numpy as np; arr = np.arange(1e7, dtype=int); from count_even import count_even' 'count_even(arr)'
100 loops, best of 3: 10.3 msec per loop
With Pythran, two cores:
$ OMP_NUM_THREADS=2 python -m timeit -s 'import numpy as np; arr = np.arange(1e7, dtype=int); from count_even import count_even' 'count_even(arr)'
100 loops, best of 3: 5.5 msec per loop
twice as fast, parallelization is working :-)
Note that OpenMP enables multi-threading, not multi-processing.

Your strategies aren't doing the same!
In the first strategy, the Pool.map iterates over an array, so count_even_numbers is called for every array item (since the shape of the array is one-dimensional).
The second strategy maps over a list of arrays, so count_even_numbers is called for every array in the list.

How to apply a function to a 2D numpy array with multiprocessing

Suppose I have the following function:
def f(x,y):
return x*y
How do I apply the funtion to each element in an NxM 2D numpy array using the multiprocessing module? Using serial iteration, the code might look as follows:
import numpy as np
N = 10
M = 12
results = np.zeros(shape=(N,M))
for x in range(N):
for y in range(M):
results[x,y] = f(x,y)

Here's how you might parallelize your example function using multiprocesssing. I've also included an almost identical pure Python function that uses non-parallel for loops, and a numpy one-liner that achieves the same result:
import numpy as np
from multiprocessing import Pool
def f(x,y):
return x * y
# this helper function is needed because map() can only be used for functions
# that take a single argument (see http://stackoverflow.com/q/5442910/1461210)
def splat_f(args):
return f(*args)
# a pool of 8 worker processes
pool = Pool(8)
def parallel(M, N):
results = pool.map(splat_f, ((i, j) for i in range(M) for j in range(N)))
return np.array(results).reshape(M, N)
def nonparallel(M, N):
out = np.zeros((M, N), np.int)
for i in range(M):
for j in range(N):
out[i, j] = f(i, j)
return out
def broadcast(M, N):
return np.prod(np.ogrid[:M, :N])
Now let's look at the performance:
%timeit parallel(1000, 1000)
# 1 loops, best of 3: 1.67 s per loop
%timeit nonparallel(1000, 1000)
# 1 loops, best of 3: 395 ms per loop
%timeit broadcast(1000, 1000)
# 100 loops, best of 3: 2 ms per loop
The non-parallel pure Python version beats the parallelized version by a factor of about 4, and the version using numpy array broadcasting absolutely crushes the other two.
The problem is that starting and stopping Python subprocesses carries quite a lot of overhead, and your test function is so trivial that each worker thread spends only a tiny proportion of its lifetime doing useful work. Multiprocessing only makes sense if each thread has a substantial amount of work to do before it is killed. You might, for example, give each worker a bigger chunk of the output array to compute (try messing around with the chunksize= parameter to pool.map()), but with such a trivial example I doubt you'll see a big improvement.
I don't know what your actual code looks like - maybe your function is big and expensive enough to warrant using multiprocessing. However, I would bet that there are much better ways to improve its performance.

Not sure multiprocessing is needed in your case. In the simple example above, you can do
X, Y = numpy.meshgrid(numpy.arange(10), numpy.arange(12))
result = X*Y

Are there function calls that can replace the for loops in this code?

This code fragment is a bottleneck in a project of mine. Are there function calls that could replace the for loops and speed it up?
D = np.zeros((nOcc,nOcc,nVir,nVir))
for i in range(nOcc):
for j in range(i+1):
tmp = Ew[i] + Ew[j]
for a in range(nVir):
tmp2 = tmp - Ew[a+nOcc]
for b in range(a+1):
tmp3 = 1.0/(tmp2 - Ew[b+nOcc])
D[i,j,a,b] = Iiajb[i,a,j,b]*tmp3
D[i,j,b,a] = Iiajb[i,b,j,a]*tmp3
D[j,i,a,b] = D[i,j,b,a]
D[j,i,b,a] = D[i,j,a,b]

To start off lets generate some arbitrary data, thats obeys a few required principles:
nOcc = 30
nVir = 120
Ew = np.random.rand(nOcc+nVir)
Ew[:nOcc]*=-1
Ia = np.random.rand(nOcc)
Ib = np.random.rand(nVir)
I = np.einsum('a,b,c,d->abcd',Ia,Ib,Ia,Ib)
Lets wrap your base code as an example:
def oldcalc_D(Iiajb,nOcc,nVir,Ew):
D = np.zeros((nOcc,nOcc,nVir,nVir))
for i in range(nOcc):
for j in range(i+1):
tmp = Ew[i] + Ew[j]
for a in range(nVir):
tmp2 = tmp - Ew[a+nOcc]
for b in range(a+1):
tmp3 = 1.0/(tmp2 - Ew[b+nOcc])
D[i,j,a,b] = Iiajb[i,a,j,b]*tmp3
D[i,j,b,a] = Iiajb[i,b,j,a]*tmp3
D[j,i,a,b] = D[i,j,b,a]
D[j,i,b,a] = D[i,j,a,b]
return D
Taking advantage of integral symmetry is typically a good tactic; however, in numpy alone it is not worth the cost so we are going to ignore this and simply vectorize your code:
def newcalc_D(I,nOcc,nVir,Ew):
O = Ew[:nOcc]
V = Ew[nOcc:]
D = O[:,None,None,None] - V[:,None,None] + O[:,None] - V
return (I/D).swapaxes(1,2)
Some timings:
np.allclose(oldcalc_D(I,nOcc,nVir,Ew),newcalc_D(I,nOcc,nVir,Ew))
True
%timeit newcalc_D(I,nOcc,nVir,Ew)
1 loops, best of 3: 142 ms per loop
%timeit oldcalc_D(I,nOcc,nVir,Ew)
1 loops, best of 3: 15 s per loop
So only about ~100x faster, as I said this is a fairly simple pass to give you an idea what to do. This can be done much better, but should be a trivial part of the calculation as the integral transformation is (O)N^5 vs this at (O)N^4. For these operations I use numba's autojit feature:
from numba import autojit
numba_D = autojit(oldcalc_D)
%timeit numba_D(I,nOcc,nVir,Ew)
10 loops, best of 3: 55.1 ms per loop

unless this is Python3, you may want to start by replacing range with xrange: the former creates the entire list, while the latter is just an iterator, which is all you need in this case.
For large Ns, speed difference should be noticeable.
Also, seeing as your using numpy, there is probably a vectorized way to implement the algorithm. If that's the case, the vectorized implementation should be orders of magnitude faster. But unless you explain the variables and the algorithm, we can't help you in that direction.

How can I speed up transition matrix creation in Numpy?

The following is the most basic way I know of to count transitions in a markov chain and use it to populate a transition matrix:
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
I've tried speeding it up in 3 different ways:
1) Using a sparse matrix one-liner based on this Matlab code:
transition_matrix = full(sparse(markov_chain(1:end-1), markov_chain(2:end), 1))
Which in Numpy/SciPy, looks like this:
def get_sparse_counts_matrix(markov_chain, number_of_states):
return coo_matrix(([1]*(len(markov_chain) - 1), (markov_chain[0:-1], markov_chain[1:])), shape=(number_of_states, number_of_states))
And I've tried a couple more Python tweaks, like using zip():
for old_state, new_state in zip(markov_chain[0:-1], markov_chain[1:]):
transition_counts_matrix[old_state, new_state] += 1
And Queues:
old_and_new_states_holder = Queue(maxsize=2)
old_and_new_states_holder.put(markov_chain[0])
for new_state in markov_chain[1:]:
old_and_new_states_holder.put(new_state)
old_state = old_and_new_states_holder.get()
transition_counts_matrix[old_state, new_state] += 1
But none of these 3 methods sped things up. In fact, everything but the zip() solution was at least 10X slower than my original solution.
Are there any other solutions worth looking into?
Modified solution for building a transition matrix from lots of chains
The best answer to the above question specifically was DSM's. However, for anyone who wants to populate a transition matrix based on a list of millions of markov chains, the quickest way is this:
def fast_increment_transition_counts_from_chain(markov_chain, transition_counts_matrix):
flat_coords = numpy.ravel_multi_index((markov_chain[:-1], markov_chain[1:]), transition_counts_matrix.shape)
transition_counts_matrix.flat += numpy.bincount(flat_coords, minlength=transition_counts_matrix.size)
def get_fake_transitions(markov_chains):
fake_transitions = []
for i in xrange(1,len(markov_chains)):
old_chain = markov_chains[i - 1]
new_chain = markov_chains[i]
end_of_old = old_chain[-1]
beginning_of_new = new_chain[0]
fake_transitions.append((end_of_old, beginning_of_new))
return fake_transitions
def decrement_fake_transitions(fake_transitions, counts_matrix):
for old_state, new_state in fake_transitions:
counts_matrix[old_state, new_state] -= 1
def fast_get_transition_counts_matrix(markov_chains, number_of_states):
"""50% faster than original, but must store 2 additional slice copies of all markov chains in memory at once.
You might need to break up the chains into manageable chunks that don't exceed your memory.
"""
transition_counts_matrix = numpy.zeros([number_of_states, number_of_states])
fake_transitions = get_fake_transitions(markov_chains)
markov_chains = list(itertools.chain(*markov_chains))
fast_increment_transition_counts_from_chain(markov_chains, transition_counts_matrix)
decrement_fake_transitions(fake_transitions, transition_counts_matrix)
return transition_counts_matrix

Just for kicks, and because I've been wanting to try it out, I applied Numba to your problem. In code, that involves just adding a decorator (although I've made a direct call so I could test the jit variants that numba provides here):
import numpy as np
import numba
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
autojit_func = numba.autojit()(increment_counts_in_matrix_from_chain)
jit_func = numba.jit(argtypes=[numba.int64[:,::1],numba.double[:,::1]])(increment_counts_in_matrix_from_chain)
t = np.random.randint(0,50, 500)
m1 = np.zeros((50,50))
m2 = np.zeros((50,50))
m3 = np.zeros((50,50))
And then timings:
In [10]: %timeit increment_counts_in_matrix_from_chain(t,m1)
100 loops, best of 3: 2.38 ms per loop
In [11]: %timeit autojit_func(t,m2)
10000 loops, best of 3: 67.5 us per loop
In [12]: %timeit jit_func(t,m3)
100000 loops, best of 3: 4.93 us per loop
The autojit method does some guessing based on runtime inputs, and the jit function has types dictated. You have to be a little careful since numba at these early stages doesn't communicate that there was an error with jit if you pass in the wrong type for an input. It will just spit out an incorrect answer.
That said though, getting a 35x and 485x speed-up without any code change and just adding a call to numba (can be also called as a decorator) is pretty impressive in my book. You could probably get similar results using cython, but it would require a bit more boilerplate and writing a setup.py file.
I also like this solution because the code remains readable and you can write it the way you originally thought about implementing the algorithm.

How about something like this, taking advantage of np.bincount? Not super-robust, but functional. [Thanks to #Warren Weckesser for the setup.]
import numpy as np
from collections import Counter
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
def using_counter(chain, counts_matrix):
counts = Counter(zip(chain[:-1], chain[1:]))
from_, to = zip(*counts.keys())
counts_matrix[from_, to] = counts.values()
def using_bincount(chain, counts_matrix):
flat_coords = np.ravel_multi_index((chain[:-1], chain[1:]), counts_matrix.shape)
counts_matrix.flat = np.bincount(flat_coords, minlength=counts_matrix.size)
def using_bincount_reshape(chain, counts_matrix):
flat_coords = np.ravel_multi_index((chain[:-1], chain[1:]), counts_matrix.shape)
return np.bincount(flat_coords, minlength=counts_matrix.size).reshape(counts_matrix.shape)
which gives:
In [373]: t = np.random.randint(0,50, 500)
In [374]: m1 = np.zeros((50,50))
In [375]: m2 = m1.copy()
In [376]: m3 = m1.copy()
In [377]: timeit increment_counts_in_matrix_from_chain(t, m1)
100 loops, best of 3: 2.79 ms per loop
In [378]: timeit using_counter(t, m2)
1000 loops, best of 3: 924 us per loop
In [379]: timeit using_bincount(t, m3)
10000 loops, best of 3: 57.1 us per loop
[edit]
Avoiding flat (at the cost of not working in-place) can save some time for small matrices:
In [80]: timeit using_bincount_reshape(t, m3)
10000 loops, best of 3: 22.3 us per loop

Here's a faster method. The idea is to count the number of occurrences of each transition, and use the counts in a vectorized update of the matrix. (I'm assuming that the same transition can occur multiple times in markov_chain.) The Counter class from the collections library is used to count the number of occurrences of each transition.
from collections import Counter
def update_matrix(chain, counts_matrix):
counts = Counter(zip(chain[:-1], chain[1:]))
from_, to = zip(*counts.keys())
counts_matrix[from_, to] += counts.values()
Timing example, in ipython:
In [64]: t = np.random.randint(0,50, 500)
In [65]: m1 = zeros((50,50))
In [66]: m2 = zeros((50,50))
In [67]: %timeit increment_counts_in_matrix_from_chain(t, m1)
1000 loops, best of 3: 895 us per loop
In [68]: %timeit update_matrix(t, m2)
1000 loops, best of 3: 504 us per loop
It's faster, but not orders of magnitude faster. For a real speed up, you might consider implementing this in Cython.

Ok, few ideas to tamper with, with some slight improvement (at cost of human undestanding)
Let's start with a random vector of integers between 0 and 9 of length 3000:
L = 3000
N = 10
states = array(randint(N),size=L)
transitions = np.zeros((N,N))
Your method, on my machine, has a timeit performance of 11.4 ms.
The first thing for a little improvement is to avoid to read the data twice, storing it in a temporary variable:
old = states[0]
for i in range(1,len(states)):
new = states[i]
transitions[new,old]+=1
old=new
This gives you a ~10% improvement and drops the time to 10.9 ms.
A more involuted approach uses the strides:
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
state_2 = rolling(states, 2)
for i in range(len(state_2)):
l,m = state_2[i,0],state_2[i,1]
transitions[m,l]+=1
The strides allow you to read the consecutive numbers of the array tricking the array to think that the rows start in a different way (ok, it's not well described, but if you take some time to read about strides you will get it)
This approach loses performance, going to 12.2 ms, but it is the hallway to trick the system even more. flattening both the transition matrix and the strided array to one dimensional arrays, you can speed up the performance a little more:
transitions = np.zeros(N*N)
state_2 = rolling(states, 2)
state_flat = np.sum(state_2 * array([1,10]),axis=1)
for i in state_flat:
transitions[i]+=1
transitions.reshape((N,N))
This goes down to 7.75 ms. It's not an order of magnitude, but it's a 30% better anyway :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallelize np.searchsorted - python

Related

Numpy optimization with Numba

Optimizing a simple CPU bound function with python multiprocessing

How to apply a function to a 2D numpy array with multiprocessing

Are there function calls that can replace the for loops in this code?

How can I speed up transition matrix creation in Numpy?

Categories

Resources