I have two sets of points on a sphere, labelled 'obj' and 'ps' in the code example below. I would like to identify all 'obj' points that are closer than a certain angular distance from a 'ps' point.
My take on this is to represent each point by a 3D unit vector, and to compare their dot products to cos(maximum separation). This can be done easily with numpy broadcasting, but in my application I have n_obj ~ 500,000 and n_ps ~ 50,000, so the memory requirements of broadcasting are too large. Below I have pasted my current take using numba. Can this be optimized further?
from numba import jit
import numpy as np
from sklearn.preprocessing import normalize
def gen_points(n):
"""
generate random 3D unit vectors (not uniform, but irrelevant here)
"""
vec = 2*np.random.rand(n,3)-1.
vec_norm = normalize(vec)
return vec_norm
##jit(nopython=True)
#jit
def angdist_threshold_numba(vec_obj,vec_ps,cos_maxsep):
"""
finds obj that are closer than maxsep to a ps
"""
nps = len(vec_ps)
nobj = len(vec_obj)
#closeobj_all = []
closeobj_all = np.empty(0)
dotprod = np.empty(nobj)
a = np.arange(nobj)
for ps in range(nps):
np.sum(vec_obj*vec_ps[ps],axis=1,out=dotprod)
#closeobj_all.extend(a[dotprod > cos_maxsep])
closeobj_all = np.append(closeobj_all, a[dotprod > cos_maxsep])
return closeobj_all
vec_obj = gen_points(50000) #in reality ~500,000
vec_ps = gen_points(5000) #in reality ~50,000
cos_maxsep = np.cos(0.003)
closeobj_all = np.unique(angdist_threshold_numba(vec_obj,vec_ps,cos_maxsep))
This is the performance using the test case given in the code:
%timeit np.unique(angdist_threshold_numba(vec_obj,vec_ps,cos_maxsep))
1 loops, best of 3: 4.53 s per loop
I have tried to speed it up using
#jit(nopython=True)
but this fails with
NotImplementedError: Failed at nopython (nopython frontend)
(<class 'numba.ir.Expr'>, build_list(items=[]))
Edit: After a numba update to 0.26 the creation of the empty list fails even in the python mode. This can be fixed by replacing it with np.empty(0), and the .extend() with np.append(), see above. This almost doesn't change the performance.
According to https://github.com/numba/numba/issues/858 np.empty() is now supported in nopython mode, but I still can't run this with #jit(nopython = True):
TypingError: Internal error at <numba.typeinfer.CallConstraint object at 0x7ff3114a9310>
Unlike list.append you should never call numpy.append in a loop! This is because even for appending a single element the whole array needs to be copied. Because you're only interested in the unique obj you could use a Boolean array to flag the matches found so far.
As for Numba, it works best if you write out all the loops. So for example:
#jit(nopython=True)
def numba2(vec_obj, vec_ps, cos_maxsep):
nps = vec_ps.shape[0]
nobj = vec_obj.shape[0]
dim = vec_obj.shape[1]
found = np.zeros(nobj, np.bool_)
for i in range(nobj):
for j in range(nps):
cos = 0.0
for k in range(dim):
cos += vec_obj[i,k] * vec_ps[j,k]
if cos > cos_maxsep:
found[i] = True
break
return found.nonzero()
The added benefit is that we can break out of the loop over the ps array as soon as we find a match to the current obj.
You can gain some more speed by specializing the function for 3 dimensional spaces. Also, for some reason, passing all arrays and relevant dimensions into a helper function results in another speedup:
def numba3(vec_obj, vec_ps, cos_maxsep):
nps = len(vec_ps)
nobj = len(vec_obj)
out = np.zeros(nobj, bool)
numba3_helper(vec_obj, vec_ps, cos_maxsep, out, nps, nobj)
return np.flatnonzero(out)
#jit(nopython=True)
def numba3_helper(vec_obj, vec_ps, cos_maxsep, out, nps, nobj):
for i in range(nobj):
for j in range(nps):
cos = (vec_obj[i,0]*vec_ps[j,0] +
vec_obj[i,1]*vec_ps[j,1] +
vec_obj[i,2]*vec_ps[j,2])
if cos > cos_maxsep:
out[i] = True
break
return out
Timings I get for 20,000 obj and 2,000 ps:
%timeit angdist_threshold_numba(vec_obj,vec_ps,cos_maxsep)
1 loop, best of 3: 2.99 s per loop
%timeit numba2(vec_obj, vec_ps, cos_maxsep)
1 loop, best of 3: 444 ms per loop
%timeit numba3(vec_obj, vec_ps, cos_maxsep)
10 loops, best of 3: 134 ms per loop
Related
I am calculating the most frequent number in a vector of int8s. Numba complains when I set up a counter array of ints:
#jit(nopython=True)
def freq_int8(y):
"""Find most frequent number in array"""
count = np.zeros(256, dtype=int)
for val in y:
count[val] += 1
return ((np.argmax(count)+128) % 256) - 128
Calling it I get the following error:
TypingError: Invalid usage of Function(<built-in function zeros>) with parameters (int64, Function(<class 'int'>))
If I delete dtype=int it works and I get a decent speedup. I am however puzzled as to why declaring an array of ints isn't working. Is there a known workaround, and would there be any efficiency gain worth having here?
Background: I am trying to shave microseconds off some numpy-heavy code. I am especially being hurt by numpy.median, and have been looking into Numba, but am struggling to improve on median. Finding the most frequent number is an acceptable alternative to median, and here I've been able to gain some performance. The above numba code is also faster than numpy.bincount.
Update: After input in the accepted answer, here's an implementation of median for int8 vectors. It is roughly an order of magnitude faster than numpy.median:
#jit(nopython=True)
def median_int8(y):
N2 = len(y)//2
count = np.zeros(256, dtype=np.int32)
for val in y:
count[val] += 1
cs = 0
for i in range(-128, 128):
cs += count[i]
if cs > N2:
return float(i)
elif cs == N2:
j = i+1
while count[j] == 0:
j += 1
return (i + j)/2
Surprisingly, the performance difference is even greater for short vectors, apparently due to overhead in numpy vectors:
>>> a = np.random.randint(-128, 128, 10)
>>> %timeit np.median(a)
The slowest run took 7.03 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 20.8 µs per loop
>>> %timeit median_int8(a)
The slowest run took 11.67 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 593 ns per loop
This overhead is so large, I'm wondering if something is wrong.
Just a quick note, finding the most frequent number is normally called mode, and it is as similar to the median as it is the mean... in which case np.mean will be considerably faster. Unless you have some constrains or particularities in your data, there is no guarantee that the mode approximates the median.
If you still want to calculate the mode of a list of integer numbers, np.bincount, as you mention, should be enough (if numba is faster, it shouldn't be by much):
count = np.bincount(y, minlength=256)
result = ((np.argmax(count)+128) % 256) - 128
Note I've added the minlength parameter to np.bincount just so it returns the same 256 length list that you have in your code. But it is completely unnecessary in practice, as you only want the argmax, np.bincount (without minlength) will return a list which length is the maximum number in y.
As for the numba error, replacing dtype=int with dtype=np.int32 should solve the problem. int is a python function, and you are specifying nopython in the numba header. If you remove nopython, then either dtype=int or dtype='i' will also work (having the same effect).
This question already has answers here:
Faster alternatives to numpy.argmax/argmin which is slow
(3 answers)
Closed 6 years ago.
I'm completely new to numpy and unable to find a solution.
I have a 2d list of floating point numbers in python like:
list1[0..8][0..2]
Where e.g.:
print(list1[0][0])
> 0.1122233784
Now I want to find min and max values:
b1 = numpy.array(list1)
list1MinX, list1MinY, list1MinZ = b1.min(axis=0)
list1MaxX, list1MaxY, list1MaxZ = b1.max(axis=0)
I need to do this about a million times in a loop.
It works correctly, but it's about 3x slower than my previous native python approach.
(1:15 min[numpy] vs 0:25 min[native])
What am I doing wrong?
I've read that the list conversion could be the problem, but I don't know how to do it better.
EDIT
As request some non-pseudo code, although in my script the list is created in another way.
import numpy
import random
def moonPositionNow():
#assume we read like from a file, line by line
#nextChunk = readNextLine()
#the file is build like this
#x-coord
#y-coord
#z-coord
#x-coord
#...
#but we don't have that data here, so as a **placeholder** we return a random number
nextChunk = random.random()
return nextChunk
for w in range(1000000):
list1 = [[moonPositionNow() for i in range(3)] for j in range(9)]
b1 = numpy.array(list1)
list1MinX, list1MinY, list1MinZ = b1.min(axis=0)
list1MaxX, list1MaxY, list1MaxZ = b1.max(axis=0)
#Print out results
Although the list creation may be a bottle neck here I guaranty in the original code it's not the problem.
EDIT2:
Updated the example code to clarify, I don't need a numpy array of random numbers.
Since your data is available as a Python list it seems reasonable to me that a native implementation (which likely calls some optimized C code) could be faster than converting to numpy first and then calling optimized C code.
You basically loop over your data twice: once for converting the python objects to numpy arrays, and once for computing the maximum or minimum.
The native implementation (I assume it is something like calling min/max on the Python list) only needs to loop over the data once.
Furthermore, it seems that numpy's min/max functions are surprisingly slow: https://stackoverflow.com/a/12200671/3005167
The problem arises because you are passing a python list to a numpy function. The numpy function is significantly faster if you pass a numpy array as the argument.
#Create numpy numbers
nptest = np.random.uniform(size=(10000, 10))
#Create a native python list
listtest = list(nptest)
#Compare performance
%timeit np.min(nptest, axis=0)
%timeit np.min(listtest, axis=0)
Output
1000 loops, best of 3: 394 µs per loop
100 loops, best of 3: 20 ms per loop
EDIT: Added example on how to evaluate a cost function over a grid.
The following evaluates a quadratic cost function over a grid and then takes the minimum along the first axis. In particular, np.meshgrid is your friend.
def cost_function(x, y):
return x ** 2 + y ** 2
x = linspace(-1, 1)
y = linspace(-1, 1)
def eval_python(x, y):
matrix = [cost_function(_x, _y) for _x in x for _y in y]
return np.min(matrix, axis=0)
def eval_numpy(x, y):
xx, yy = np.meshgrid(x, y)
matrix = cost_function(xx, yy)
return np.min(matrix, axis=0)
%timeit eval_python(x, y)
%timeit eval_numpy(x, y)
Output
100 loops, best of 3: 13.9 ms per loop
10000 loops, best of 3: 136 µs per loop
Finally, if you cannot cast your problem in this form, you can preallocated the memory and then fill in each element.
matrix = np.empty((num_x, num_y))
for i in range(num_x):
for j in range(num_y):
matrix[i, j] = cost_function(i, j)
Input data
Produce n matrices of a given size (here, 3x2). I also chose n=25, but I let n to lay the emphasis on the fact that what we have is a bunch of matrices.
import numpy as np
n = 25
data = np.random.rand(n, 3, 2)
This is just a format example : I can't change it. Or if I do, one must take into account the computational cost of this change.
Current implementation
What I want to achieve atomically is:
output = []
for datum in data: # This outputs on (3x2) matrix after the other
d0 = datum[0]
dr = datum[1:]
output.append(dr-d0)
or, in a faster fashion:
output = [dr-d0 for (dr, d0) in zip(datum[:,0], datum[:,1:])]
Problem
This is too slow and:
output = datum[:,1:] - datum[:,0]
does not work since the behavior of the subtraction operation is not well defined in that case. Plus, this kind of slicing is not very efficient.
Cython/Nuitka/PyPy and the likes are possible solutions, but I'd like to stick with raw Numpy for now, if possible. Maybe some kind of function that can be applied on elements of the outer loop of a numpy array very quickly without the overhead of python stuff...
The np.vectorize function doesn't work on:
def get_diff(mat):
return mat[1:] - mat[0]
So I invoke ye, High Priests of Numpy, servants of Python to enlighten my poor soul!
EDIT:
XY Problem
(I didn't know it had a name)
What I actually want to do is to determine the content (read "volume") of a lot of simplices (read "tetrahedra"). The easiest and most efficient way to do it, AFAIK is to calculate:
np.linalg.det(mat[:1]-mat[0])
Then let me rephrase my question: how can I efficiently compute the content of any ensemble of simplices of dimension k using plain python and numpy?
I suggest data[:,1:] - data[:,0,None]. The None creates a new axis (officially you're supposed to use np.newaxis, which makes it very clear what you're doing), and then the subtraction will behave the way you want it to.
Correcting what I think are errors in your list comprehension:
def loop(data):
output = []
for datum in data: # This outputs on (3x2) matrix after the other
d0 = datum[0]
dr = datum[1:]
output.append(dr-d0)
return output
def listcomp(data):
output = [dr-d0 for (d0, dr) in zip(data[:,0], data[:,1:])]
return output
def sub(data):
output = data[:,1:] - data[:,0,None]
return output
we have
>>> import numpy as np
>>> n = 25
>>> data = np.random.rand(n, 3, 2)
>>> res_loop = loop(data)
>>> res_listcomp = listcomp(data)
>>> res_sub = sub(data)
>>> np.allclose(res_loop, res_listcomp)
True
>>> np.allclose(res_loop, res_sub)
True
>>>
>>> %timeit loop(data)
10000 loops, best of 3: 184 µs per loop
>>> %timeit listcomp(data)
10000 loops, best of 3: 158 µs per loop
>>> %timeit sub(data)
100000 loops, best of 3: 12.8 µs per loop
This code fragment is a bottleneck in a project of mine. Are there function calls that could replace the for loops and speed it up?
D = np.zeros((nOcc,nOcc,nVir,nVir))
for i in range(nOcc):
for j in range(i+1):
tmp = Ew[i] + Ew[j]
for a in range(nVir):
tmp2 = tmp - Ew[a+nOcc]
for b in range(a+1):
tmp3 = 1.0/(tmp2 - Ew[b+nOcc])
D[i,j,a,b] = Iiajb[i,a,j,b]*tmp3
D[i,j,b,a] = Iiajb[i,b,j,a]*tmp3
D[j,i,a,b] = D[i,j,b,a]
D[j,i,b,a] = D[i,j,a,b]
To start off lets generate some arbitrary data, thats obeys a few required principles:
nOcc = 30
nVir = 120
Ew = np.random.rand(nOcc+nVir)
Ew[:nOcc]*=-1
Ia = np.random.rand(nOcc)
Ib = np.random.rand(nVir)
I = np.einsum('a,b,c,d->abcd',Ia,Ib,Ia,Ib)
Lets wrap your base code as an example:
def oldcalc_D(Iiajb,nOcc,nVir,Ew):
D = np.zeros((nOcc,nOcc,nVir,nVir))
for i in range(nOcc):
for j in range(i+1):
tmp = Ew[i] + Ew[j]
for a in range(nVir):
tmp2 = tmp - Ew[a+nOcc]
for b in range(a+1):
tmp3 = 1.0/(tmp2 - Ew[b+nOcc])
D[i,j,a,b] = Iiajb[i,a,j,b]*tmp3
D[i,j,b,a] = Iiajb[i,b,j,a]*tmp3
D[j,i,a,b] = D[i,j,b,a]
D[j,i,b,a] = D[i,j,a,b]
return D
Taking advantage of integral symmetry is typically a good tactic; however, in numpy alone it is not worth the cost so we are going to ignore this and simply vectorize your code:
def newcalc_D(I,nOcc,nVir,Ew):
O = Ew[:nOcc]
V = Ew[nOcc:]
D = O[:,None,None,None] - V[:,None,None] + O[:,None] - V
return (I/D).swapaxes(1,2)
Some timings:
np.allclose(oldcalc_D(I,nOcc,nVir,Ew),newcalc_D(I,nOcc,nVir,Ew))
True
%timeit newcalc_D(I,nOcc,nVir,Ew)
1 loops, best of 3: 142 ms per loop
%timeit oldcalc_D(I,nOcc,nVir,Ew)
1 loops, best of 3: 15 s per loop
So only about ~100x faster, as I said this is a fairly simple pass to give you an idea what to do. This can be done much better, but should be a trivial part of the calculation as the integral transformation is (O)N^5 vs this at (O)N^4. For these operations I use numba's autojit feature:
from numba import autojit
numba_D = autojit(oldcalc_D)
%timeit numba_D(I,nOcc,nVir,Ew)
10 loops, best of 3: 55.1 ms per loop
unless this is Python3, you may want to start by replacing range with xrange: the former creates the entire list, while the latter is just an iterator, which is all you need in this case.
For large Ns, speed difference should be noticeable.
Also, seeing as your using numpy, there is probably a vectorized way to implement the algorithm. If that's the case, the vectorized implementation should be orders of magnitude faster. But unless you explain the variables and the algorithm, we can't help you in that direction.
The following is the most basic way I know of to count transitions in a markov chain and use it to populate a transition matrix:
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
I've tried speeding it up in 3 different ways:
1) Using a sparse matrix one-liner based on this Matlab code:
transition_matrix = full(sparse(markov_chain(1:end-1), markov_chain(2:end), 1))
Which in Numpy/SciPy, looks like this:
def get_sparse_counts_matrix(markov_chain, number_of_states):
return coo_matrix(([1]*(len(markov_chain) - 1), (markov_chain[0:-1], markov_chain[1:])), shape=(number_of_states, number_of_states))
And I've tried a couple more Python tweaks, like using zip():
for old_state, new_state in zip(markov_chain[0:-1], markov_chain[1:]):
transition_counts_matrix[old_state, new_state] += 1
And Queues:
old_and_new_states_holder = Queue(maxsize=2)
old_and_new_states_holder.put(markov_chain[0])
for new_state in markov_chain[1:]:
old_and_new_states_holder.put(new_state)
old_state = old_and_new_states_holder.get()
transition_counts_matrix[old_state, new_state] += 1
But none of these 3 methods sped things up. In fact, everything but the zip() solution was at least 10X slower than my original solution.
Are there any other solutions worth looking into?
Modified solution for building a transition matrix from lots of chains
The best answer to the above question specifically was DSM's. However, for anyone who wants to populate a transition matrix based on a list of millions of markov chains, the quickest way is this:
def fast_increment_transition_counts_from_chain(markov_chain, transition_counts_matrix):
flat_coords = numpy.ravel_multi_index((markov_chain[:-1], markov_chain[1:]), transition_counts_matrix.shape)
transition_counts_matrix.flat += numpy.bincount(flat_coords, minlength=transition_counts_matrix.size)
def get_fake_transitions(markov_chains):
fake_transitions = []
for i in xrange(1,len(markov_chains)):
old_chain = markov_chains[i - 1]
new_chain = markov_chains[i]
end_of_old = old_chain[-1]
beginning_of_new = new_chain[0]
fake_transitions.append((end_of_old, beginning_of_new))
return fake_transitions
def decrement_fake_transitions(fake_transitions, counts_matrix):
for old_state, new_state in fake_transitions:
counts_matrix[old_state, new_state] -= 1
def fast_get_transition_counts_matrix(markov_chains, number_of_states):
"""50% faster than original, but must store 2 additional slice copies of all markov chains in memory at once.
You might need to break up the chains into manageable chunks that don't exceed your memory.
"""
transition_counts_matrix = numpy.zeros([number_of_states, number_of_states])
fake_transitions = get_fake_transitions(markov_chains)
markov_chains = list(itertools.chain(*markov_chains))
fast_increment_transition_counts_from_chain(markov_chains, transition_counts_matrix)
decrement_fake_transitions(fake_transitions, transition_counts_matrix)
return transition_counts_matrix
Just for kicks, and because I've been wanting to try it out, I applied Numba to your problem. In code, that involves just adding a decorator (although I've made a direct call so I could test the jit variants that numba provides here):
import numpy as np
import numba
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
autojit_func = numba.autojit()(increment_counts_in_matrix_from_chain)
jit_func = numba.jit(argtypes=[numba.int64[:,::1],numba.double[:,::1]])(increment_counts_in_matrix_from_chain)
t = np.random.randint(0,50, 500)
m1 = np.zeros((50,50))
m2 = np.zeros((50,50))
m3 = np.zeros((50,50))
And then timings:
In [10]: %timeit increment_counts_in_matrix_from_chain(t,m1)
100 loops, best of 3: 2.38 ms per loop
In [11]: %timeit autojit_func(t,m2)
10000 loops, best of 3: 67.5 us per loop
In [12]: %timeit jit_func(t,m3)
100000 loops, best of 3: 4.93 us per loop
The autojit method does some guessing based on runtime inputs, and the jit function has types dictated. You have to be a little careful since numba at these early stages doesn't communicate that there was an error with jit if you pass in the wrong type for an input. It will just spit out an incorrect answer.
That said though, getting a 35x and 485x speed-up without any code change and just adding a call to numba (can be also called as a decorator) is pretty impressive in my book. You could probably get similar results using cython, but it would require a bit more boilerplate and writing a setup.py file.
I also like this solution because the code remains readable and you can write it the way you originally thought about implementing the algorithm.
How about something like this, taking advantage of np.bincount? Not super-robust, but functional. [Thanks to #Warren Weckesser for the setup.]
import numpy as np
from collections import Counter
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
def using_counter(chain, counts_matrix):
counts = Counter(zip(chain[:-1], chain[1:]))
from_, to = zip(*counts.keys())
counts_matrix[from_, to] = counts.values()
def using_bincount(chain, counts_matrix):
flat_coords = np.ravel_multi_index((chain[:-1], chain[1:]), counts_matrix.shape)
counts_matrix.flat = np.bincount(flat_coords, minlength=counts_matrix.size)
def using_bincount_reshape(chain, counts_matrix):
flat_coords = np.ravel_multi_index((chain[:-1], chain[1:]), counts_matrix.shape)
return np.bincount(flat_coords, minlength=counts_matrix.size).reshape(counts_matrix.shape)
which gives:
In [373]: t = np.random.randint(0,50, 500)
In [374]: m1 = np.zeros((50,50))
In [375]: m2 = m1.copy()
In [376]: m3 = m1.copy()
In [377]: timeit increment_counts_in_matrix_from_chain(t, m1)
100 loops, best of 3: 2.79 ms per loop
In [378]: timeit using_counter(t, m2)
1000 loops, best of 3: 924 us per loop
In [379]: timeit using_bincount(t, m3)
10000 loops, best of 3: 57.1 us per loop
[edit]
Avoiding flat (at the cost of not working in-place) can save some time for small matrices:
In [80]: timeit using_bincount_reshape(t, m3)
10000 loops, best of 3: 22.3 us per loop
Here's a faster method. The idea is to count the number of occurrences of each transition, and use the counts in a vectorized update of the matrix. (I'm assuming that the same transition can occur multiple times in markov_chain.) The Counter class from the collections library is used to count the number of occurrences of each transition.
from collections import Counter
def update_matrix(chain, counts_matrix):
counts = Counter(zip(chain[:-1], chain[1:]))
from_, to = zip(*counts.keys())
counts_matrix[from_, to] += counts.values()
Timing example, in ipython:
In [64]: t = np.random.randint(0,50, 500)
In [65]: m1 = zeros((50,50))
In [66]: m2 = zeros((50,50))
In [67]: %timeit increment_counts_in_matrix_from_chain(t, m1)
1000 loops, best of 3: 895 us per loop
In [68]: %timeit update_matrix(t, m2)
1000 loops, best of 3: 504 us per loop
It's faster, but not orders of magnitude faster. For a real speed up, you might consider implementing this in Cython.
Ok, few ideas to tamper with, with some slight improvement (at cost of human undestanding)
Let's start with a random vector of integers between 0 and 9 of length 3000:
L = 3000
N = 10
states = array(randint(N),size=L)
transitions = np.zeros((N,N))
Your method, on my machine, has a timeit performance of 11.4 ms.
The first thing for a little improvement is to avoid to read the data twice, storing it in a temporary variable:
old = states[0]
for i in range(1,len(states)):
new = states[i]
transitions[new,old]+=1
old=new
This gives you a ~10% improvement and drops the time to 10.9 ms.
A more involuted approach uses the strides:
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
state_2 = rolling(states, 2)
for i in range(len(state_2)):
l,m = state_2[i,0],state_2[i,1]
transitions[m,l]+=1
The strides allow you to read the consecutive numbers of the array tricking the array to think that the rows start in a different way (ok, it's not well described, but if you take some time to read about strides you will get it)
This approach loses performance, going to 12.2 ms, but it is the hallway to trick the system even more. flattening both the transition matrix and the strided array to one dimensional arrays, you can speed up the performance a little more:
transitions = np.zeros(N*N)
state_2 = rolling(states, 2)
state_flat = np.sum(state_2 * array([1,10]),axis=1)
for i in state_flat:
transitions[i]+=1
transitions.reshape((N,N))
This goes down to 7.75 ms. It's not an order of magnitude, but it's a 30% better anyway :)