I'm trying to evaluate polynomial (3'd degree) using numpy.
I found that doing it by simpler python code will be much more efficient.
import numpy as np
import timeit
m = [3,7,1,2]
f = lambda m,x: m[0]*x**3 + m[1]*x**2 + m[2]*x + m[3]
np_poly = np.poly1d(m)
np_polyval = lambda m,x: np.polyval(m,x)
np_pow = lambda m,x: np.power(x,[3,2,1,0]).dot(m)
print 'result={}, timeit={}'.format(f(m,12),timeit.Timer('f(m,12)', 'from __main__ import f,m').timeit(10000))
result=6206, timeit=0.0036780834198
print 'result={}, timeit={}'.format(np_poly(12),timeit.Timer('np_poly(12)', 'from __main__ import np_poly').timeit(10000))
result=6206, timeit=0.180546045303
print 'result={}, timeit={}'.format(np_polyval(m,12),timeit.Timer('np_polyval(m,12)', 'from __main__ import np_polyval,m').timeit(10000))
result=6206, timeit=0.227771043777
print 'result={}, timeit={}'.format(np_pow(m,12),timeit.Timer('np_pow(m,12)', 'from __main__ import np_pow,m').timeit(10000))
result=6206, timeit=0.168987989426
Did I miss something?
Is there another way in numpy to evaluate a polynomial?
Something like 23 years ago I checked out a copy of Press et al Numerical Recipes in C from the university's library. There was a lot of cool stuff in that book, but there's a passage that has stuck with me over the years, page 173 here:
We assume that you know enough never to evaluate a polynomial this
way:
p=c[0]+c[1]*x+c[2]*x*x+c[3]*x*x*x+c[4]*x*x*x*x;
or (even worse!),
p=c[0]+c[1]*x+c[2]*pow(x,2.0)+c[3]*pow(x,3.0)+c[4]*pow(x,4.0);
Come the (computer) revolution, all persons found guilty of such
criminal behavior will be summarily executed, and their programs won't
be! It is a matter of taste, however, whether to write
p = c[0]+x*(c[1]+x*(c[2]+x*(c[3]+x*c[4])));
or
p = (((c[4]*x+c[3])*x+c[2])*x+c[1])*x+c[0];
So if you are really worried about performance, you want to try that, the differences will be huge for higher degree polynomials:
In [24]: fast_f = lambda m, x: m[3] + x*(m[1] + x*(m[2] + x*m[3]))
In [25]: %timeit f(m, 12)
1000000 loops, best of 3: 478 ns per loop
In [26]: %timeit fast_f(m, 12)
1000000 loops, best of 3: 374 ns per loop
If you want to stick with numpy, there is a newer polynomial class that runs 2x faster than poly1d on my system, but is still much slower than the previous loops:
In [27]: np_fast_poly = np.polynomial.polynomial.Polynomial(m[::-1])
In [28]: %timeit np_poly(12)
100000 loops, best of 3: 15.4 us per loop
In [29]: %timeit np_fast_poly(12)
100000 loops, best of 3: 8.01 us per loop
Well, looking at the implementation of polyval (which is the function eventually being called when you eval a poly1d), it seems weird the implementor decided to include an explicit loop... From the source of numpy 1.6.2:
def polyval(p, x):
p = NX.asarray(p)
if isinstance(x, poly1d):
y = 0
else:
x = NX.asarray(x)
y = NX.zeros_like(x)
for i in range(len(p)):
y = x * y + p[i]
return y
On one hand, avoiding the power operation should be advantageous speed-wise, on the other hand, the python-level loop pretty much screws things up.
Here's an alternative numpy-ish implemenation:
POW = np.arange(100)[::-1]
def g(m, x):
return np.dot(m, x ** POW[m.size : ])
For speed, I avoid recreating the power array on each call. Also, to be fair when benchmarking against numpy, you should start with numpy arrays, not lists, to avoid the penalty of converting the list to numpy on each call.
So, when adding m = np.array(m), my g above only runs about 50% slower than your f.
Despite being slower on the example you posted, for evaluating a low-degree polynomial on a scalar x, you really can't do much faster than an explict implemenation (like your f) (of course you can, but probably not by much without resorting to writing lower-level code). However, for higher degrees (where you have to replace you explict expression with some sort of a loop), the numpy approach (e.g. g) would prove much faster as the degree increases, and also for vectorized evaluation, i.e. when x is a vector.
Related
Recently I answered to THIS question which wanted the multiplication of 2 lists,some user suggested the following way using numpy, alongside mine which I think is the proper way :
(a.T*b).T
Also I found that aray.resize() has a same performance like that. any way another answer suggested a solution using list comprehension :
[[m*n for n in second] for m, second in zip(b,a)]
But after the benchmark I saw that the list comprehension performs very faster than numpy :
from timeit import timeit
s1="""
a=[[2,3,5],[3,6,2],[1,3,2]]
b=[4,2,1]
[[m*n for n in second] for m, second in zip(b,a)]
"""
s2="""
a=np.array([[2,3,5],[3,6,2],[1,3,2]])
b=np.array([4,2,1])
(a.T*b).T
"""
print ' first: ' ,timeit(stmt=s1, number=1000000)
print 'second : ',timeit(stmt=s2, number=1000000,setup="import numpy as np")
result :
first: 1.49778485298
second : 7.43547797203
As you can see numpy is approximately 5 time faster. but most surprising thing was that its faster without using transpose, and for following code :
a=np.array([[2,3,5],[3,6,2],[1,3,2]])
b=np.array([[4],[2],[1]])
a*b
The list comprehension still was 5 time faster.So besides of this point that list comprehensions performs in C here we used 2 nested loop and a zip function So what can be the reason? Is it because of operation * in numpy?
Also note that there is no problem with timeit here I putted the import part in setup.
I also tried it with larger arras, the difference gets lower but still doesn't make sense :
s1="""
a=[[2,3,5],[3,6,2],[1,3,2]]*10000
b=[4,2,1]*10000
[[m*n for n in second] for m, second in zip(b,a)]
"""
s2="""
a=np.array([[2,3,5],[3,6,2],[1,3,2]]*10000)
b=np.array([4,2,1]*10000)
(a.T*b).T
"""
print ' first: ' ,timeit(stmt=s1, number=1000)
print 'second : ',timeit(stmt=s2, number=1000,setup="import numpy as np")
result :
first: 10.7480301857
second : 13.1278889179
Creation of numpy arrays is much slower than creation of lists:
In [153]: %timeit a = [[2,3,5],[3,6,2],[1,3,2]]
1000000 loops, best of 3: 308 ns per loop
In [154]: %timeit a = np.array([[2,3,5],[3,6,2],[1,3,2]])
100000 loops, best of 3: 2.27 µs per loop
There can also fixed costs incurred by NumPy function calls before the meat
of the calculation can be performed by a fast underlying C/Fortran function. This can include ensuring the inputs are NumPy arrays,
These setup/fixed costs are something to keep in mind before assuming NumPy
solutions are inherently faster than pure-Python solutions. NumPy shines when
you set up large arrays once and then perform many fast NumPy operations
on the arrays. It may fail to outperform pure Python if the arrays are small
because the setup cost can outweigh the benefit of offloading the calculations
to compiled C/Fortran functions. For small arrays there simply may not be enough
calculations to make it worth it.
If you increase the size of the arrays a bit, and move creation of the arrays
into the setup, then NumPy can be much faster than pure Python:
import numpy as np
from timeit import timeit
N, M = 300, 300
a = np.random.randint(100, size=(N,M))
b = np.random.randint(100, size=(N,))
a2 = a.tolist()
b2 = b.tolist()
s1="""
[[m*n for n in second] for m, second in zip(b2,a2)]
"""
s2 = """
(a.T*b).T
"""
s3 = """
a*b[:,None]
"""
assert np.allclose([[m*n for n in second] for m, second in zip(b2,a2)], (a.T*b).T)
assert np.allclose([[m*n for n in second] for m, second in zip(b2,a2)], a*b[:,None])
print 's1: {:.4f}'.format(
timeit(stmt=s1, number=10**3, setup='from __main__ import a2,b2'))
print 's2: {:.4f}'.format(
timeit(stmt=s2, number=10**3, setup='from __main__ import a,b'))
print 's3: {:.4f}'.format(
timeit(stmt=s3, number=10**3, setup='from __main__ import a,b'))
yields
s1: 4.6990
s2: 0.1224
s3: 0.1234
I expected this Python implementation of ThreeSum to be slow:
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
But I was shocked that this version looks pretty slow too:
def count_python(a):
"""ThreeSum using itertools"""
return sum(map(lambda X: sum(X)==0, itertools.combinations(a, r=3)))
Can anyone recommend a faster Python implementation? Both implementations just seem so slow...
Thanks
...
ANSWER SUMMARY:
Here is how the runs of all the various versions provided in this thread of the O(N^3) (for educational purposes, not used in real life) version worked out on my machine:
56 sec RUNNING count_slow...
28 sec RUNNING count_itertools, written by Ashwini Chaudhary...
14 sec RUNNING count_fixed, written by roippi...
11 sec RUNNING count_itertools (faster), written by Veedrak...
08 sec RUNNING count_enumerate, written by roippi...
*Note: Needed to modify Veedrak's solution to this to get the correct count output:
sum(1 for x, y, z in itertools.combinations(a, r=3) if x+y==-z)
Supplying a second answer. From various comments, it looks like you're primarily concerned about why this particular O(n**3) algorithm is slow when being ported over from java. Let's dive in.
def count(a):
"""ThreeSum: Given N distinct integers, how many triples sum to exactly zero?"""
N = len(a)
cnt = 0
for i in range(N):
for j in range(i+1, N):
for k in range(j+1, N):
if sum([a[i], a[j], a[k]]) == 0:
cnt += 1
return cnt
One major problem that immediately pops out is that you're doing something your java code almost certainly isn't doing: materializing a 3-element list just to add three numbers together!
if sum([a[i], a[j], a[k]]) == 0:
Yuck! Just write that as
if a[i] + a[j] + a[k] == 0:
Some benchmarking shows that you're adding 50%+ overhead just by doing that. Yikes.
The other issue here is that you're using indexing where you should be using iteration. In python try to avoid writing code like this:
for i in range(len(some_list)):
do_something(some_list[i])
And instead just write:
for x in some_list:
do_something(x)
And if you explicitly need the index that you're on (as you actually do in your code), use enumerate:
for i,x in enumerate(some_list):
#etc
This is, in general, a style thing (though it goes deeper than that, with duck typing and the iterator protocol) - but it is also a performance thing. In order to look up the value of a[i], that call is converted to a.__getitem__(i), then python has to dynamically resolve a __getitem__ method lookup, call it, and return the value. Every time. It's not a crazy amount of overhead - at least on builtin types - but it adds up if you're doing it a lot in a loop. Treating a as an iterable, on the other hand, sidesteps a lot of that overhead.
So taking that change in mind, you can rewrite your function once again:
def count_enumerate(a):
cnt = 0
for i, x in enumerate(a):
for j, y in enumerate(a[i+1:], i+1):
for z in a[j+1:]:
if x + y + z == 0:
cnt += 1
return cnt
Let's look at some timings:
%timeit count(range(-100,100))
1 loops, best of 3: 394 ms per loop
%timeit count_fixed(range(-100,100)) #just fixing your sum() line
10 loops, best of 3: 158 ms per loop
%timeit count_enumerate(range(-100,100))
10 loops, best of 3: 88.9 ms per loop
And that's about as fast as it's going to go. You can shave off a percent or so by wrapping everything in a comprehension instead of doing cnt += 1 but that's pretty minor.
I've toyed around with a few itertools implementations but I actually can't get them to go faster than this explicit loop version. This makes sense if you think about it - for every iteration, the itertools.combinations version has to rebind what all three variables refer to, whereas the explicit loops get to "cheat" and rebind the variables in the outer loops far less often.
Reality check time, though: after everything is said and done, you can still expect cPython to run this algorithm an order of magnitude slower than a modern JVM would. There is simply too much abstraction built in to python that gets in the way of looping quickly. If you care about speed (and you can't fix your algorithm - see my other answer), either use something like numpy to spend all of your time looping in C, or use a different implementation of python.
postscript: pypy
For fun, I ran count_fixed on a 1000-element list, on both cPython and pypy.
cPython:
In [81]: timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
Out[81]: 19.230753898620605
pypy:
>>>> timeit.timeit('count_fixed(range(-500,500))', setup='from __main__ import count_fixed', number = 1)
0.6961538791656494
Speedy!
I might add some java testing in later to compare :-)
Algorithmically, both versions of your function are O(n**3) - so asymptotically neither is superior. You will find that the itertools version is in practice somewhat faster since it spends more time looping in C rather than in python bytecode. You can get it down a few more percentage points by removing map entirely (especially if you're running py2) but it's still going to be "slow" compared to whatever times you got from running it in a JVM.
Note that there are plenty of python implementations other than cPython out there - for loopy code, pypy tends to be much faster than cPython. So I wouldn't write python-as-a-language off as being slow, necessarily, but I would certainly say that the reference implementation of python is not known for its blazing loop speed. Give other python flavors a shot if that's something you care about.
Specific to your algorithm, an optimization will let you drop it down to O(n**2). Build up a set of your integers, s, and build up all pairs (a,b). You know that you can "zero out" (a+b) if and only if -(a+b) in (s - {a,b}).
Thanks to #Veedrak: unfortunately constructing s - {a,b} is a slow O(len(s)) operation itself - so simply check if -(a+b) is equal to either a or b. If it is, you know there's no third c that can fulfill a+b+c == 0 since all numbers in your input are distinct.
def count_python_faster(a):
s = frozenset(a)
return sum(1 for x,y in itertools.combinations(a,2)
if -(x+y) not in (x,y) and -(x+y) in s) // 3
Note the divide-by-three at the end; this is because each successful combination is triple-counted. It's possible to avoid that but it doesn't actually speed things up and (imo) just complicates the code.
Some timings for the curious:
%timeit count(range(-100,100))
1 loops, best of 3: 407 ms per loop
%timeit count_python(range(-100,100)) #this is about 100ms faster on py3
1 loops, best of 3: 382 ms per loop
%timeit count_python_faster(range(-100,100))
100 loops, best of 3: 5.37 ms per loop
You haven't stated which version of Python you're using.
In Python 3.x, a generator expression is around 10% faster than either of the two implementations you listed. Using a random array of 100 numbers in the range [-100,100] for a:
count(a) -> 8.94 ms # as per your implementation
count_python(a) -> 8.75 ms # as per your implementation
def count_generator(a):
return sum((sum(x) == 0 for x in itertools.combinations(a,r=3)))
count_generator(a) -> 7.63 ms
But other than that, it's the shear amount of combinations that's dominating execution time - O(N^3).
I should add the times shown above are for loops of 10 calls each, averaged over 10 loops. And yeah, my laptop is slow too :)
This code fragment is a bottleneck in a project of mine. Are there function calls that could replace the for loops and speed it up?
D = np.zeros((nOcc,nOcc,nVir,nVir))
for i in range(nOcc):
for j in range(i+1):
tmp = Ew[i] + Ew[j]
for a in range(nVir):
tmp2 = tmp - Ew[a+nOcc]
for b in range(a+1):
tmp3 = 1.0/(tmp2 - Ew[b+nOcc])
D[i,j,a,b] = Iiajb[i,a,j,b]*tmp3
D[i,j,b,a] = Iiajb[i,b,j,a]*tmp3
D[j,i,a,b] = D[i,j,b,a]
D[j,i,b,a] = D[i,j,a,b]
To start off lets generate some arbitrary data, thats obeys a few required principles:
nOcc = 30
nVir = 120
Ew = np.random.rand(nOcc+nVir)
Ew[:nOcc]*=-1
Ia = np.random.rand(nOcc)
Ib = np.random.rand(nVir)
I = np.einsum('a,b,c,d->abcd',Ia,Ib,Ia,Ib)
Lets wrap your base code as an example:
def oldcalc_D(Iiajb,nOcc,nVir,Ew):
D = np.zeros((nOcc,nOcc,nVir,nVir))
for i in range(nOcc):
for j in range(i+1):
tmp = Ew[i] + Ew[j]
for a in range(nVir):
tmp2 = tmp - Ew[a+nOcc]
for b in range(a+1):
tmp3 = 1.0/(tmp2 - Ew[b+nOcc])
D[i,j,a,b] = Iiajb[i,a,j,b]*tmp3
D[i,j,b,a] = Iiajb[i,b,j,a]*tmp3
D[j,i,a,b] = D[i,j,b,a]
D[j,i,b,a] = D[i,j,a,b]
return D
Taking advantage of integral symmetry is typically a good tactic; however, in numpy alone it is not worth the cost so we are going to ignore this and simply vectorize your code:
def newcalc_D(I,nOcc,nVir,Ew):
O = Ew[:nOcc]
V = Ew[nOcc:]
D = O[:,None,None,None] - V[:,None,None] + O[:,None] - V
return (I/D).swapaxes(1,2)
Some timings:
np.allclose(oldcalc_D(I,nOcc,nVir,Ew),newcalc_D(I,nOcc,nVir,Ew))
True
%timeit newcalc_D(I,nOcc,nVir,Ew)
1 loops, best of 3: 142 ms per loop
%timeit oldcalc_D(I,nOcc,nVir,Ew)
1 loops, best of 3: 15 s per loop
So only about ~100x faster, as I said this is a fairly simple pass to give you an idea what to do. This can be done much better, but should be a trivial part of the calculation as the integral transformation is (O)N^5 vs this at (O)N^4. For these operations I use numba's autojit feature:
from numba import autojit
numba_D = autojit(oldcalc_D)
%timeit numba_D(I,nOcc,nVir,Ew)
10 loops, best of 3: 55.1 ms per loop
unless this is Python3, you may want to start by replacing range with xrange: the former creates the entire list, while the latter is just an iterator, which is all you need in this case.
For large Ns, speed difference should be noticeable.
Also, seeing as your using numpy, there is probably a vectorized way to implement the algorithm. If that's the case, the vectorized implementation should be orders of magnitude faster. But unless you explain the variables and the algorithm, we can't help you in that direction.
The following is the most basic way I know of to count transitions in a markov chain and use it to populate a transition matrix:
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
I've tried speeding it up in 3 different ways:
1) Using a sparse matrix one-liner based on this Matlab code:
transition_matrix = full(sparse(markov_chain(1:end-1), markov_chain(2:end), 1))
Which in Numpy/SciPy, looks like this:
def get_sparse_counts_matrix(markov_chain, number_of_states):
return coo_matrix(([1]*(len(markov_chain) - 1), (markov_chain[0:-1], markov_chain[1:])), shape=(number_of_states, number_of_states))
And I've tried a couple more Python tweaks, like using zip():
for old_state, new_state in zip(markov_chain[0:-1], markov_chain[1:]):
transition_counts_matrix[old_state, new_state] += 1
And Queues:
old_and_new_states_holder = Queue(maxsize=2)
old_and_new_states_holder.put(markov_chain[0])
for new_state in markov_chain[1:]:
old_and_new_states_holder.put(new_state)
old_state = old_and_new_states_holder.get()
transition_counts_matrix[old_state, new_state] += 1
But none of these 3 methods sped things up. In fact, everything but the zip() solution was at least 10X slower than my original solution.
Are there any other solutions worth looking into?
Modified solution for building a transition matrix from lots of chains
The best answer to the above question specifically was DSM's. However, for anyone who wants to populate a transition matrix based on a list of millions of markov chains, the quickest way is this:
def fast_increment_transition_counts_from_chain(markov_chain, transition_counts_matrix):
flat_coords = numpy.ravel_multi_index((markov_chain[:-1], markov_chain[1:]), transition_counts_matrix.shape)
transition_counts_matrix.flat += numpy.bincount(flat_coords, minlength=transition_counts_matrix.size)
def get_fake_transitions(markov_chains):
fake_transitions = []
for i in xrange(1,len(markov_chains)):
old_chain = markov_chains[i - 1]
new_chain = markov_chains[i]
end_of_old = old_chain[-1]
beginning_of_new = new_chain[0]
fake_transitions.append((end_of_old, beginning_of_new))
return fake_transitions
def decrement_fake_transitions(fake_transitions, counts_matrix):
for old_state, new_state in fake_transitions:
counts_matrix[old_state, new_state] -= 1
def fast_get_transition_counts_matrix(markov_chains, number_of_states):
"""50% faster than original, but must store 2 additional slice copies of all markov chains in memory at once.
You might need to break up the chains into manageable chunks that don't exceed your memory.
"""
transition_counts_matrix = numpy.zeros([number_of_states, number_of_states])
fake_transitions = get_fake_transitions(markov_chains)
markov_chains = list(itertools.chain(*markov_chains))
fast_increment_transition_counts_from_chain(markov_chains, transition_counts_matrix)
decrement_fake_transitions(fake_transitions, transition_counts_matrix)
return transition_counts_matrix
Just for kicks, and because I've been wanting to try it out, I applied Numba to your problem. In code, that involves just adding a decorator (although I've made a direct call so I could test the jit variants that numba provides here):
import numpy as np
import numba
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
autojit_func = numba.autojit()(increment_counts_in_matrix_from_chain)
jit_func = numba.jit(argtypes=[numba.int64[:,::1],numba.double[:,::1]])(increment_counts_in_matrix_from_chain)
t = np.random.randint(0,50, 500)
m1 = np.zeros((50,50))
m2 = np.zeros((50,50))
m3 = np.zeros((50,50))
And then timings:
In [10]: %timeit increment_counts_in_matrix_from_chain(t,m1)
100 loops, best of 3: 2.38 ms per loop
In [11]: %timeit autojit_func(t,m2)
10000 loops, best of 3: 67.5 us per loop
In [12]: %timeit jit_func(t,m3)
100000 loops, best of 3: 4.93 us per loop
The autojit method does some guessing based on runtime inputs, and the jit function has types dictated. You have to be a little careful since numba at these early stages doesn't communicate that there was an error with jit if you pass in the wrong type for an input. It will just spit out an incorrect answer.
That said though, getting a 35x and 485x speed-up without any code change and just adding a call to numba (can be also called as a decorator) is pretty impressive in my book. You could probably get similar results using cython, but it would require a bit more boilerplate and writing a setup.py file.
I also like this solution because the code remains readable and you can write it the way you originally thought about implementing the algorithm.
How about something like this, taking advantage of np.bincount? Not super-robust, but functional. [Thanks to #Warren Weckesser for the setup.]
import numpy as np
from collections import Counter
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
def using_counter(chain, counts_matrix):
counts = Counter(zip(chain[:-1], chain[1:]))
from_, to = zip(*counts.keys())
counts_matrix[from_, to] = counts.values()
def using_bincount(chain, counts_matrix):
flat_coords = np.ravel_multi_index((chain[:-1], chain[1:]), counts_matrix.shape)
counts_matrix.flat = np.bincount(flat_coords, minlength=counts_matrix.size)
def using_bincount_reshape(chain, counts_matrix):
flat_coords = np.ravel_multi_index((chain[:-1], chain[1:]), counts_matrix.shape)
return np.bincount(flat_coords, minlength=counts_matrix.size).reshape(counts_matrix.shape)
which gives:
In [373]: t = np.random.randint(0,50, 500)
In [374]: m1 = np.zeros((50,50))
In [375]: m2 = m1.copy()
In [376]: m3 = m1.copy()
In [377]: timeit increment_counts_in_matrix_from_chain(t, m1)
100 loops, best of 3: 2.79 ms per loop
In [378]: timeit using_counter(t, m2)
1000 loops, best of 3: 924 us per loop
In [379]: timeit using_bincount(t, m3)
10000 loops, best of 3: 57.1 us per loop
[edit]
Avoiding flat (at the cost of not working in-place) can save some time for small matrices:
In [80]: timeit using_bincount_reshape(t, m3)
10000 loops, best of 3: 22.3 us per loop
Here's a faster method. The idea is to count the number of occurrences of each transition, and use the counts in a vectorized update of the matrix. (I'm assuming that the same transition can occur multiple times in markov_chain.) The Counter class from the collections library is used to count the number of occurrences of each transition.
from collections import Counter
def update_matrix(chain, counts_matrix):
counts = Counter(zip(chain[:-1], chain[1:]))
from_, to = zip(*counts.keys())
counts_matrix[from_, to] += counts.values()
Timing example, in ipython:
In [64]: t = np.random.randint(0,50, 500)
In [65]: m1 = zeros((50,50))
In [66]: m2 = zeros((50,50))
In [67]: %timeit increment_counts_in_matrix_from_chain(t, m1)
1000 loops, best of 3: 895 us per loop
In [68]: %timeit update_matrix(t, m2)
1000 loops, best of 3: 504 us per loop
It's faster, but not orders of magnitude faster. For a real speed up, you might consider implementing this in Cython.
Ok, few ideas to tamper with, with some slight improvement (at cost of human undestanding)
Let's start with a random vector of integers between 0 and 9 of length 3000:
L = 3000
N = 10
states = array(randint(N),size=L)
transitions = np.zeros((N,N))
Your method, on my machine, has a timeit performance of 11.4 ms.
The first thing for a little improvement is to avoid to read the data twice, storing it in a temporary variable:
old = states[0]
for i in range(1,len(states)):
new = states[i]
transitions[new,old]+=1
old=new
This gives you a ~10% improvement and drops the time to 10.9 ms.
A more involuted approach uses the strides:
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
state_2 = rolling(states, 2)
for i in range(len(state_2)):
l,m = state_2[i,0],state_2[i,1]
transitions[m,l]+=1
The strides allow you to read the consecutive numbers of the array tricking the array to think that the rows start in a different way (ok, it's not well described, but if you take some time to read about strides you will get it)
This approach loses performance, going to 12.2 ms, but it is the hallway to trick the system even more. flattening both the transition matrix and the strided array to one dimensional arrays, you can speed up the performance a little more:
transitions = np.zeros(N*N)
state_2 = rolling(states, 2)
state_flat = np.sum(state_2 * array([1,10]),axis=1)
for i in state_flat:
transitions[i]+=1
transitions.reshape((N,N))
This goes down to 7.75 ms. It's not an order of magnitude, but it's a 30% better anyway :)
I need function-aggregator, which would reduce two lists to one total number. 'Items' is supposed to be vector of Booleans.
So, I wrote these functions:
def element_wise_multiplication(weights, items):
return map(lambda x, y: x * y, weights, items)
def total(weights, items):
return sum(element_wise_multiplication(weights, items))
They look OK to me, but the problem is that profiler showed that the line with lambda in it is responsible for 95% of runtime, so its performance is pretty much unacceptable.
What is the most efficient way to implement it?
P.S. I am aware of NumPy's arrays, but I would like to use PyPy on this one. Or is using it not worth it in this case?
You can take care of that with a generator like so:
from itertools import izip
value = sum((x * y) for x, y in izip(weights, items))
izip accomplishes the same thing as the built-in zip, but without the memory overhead.
Although you mention not wishing to use numpy in this case, it may be worth looking at the speed differences.
The best non-numpy solution appears to be a generator using izip which marginally outperforms zip.
In [31]: %timeit sum(x*y for x,y in zip(weights,items))
10000 loops, best of 3: 158 us per loop
In [32]: %timeit sum(x*y for x,y in izip(weights,items))
10000 loops, best of 3: 125 us per loop
However when we use the numpy arrays we get:
In [33]: %timeit (np_weights,np_items).sum()
100000 loops, best of 3: 9.08 us per loop
The numpy solution is a full 14 times faster. If this is really a bottleneck in your code then numpy is the way to go.
try just this:
def total(weights, items):
return sum (x * y for x, y in zip(weights, items))