The following is the most basic way I know of to count transitions in a markov chain and use it to populate a transition matrix:
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
I've tried speeding it up in 3 different ways:
1) Using a sparse matrix one-liner based on this Matlab code:
transition_matrix = full(sparse(markov_chain(1:end-1), markov_chain(2:end), 1))
Which in Numpy/SciPy, looks like this:
def get_sparse_counts_matrix(markov_chain, number_of_states):
return coo_matrix(([1]*(len(markov_chain) - 1), (markov_chain[0:-1], markov_chain[1:])), shape=(number_of_states, number_of_states))
And I've tried a couple more Python tweaks, like using zip():
for old_state, new_state in zip(markov_chain[0:-1], markov_chain[1:]):
transition_counts_matrix[old_state, new_state] += 1
And Queues:
old_and_new_states_holder = Queue(maxsize=2)
old_and_new_states_holder.put(markov_chain[0])
for new_state in markov_chain[1:]:
old_and_new_states_holder.put(new_state)
old_state = old_and_new_states_holder.get()
transition_counts_matrix[old_state, new_state] += 1
But none of these 3 methods sped things up. In fact, everything but the zip() solution was at least 10X slower than my original solution.
Are there any other solutions worth looking into?
Modified solution for building a transition matrix from lots of chains
The best answer to the above question specifically was DSM's. However, for anyone who wants to populate a transition matrix based on a list of millions of markov chains, the quickest way is this:
def fast_increment_transition_counts_from_chain(markov_chain, transition_counts_matrix):
flat_coords = numpy.ravel_multi_index((markov_chain[:-1], markov_chain[1:]), transition_counts_matrix.shape)
transition_counts_matrix.flat += numpy.bincount(flat_coords, minlength=transition_counts_matrix.size)
def get_fake_transitions(markov_chains):
fake_transitions = []
for i in xrange(1,len(markov_chains)):
old_chain = markov_chains[i - 1]
new_chain = markov_chains[i]
end_of_old = old_chain[-1]
beginning_of_new = new_chain[0]
fake_transitions.append((end_of_old, beginning_of_new))
return fake_transitions
def decrement_fake_transitions(fake_transitions, counts_matrix):
for old_state, new_state in fake_transitions:
counts_matrix[old_state, new_state] -= 1
def fast_get_transition_counts_matrix(markov_chains, number_of_states):
"""50% faster than original, but must store 2 additional slice copies of all markov chains in memory at once.
You might need to break up the chains into manageable chunks that don't exceed your memory.
"""
transition_counts_matrix = numpy.zeros([number_of_states, number_of_states])
fake_transitions = get_fake_transitions(markov_chains)
markov_chains = list(itertools.chain(*markov_chains))
fast_increment_transition_counts_from_chain(markov_chains, transition_counts_matrix)
decrement_fake_transitions(fake_transitions, transition_counts_matrix)
return transition_counts_matrix
Just for kicks, and because I've been wanting to try it out, I applied Numba to your problem. In code, that involves just adding a decorator (although I've made a direct call so I could test the jit variants that numba provides here):
import numpy as np
import numba
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
autojit_func = numba.autojit()(increment_counts_in_matrix_from_chain)
jit_func = numba.jit(argtypes=[numba.int64[:,::1],numba.double[:,::1]])(increment_counts_in_matrix_from_chain)
t = np.random.randint(0,50, 500)
m1 = np.zeros((50,50))
m2 = np.zeros((50,50))
m3 = np.zeros((50,50))
And then timings:
In [10]: %timeit increment_counts_in_matrix_from_chain(t,m1)
100 loops, best of 3: 2.38 ms per loop
In [11]: %timeit autojit_func(t,m2)
10000 loops, best of 3: 67.5 us per loop
In [12]: %timeit jit_func(t,m3)
100000 loops, best of 3: 4.93 us per loop
The autojit method does some guessing based on runtime inputs, and the jit function has types dictated. You have to be a little careful since numba at these early stages doesn't communicate that there was an error with jit if you pass in the wrong type for an input. It will just spit out an incorrect answer.
That said though, getting a 35x and 485x speed-up without any code change and just adding a call to numba (can be also called as a decorator) is pretty impressive in my book. You could probably get similar results using cython, but it would require a bit more boilerplate and writing a setup.py file.
I also like this solution because the code remains readable and you can write it the way you originally thought about implementing the algorithm.
How about something like this, taking advantage of np.bincount? Not super-robust, but functional. [Thanks to #Warren Weckesser for the setup.]
import numpy as np
from collections import Counter
def increment_counts_in_matrix_from_chain(markov_chain, transition_counts_matrix):
for i in xrange(1, len(markov_chain)):
old_state = markov_chain[i - 1]
new_state = markov_chain[i]
transition_counts_matrix[old_state, new_state] += 1
def using_counter(chain, counts_matrix):
counts = Counter(zip(chain[:-1], chain[1:]))
from_, to = zip(*counts.keys())
counts_matrix[from_, to] = counts.values()
def using_bincount(chain, counts_matrix):
flat_coords = np.ravel_multi_index((chain[:-1], chain[1:]), counts_matrix.shape)
counts_matrix.flat = np.bincount(flat_coords, minlength=counts_matrix.size)
def using_bincount_reshape(chain, counts_matrix):
flat_coords = np.ravel_multi_index((chain[:-1], chain[1:]), counts_matrix.shape)
return np.bincount(flat_coords, minlength=counts_matrix.size).reshape(counts_matrix.shape)
which gives:
In [373]: t = np.random.randint(0,50, 500)
In [374]: m1 = np.zeros((50,50))
In [375]: m2 = m1.copy()
In [376]: m3 = m1.copy()
In [377]: timeit increment_counts_in_matrix_from_chain(t, m1)
100 loops, best of 3: 2.79 ms per loop
In [378]: timeit using_counter(t, m2)
1000 loops, best of 3: 924 us per loop
In [379]: timeit using_bincount(t, m3)
10000 loops, best of 3: 57.1 us per loop
[edit]
Avoiding flat (at the cost of not working in-place) can save some time for small matrices:
In [80]: timeit using_bincount_reshape(t, m3)
10000 loops, best of 3: 22.3 us per loop
Here's a faster method. The idea is to count the number of occurrences of each transition, and use the counts in a vectorized update of the matrix. (I'm assuming that the same transition can occur multiple times in markov_chain.) The Counter class from the collections library is used to count the number of occurrences of each transition.
from collections import Counter
def update_matrix(chain, counts_matrix):
counts = Counter(zip(chain[:-1], chain[1:]))
from_, to = zip(*counts.keys())
counts_matrix[from_, to] += counts.values()
Timing example, in ipython:
In [64]: t = np.random.randint(0,50, 500)
In [65]: m1 = zeros((50,50))
In [66]: m2 = zeros((50,50))
In [67]: %timeit increment_counts_in_matrix_from_chain(t, m1)
1000 loops, best of 3: 895 us per loop
In [68]: %timeit update_matrix(t, m2)
1000 loops, best of 3: 504 us per loop
It's faster, but not orders of magnitude faster. For a real speed up, you might consider implementing this in Cython.
Ok, few ideas to tamper with, with some slight improvement (at cost of human undestanding)
Let's start with a random vector of integers between 0 and 9 of length 3000:
L = 3000
N = 10
states = array(randint(N),size=L)
transitions = np.zeros((N,N))
Your method, on my machine, has a timeit performance of 11.4 ms.
The first thing for a little improvement is to avoid to read the data twice, storing it in a temporary variable:
old = states[0]
for i in range(1,len(states)):
new = states[i]
transitions[new,old]+=1
old=new
This gives you a ~10% improvement and drops the time to 10.9 ms.
A more involuted approach uses the strides:
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
state_2 = rolling(states, 2)
for i in range(len(state_2)):
l,m = state_2[i,0],state_2[i,1]
transitions[m,l]+=1
The strides allow you to read the consecutive numbers of the array tricking the array to think that the rows start in a different way (ok, it's not well described, but if you take some time to read about strides you will get it)
This approach loses performance, going to 12.2 ms, but it is the hallway to trick the system even more. flattening both the transition matrix and the strided array to one dimensional arrays, you can speed up the performance a little more:
transitions = np.zeros(N*N)
state_2 = rolling(states, 2)
state_flat = np.sum(state_2 * array([1,10]),axis=1)
for i in state_flat:
transitions[i]+=1
transitions.reshape((N,N))
This goes down to 7.75 ms. It's not an order of magnitude, but it's a 30% better anyway :)
Related
Suppose I have a bunch of arrays, including x and y, and I want to check if they're equal. Generally, I can just use np.all(x == y) (barring some dumb corner cases which I'm ignoring now).
However this evaluates the entire array of (x == y), which is usually not needed. My arrays are really large, and I have a lot of them, and the probability of two arrays being equal is small, so in all likelihood, I really only need to evaluate a very small portion of (x == y) before the all function could return False, so this is not an optimal solution for me.
I've tried using the builtin all function, in combination with itertools.izip: all(val1==val2 for val1,val2 in itertools.izip(x, y))
However, that just seems much slower in the case that two arrays are equal, that overall, it's stil not worth using over np.all. I presume because of the builtin all's general-purposeness. And np.all doesn't work on generators.
Is there a way to do what I want in a more speedy manner?
I know this question is similar to previously asked questions (e.g. Comparing two numpy arrays for equality, element-wise) but they specifically don't cover the case of early termination.
Until this is implemented in numpy natively you can write your own function and jit-compile it with numba:
import numpy as np
import numba as nb
#nb.jit(nopython=True)
def arrays_equal(a, b):
if a.shape != b.shape:
return False
for ai, bi in zip(a.flat, b.flat):
if ai != bi:
return False
return True
a = np.random.rand(10, 20, 30)
b = np.random.rand(10, 20, 30)
%timeit np.all(a==b) # 100000 loops, best of 3: 9.82 µs per loop
%timeit arrays_equal(a, a) # 100000 loops, best of 3: 9.89 µs per loop
%timeit arrays_equal(a, b) # 100000 loops, best of 3: 691 ns per loop
Worst case performance (arrays equal) is equivalent to np.all and in case of early stopping the compiled function has the potential to outperform np.all a lot.
Adding short-circuit logic to array comparisons is apparently being discussed on the numpy page on github, and will thus presumably be available in a future version of numpy.
Probably someone who understands the underlying data structure could optimize this or explain whether it's reliable/safe/good practice, but it seems to work.
np.all(a==b)
Out[]: True
memoryview(a.data)==memoryview(b.data)
Out[]: True
%timeit np.all(a==b)
The slowest run took 10.82 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.2 µs per loop
%timeit memoryview(a.data)==memoryview(b.data)
The slowest run took 8.55 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.85 µs per loop
If I understand this correctly, ndarray.data creates a pointer to the data buffer and memoryview creates a native python type that can be short-circuited out of the buffer.
I think.
EDIT: further testing shows it may not be as big a time-improvement as shown. previously a=b=np.eye(5)
a=np.random.randint(0,10,(100,100))
b=a.copy()
%timeit np.all(a==b)
The slowest run took 6.70 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 17.7 µs per loop
%timeit memoryview(a.data)==memoryview(b.data)
10000 loops, best of 3: 30.1 µs per loop
np.all(a==b)
Out[]: True
memoryview(a.data)==memoryview(b.data)
Out[]: True
Hmmm, I know it is the poor answer but it seems there is no easy way for this. Numpy Creators should fix it. I suggest:
def compare(a, b):
if len(a) > 0 and not np.array_equal(a[0], b[0]):
return False
if len(a) > 15 and not np.array_equal(a[:15], b[:15]):
return False
if len(a) > 200 and not np.array_equal(a[:200], b[:200]):
return False
return np.array_equal(a, b)
:)
Well, not really an answer as I haven't checked if it break-circuits, but:
assert_array_equal.
From the documentation:
Raises an AssertionError if two array_like objects are not equal.
Try Except it if not on a performance sensitive code path.
Or follow the underlying source code, maybe it's efficient.
You could iterate all elements of the arrays and check if they are equal.
If the arrays are most likely not equal it will return much faster than the .all function.
Something like this:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([1, 3, 4])
areEqual = True
for x in range(0, a.size-1):
if a[x] != b[x]:
areEqual = False
break
else:
print "a[x] is equal to b[x]\n"
if areEqual:
print "The tables are equal\n"
else:
print "The tables are not equal\n"
As Thomas Kühn wrote in a comment to your post, array_equal is a function which should solve the problem. It is described in Numpy's API reference.
Breaking down the original problem to three parts: "(1) My arrays are really large, and (2) I have a lot of them, and (3) the probability of two arrays being equal is small"
All the solutions (to date) are focused on part (1) - optimizing the performance of each equality check, and some improve this performance by factor of 10. Points (2) and (3) are ignored. Comparing each pair has O(n^2) complexity, which can become huge for a lot of matrices, while needles as the probability of being duplicates is very small.
The check can become much faster with the following general algorithm -
fast hash of each array O(n)
check equality only for arrays with the same hash
A good hash is almost unique, so the number of keys can easily be a very large fraction of n. On average, number of arrays with the same hash will be very small, and almost 1 in some cases. Duplicate arrays will have the same hash, while having the same hash doesn't guarantee they are duplicates. In that sense, the algorithm will catch all the duplicates. Comparing images only with the same hash significantly reduces the number of comparisons, which becomes almost O(n)
For my problem, I had to check duplicates within ~1 million integer arrays, each with 10k elements. Optimizing only the array equality check (with #MB-F solution) estimated run time was 5 days. With hashing first it finished in minutes. (I used array sum as the hash, that was suited for my arrays characteristics)
Some psuedo-python code
def fast_hash(arr) -> int:
pass
def arrays_equal(arr1, arr2) -> bool:
pass
def make_hash_dict(array_stack, hush_fn=np.sum):
hash_dict = defaultdict(list)
hashes = np.squeeze(np.apply_over_axes(hush_fn, array_stack, range(1, array_stack.ndim)))
for idx, hash_val in enumerate(hashes):
hash_dict[hash_val].append(idx)
return hash_dict
def get_duplicate_sets(hash_dict, array_stack):
duplicate_sets = []
for hash_key, ind_list in hash_dict.items():
if len(ind_list) == 1:
continue
all_duplicates = []
for idx1 in range(len(ind_list)):
v1 = ind_list[idx1]
if v1 in all_duplicates:
continue
arr1 = array_stack[v1]
curr_duplicates = []
for idx2 in range(idx1+1, len(ind_list)):
v2 = ind_list[idx2]
arr2 = array_stack[v2]
if arrays_equal(arr1, arr2):
if len(curr_duplicates) == 0:
curr_duplicates.append(v1)
curr_duplicates.append(v2)
if len(curr_duplicates) > 0:
all_duplicates.extend(curr_duplicates)
duplicate_sets.append(curr_duplicates)
return duplicate_sets
The variable duplicate_sets is a list of lists, each internal list contains indices of all the same duplicates.
I have some data(stock data) and need to manipulate it by making some calculations on that data. I did it with numpy arrays. Numpy is pretty faster than python built-in functions. But, the execution time of my code is higher than expected. My code is in below and I tested it with ipython %timeit function. The result is like this: Total execution time is 5.44 ms, second 'for' loop takes most time 3.88ms and cause for that, is 'np.mean' function in that loop. So, alternatives to 'np.mean' and any other suggestions to speed up execution time would be helpful.
Code
data = my_class.Project.all_data["AAP_data"]
data = np.array(data[["High", "Low", "Close"]])
true_range = np.empty((data.shape[0]-1, 1))
for i in range(1, true_range.shape[0]+1):
true_range[i-1] = max((data[i, 0] - data[i, 1]), (abs(data[i, 0] - data[i-1, 2])),
(abs(data[i, 1] - data[i-1, 2])))
average_true_range = np.empty((true_range.shape[0]-13, 1))
for i in range(13, average_true_range.shape[0]+13):
lastn_tr = true_range[(i-13):(i+1)]
average_true_range[i-13] = np.mean(lastn_tr)
That is basically sliding window average calculations. This averaging could be thought of as summing in sliding windows and dividing by the length of window size. So, we can use 1D convolution with np.convolve for a vectorized solution to get rid of that entire loopy process to give us average_true_range, like so -
np.convolve(true_range,np.ones((14),dtype=int),'valid')/14.0
For further performance boost, as we might have learnt from studying how CPUs are more efficient in multiplications than divisions. So, let's employ it here for a improved version -
r = 1.0/14
out = np.convolve(true_range,np.ones((14),dtype=int),'valid')*r
Runtime test -
In [53]: def original_app(true_range):
...: average_true_range = np.zeros((true_range.shape[0]-13, 1))
...: for i in range(13, average_true_range.shape[0]+13):
...: lastn_tr = true_range[(i-13):(i+1)]
...: average_true_range[i-13] = np.mean(lastn_tr)
...: return average_true_range
...:
...: def vectorized_app(true_range):
...: return np.convolve(true_range,np.ones((14),dtype=int),'valid')/14.0
...:
...: def vectorized_app2(true_range):
...: r = 1.0/14
...: return np.convolve(true_range,np.ones((14),dtype=int),'valid')*r
...:
In [54]: true_range = np.random.rand(10000) # Input array
In [55]: %timeit original_app(true_range)
1 loops, best of 3: 180 ms per loop
In [56]: %timeit vectorized_app(true_range)
1000 loops, best of 3: 446 µs per loop
In [57]: %timeit vectorized_app2(true_range)
1000 loops, best of 3: 401 µs per loop
Massive speedups there!
Later on, the bottleneck might shift to the first part of getting true_range. To vectorize things there, here's an approach using slicing -
col0 = data[1:,0] - data[1:,1]
col1 = np.abs(data[1:,0] - data[:-1,2])
col2 = np.abs(data[1:,1] - data[:-1,2])
true_range = np.maximum(np.maximum(col0,col1),col2)
I would like to speed up this code :
import numpy as np
import pandas as pd
a = pd.read_csv(path)
closep = a['Clsprc']
delta = np.array(closep.diff())
upgain = np.where(delta >= 0, delta, 0)
downloss = np.where(delta <= 0, -delta, 0)
up = sum(upgain[0:14]) / 14
down = sum(downloss[0:14]) / 14
u = []
d = []
for x in np.nditer(upgain[14:]):
u1 = 13 * up + x
u.append(u1)
up = u1
for y in np.nditer(downloss[14:]):
d1 = 13 * down + y
d.append(d1)
down = d1
The data below:
0 49.00
1 48.76
2 48.52
3 48.28
...
36785758 13.88
36785759 14.65
36785760 13.19
Name: Clsprc, Length: 36785759, dtype: float64
The for loop is too slow, what can I do to speed up this code? Can I vectorize the entire operation?
It looks like you're trying to calculate an exponential moving average (rolling mean), but forgot the division. If that's the case then you may want to see this SO question. Meanwhile, here's a fast a simple moving average using the cumsum() function taken from the referenced link.
def moving_average(a, n=14) :
ret = np.cumsum(a, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
If this is not the case, and you really want the function described, you can increase the iteration speed by getting using the external_loop flag in your iteration. From the numpy documentation:
The nditer will try to provide chunks that are as large as possible to
the inner loop. By forcing ‘C’ and ‘F’ order, we get different
external loop sizes. This mode is enabled by specifying an iterator
flag.
Observe that with the default of keeping native memory order, the
iterator is able to provide a single one-dimensional chunk, whereas
when forcing Fortran order, it has to provide three chunks of two
elements each.
for x in np.nditer(upgain[14:], flags=['external_loop'], order='F'):
# x now has x[0],x[1], x[2], x[3], x[4], x[5] elements.
In simplified terms, I think this is what the loops are doing:
upgain=np.array([.1,.2,.3,.4])
u=[]
up=1
for x in upgain:
u1=10*up+x
u.append(u1)
up=u1
producing:
[10.1, 101.2, 1012.3, 10123.4]
np.cumprod([10,10,10,10]) is there, plus a modified cumsum for the [.1,.2,.3,.4] terms. But I can't off hand think of a way of combining these with compiled numpy functions. We could write a custom ufunc, and use its accumulate. Or we could write it in cython (or other c interface).
https://stackoverflow.com/a/27912352 suggests that frompyfunc is a way of writing a generalized accumulate. I don't expect big time savings, maybe 2x.
To use frompyfunc, define:
def foo(x,y):return 10*x+y
The loop application (above) would be
def loopfoo(upgain,u,u1):
for x in upgain:
u1=foo(u1,x)
u.append(u1)
return u
The 'vectorized' version would be:
vfoo=np.frompyfunc(foo,2,1) # 2 in arg, 1 out
vfoo.accumulate(upgain,dtype=object).astype(float)
The dtype=object requirement was noted in the prior SO, and https://github.com/numpy/numpy/issues/4155
In [1195]: loopfoo([1,.1,.2,.3,.4],[],0)
Out[1195]: [1, 10.1, 101.2, 1012.3, 10123.4]
In [1196]: vfoo.accumulate([1,.1,.2,.3,.4],dtype=object)
Out[1196]: array([1.0, 10.1, 101.2, 1012.3, 10123.4], dtype=object)
For this small list, loopfoo is faster (3µs v 21µs)
For a 100 element array, e.g. biggain=np.linspace(.1,1,100), the vfoo.accumulate is faster:
In [1199]: timeit loopfoo(biggain,[],0)
1000 loops, best of 3: 281 µs per loop
In [1200]: timeit vfoo.accumulate(biggain,dtype=object)
10000 loops, best of 3: 57.4 µs per loop
For an even larger biggain=np.linspace(.001,.01,1000) (smaller number to avoid overflow), the 5x speed ratio remains.
Input data
Produce n matrices of a given size (here, 3x2). I also chose n=25, but I let n to lay the emphasis on the fact that what we have is a bunch of matrices.
import numpy as np
n = 25
data = np.random.rand(n, 3, 2)
This is just a format example : I can't change it. Or if I do, one must take into account the computational cost of this change.
Current implementation
What I want to achieve atomically is:
output = []
for datum in data: # This outputs on (3x2) matrix after the other
d0 = datum[0]
dr = datum[1:]
output.append(dr-d0)
or, in a faster fashion:
output = [dr-d0 for (dr, d0) in zip(datum[:,0], datum[:,1:])]
Problem
This is too slow and:
output = datum[:,1:] - datum[:,0]
does not work since the behavior of the subtraction operation is not well defined in that case. Plus, this kind of slicing is not very efficient.
Cython/Nuitka/PyPy and the likes are possible solutions, but I'd like to stick with raw Numpy for now, if possible. Maybe some kind of function that can be applied on elements of the outer loop of a numpy array very quickly without the overhead of python stuff...
The np.vectorize function doesn't work on:
def get_diff(mat):
return mat[1:] - mat[0]
So I invoke ye, High Priests of Numpy, servants of Python to enlighten my poor soul!
EDIT:
XY Problem
(I didn't know it had a name)
What I actually want to do is to determine the content (read "volume") of a lot of simplices (read "tetrahedra"). The easiest and most efficient way to do it, AFAIK is to calculate:
np.linalg.det(mat[:1]-mat[0])
Then let me rephrase my question: how can I efficiently compute the content of any ensemble of simplices of dimension k using plain python and numpy?
I suggest data[:,1:] - data[:,0,None]. The None creates a new axis (officially you're supposed to use np.newaxis, which makes it very clear what you're doing), and then the subtraction will behave the way you want it to.
Correcting what I think are errors in your list comprehension:
def loop(data):
output = []
for datum in data: # This outputs on (3x2) matrix after the other
d0 = datum[0]
dr = datum[1:]
output.append(dr-d0)
return output
def listcomp(data):
output = [dr-d0 for (d0, dr) in zip(data[:,0], data[:,1:])]
return output
def sub(data):
output = data[:,1:] - data[:,0,None]
return output
we have
>>> import numpy as np
>>> n = 25
>>> data = np.random.rand(n, 3, 2)
>>> res_loop = loop(data)
>>> res_listcomp = listcomp(data)
>>> res_sub = sub(data)
>>> np.allclose(res_loop, res_listcomp)
True
>>> np.allclose(res_loop, res_sub)
True
>>>
>>> %timeit loop(data)
10000 loops, best of 3: 184 µs per loop
>>> %timeit listcomp(data)
10000 loops, best of 3: 158 µs per loop
>>> %timeit sub(data)
100000 loops, best of 3: 12.8 µs per loop
This code fragment is a bottleneck in a project of mine. Are there function calls that could replace the for loops and speed it up?
D = np.zeros((nOcc,nOcc,nVir,nVir))
for i in range(nOcc):
for j in range(i+1):
tmp = Ew[i] + Ew[j]
for a in range(nVir):
tmp2 = tmp - Ew[a+nOcc]
for b in range(a+1):
tmp3 = 1.0/(tmp2 - Ew[b+nOcc])
D[i,j,a,b] = Iiajb[i,a,j,b]*tmp3
D[i,j,b,a] = Iiajb[i,b,j,a]*tmp3
D[j,i,a,b] = D[i,j,b,a]
D[j,i,b,a] = D[i,j,a,b]
To start off lets generate some arbitrary data, thats obeys a few required principles:
nOcc = 30
nVir = 120
Ew = np.random.rand(nOcc+nVir)
Ew[:nOcc]*=-1
Ia = np.random.rand(nOcc)
Ib = np.random.rand(nVir)
I = np.einsum('a,b,c,d->abcd',Ia,Ib,Ia,Ib)
Lets wrap your base code as an example:
def oldcalc_D(Iiajb,nOcc,nVir,Ew):
D = np.zeros((nOcc,nOcc,nVir,nVir))
for i in range(nOcc):
for j in range(i+1):
tmp = Ew[i] + Ew[j]
for a in range(nVir):
tmp2 = tmp - Ew[a+nOcc]
for b in range(a+1):
tmp3 = 1.0/(tmp2 - Ew[b+nOcc])
D[i,j,a,b] = Iiajb[i,a,j,b]*tmp3
D[i,j,b,a] = Iiajb[i,b,j,a]*tmp3
D[j,i,a,b] = D[i,j,b,a]
D[j,i,b,a] = D[i,j,a,b]
return D
Taking advantage of integral symmetry is typically a good tactic; however, in numpy alone it is not worth the cost so we are going to ignore this and simply vectorize your code:
def newcalc_D(I,nOcc,nVir,Ew):
O = Ew[:nOcc]
V = Ew[nOcc:]
D = O[:,None,None,None] - V[:,None,None] + O[:,None] - V
return (I/D).swapaxes(1,2)
Some timings:
np.allclose(oldcalc_D(I,nOcc,nVir,Ew),newcalc_D(I,nOcc,nVir,Ew))
True
%timeit newcalc_D(I,nOcc,nVir,Ew)
1 loops, best of 3: 142 ms per loop
%timeit oldcalc_D(I,nOcc,nVir,Ew)
1 loops, best of 3: 15 s per loop
So only about ~100x faster, as I said this is a fairly simple pass to give you an idea what to do. This can be done much better, but should be a trivial part of the calculation as the integral transformation is (O)N^5 vs this at (O)N^4. For these operations I use numba's autojit feature:
from numba import autojit
numba_D = autojit(oldcalc_D)
%timeit numba_D(I,nOcc,nVir,Ew)
10 loops, best of 3: 55.1 ms per loop
unless this is Python3, you may want to start by replacing range with xrange: the former creates the entire list, while the latter is just an iterator, which is all you need in this case.
For large Ns, speed difference should be noticeable.
Also, seeing as your using numpy, there is probably a vectorized way to implement the algorithm. If that's the case, the vectorized implementation should be orders of magnitude faster. But unless you explain the variables and the algorithm, we can't help you in that direction.