I've written a method that takes in an integer "n" and creates a square matrix where the values of each element are dictated by their respective i,j indices.
When I build a small matrix 30x30 it works just fine, but when I try to do something larger like 1000x1000 it takes very long. Is there any way that I can speed it up with multiprocessing?
def createMatrix(n):
matrix = []
for j in range(1,n+1):
row = []
for i in range(1,n+1):
value = 1/(i+j-1)
row.append(value)
matrix.append(row)
return np.array(matrix)
Parallelizing two computation-bound for loops in Python is not trivial because of GIL. The good news is that your case is perfectly vectorizeable:
def createMatrix(n):
return 1 / (np.arange(n)[None, :] + np.arange(n)[:, None] + 1)
Explanation:
essentially, your formula for the matrix is X[row][column] = 1/(row+column-1), where rows and columns are 1-based
np.arange(n) creates a range that can be used for rows or columns
[None, :] and [:, None] turn it into a 2d array, 1 x n or n x 1
numpy then broadcasts dimensions, replicating row and column indexes to match dimensions - thus, implicitly tiling both into n x n when added
since both ranges are 0-based, using +1 instead of -1
As a rule of thumb, it is almost never a good idea to use for loops on numpy arrays. A vectorized approach (i.e. matrix form computations) is orders of magnitude faster.
It's not a good idea to use fors to fill a list then convert it to a matrix. the operation that you have can be vectorized with numpy from scratch. if you think that given the i,j, M(i,j) = 1/(j+i-1) considering that both indices starts at 1.
Here's my proposal :
def createMatrix2(n):
arr =np.arange(1,n+1)
xx,yy = np.meshgrid(arr,arr)
matrix = 1/(xx+yy-1)
return matrix
looking at Marat answer, I think his/her it's better, so tested the 3 methods:
EDIT: added wwii method as createMatrix4 (correcting the errors):
import numpy as np
from time import time
def createMatrix1(n):
matrix = []
for j in range(1,n+1):
row = []
for i in range(1,n+1):
value = 1/(i+j-1)
row.append(value)
matrix.append(row)
return np.array(matrix)
def createMatrix2(n):
arr =np.arange(1,n+1)
xx,yy = np.meshgrid(arr,arr)
matrix = 1/(xx+yy-1)
return matrix
def createMatrix3(n):
"""Marat's proposed matrix"""
return 1 / (1 + np.arange(n)[None, :] + np.arange(n)[:, None])
def createMatrix4(n):
""" wwii method"""
i,j = np.ogrid[1:n,1:n]
return 1/(i+j-1)
#test all the three methods
n = 10000
t1 = time()
m1 = createMatrix1(n)
t2 = time()
m2 = createMatrix2(n)
t3 = time()
m3 = createMatrix3(n)
t4 = time()
m4 = createMatrix4(n)
t5 = time()
print(np.allclose(m1,m2))
print(np.allclose(m1,m3))
print(np.allclose(m1,m4))
print("Matrix 1 (OP): ",t2-t1)
print("Matrix 2: (mine)",t3-t2)
print("Matrix 3: (Marat)",t4-t3)
print("Matrix 4: (wwii)",t5-t4)
# the output is:
#True
#True
#True
#Matrix 1 (OP): 18.4886577129364
#Matrix 2: (mine) 1.005324363708496
#Matrix 3: (Marat) 0.43033909797668457
#Matrix 4: (wwii) 0.5138359069824219
So Marat's solution is faster. As general comments:
Try to avoid fors loops
Think your problem as operation with indices and dessing operations with numpy arrays directly.
For last, given Marat's answer I thought my proposal is a easier to read, and understand. But it's just a subjective view
Your code can be written in another style, accelerated by numba library in a parallel no python mode:
import numba as nb
#nb.njit("float64[:, ::1](int64)", parallel=True, fastmath=True)
def createMatrix(n):
matrix = np.empty((n, n)) # np.zeros is slower than np.empty
for j in nb.prange(1, n + 1):
for i in range(1, n + 1):
matrix[j - 1, i - 1] = 1 / (i + j - 1)
return matrix
This solution will be faster than the Marat answer above 3 times.
Benchmarks: (temporary link to colab)
n = 1000
1000 loops, best of 5: 3.52 ms per loop # Marat
1000 loops, best of 5: 1.5 ms per loop # numba accelerated with np.zeros
1000 loops, best of 5: 1.05 ms per loop # numba accelerated with np.empty
n = 3000
1000 loops, best of 5: 39.5 ms per loop
1000 loops, best of 5: 19.3 ms per loop
1000 loops, best of 5: 8.91 ms per loop
n = 5000
1000 loops, best of 5: 109 ms per loop
1000 loops, best of 5: 53.5 ms per loop
1000 loops, best of 5: 24.8 ms per loop
Related
Ordered list reduction
I need to reduce some lists where, depending on element types, the speed and implementation of the binary operation varies, i.e. large speed reductions can be gained by reducing some pairs with specific functions first.
For example foo(a[0], bar(a[1], a[2]))
might be a lot slower than bar(foo(a[0], a[1]), a[2]) but in this case give the same result.
I have the code that produces an optimal ordering in the form of a list of tuples (pair_index, binary_function) already. I am struggling to implement an efficient function to perform the reduction, ideally one that returns a new partial function which can then be used repeatedly on lists of the same type-ordering but varying values.
Simple and slow(?) solution
Here is my naive solution involving a for loop, deletion of elements and closure over the (pair_index, binary_function) list to return a 'precomputed' function.
def ordered_reduce(a, pair_indexes, binary_functions, precompute=False):
"""
a: list to reduce, length n
pair_indexes: order of pairs to reduce, length (n-1)
binary_functions: functions to use for each reduction, length (n-1)
"""
def ord_red_func(x):
y = list(x) # copy so as not to eat up
for p, f in zip(pair_indexes, binary_functions):
b = f(y[p], y[p+1])
# Replace pair
del y[p]
y[p] = b
return y[0]
return ord_red_func if precompute else ord_red_func(a)
>>> foos = (lambda a, b: a - b, lambda a, b: a + b, lambda a, b: a * b)
>>> ordered_reduce([1, 2, 3, 4], (2, 1, 0), foos)
1
>>> 1 * (2 + (3-4))
1
And how pre-compution works:
>>> foo = ordered_reduce(None, (0, 1, 0), foos)
>>> foo([1, 2, 3, 4])
-7
>>> (1 - 2) * (3 + 4)
-7
However it involves copying the whole list and is also (therefore?) slow. Is there a better/standard way to do this?
(EDIT:) Some Timings:
from operators import add
from functools import reduce
from itertools import repeat
from random import random
r = 100000
xs = [random() for _ in range(r)]
# slightly trivial choices of pairs and functions, to replicate reduce
ps = [0]*(r-1)
fs = repeat(add)
foo = ordered_reduce(None, ps, fs, precompute=True)
>>> %timeit reduce(add, xs)
100 loops, best of 3: 3.59 ms per loop
>>> %timeit foo(xs)
1 loop, best of 3: 1.44 s per loop
This is kind of worst case scenario, and slightly cheating as reduce does not take a iterable of functions, but a function which does (but no order) is still pretty fast:
def multi_reduce(fs, xs):
xs = iter(xs)
x = next(xs)
for f, nx in zip(fs, xs):
x = f(x, nx)
return x
>>> %timeit multi_reduce(fs, xs)
100 loops, best of 3: 8.71 ms per loop
(EDIT2): and for fun, the performance of a massively cheating 'compiled' version, which gives some idea of the total overhead occurring.
from numba import jit
#jit(nopython=True)
def numba_sum(xs):
y = 0
for x in xs:
y += x
return y
>>> %timeit numba_sum(xs)
1000 loops, best of 3: 1.46 ms per loop
When I read this problem, I immediately thought of reverse Polish notation (RPN). While it may not be the best approach, it still gives a substantial speedup in this case.
My second thought is that you may get an equivalent result if you just reorder the sequence xs appropriately to get rid of del y[p]. (Arguably the best performance would be achieved if the whole reduce procedure is written in C. But it's a different kettle of fish.)
Reverse Polish Notation
If you are not familiar with RPN, please read the short explanation in the wikipedia article. Basically, all operations can be written down without parentheses, for example (1-2)*(3+4) is 1 2 - 3 4 + * in RPN, while 1-(2*(3+4)) becomes 1 2 3 4 + * -.
Here is a simple implementation of an RPN parser. I separated an list of objects from an RPN sequence, so that the same sequence can be used for directly for different lists.
def rpn(arr, seq):
'''
Reverse Polish Notation algorithm
(this version works only for binary operators)
arr: array of objects
seq: rpn sequence containing indices of objects from arr and functions
'''
stack = []
for x in seq:
if isinstance(x, int):
# it's an object: push it to stack
stack.append(arr[x])
else:
# it's a function: pop two objects, apply the function, push the result to stack
b = stack.pop()
#a = stack.pop()
#stack.append(x(a,b))
## shortcut:
stack[-1] = x(stack[-1], b)
return stack.pop()
Example of usage:
# Say we have an array
arr = [100, 210, 42, 13]
# and want to calculate
(100 - 210) * (42 + 13)
# It translates to RPN:
100 210 - 42 13 + *
# or
arr[0] arr[1] - arr[2] arr[3] + *
# So we apply `
rpn(arr,[0, 1, subtract, 2, 3, add, multiply])
To apply RPN to your case you'd need either to generate rpn sequences from scratch or to convert your (pair_indexes, binary_functions) into them. I haven't thought about a converter but it surely can be done.
Tests
Your original test comes first:
r = 100000
xs = [random() for _ in range(r)]
ps = [0]*(r-1)
fs = repeat(add)
foo = ordered_reduce(None, ps, fs, precompute=True)
rpn_seq = [0] + [x for i, f in zip(range(1,r), repeat(add)) for x in (i,f)]
rpn_seq2 = list(range(r)) + list(repeat(add,r-1))
# Here rpn_seq denotes (_ + (_ + (_ +( ... )...))))
# and rpn_seq2 denotes ((...( ... _)+ _) + _).
# Obviously, they are not equivalent but with 'add' they yield the same result.
%timeit reduce(add, xs)
100 loops, best of 3: 7.37 ms per loop
%timeit foo(xs)
1 loops, best of 3: 1.71 s per loop
%timeit rpn(xs, rpn_seq)
10 loops, best of 3: 79.5 ms per loop
%timeit rpn(xs, rpn_seq2)
10 loops, best of 3: 73 ms per loop
# Pure numpy just out of curiosity:
%timeit np.sum(np.asarray(xs))
100 loops, best of 3: 3.84 ms per loop
xs_np = np.asarray(xs)
%timeit np.sum(xs_np)
The slowest run took 4.52 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 48.5 µs per loop
So, rpn was 10 times slower than reduce but about 20 times faster than ordered_reduce.
Now, let's try something more complicated: alternately adding and multiplying matrices. I need a special function for it to test against reduce.
add_or_dot_b = 1
def add_or_dot(x,y):
'''calls 'add' and 'np.dot' alternately'''
global add_or_dot_b
if add_or_dot_b:
out = x+y
else:
out = np.dot(x,y)
add_or_dot_b = 1 - add_or_dot_b
# normalizing out to avoid `inf` in results
return out/np.max(out)
r = 100001 # +1 for convenience
# (we apply an even number of functions)
xs = [np.random.rand(2,2) for _ in range(r)]
ps = [0]*(r-1)
fs = repeat(add_or_dot)
foo = ordered_reduce(None, ps, fs, precompute=True)
rpn_seq = [0] + [x for i, f in zip(range(1,r), repeat(add_or_dot)) for x in (i,f)]
%timeit reduce(add_or_dot, xs)
1 loops, best of 3: 894 ms per loop
%timeit foo(xs)
1 loops, best of 3: 2.72 s per loop
%timeit rpn(xs, rpn_seq)
1 loops, best of 3: 1.17 s per loop
Here, rpn was roughly 25% slower than reduce and more than 2 times faster than ordered_reduce.
I've recently been working on a project where a majority of my time is spent multiplying a dense matrix A and a sparse vector v (see here). In my attempts to reduce computation, I've noticed that the runtime of A.dot(v) is not affected by the number of zero entries of v.
To explain why I would expect the runtime to improve in this case, let result = A.dot.v so that result[j] = sum_i(A[i,j]*v[j]) for j = 1...v.shape[0]. If v[j] = 0 then clearly result[j] = 0 no matter the values A[::,j]. In this case, I would therefore expect numpy to just set result[j] = 0 but it seems as if it goes ahead and computes sum_i(A[i,j]*v[j]) anyways.
I went ahead and wrote a short sample script to confirm this behavior below.
import time
import numpy as np
np.__config__.show() #make sure BLAS/LAPACK is being used
np.random.seed(seed = 0)
n_rows, n_cols = 1e5, 1e3
#initialize matrix and vector
A = np.random.rand(n_rows, n_cols)
u = np.random.rand(n_cols)
u = np.require(u, dtype=A.dtype, requirements = ['C'])
#time
start_time = time.time()
A.dot(u)
print "time with %d non-zero entries: %1.5f seconds" % (sum(u==0.0), (time.time() - start_time))
#set all but one entry of u to zero
v = u
set_to_zero = np.random.choice(np.array(range(0, u.shape[0])), size = (u.shape[0]-2), replace=False)
v[set_to_zero] = 0.0
start_time = time.time()
A.dot(v)
print "time with %d non-zero entries: %1.5f seconds" % (sum(v==0.0), (time.time() - start_time))
#what I would really expect it to take
non_zero_index = np.squeeze(v != 0.0)
A_effective = A[::,non_zero_index]
v_effective = v[non_zero_index]
start_time = time.time()
A_effective.dot(v_effective)
print "expected time with %d non-zero entries: %1.5f seconds" % (sum(v==0.0), (time.time() - start_time))
Running this, I get that the runtime for matrix-vector multiplication is the same whether I use a dense matrix u or a sparse one v:
time with 0 non-zero entries: 0.04279 seconds
time with 999 non-zero entries: 0.04050 seconds
expected time with 999 non-zero entries: 0.00466 seconds
I am wondering if this is by design? Or am I missing something in the way that I'm running matrix-vector multiplication. Just as sanity checks: I've made sure that numpy is linked to a BLAS library on my machine and both arrays are C_CONTIGUOUS (since this is apparently required for numpy to call BLAS).
How about experimenting with a simple function like?
def dot2(A,v):
ind = np.where(v)[0]
return np.dot(A[:,ind],v[ind])
In [352]: A=np.ones((100,100))
In [360]: timeit v=np.zeros((100,));v[::60]=1;dot2(A,v)
10000 loops, best of 3: 35.4 us per loop
In [362]: timeit v=np.zeros((100,));v[::40]=1;dot2(A,v)
10000 loops, best of 3: 40.1 us per loop
In [364]: timeit v=np.zeros((100,));v[::20]=1;dot2(A,v)
10000 loops, best of 3: 46.5 us per loop
In [365]: timeit v=np.zeros((100,));v[::60]=1;np.dot(A,v)
10000 loops, best of 3: 29.2 us per loop
In [366]: timeit v=np.zeros((100,));v[::20]=1;np.dot(A,v)
10000 loops, best of 3: 28.7 us per loop
A fully iterative Python implentation would be:
def dotit(A,v, test=False):
n,m = A.shape
res = np.zeros(n)
if test:
for i in range(n):
for j in range(m):
if v[j]:
res[i] += A[i,j]*v[j]
else:
for i in range(n):
for j in range(m):
res[i] += A[i,j]*v[j]
return res
Obviously this won't be as fast as the compiled dot, but I expect the relative advantages of testing still apply. For further testing you could implement it in cython.
Notice that the v[j] test occurs deep in the iteration.
For a sparse v (3 out of 100 elements) testing saves time:
In [374]: timeit dotit(A,v,True)
100 loops, best of 3: 3.81 ms per loop
In [375]: timeit dotit(A,v,False)
10 loops, best of 3: 21.1 ms per loop
but it costs time if v is dense:
In [376]: timeit dotit(A,np.arange(100),False)
10 loops, best of 3: 22.7 ms per loop
In [377]: timeit dotit(A,np.arange(100),True)
10 loops, best of 3: 25.6 ms per loop
For simple arrays, Numpy doesn't perform such optimizations, but if You need, You may use sparse matrices which may improve dot product timings in that case.
For more on the topic, see: http://docs.scipy.org/doc/scipy/reference/sparse.html
Hi I'm writing program for AES mix column stage. Here I have to multiply two matrices of (4,4) shape. The only difference is that while multiplying two matrices I have to take 'xor' instead of where I have to add. e.g
a = np.array([[1,2],[3,4]])
b = np.array([[5,6],[7,8]])
np.dot(a,b) # this gives [[(1*5+2*7),(1*6+2*8)][(3*5+4*7),(3*6+4*8)]]
# but I want [[((1*5)^(2*7)),((1*6)^(2*8))][((3*5)^(4*7)),((3*6)^(4*8))]]
Here's the solution with loops
result = [[0,0,0,0],
[0,0,0,0],
[0,0,0,0],
[0,0,0,0]]
# iterate through rows of X
for i in range(len(X)):
# iterate through columns of Y
for j in range(len(Y[0])):
# iterate through rows of Y
for k in range(len(Y)):
result[i][j] = result[i][j] ^ (X[i][k] * Y[k][j])
How to achieve that without using loops?
xor_ab=np.bitwise_xor.reduce(a[...,None]*b,axis=1)
For explanation, consider a rectangular problem for easier identification:
a=np.arange(12).reshape(4,3).astype(object)
b=np.arange(12).reshape(3,4).astype(object)
object is to provide the python int arbitrary precision for AES.
products are obtained by broadcasting,
c=a[...,None]*b # dims : (4,3,1) * ((1),3,4) -> (4,3,4) , c_ijk =a_ij*b_jk
The dot product it then obtained by :
dot_ab=c.sum(axis=1) # ->(4,4)
In [734]: (dot_ab==a.dot(b)).all()
Out[734]: True
Then change to the equivalent xor function :
xor_ab=np.bitwise_xor.reduce(a[...,None]*b,axis=1)
As an alternative, you can interpret your loops with numba (0.23):
from numba import jit
#jit(nopython=True)
def xor(X,Y):
result=np.zeros((4,4),np.uint64)
for i in range(len(X)):
# iterate through columns of Y
for j in range(Y.shape[1]):
# iterate through rows of Y
for k in range(len(Y)):
result[i,j] = result[i,j] ^ (X[i,k] * Y[k,j])
return result
for a impressive efficiency gain,due to optimal memory usage.
But you are limited to 32 bits for a and b:
In [790]: %timeit xor(a,b)
1000000 loops, best of 3: 580 ns per loop
In [791]: %timeit xor_ab=np.bitwise_xor.reduce(a[...,None]*b,axis=1)
100000 loops, best of 3: 13.2 µs per loop
In [792] (xor(a,b)==np.bitwise_xor.reduce(a[...,None]*b,axis=1)).all()
Out[792]: True
I need to generate 1D array where repeated sequences of integers are separated by a random number of zeros.
So far I am using next code for this:
from random import normalvariate
regular_sequence = np.array([1,2,3,4,5], dtype=np.int)
n_iter = 10
lag_mean = 10 # mean length of zeros sequence
lag_sd = 1 # standard deviation of zeros sequence length
# Sequence of lags lengths
lag_seq = [int(round(normalvariate(lag_mean, lag_sd))) for x in range(n_iter)]
# Generate list of concatenated zeros and regular sequences
seq = [np.concatenate((np.zeros(x, dtype=np.int), regular_sequence)) for x in lag_seq]
seq = np.concatenate(seq)
It works but looks very slow when I need a lot of long sequences. So, how can I optimize it?
You can pre-compute indices where repeated regular_sequence elements are to be put and then set those with regular_sequence in a vectorized manner. For pre-computing those indices, one can use np.cumsum to get the start of each such chunk of regular_sequence and then add a continuous set of integers extending to the size of regular_sequence to get all indices that are to be updated. Thus, the implementation would look something like this -
# Size of regular_sequence
N = regular_sequence.size
# Use cumsum to pre-compute start of every occurance of regular_sequence
offset_arr = np.cumsum(lag_seq)
idx = np.arange(offset_arr.size)*N + offset_arr
# Setup output array
out = np.zeros(idx.max() + N,dtype=regular_sequence.dtype)
# Broadcast the start indices to include entire length of regular_sequence
# to get all positions where regular_sequence elements are to be set
np.put(out,idx[:,None] + np.arange(N),regular_sequence)
Runtime tests -
def original_app(lag_seq, regular_sequence):
seq = [np.concatenate((np.zeros(x, dtype=np.int), regular_sequence)) for x in lag_seq]
return np.concatenate(seq)
def vectorized_app(lag_seq, regular_sequence):
N = regular_sequence.size
offset_arr = np.cumsum(lag_seq)
idx = np.arange(offset_arr.size)*N + offset_arr
out = np.zeros(idx.max() + N,dtype=regular_sequence.dtype)
np.put(out,idx[:,None] + np.arange(N),regular_sequence)
return out
In [64]: # Setup inputs
...: regular_sequence = np.array([1,2,3,4,5], dtype=np.int)
...: n_iter = 1000
...: lag_mean = 10 # mean length of zeros sequence
...: lag_sd = 1 # standard deviation of zeros sequence length
...:
...: # Sequence of lags lengths
...: lag_seq = [int(round(normalvariate(lag_mean, lag_sd))) for x in range(n_iter)]
...:
In [65]: out1 = original_app(lag_seq, regular_sequence)
In [66]: out2 = vectorized_app(lag_seq, regular_sequence)
In [67]: %timeit original_app(lag_seq, regular_sequence)
100 loops, best of 3: 4.28 ms per loop
In [68]: %timeit vectorized_app(lag_seq, regular_sequence)
1000 loops, best of 3: 294 µs per loop
The best approach, I think, would be to use convolution. You can figure out the lag lengths, combine that with the length of the sequence, and use that to figure out the starting point of each regular sequence. Set those starting points to zero, then convolve with your regular sequence to fill in the values.
import numpy as np
regular_sequence = np.array([1,2,3,4,5], dtype=np.int)
n_iter = 10000000
lag_mean = 10 # mean length of zeros sequence
lag_sd = 1 # standard deviation of zeros sequence length
# Sequence of lags lengths
lag_lens = np.round(np.random.normal(lag_mean, lag_sd, n_iter)).astype(np.int)
lag_lens[1:] += len(regular_sequence)
starts_inds = lag_lens.cumsum()-1
# Generate list of convolved ones and regular sequences
seq = np.zeros(lag_lens.sum(), dtype=np.int)
seq[starts_inds] = 1
seq = np.convolve(seq, regular_sequence)
This approach takes something like 1/20th the time on large sequences, even after changing your version to use the numpy random number generator.
Not a trivial problem because data is misaligned. Performance depends on what is a long sequence. Take the example of a square problem : a lot of, long, regular and zeros sequences (n_iter==n_reg==lag_mean):
import numpy as np
n_iter = 1000
n_reg = 1000
regular_sequence = np.arange(n_reg, dtype=np.int)
lag_mean = n_reg # mean length of zeros sequence
lag_sd = lag_mean/10 # standard deviation of zeros sequence length
lag_seq=np.int64(np.random.normal(lag_mean,lag_sd,n_iter)) # Sequence of lags lengths
First your solution :
def seq_hybrid():
seqs = [np.concatenate((np.zeros(x, dtype=np.int), regular_sequence)) for x in lag_seq]
seq = np.concatenate(seqs)
return seq
Then a pure numpy one :
def seq_numpy():
seq=np.zeros(lag_seq.sum()+n_iter*n_reg,dtype=int)
cs=np.cumsum(lag_seq+n_reg)-n_reg
indexes=np.add.outer(cs,np.arange(n_reg))
seq[indexes]=regular_sequence
return seq
A for loop solution :
def seq_python():
seq=np.empty(lag_seq.sum()+n_iter*n_reg,dtype=int)
i=0
for lag in lag_seq:
for k in range(lag):
seq[i]=0
i+=1
for k in range(n_reg):
seq[i]=regular_sequence[k]
i+=1
return seq
And a just in time compilation with numba :
from numba import jit
seq_numba=jit(seq_python)
Tests now :
In [96]: %timeit seq_hybrid()
10 loops, best of 3: 38.5 ms per loop
In [97]: %timeit seq_numpy()
10 loops, best of 3: 34.4 ms per loop
In [98]: %timeit seq_python()
1 loops, best of 3: 1.56 s per loop
In [99]: %timeit seq_numba()
100 loops, best of 3: 12.9 ms per loop
Your hybrid solution is quite as speed as a pure numpy one in this case because
the performance depend essentially of the inner loop. And yours (zeros and concatenate) is a numpy one. Predictably , python solution is slower with a traditional about 40x factor. But numpy is not optimal here, because it uses fancy indexing, necessary with misaligned data . In this case numba can help you : minimal operations are done at C level, for a 120x factor gain this time compared to the python solution.
For other values of n_iter,n_reg the factor gains compared to the python solution are:
n_iter= 1000, n_reg= 1000 : seq_numba 124, seq_hybrid 49, seq_numpy 44.
n_iter= 10, n_reg= 100000 : seq_numba 123, seq_hybrid 104, seq_numpy 49.
n_iter= 100000, n_reg= 10 : seq_numba 127, seq_hybrid 1, seq_numpy 42.
I thought an answer posted on this question had a good approach using a binary mask and np.convolve but the answer got deleted and I don't know why. Here it is with 2 concerns addressed.
def insert_sequence(lag_seq, regular_sequence):
offsets = np.cumsum(lag_seq)
start_locs = np.zeros(offsets[-1] + 1, dtype=regular_sequence.dtype)
start_locs[offsets] = 1
return np.convolve(start_locs, regular_sequence)
lag_seq = np.random.normal(15,1,10)
lag_seq = lag_seq.astype(np.uint8)
regular_sequence = np.arange(1, 6)
seq = insert_sequence(lag_seq, regular_sequence)
print(repr(seq))
I need to extend this question, which sums values of an array based on indices from a second array. Let A be the result array, B be the index array, and C the array to be summed over. Then A[i] = sum over C such that index(B) == i.
Instead, my setup is
N = 5
M = 2
A = np.zeros((M,N))
B = np.random.randint(M, size=N) # contains indices for A
C = np.random.rand(N,N)
I need A[i,j] = sum_{k in 0...N} C[j,k] such that C[k] == i , i.e. a rowsum conditional on the indices of B matching i. Is there an efficient way to do this? For my application N is around 10,000 and M is around 20. This operation is called for every iteration in a minimization problem... my current looping method is terribly slow.
Thanks!
Following #DSM's comment, I'm assuming your C[k] == i supposed to be B[k] == i. If that's the case, does your loop version look something like this?
Nested Loop Version
import numpy as np
N = 5
M = 2
A = np.zeros((M,N))
B = np.random.randint(M, size=N) # contains indices for A
C = np.random.rand(N,N)
for i in range(M):
for j in range(N):
for k in range(N):
if B[k] == i:
A[i,j] += C[j,k]
There's more than one way to vectorize this problem. I'm going to show my thought process below, but there are more efficient ways to do it (e.g. #DSM's version that recognizes the matrix multiplication inherent in the problem).
For the sake of explanation, here's a walk-through of one approach.
Vectorizing the Inner Loop
Let's start by re-writing the inner k loop:
for i in range(M):
for j in range(N):
A[i,j] = C[j, B == i].sum()
It might be easier to think of this as C[j][B == i].sum(). We're just selecting jth row of C, selecting only the elements in that row where B is equal to i, and summing them.
Vectorizing the Outer-most Loop
Next let's break down the outer i loop. Now we're going to get to the point where readability will start to suffer, unfortunately...
i = np.arange(M)[:,np.newaxis]
mask = (B == i).astype(int)
for j in range(N):
A[:,j] = (C[j] * mask).sum(axis=-1)
There are a couple different tricks here. In this case, we're iterating over the columns of A. Each column of A is the sum of a subset of the corresponding row of C. The subset of the row of C is determined by where B is equal to the row index i.
To get around iterating through i, we're making a 2D array where B == i by adding a new axis to i. (Have a look at the documentation for numpy broadcasting if you're confused by this.) In other words:
B:
array([1, 1, 1, 1, 0])
i:
array([[0],
[1]])
B == i:
array([[False, False, False, False, True],
[ True, True, True, True, False]], dtype=bool)
What we want is to take two (M) filtered sums of C[j], one for each row in B == i. This will give us a two-element vector corresponding to the jth column in A.
We can't do this by indexing C directly because the result won't maintain it's shape, as each row may have a different number of elements. We'll get around this by multiplying the B == i mask by the current row of C, resulting in zeros where B == i is False, and the value in the current row of C where it's true.
To do this, we need to turn the boolean array B == i into integers:
mask = (B == i).astype(int):
array([[0, 0, 0, 0, 1],
[1, 1, 1, 1, 0]])
So when we multiply it by the current row of C:
C[j]:
array([ 0.19844887, 0.44858679, 0.35370919, 0.84074259, 0.74513377])
C[j] * mask:
array([[ 0. , 0. , 0. , 0. , 0.74513377],
[ 0.19844887, 0.44858679, 0.35370919, 0.84074259, 0. ]])
Then we can sum over each row to get the current column of A (This will be broadcast to a column when it's assigned to A[:,j]):
(C[j] * mask).sum(axis=-1):
array([ 0.74513377, 1.84148744])
Fully Vectorized Version
Finally, breaking down the last loop, we can apply the exact same principle to add a third dimension for the loop over j:
i = np.arange(M)[:,np.newaxis,np.newaxis]
mask = (B == i).astype(int)
A = (C * mask).sum(axis=-1)
#DSM's vectorized version
As #DSM suggested, you could also do:
A = (B == np.arange(M)[:,np.newaxis]).dot(C.T)
This is by far the fastest solution for most sizes of M and N, and arguably the most elegant (much more elegant than my solutions, anyway).
Let's break it down a bit.
The B == np.arange(M)[:,np.newaxis] is exactly equivalent to B == i in the "Vectorizing the Outer-most Loop" section above.
The key is in recognizing that all of the j and k loops are equivalent to matrix multiplication. dot will cast the boolean B == i array to the same dtype as C behind-the-scenes, so we don't need to worry about explicitly casting it to a different type.
After that, we're just performing matrix multiplication on the transpose of C (a 5x5 array) and the "mask" 0 and 1 array above, yielding a 2x5 array.
dot will take advantage of any optimized BLAS libraries you have installed (e.g. ATLAS, MKL), so it's very fast.
Timings
For small M's and N's, the differences are less apparent (~6x between looping and DSM's version):
M, N = 2, 5
%timeit loops(B,C,M)
10000 loops, best of 3: 83 us per loop
%timeit k_vectorized(B,C,M)
10000 loops, best of 3: 106 us per loop
%timeit vectorized(B,C,M)
10000 loops, best of 3: 23.7 us per loop
%timeit askewchan(B,C,M)
10000 loops, best of 3: 42.7 us per loop
%timeit einsum(B,C,M)
100000 loops, best of 3: 15.2 us per loop
%timeit dsm(B,C,M)
100000 loops, best of 3: 13.9 us per loop
However, once M and N start to grow, the difference becomes very significant (~600x) (note the units!):
M, N = 50, 20
%timeit loops(B,C,M)
10 loops, best of 3: 50.3 ms per loop
%timeit k_vectorized(B,C,M)
100 loops, best of 3: 10.5 ms per loop
%timeit ik_vectorized(B,C,M)
1000 loops, best of 3: 963 us per loop
%timeit vectorized(B,C,M)
1000 loops, best of 3: 247 us per loop
%timeit askewchan(B,C,M)
1000 loops, best of 3: 493 us per loop
%timeit einsum(B,C,M)
10000 loops, best of 3: 134 us per loop
%timeit dsm(B,C,M)
10000 loops, best of 3: 80.2 us per loop
I am assuming #DSM found your typo, and you want:
A[i,j] = sum_{k in 0...N} C[j,k] where B[k] == i
Then you can loop over i in range(M) since M is relatively small.
A = np.array([C[:,B == i].sum(axis=1) for i in range(M)])