x has shape [batch_size, n_time] where the batches are independent
If k=3, d=discount_rate. Pseudocode:
x[:,i] = x[:,i] + x[:,i+1]*(d**1) + x[:,i+2]*(d**2) + x[:,i+3]*(d**3)
Here's working code, but it's very slow. I'll be executing this function millions of times, so I'm hoping for a faster implementation
import numpy as np
def k_step_discount(x, k, discount_rate):
n_time = x.shape[1]
k_include_cur = k + 1 # k excludes current timestep
for i in range(n_time):
k_cur = min(n_time - i, k_include_cur) # prevent out of bounds
for j in range(1, k_cur):
x[:, i] += x[:, i+j] * (discount_rate ** j)
return x
x = np.array([
[0,0,0,1,0,0],
[0,1,2,3,4,5.]
])
y = k_step_discount(x+0, k=2, discount_rate=.9)
print('x\n{}\ny\n{}'.format(x, y))
>> x
[[ 0. 0. 0. 1. 0. 0.]
[ 0. 1. 2. 3. 4. 5.]]
>> y
[[ 0. 0.81 0.9 1. 0. 0. ]
[ 2.52 5.23 7.94 10.65 8.5 5. ]]
A scipy function that's similar is:
import scipy.signal
import numpy as np
x = np.array([[0,0,0,1,0,0.]])
discount_rate = .9
y = np.flip(scipy.signal.lfilter([1], [1, -discount_rate], np.flip(x+0, 1), axis=1), 1)
print('x\n{}\ny\n{}'.format(x, y))
>> x
[[ 0. 0. 0. 1. 0. 0.]]
>> y
[[ 0.729 0.81 0.9 1. 0. 0. ]]
However, it discounts until the end of n_time rather than only for k steps
I'm also interested in K-step discounting without batches, if that'd be easier/faster
import numpy as np
def k_step_discount_no_batch(x, k, discount_rate):
n_time = x.shape[0]
k_include_cur = k + 1 # k excludes current timestep
for i in range(n_time):
k_cur = min(n_time - i, k_include_cur) # prevent out of bounds
for j in range(1, k_cur):
x[i] += x[i+j] * (discount_rate ** j)
return x
x = np.array([8,0,0,0,1,2.])
y = k_step_discount_no_batch(x+0, k=2, discount_rate=.9)
print('x\n{}\ny\n{}'.format(x, y))
>> x
[ 8. 0. 0. 0. 1. 2.]
>> y
[ 8. 0. 0.81 2.52 2.8 2. ]
Similar no_batch scipy function
import scipy.signal
import numpy as np
x = np.array([8,0,0,0,1,2.])
discount_rate = .9
y = scipy.signal.lfilter([1], [1, -discount_rate], x[::-1], axis=0)[::-1]
print('x\n{}\ny\n{}'.format(x, y))
>> x
[ 8. 0. 0. 0. 1. 2.]
>> y
[ 9.83708 2.0412 2.268 2.52 2.8 2. ]
You could use 2D convolution here. To get the scaling done properly, we need to create the proper 2D kernel, which would be a flipped version of the powered-scaled numbers of discount_rate. This is in accordance with the definition of convolution, in which kernel is slided in the flipped order against the input data and its elements are scaled with those kernel ones and summed up, as precisely done in this case.
Thus, the implementation would be simply -
from scipy.signal import convolve2d as conv2d
import numpy as np
def k_step_discount(x, k, discount_rate, is_batch=True):
if is_batch:
kernel = discount_rate**np.arange(k+1)[::-1][None]
return conv2d(x,kernel)[:,k:]
else:
kernel = discount_rate**np.arange(k+1)[::-1]
return np.convolve(x, kernel)[k:]
Sample run -
In [190]: x
Out[190]:
array([[ 0., 0., 0., 1., 0., 0.],
[ 0., 1., 2., 3., 4., 5.]])
# Proposed method
In [191]: k_step_discount_conv2d(x, k=2, discount_rate=0.9)
Out[191]:
array([[ 0. , 0.81, 0.9 , 1. , 0. , 0. ],
[ 2.52, 5.23, 7.94, 10.65, 8.5 , 5. ]])
# Original loopy method
In [192]: k_step_discount(x, k=2, discount_rate=.9)
Out[192]:
array([[ 0. , 0.81, 0.9 , 1. , 0. , 0. ],
[ 2.52, 5.23, 7.94, 10.65, 8.5 , 5. ]])
Runtime test
In [206]: x = np.random.randint(0,9,(100,1000)).astype(float)
In [207]: %timeit k_step_discount_conv2d(x, k=2, discount_rate=0.9)
1000 loops, best of 3: 1.27 ms per loop
In [208]: %timeit k_step_discount(x, k=2, discount_rate=.9)
100 loops, best of 3: 4.83 ms per loop
With bigger k's :
In [215]: x = np.random.randint(0,9,(100,1000)).astype(float)
In [216]: %timeit k_step_discount_conv2d(x, k=20, discount_rate=0.9)
100 loops, best of 3: 5.44 ms per loop
In [217]: %timeit k_step_discount(x, k=20, discount_rate=.9)
10 loops, best of 3: 44.8 ms per loop
Thus, expect huge speedups with bigger k's!
Further boost
As suggested by #Eric, we could also leverage scipy.ndimage.filters's 1D convolution here.
For a proper comparison listing both with Scipy's 2D and 1D convolution methods -
from scipy.ndimage.filters import convolve1d as conv1d
def using_conv2d(x, k, discount_rate):
kernel = discount_rate**np.arange(k+1)[::-1][None]
return conv2d(x,kernel)[:,k:]
def using_conv1d(x, k, discount_rate):
kernel = discount_rate**np.arange(k+1)[::-1]
return conv1d(x,kernel, mode='constant', origin=k//2)
Timings -
In [100]: x = np.random.randint(0,9,(100,1000)).astype(float)
In [101]: out1 = using_conv2d(x, k=20, discount_rate=0.9)
...: out2 = using_conv1d(x, k=20, discount_rate=0.9)
...:
In [102]: np.allclose(out1, out2)
Out[102]: True
In [103]: %timeit using_conv2d(x, k=20, discount_rate=0.9)
100 loops, best of 3: 5.27 ms per loop
In [104]: %timeit using_conv1d(x, k=20, discount_rate=0.9)
1000 loops, best of 3: 1.43 ms per loop
Related
I'm trying to create a certain style of band(ed) matrix (see Wikipedia). The following code works, but for large M (~300 or so) it becomes quite slow because of the for loop. Is there a way to vectorize it/make better use of NumPy and/or SciPy? I am having trouble figuring out the mathematical operation that this corresponds to, and hence I have not succeeded thus far.
The code I have is as follows
def banded_matrix(M):
phis = np.linspace(0, 2*np.pi, M)
i = 0
ham = np.zeros((int(2*M), int(2*M)))
for phi in phis:
ham_phi = np.array([[1, 1],
[1, -1]])*(1+np.cos(phi))
array_phi = np.zeros(M)
array_phi[i] = 1
mat_phi = np.diag(array_phi)
ham += np.kron(mat_phi, ham_phi)
i += 1
return ham
With %timeit banded_matrix(M=300) it takes about 4 seconds on my computer.
Since the code is a bit opaque, what I want to do is construct a large 2M by 2M matrix. In a sense it has M entries on it's 'width 2' diagonal, where the entries are 2x2 matrices ham_phi that depend on phi. The matrix will afterwards be diagonalized, so perhaps one could even make use of its structure/the fact that it is rather sparse to speed that up, but of that I am not sure.
If anyone has an idea where to go with this, I'd be happy to follow up on that!
Your matrix is diagonal by blocks, so you can use scipy.linalg.block_diag:
import numpy as np
from scipy.linalg import block_diag
def banded_matrix_scipy(M):
ham = np.array([[1, 1], [1, -1]])
phis = np.linspace(0, 2 * np.pi, M)
ham_phis = ham * (1 + np.cos(phis))[:, None, None]
return block_diag(*ham_phis)
Let's check that it works and is faster:
b1 = banded_matrix(300)
b2 = banded_matrix_scipy(300)
np.all(b1 == b2) # True
>>> %timeit banded_matrix(300)
>>> %timeit banded_matrix_scipy(300)
1.51 s ± 57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.24 ms ± 4.57 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
The obligatory np.einsum benchmark
def banded_matrix_einsum(M):
return np.einsum('ij, kl-> ikjl',
np.eye(M)*(1 + np.cos(np.linspace(0, 2 * np.pi, M))),
np.array([[1, 1], [1, -1]])).reshape(2*M, 2*M)
banded_matrix_einsum(4)
Output
array([[ 2. , 2. , 0. , 0. , 0. , 0. , 0. , 0. ],
[ 2. , -2. , 0. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0.5, 0.5, 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0.5, -0.5, 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0.5, 0.5, 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0.5, -0.5, 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 2. , 2. ],
[ 0. , 0. , 0. , 0. , 0. , 0. , 2. , -2. ]])
Benchmark results
import perfplot
perfplot.show(
setup = lambda M: M,
kernels = [banded_matrix_einsum, banded_matrix_scipy, banded_matrix],
n_range = [50, 100, 150, 200, 250, 300],
logx = False
)
scipy.linalg.block_diag vs np.einsum details
perfplot.show(
setup = lambda M: M,
kernels = [banded_matrix_einsum, banded_matrix_scipy],
n_range = [50, 100, 150, 200, 250, 300, 350, 400],
logx = False
)
As another way, you can use numba accelerator to speed it up with jitting. I propose an equivalent scipy.linalg.block_diag numba method that is based on paime answer:
import numba as nb
#nb.njit
def block_diag_numba(result, ham_phis):
for i in range(ham_phis.shape[0]):
for j in range(ham_phis.shape[1]):
result[i * 2, i * 2:i * 2 + 2] = ham_phis[i, 0]
result[i * 2 + 1, i * 2:i * 2 + 2] = ham_phis[i, 1]
return result
def numba_(M):
ham = np.array([[1, 1], [1, -1]])
phis = np.linspace(0, 2 * np.pi, M)
ham_phis = ham * (1 + np.cos(phis))[:, None, None]
return block_diag_numba(np.zeros((M * ham.shape[1], M * ham.shape[1])), ham_phis)
This method will be faster than the previous ones at least 4-5 times for up to m=400 (us scale). This method can be adjust for other array shapes and improved by optimizing the code further (not using the paime answer) and bringing all code lines to numba function or parallelizing. I didn't go further because the paime answer performance seemed to be satisfiable by the OP acceptance; Just to show we can use numba to write much faster scipy.linalg.block_diag equivalent code:
I have different sized vectors and want to do element-wise manipulations. How can I optimize the following for-loop in Python? (For instance with np.vectorize())
import numpy as np
n = 1000000
vec1 = np.random.rand(n)
vec2 = np.random.rand(3*n)
vec3 = np.random.rand(3*n)
for i in range(len(vec1)):
if vec1[i] < 0.5:
vec2[3*i : 3*(i+1)] = vec1[i]*vec3[3*i : 3*(i+1)]
else:
vec2[3*i : 3*(i+1)] = [0,0,0]
Thanks a lot for your help.
We could leverage broadcasting -
v = vec3.reshape(-1,3)*vec1[:,None]
m = vec1<0.5
vec2_out = (v*m[:,None]).ravel()
Another way to express that would be -
mask = vec1<0.5
vec2_out = (vec3.reshape(-1,3)*(vec1*mask)[:,None]).ravel()
And use multi-cores with numexpr module -
import numexpr as ne
d = {'V3r':vec3.reshape(-1,3),'vec12D':vec1[:,None]}
out = ne.evaluate('V3r*vec12D*(vec12D<0.5)',d).ravel()
Timings -
In [84]: n = 1000000
...: np.random.seed(0)
...: vec1 = np.random.rand(n)
...: vec2 = np.random.rand(3*n)
...: vec3 = np.random.rand(3*n)
In [86]: %%timeit
...: v = vec3.reshape(-1,3)*vec1[:,None]
...: m = vec1<0.5
...: vec2_out = (v*m[:,None]).ravel()
10 loops, best of 3: 23.2 ms per loop
In [87]: %%timeit
...: mask = vec1<0.5
...: vec2_out = (vec3.reshape(-1,3)*(vec1*mask)[:,None]).ravel()
100 loops, best of 3: 13.1 ms per loop
In [88]: %%timeit
...: d = {'V3r':vec3.reshape(-1,3),'vec12D':vec1[:,None]}
...: out = ne.evaluate('V3r*vec12D*(vec12D<0.5)',d).ravel()
100 loops, best of 3: 4.11 ms per loop
For a generic case, where the else-part could be something other than zeros, it would be -
mask = vec1<0.5
IF_vals = vec3.reshape(-1,3)*vec1[:,None]
ELSE_vals = np.array([1,1,1])
out = np.where(mask[:,None],IF_vals,ELSE_vals).ravel()
numpy.vectorize, as mentioned in the comments, is for convenience, not performance, per the docs:
The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
One solution to actually vectorize this would be:
vec2[:] = vec1.repeat(3) * vec3 # Bulk compute all results
vec2[(vec1 < 0.5).repeat(3)] = 0 # Zero the results you meant to exclude
Another approach (that minimizes temporaries) would be to filter and reshape vec1 so it can be assigned to vec2, then multiply vec2 by vec3 in place to avoid a temporary (beyond the two n length arrays from the first step), e.g.:
vec2.reshape(-1, 3)[:] = (vec1 * (vec1 >= 0.5)).reshape(-1, 1)
vec2 *= vec3
An additional temporary could be shaved if vec1 can be modified, simplifying to:
vec1 *= vec1 >= 0.5
vec2.reshape(-1, 3)[:] = vec1.reshape(-1, 1)
vec2 *= vec3
The reshape/broadcasting that #Divakar demonstrates is equivalent to rewriting your iteration as:
In [5]: n = 10
...: vec1 = np.random.rand(n)
...: vec2 = np.zeros((n,3))
...: vec3 = np.random.rand(n,3)
...:
...: for i in range(len(vec1)):
...: if vec1[i] < 0.5:
...: vec2[i,:] = vec1[i]*vec3[i,:]
...: else:
...: vec2[i,:] = 0
...:
In [6]: vec2
Out[6]:
array([[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0.119655 , 0.05079028, 0.00392748],
[0.04529872, 0.04630456, 0.01565116],
[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0.08361475, 0.21825921, 0.1273483 ]])
In [7]: vec1
Out[7]:
array([0.934649 , 0.85309325, 0.50775071, 0.91246865, 0.12970539,
0.13075136, 0.89861756, 0.68921343, 0.80572879, 0.25996369])
By defining vec2 as a (n,3) array, we replace this indexing vec2[3*i : 3*(i+1)] with vec2[i,:] or vec2[i].
Use of a mask to set values to 0 is a good basic numpy idea. But ufunc also provide a where parameter that can be used as:
In [11]: vec2 = np.zeros((n,3))
In [12]: np.multiply(vec1[:,None],vec3, out=vec2, where=vec1[:,None]<0.5);
In [13]: vec2
Out[13]:
array([[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0.119655 , 0.05079028, 0.00392748],
[0.04529872, 0.04630456, 0.01565116],
[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0.08361475, 0.21825921, 0.1273483 ]])
This where needs to be used in conjunction with a out parameter, since it only does the multiply for the True instances.
I'm not sure how much of a time saver it is.
I have an NumPy array of coordinates. For example purposes, I will use this
In [1]: np.random.seed(123)
In [2]: coor = np.random.randint(10, size=12).reshape(-1,3)
In [3]: coor
Out[3]: array([[2, 2, 6],
[1, 3, 9],
[6, 1, 0],
[1, 9, 0]])
I want the triangular matrix of distances between all coordinates. A simple approach would be to code a double loop over all coordinates
In [4]: n_coor = len(coor)
In [5]: dist = np.zeros((n_coor, n_coor))
In [6]: for j in xrange(n_coor):
for k in xrange(j+1, n_coor):
dist[j, k] = np.sqrt(np.sum((coor[j] - coor[k]) ** 2))
with the result being an upper triangular matrix of the distances
In [7]: dist
Out[7]: array([[ 0. , 3.31662479, 7.28010989, 9.2736185 ],
[ 0. , 0. , 10.48808848, 10.81665383],
[ 0. , 0. , 0. , 9.43398113],
[ 0. , 0. , 0. , 0. ]])
Leveraging NumPy, I can avoid looping using
In [8]: dist = np.sqrt(((coor[:, None, :] - coor) ** 2).sum(-1))
but the result is the entire matrix
In [9]: dist
Out[9]: array([[ 0. , 3.31662479, 7.28010989, 9.2736185 ],
[ 3.31662479, 0. , 10.48808848, 10.81665383],
[ 7.28010989, 10.48808848, 0. , 9.43398113],
[ 9.2736185 , 10.81665383, 9.43398113, 0. ]])
This one line version takes roughly half the time when I use 2048 coordinates (4 s instead of 10 s) but this is doing twice as many calculations as it needs in order to get the symmetric matrix. Is there a way to adjust the one line command to only get the triangular matrix (and the additional 2x speedup, i.e. 2 s)?
We can use SciPy's pdist method to get those distances. So, we just need to initialize the output array and then set the upper triangular values with those distances
from scipy.spatial.distance import pdist
n_coor = len(coor)
dist = np.zeros((n_coor, n_coor))
row,col = np.triu_indices(n_coor,1)
dist[row,col] = pdist(coor)
Alternatively, we can use boolean-indexing to assign values, replacing the last two lines
dist[np.arange(n_coor)[:,None] < np.arange(n_coor)] = pdist(coor)
Runtime test
Functions:
def subscripted_indexing(coor):
n_coor = len(coor)
dist = np.zeros((n_coor, n_coor))
row,col = np.triu_indices(n_coor,1)
dist[row,col] = pdist(coor)
return dist
def boolean_indexing(coor):
n_coor = len(coor)
dist = np.zeros((n_coor, n_coor))
r = np.arange(n_coor)
dist[r[:,None] < r] = pdist(coor)
return dist
Timings:
In [110]: # Setup input array
...: coor = np.random.randint(0,10, (2048,3))
In [111]: %timeit subscripted_indexing(coor)
10 loops, best of 3: 91.4 ms per loop
In [112]: %timeit boolean_indexing(coor)
10 loops, best of 3: 47.8 ms per loop
I have a large real 1-d data set called r. I would like plot:
mean(log(1+a*r)) vs a, with a > -1 .
This is my code:
rr=pd.read_csv('goog.csv')
dd=rr['Close']
series=pd.Series(dd)
seriespct=series.pct_change()
seriespct[0]=seriespct.mean()
dum1 =[0]*len(dd)
a=1.
a_max = 1.
a_step = 0.01
a = scipy.arange(-3.+a_step, a_max, a_step)
n = len(a)
dum2 =[0]*n
m=len(dd)
for j in range(n):
for i in range(m):
dum1[i]=math.log(1+a[j]*seriespct[i])
dum2[j]=scipy.mean(dum1)
plt.plot(a,dum2)
plt.show()
How can I do this in a more elgant way?
I would recommend this:
plt.plot(a, np.log(1 + r*a[:,None]).mean(1))
This has a big speed advantage because it avoids for-loops, and loops done in numpy are significantly faster in case your dataset is large.
In [49]: a = np.arange(a_step-.3, a_max, a_step)
In [50]: r = np.random.random(100)
In [51]: timeit [scipy.mean(log(1+a[i]*r)) for i in range(len(a))]
100 loops, best of 3: 5.47 ms per loop
In [52]: timeit np.log(1 + r*a[:,None]).mean(1)
1000 loops, best of 3: 384 µs per loop
It works by broadcasting so that a varies along one axis and r along another, then you can take the mean just along the axis that r varies along, so you still have an array that varies with a (and has the same shape as a):
import numpy as np
import matplotlib.pyplot as plt
r = np.random.random(100)
a = 1.
a_max = 1.
a_step = 0.01
a = np.arange(a_step-.3, a_max, a_step)
a.shape
#(129,)
a = a[:,None] #adds a new axis, making this a column vector, same as: a = a.reshape(-1,1)
a.shape
#(129, 1)
(a*r).shape
#(129, 100)
loga = np.log(1 + a*r)
loga.shape
#(129,100)
mloga = loga.mean(axis=1) #take the mean along the 2nd axis where `a` varies
mloga.shape
#(129,)
plt.plot(a, mloga)
plt.show()
ADDENDUM:
To avoid dependency on broadcasting, you can use np.outer:
plt.plot(a, np.log(1 + np.outer(a,r)).mean(1))
Which has no need for reshaping a (skip the step a = a[:,None])
Here's a simpler example, so you can see what's happening:
r = np.exp(np.arange(1,5))
a = np.arange(5)
In [33]: r
Out[33]: array([ 2.71828183, 7.3890561 , 20.08553692, 54.59815003])
In [34]: a
Out[34]: array([0, 1, 2, 3, 4])
In [39]: r*a[:,None]
Out[39]:
# this is 2.7... 7.3... 20.08... 54.5... # times:
array([[ 0. , 0. , 0. , 0. ], # 0
[ 2.71828183, 7.3890561 , 20.08553692, 54.59815003], # 1
[ 5.43656366, 14.7781122 , 40.17107385, 109.19630007], # 2
[ 8.15484549, 22.1671683 , 60.25661077, 163.7944501 ], # 3
[ 10.87312731, 29.5562244 , 80.34214769, 218.39260013]]) # 4
In [40]: np.outer(a,r)
Out[40]:
array([[ 0. , 0. , 0. , 0. ],
[ 2.71828183, 7.3890561 , 20.08553692, 54.59815003],
[ 5.43656366, 14.7781122 , 40.17107385, 109.19630007],
[ 8.15484549, 22.1671683 , 60.25661077, 163.7944501 ],
[ 10.87312731, 29.5562244 , 80.34214769, 218.39260013]])
# this is the mean of each column:
In [41]: (np.outer(a,r)).mean(1)
Out[41]: array([ 0. , 21.19775622, 42.39551244, 63.59326866, 84.79102488])
# and the log of 1 + the above is:
In [42]: np.log(1+(np.outer(a,r)).mean(1))
Out[42]: array([ 0. , 3.09999121, 3.77035604, 4.16811021, 4.4519144 ])
You can use scipy to do means.
You can use matplotlib to do plotting.
import scipy
from matplotlib import pyplot
#convert r from a python list to an 1-D array
r = scipy.array(r)
#edit these
a_max = 100
a_step = 0.1
a = scipy.arange(-1+a_step, a_max, a_step)
n = len(a)
pyplot.plot(a, [scipy.mean(log(1+a[i]*r)) for i in range(n)], 'b-')
pyplot.show()
I have in my code the following expression:
a = (b / x[:, np.newaxis]).sum(axis=1)
where b is an ndarray of shape (M, N), and x is an ndarray of shape (M,). Now, b is actually sparse, so for memory efficiency I would like to substitute in a scipy.sparse.csc_matrix or csr_matrix. However, broadcasting in this way is not implemented (even though division or multiplication is guaranteed to maintain sparsity) (the entries of x are non-zero), and raises a NotImplementedError. Is there a sparse function I'm not aware of that would do what I want? (dot() would sum along the wrong axis.)
If b is in CSC format, then b.data has the non-zero entries of b, and b.indices has the row index of each of the non-zero entries, so you can do your division as:
b.data /= np.take(x, b.indices)
It's hackier than Warren's elegant solution, but it will probably also be faster in most settings:
b = sps.rand(1000, 1000, density=0.01, format='csc')
x = np.random.rand(1000)
def row_divide_col_reduce(b, x):
data = b.data.copy() / np.take(x, b.indices)
ret = sps.csc_matrix((data, b.indices.copy(), b.indptr.copy()),
shape=b.shape)
return ret.sum(axis=1)
def row_divide_col_reduce_bis(b, x):
d = sps.spdiags(1.0/x, 0, len(x), len(x))
return (d * b).sum(axis=1)
In [2]: %timeit row_divide_col_reduce(b, x)
1000 loops, best of 3: 210 us per loop
In [3]: %timeit row_divide_col_reduce_bis(b, x)
1000 loops, best of 3: 697 us per loop
In [4]: np.allclose(row_divide_col_reduce(b, x),
...: row_divide_col_reduce_bis(b, x))
Out[4]: True
You can cut the time almost in half in the above example if you do the division in-place, i.e.:
def row_divide_col_reduce(b, x):
b.data /= np.take(x, b.indices)
return b.sum(axis=1)
In [2]: %timeit row_divide_col_reduce(b, x)
10000 loops, best of 3: 131 us per loop
To implement a = (b / x[:, np.newaxis]).sum(axis=1), you can use a = b.sum(axis=1).A1 / x. The A1 attribute returns the 1D ndarray, so the result is a 1D ndarray, not a matrix. This concise expression works because you are both scaling by x and summing along axis 1. For example:
In [190]: b
Out[190]:
<3x3 sparse matrix of type '<type 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
In [191]: b.A
Out[191]:
array([[ 1., 0., 2.],
[ 0., 3., 0.],
[ 4., 0., 5.]])
In [192]: x
Out[192]: array([ 2., 3., 4.])
In [193]: b.sum(axis=1).A1 / x
Out[193]: array([ 1.5 , 1. , 2.25])
More generally, if you want to scale the rows of a sparse matrix with a vector x, you could multiply b on the left with a sparse matrix containing 1.0/x on the diagonal. The function scipy.sparse.spdiags can be used to create such a matrix. For example:
In [71]: from scipy.sparse import csc_matrix, spdiags
In [72]: b = csc_matrix([[1,0,2],[0,3,0],[4,0,5]], dtype=np.float64)
In [73]: b.A
Out[73]:
array([[ 1., 0., 2.],
[ 0., 3., 0.],
[ 4., 0., 5.]])
In [74]: x = array([2., 3., 4.])
In [75]: d = spdiags(1.0/x, 0, len(x), len(x))
In [76]: d.A
Out[76]:
array([[ 0.5 , 0. , 0. ],
[ 0. , 0.33333333, 0. ],
[ 0. , 0. , 0.25 ]])
In [77]: p = d * b
In [78]: p.A
Out[78]:
array([[ 0.5 , 0. , 1. ],
[ 0. , 1. , 0. ],
[ 1. , 0. , 1.25]])
In [79]: a = p.sum(axis=1)
In [80]: a
Out[80]:
matrix([[ 1.5 ],
[ 1. ],
[ 2.25]])