I'm writing a numerical code, where I'm using scipy.sparse.dia_matrix. My matrices are quite large (up to about 1000000 x 1000000), but very sparse. Sometimes tridiagonal, sometimes with a some more diagonals.
For various reasons, it is extremely convenient and clear from a coding point of view to just add together several of these matrices (of the same size, of course). However, I have found that adding these sparse matrices is very slow. The following example illustrates what I mean:
import numpy as np
from scipy.sparse import diags, dia_matrix
N = 100000
M1 = diags(diagonals = [np.random.random(N-1), np.random.random(N), np.random.random(N-1)], offsets = [-1, 0, 1])
M2 = diags(diagonals = [np.random.random(N-1), np.random.random(N), np.random.random(N-1)], offsets = [-1, 0, 1])
M3 = diags(diagonals = [np.random.random(N-1), np.random.random(N), np.random.random(N-1)], offsets = [-1, 0, 1])
def simple_add():
M = M1 + M2 + M3
def complicated_add():
M_ = dia_matrix((N, N))
for d in [-1, 0, 1]:
M_.setdiag(M1.diagonal(d) + M2.diagonal(d) + M3.diagonal(d), d)
%timeit simple_add()
%timeit complicated_add()
The output of the timing is:
16.9 ms ± 730 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
959 µs ± 39.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I don't understand why adding the matrices together with the + operator is 17 times slower than creating an empty digagonal matrix and explicitly setting the diagonals. Is there anything I can do to speed this up? I would much prefer to keep the simpler expression with the + operator, as it's far more readable, but not at the expense of an order of magnitude increase in computational time.
Update:
I proposed a change in Scipy that would make addition of two instances of dia_matrix faster, and after a bit of discussion I submitted a pull request to Scipy, which has now been merged. So in the future, adding two instances of dia_matrix will no longer convert to csr_matrix.
https://github.com/scipy/scipy/pull/14004
diags makes a dia_matrix from the list of inputs:
In [84]: M=sparse.diags([np.arange(1,4),np.arange(1,5),np.arange(1,4)], offsets=[-1,0,1])
In [85]: M
Out[85]:
<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements (3 diagonals) in DIAgonal format>
In [86]: M.offsets
Out[86]: array([-1, 0, 1], dtype=int32)
In [87]: M.data
Out[87]:
array([[1., 2., 3., 0.],
[1., 2., 3., 4.],
[0., 1., 2., 3.]])
The list of diagonals (different lengths) has been transformed into a 2 array, with offsets. This is intended primarily as a input format. Most, if not all, math is implemented in the csr format. And even there, matrix_multiplication is the relative strong point. Element-wise math is distinctly inferior to numpy array equivalents.
In [89]: Mr=M.tocsr()
In [90]: Mr
Out[90]:
<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in Compressed Sparse Row format>
In [91]: Mr.data
Out[91]: array([1., 1., 1., 2., 2., 2., 3., 3., 3., 4.])
In [92]: Mr.indices
Out[92]: array([0, 1, 0, 1, 2, 1, 2, 3, 2, 3], dtype=int32)
In [93]: Mr.indptr
Out[93]: array([ 0, 2, 5, 8, 10], dtype=int32)
The dia format suggests a faster addition, if the offsets and shape are all the same.
In [94]: M.data += M.data + M.data
In [95]: M.data
Out[95]:
array([[ 3., 6., 9., 0.],
[ 3., 6., 9., 12.],
[ 0., 3., 6., 9.]])
In [96]: M.A
Out[96]:
array([[ 3., 3., 0., 0.],
[ 3., 6., 6., 0.],
[ 0., 6., 9., 9.],
[ 0., 0., 9., 12.]])
With any of the sparse formats, if the sparsity is the same for all arguments and output, you can often do math directly on the data attribute, leaving the implied 0's unchanged.
The implementation of _add_sparse(self, other) is return self.tocsr()._add_sparse(other). The extra time is to turn it into a CSR matrix (which has a C extension for addition).
Could you create a sparse matrix that does what you want? Probably.
from scipy.sparse import dia_matrix, isspmatrix_dia
class dia_matrix_adder(dia_matrix):
def __add__(self, other):
if not isspmatrix_dia(other):
return super(dia_matrix_adder, self).__add__(other)
M_ = dia_matrix((self.shape[0], self.shape[1]))
for d in [-1, 0, 1]:
M_.setdiag(self.diagonal(d) + other.diagonal(d), d)
return M_
I would probably not do that and just write yourself a function:
def add_dia_matrix(*mats):
if len(mats) == 1:
return mats[0]
M_ = dia_matrix((mats[0].shape[0], mats[0].shape[1]))
for d in [-1, 0, 1]:
M_diag = mats[0].diagonal(d).copy()
for i in range(1, len(mats)):
M_diag += mats[i].diagonal(d)
M_.setdiag(M_diag, d)
return M_
This should be as readable as a bunch of + without having to deal with a new class.
%timeit simple_add()
30.3 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit complicated_add()
1.28 ms ± 2.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit add_dia_matrix(M1, M2, M3)
1.22 ms ± 4.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Related
I have two lists of Pytorch 2D tensors, which are points on a plane:
ListA = tensor([ [1.0,2.0], [1.0,3.0], [4.0,8.0] ], device='cuda:0')
ListB = tensor([ [5.0,7.0], [1.0,2.0], [4.0,8.0] ], device='cuda:0')
How to compute ?
Desired output = tensor([ [1.0,2.0] , [4.0,8.0] ], device='cuda:0')
I would like to find the Intersection between two lists ListA and ListB.
Note : Computation should be carried out only on CUDA.
There is no direct way in PyTorch to accomplish this (i.e., through a function). However, a workaround can be.
Flattening both tensors:
combined = torch.cat((ListA.view(-1), ListB.view(-1)))
combined
Out[52]: tensor([1., 2., 1., 3., 4., 8., 5., 7., 1., 2., 4., 8.], device='cuda:0')
Finding unique elements:
unique, counts = combined.unique(return_counts=True)
intersection = unique[counts > 1].reshape(-1, ListA.shape[1])
intersection
Out[55]:
tensor([[1., 2.],
[4., 8.]], device='cuda:0')
Benchmarks:
def find_intersection_two_tensors(A: tensor, B:tensor):
combined = torch.cat((A.view(-1), B.view(-1)))
unique, counts = combined.unique(return_counts=True)
return unique[counts > 1].reshape(-1, A.shape[1])
Timing it
%timeit find_intersection_two_tensors(ListA, ListB)
207 µs ± 2.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you are ok with moving to CPU, numpy could be a better solution in regards to performance:
def find_intersection_two_ndarray(AGPU: tensor, BGPU: tensor):
A = AGPU.view(-1).cpu().numpy()
B = BGPU.view(-1).cpu().numpy()
C = np.intersect1d(A, B)
return torch.from_numpy(C).cuda('cuda:0')
Timing it
%timeit find_intersection_two_ndarray(ListA, ListB)
85.4 µs ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I have an array of values arr with shape (N,) and an array of coordinates coords with shape (N,2). I want to represent this in an (M,M) array grid such that grid takes the value 0 at coordinates that are not in coords, and for the coordinates that are included it should store the sum of all values in arr that have that coordinate. So if M=3, arr = np.arange(4)+1, and coords = np.array([[0,0,1,2],[0,0,2,2]]) then grid should be:
array([[3., 0., 0.],
[0., 0., 3.],
[0., 0., 4.]])
The reason this is nontrivial is that I need to be able to repeat this step many times and the values in arr change each time, and so can the coordinates. Ideally I am looking for a vectorized solution. I suspect that I might be able to use np.where somehow but it's not immediately obvious how.
Timing the solutions
I have timed the solutions present at this time and it appear that the accumulator method is slightly faster than the sparse matrix method, with the second accumulation method being the slowest for the reasons explained in the comments:
%timeit for x in range(100): accumulate_arr(np.random.randint(100,size=(2,10000)),np.random.normal(0,1,10000))
%timeit for x in range(100): accumulate_arr_v2(np.random.randint(100,size=(2,10000)),np.random.normal(0,1,10000))
%timeit for x in range(100): sparse.coo_matrix((np.random.normal(0,1,10000),np.random.randint(100,size=(2,10000))),(100,100)).A
47.3 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
103 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
48.2 ms ± 36 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
One way would be to create a sparse.coo_matrix and convert that to dense:
from scipy import sparse
sparse.coo_matrix((arr,coords),(M,M)).A
# array([[3, 0, 0],
# [0, 0, 3],
# [0, 0, 4]])
With np.bincount -
def accumulate_arr(coords, arr):
# Get output array shape
m,n = coords.max(1)+1
# Get linear indices to be used as IDs with bincount
lidx = np.ravel_multi_index(coords, (m,n))
# Or lidx = coords[0]*(coords[1].max()+1) + coords[1]
# Accumulate arr with IDs from lidx
return np.bincount(lidx,arr,minlength=m*n).reshape(m,n)
Sample run -
In [58]: arr
Out[58]: array([1, 2, 3, 4])
In [59]: coords
Out[59]:
array([[0, 0, 1, 2],
[0, 0, 2, 2]])
In [60]: accumulate_arr(coords, arr)
Out[60]:
array([[3., 0., 0.],
[0., 0., 3.],
[0., 0., 4.]])
Another with np.add.at on similar lines and might be easier to follow -
def accumulate_arr_v2(coords, arr):
m,n = coords.max(1)+1
out = np.zeros((m,n), dtype=arr.dtype)
np.add.at(out, tuple(coords), arr)
return out
I need to fill N rectangle-shaped regions of an 0-filled two dimensional array with ones. Regions to fill are stored in Nx4 numpy array, where each row contains rectangle bounds (x_low, x_high, y_low, y_high). This part is currently the slowest part of what I'm currently working, and I'm wondering if it can be done any faster.
Currently this is done by simply iterating over region array, and target array is filled with ones using slices:
import numpy as np
def fill_array_with_ones(coordinates_array, target_array):
for row in coordinates_array:
target_array[row[0]:row[1], row[2]:row[3]] = 1
coords = np.array([[1,3,1,3], [3,5,3,5]])
target = np.zeros((5,5))
fill_array_with_ones(coords, target)
print(target)
Output:
array([[0., 0., 0., 0., 0.],
[0., 1., 1., 0., 0.],
[0., 1., 1., 0., 0.],
[0., 0., 0., 1., 1.],
[0., 0., 0., 1., 1.]])
I was expecting that there is some numpy magic that would allow me to do it in a vectorized manner, which would get rid of iterating over rows and, hopefully, lead to faster execution:
target[bounds_to_slices(coords)] = 1
I did some test about the method mentioned in comment and the for loop method. I'm wondering is your bottle neck really this?
# prepare data
import numpy as np
a = np.zeros((1024, 1024), '?')
bound = np.random.randint(0, 1024, (9999,4), 'H')
# build indices, as indices can be pre-computed, don't time it
x = np.arange(1024, dtype='H')[:,None]
y = x[:,None]
# the force vectorized method
%%timeit
ymask = y >= bound[:,0]
ymask &= y < bound[:,1]
xmask = x >= bound[:,2]
xmask &= x < bound[:,3]
a[(ymask & xmask).any(2)] = True
# outputs 3.06 s ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# the normal method
%%timeit
for i,j,k,l in bound:
a[i:k,j:l] = True
# outputs 22.8 ms ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Not only the "vectorized" method is always slower no matter how many bounding boxes are there, it also generates a 10GB temp array here. On the other hand the normal method is reasonably fast.
Considering a function below:
import numpy as np
a = np.ones(16).reshape(4,4)
def fn(a):
b = np.array(a)
for i in range(b.shape[0]):
for j in range(b.shape[1] - 1):
b[i][j+1] += b[i][j]
return b
print(fn(a))
That is, for a general function that calculates t+1 based on t in an array, can I make this faster? I'm aware there's a np.vectorize but not seeming appropriate for this case.
You can use cumsum I think that would be helpful.
import numpy as np
import pandas as pd
a = np.ones(16).reshape(4,4)
df =pd.DataFrame(a)
df.cumsum(axis=1)
Or you can use np.cumsum():
np.cumsum(a,axis=1)
It is possible to reduce the two for loops to one for loop with little bit of copying overhead in addition.
In [86]: a
Out[86]:
array([[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.],
[1., 1., 1., 1.]])
In [87]: b = a.copy()
In [88]: for col in range(b.shape[1]-1):
...: b[:, col+1] = np.sum(a[:, :col+2], axis=1)
In [89]: b
Out[89]:
array([[1., 2., 3., 4.],
[1., 2., 3., 4.],
[1., 2., 3., 4.],
[1., 2., 3., 4.]])
To make this work for a generic function, you can look for an equivalent function in numpy or implement one using numpy operations (vectorized one). For the example you provided, I just used numpy.sum() that does the job for us.
In terms of performance, this approach would be much better than operating using two for loops at the indices level, particularly for larger arrays. In the approach I used above, we work with slices of columns.
Here are the timings which suggest more than 3X speedup over native python implementation.
Native Python:
def fn(a):
b = np.array(a)
for i in range(b.shape[0]):
for j in range(b.shape[1] - 1):
b[i][j+1] += b[i][j]
return b
Slightly vectorized:
In [104]: def slightly_vectorized(b):
...: for col in range(b.shape[1]-1):
...: b[:, col+1] = np.sum(a[:, :col+2], axis=1)
...: return b
In [100]: a = np.ones(625).reshape(25, 25)
In [101]: %timeit fn(a)
303 µs ± 2.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [102]: b = a.copy()
In [103]: %timeit slightly_vectorized(b)
99.8 µs ± 501 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
What you're looking for is called accumulate here's an example:
import numpy as np
from itertools import accumulate
def fn(a):
acc = accumulate(a, lambda prev, row: prev + row)
return np.array(list(acc))
a = np.arange(16).reshape(4, 4)
print(fn(a))
# [[ 0 1 2 3]
# [ 4 6 8 10]
# [12 15 18 21]
# [24 28 32 36]]
There is no optimized accumulate function in numpy because it's not really possible to write accumulate in a way that's both performant & general. The python implementation is general, but will perform much like a hand coded lok.
To get optimal performance you'll probably need to find or write a low level implementation of the specific accumulate function that you need. You've already mentioned numba and you could also look into cython.
This question already has answers here:
NumPy Broadcasting: Calculating sum of squared differences between two arrays
(3 answers)
Closed 4 years ago.
Basically, I have two matrices A and B, and I want C (dimensions marked by the side of the matrices), with computation like this:
The formula below is what I do now. I take advantage of some broadcasting, but I am still left with a loop. I am novel to Python so maybe I am wrong, but I just have a hunch that this loop can be eliminated. Can anyone share some ideas?
EDIT: 2018-04-27 09:48:28
as requested, an example:
In [5]: A
Out[5]:
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
In [6]: B
Out[6]:
array([[0, 1],
[2, 3],
[4, 5],
[6, 7]])
In [7]: C = np.zeros ((B.shape[0], A.shape[0]))
In [8]: for m in range (B.shape[0]):
...: C[m] = np.sum (np.square (B[m] - A), axis=1).flatten ()
...:
In [9]: C
Out[9]:
array([[ 0., 8., 32., 72., 128.],
[ 8., 0., 8., 32., 72.],
[ 32., 8., 0., 8., 32.],
[ 72., 32., 8., 0., 8.]])
This appears to work at the cost of some extra memory:
C = ((B[:, :, None] - A.T)**2).sum(axis=1)
Testing:
import numpy
D = 10
N = 20
M = 30
A = numpy.random.rand(N, D)
B = numpy.random.rand(M, D)
C = numpy.empty((M, N))
Timing:
for m in range(M):
C[m] = numpy.sum((B[m, :] - A)**2, axis=1)
514 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
C2 = ((B[:, :, None] - A.T)**2).sum(axis=1)
53.6 µs ± 529 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)