Related
I'm writing a numerical code, where I'm using scipy.sparse.dia_matrix. My matrices are quite large (up to about 1000000 x 1000000), but very sparse. Sometimes tridiagonal, sometimes with a some more diagonals.
For various reasons, it is extremely convenient and clear from a coding point of view to just add together several of these matrices (of the same size, of course). However, I have found that adding these sparse matrices is very slow. The following example illustrates what I mean:
import numpy as np
from scipy.sparse import diags, dia_matrix
N = 100000
M1 = diags(diagonals = [np.random.random(N-1), np.random.random(N), np.random.random(N-1)], offsets = [-1, 0, 1])
M2 = diags(diagonals = [np.random.random(N-1), np.random.random(N), np.random.random(N-1)], offsets = [-1, 0, 1])
M3 = diags(diagonals = [np.random.random(N-1), np.random.random(N), np.random.random(N-1)], offsets = [-1, 0, 1])
def simple_add():
M = M1 + M2 + M3
def complicated_add():
M_ = dia_matrix((N, N))
for d in [-1, 0, 1]:
M_.setdiag(M1.diagonal(d) + M2.diagonal(d) + M3.diagonal(d), d)
%timeit simple_add()
%timeit complicated_add()
The output of the timing is:
16.9 ms ± 730 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
959 µs ± 39.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I don't understand why adding the matrices together with the + operator is 17 times slower than creating an empty digagonal matrix and explicitly setting the diagonals. Is there anything I can do to speed this up? I would much prefer to keep the simpler expression with the + operator, as it's far more readable, but not at the expense of an order of magnitude increase in computational time.
Update:
I proposed a change in Scipy that would make addition of two instances of dia_matrix faster, and after a bit of discussion I submitted a pull request to Scipy, which has now been merged. So in the future, adding two instances of dia_matrix will no longer convert to csr_matrix.
https://github.com/scipy/scipy/pull/14004
diags makes a dia_matrix from the list of inputs:
In [84]: M=sparse.diags([np.arange(1,4),np.arange(1,5),np.arange(1,4)], offsets=[-1,0,1])
In [85]: M
Out[85]:
<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements (3 diagonals) in DIAgonal format>
In [86]: M.offsets
Out[86]: array([-1, 0, 1], dtype=int32)
In [87]: M.data
Out[87]:
array([[1., 2., 3., 0.],
[1., 2., 3., 4.],
[0., 1., 2., 3.]])
The list of diagonals (different lengths) has been transformed into a 2 array, with offsets. This is intended primarily as a input format. Most, if not all, math is implemented in the csr format. And even there, matrix_multiplication is the relative strong point. Element-wise math is distinctly inferior to numpy array equivalents.
In [89]: Mr=M.tocsr()
In [90]: Mr
Out[90]:
<4x4 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in Compressed Sparse Row format>
In [91]: Mr.data
Out[91]: array([1., 1., 1., 2., 2., 2., 3., 3., 3., 4.])
In [92]: Mr.indices
Out[92]: array([0, 1, 0, 1, 2, 1, 2, 3, 2, 3], dtype=int32)
In [93]: Mr.indptr
Out[93]: array([ 0, 2, 5, 8, 10], dtype=int32)
The dia format suggests a faster addition, if the offsets and shape are all the same.
In [94]: M.data += M.data + M.data
In [95]: M.data
Out[95]:
array([[ 3., 6., 9., 0.],
[ 3., 6., 9., 12.],
[ 0., 3., 6., 9.]])
In [96]: M.A
Out[96]:
array([[ 3., 3., 0., 0.],
[ 3., 6., 6., 0.],
[ 0., 6., 9., 9.],
[ 0., 0., 9., 12.]])
With any of the sparse formats, if the sparsity is the same for all arguments and output, you can often do math directly on the data attribute, leaving the implied 0's unchanged.
The implementation of _add_sparse(self, other) is return self.tocsr()._add_sparse(other). The extra time is to turn it into a CSR matrix (which has a C extension for addition).
Could you create a sparse matrix that does what you want? Probably.
from scipy.sparse import dia_matrix, isspmatrix_dia
class dia_matrix_adder(dia_matrix):
def __add__(self, other):
if not isspmatrix_dia(other):
return super(dia_matrix_adder, self).__add__(other)
M_ = dia_matrix((self.shape[0], self.shape[1]))
for d in [-1, 0, 1]:
M_.setdiag(self.diagonal(d) + other.diagonal(d), d)
return M_
I would probably not do that and just write yourself a function:
def add_dia_matrix(*mats):
if len(mats) == 1:
return mats[0]
M_ = dia_matrix((mats[0].shape[0], mats[0].shape[1]))
for d in [-1, 0, 1]:
M_diag = mats[0].diagonal(d).copy()
for i in range(1, len(mats)):
M_diag += mats[i].diagonal(d)
M_.setdiag(M_diag, d)
return M_
This should be as readable as a bunch of + without having to deal with a new class.
%timeit simple_add()
30.3 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit complicated_add()
1.28 ms ± 2.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit add_dia_matrix(M1, M2, M3)
1.22 ms ± 4.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I have an upper-triangular matrix of np.float64 values, like this:
array([[ 1., 2., 3., 4.],
[ 0., 5., 6., 7.],
[ 0., 0., 8., 9.],
[ 0., 0., 0., 10.]])
I would like to convert this into the corresponding symmetric matrix, like this:
array([[ 1., 2., 3., 4.],
[ 2., 5., 6., 7.],
[ 3., 6., 8., 9.],
[ 4., 7., 9., 10.]])
The conversion can be done in place, or as a new matrix. I would like it to be as fast as possible. How can I do this quickly?
np.where seems quite fast in the out-of-place, no-cache scenario:
np.where(ut,ut,ut.T)
On my laptop:
timeit(lambda:np.where(ut,ut,ut.T))
# 1.909718865994364
If you have pythran installed you can speed this up 3 times with near zero effort. But note that as far as I know pythran (currently) only understands contguous arrays.
file <upp2sym.py>, compile with pythran -O3 upp2sym.py
import numpy as np
#pythran export upp2sym(float[:,:])
def upp2sym(a):
return np.where(a,a,a.T)
Timing:
from upp2sym import *
timeit(lambda:upp2sym(ut))
# 0.5760842661838979
This is almost as fast as looping:
#pythran export upp2sym_loop(float[:,:])
def upp2sym_loop(a):
out = np.empty_like(a)
for i in range(len(a)):
out[i,i] = a[i,i]
for j in range(i):
out[i,j] = out[j,i] = a[j,i]
return out
Timing:
timeit(lambda:upp2sym_loop(ut))
# 0.4794591029640287
We can also do it inplace:
#pythran export upp2sym_inplace(float[:,:])
def upp2sym_inplace(a):
for i in range(len(a)):
for j in range(i):
a[i,j] = a[j,i]
Timing
timeit(lambda:upp2sym_inplace(ut))
# 0.28711927914991975
This is the fastest routine I've found so far that doesn't use Cython or a JIT like Numba. I takes about 1.6 μs on my machine to process a 4x4 array (average time over a list of 100K 4x4 arrays):
inds_cache = {}
def upper_triangular_to_symmetric(ut):
n = ut.shape[0]
try:
inds = inds_cache[n]
except KeyError:
inds = np.tri(n, k=-1, dtype=np.bool)
inds_cache[n] = inds
ut[inds] = ut.T[inds]
Here are some other things I've tried that are not as fast:
The above code, but without the cache. Takes about 8.3 μs per 4x4 array:
def upper_triangular_to_symmetric(ut):
n = ut.shape[0]
inds = np.tri(n, k=-1, dtype=np.bool)
ut[inds] = ut.T[inds]
A plain Python nested loop. Takes about 2.5 μs per 4x4 array:
def upper_triangular_to_symmetric(ut):
n = ut.shape[0]
for r in range(1, n):
for c in range(r):
ut[r, c] = ut[c, r]
Floating point addition using np.triu. Takes about 11.9 μs per 4x4 array:
def upper_triangular_to_symmetric(ut):
ut += np.triu(ut, k=1).T
Numba version of Python nested loop. This was the fastest thing I found (about 0.4 μs per 4x4 array), and was what I ended up using in production, at least until I started running into issues with Numba and had to revert back to a pure Python version:
import numba
#numba.njit()
def upper_triangular_to_symmetric(ut):
n = ut.shape[0]
for r in range(1, n):
for c in range(r):
ut[r, c] = ut[c, r]
Cython version of Python nested loop. I'm new to Cython so this may not be fully optimized. Since Cython adds operational overhead, I'm interested in hearing both Cython and pure-Numpy answers. Takes about 0.6 μs per 4x4 array:
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def upper_triangular_to_symmetric(np.ndarray[np.float64_t, ndim=2] ut):
cdef int n, r, c
n = ut.shape[0]
for r in range(1, n):
for c in range(r):
ut[r, c] = ut[c, r]
You are mainly measuring function call overhead on such tiny problems
Another way to do that would be to use Numba. Let's start with a implementation for only one (4x4) array.
Only one 4x4 array
import numpy as np
import numba as nb
#nb.njit()
def sym(A):
for i in range(A.shape[0]):
for j in range(A.shape[1]):
A[j,i]=A[i,j]
return A
A=np.array([[ 1., 2., 3., 4.],
[ 0., 5., 6., 7.],
[ 0., 0., 8., 9.],
[ 0., 0., 0., 10.]])
%timeit sym(A)
#277 ns ± 5.21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Larger example
#nb.njit(parallel=False)
def sym_3d(A):
for i in nb.prange(A.shape[0]):
for j in range(A.shape[1]):
for k in range(A.shape[2]):
A[i,k,j]=A[i,j,k]
return A
A=np.random.rand(1_000_000,4,4)
%timeit sym_3d(A)
#13.8 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#13.8 ns per 4x4 submatrix
I need to fill N rectangle-shaped regions of an 0-filled two dimensional array with ones. Regions to fill are stored in Nx4 numpy array, where each row contains rectangle bounds (x_low, x_high, y_low, y_high). This part is currently the slowest part of what I'm currently working, and I'm wondering if it can be done any faster.
Currently this is done by simply iterating over region array, and target array is filled with ones using slices:
import numpy as np
def fill_array_with_ones(coordinates_array, target_array):
for row in coordinates_array:
target_array[row[0]:row[1], row[2]:row[3]] = 1
coords = np.array([[1,3,1,3], [3,5,3,5]])
target = np.zeros((5,5))
fill_array_with_ones(coords, target)
print(target)
Output:
array([[0., 0., 0., 0., 0.],
[0., 1., 1., 0., 0.],
[0., 1., 1., 0., 0.],
[0., 0., 0., 1., 1.],
[0., 0., 0., 1., 1.]])
I was expecting that there is some numpy magic that would allow me to do it in a vectorized manner, which would get rid of iterating over rows and, hopefully, lead to faster execution:
target[bounds_to_slices(coords)] = 1
I did some test about the method mentioned in comment and the for loop method. I'm wondering is your bottle neck really this?
# prepare data
import numpy as np
a = np.zeros((1024, 1024), '?')
bound = np.random.randint(0, 1024, (9999,4), 'H')
# build indices, as indices can be pre-computed, don't time it
x = np.arange(1024, dtype='H')[:,None]
y = x[:,None]
# the force vectorized method
%%timeit
ymask = y >= bound[:,0]
ymask &= y < bound[:,1]
xmask = x >= bound[:,2]
xmask &= x < bound[:,3]
a[(ymask & xmask).any(2)] = True
# outputs 3.06 s ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# the normal method
%%timeit
for i,j,k,l in bound:
a[i:k,j:l] = True
# outputs 22.8 ms ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Not only the "vectorized" method is always slower no matter how many bounding boxes are there, it also generates a 10GB temp array here. On the other hand the normal method is reasonably fast.
I have two 1D numpy arrays A and B of size (n, ) and (m, ) respectively which correspond to the x positions of points on a line. I want to calculate the distance between every point in A to every point in B. I then need to use these distances at a set y distance, d, to work out the potential at each point in A.
I'm currently using the following:
V = numpy.zeros(n)
for i in range(n):
xdist = A[i] - B
r = numpy.sqrt(xdist**2 + d**2)
dV = 1/r
V[i] = numpy.sum(dV)
This works but for large data sets it can take a while so I would like to use a function similar to scipy.spatial.distance.cdist which doesn't work for 1D arrays and I don't want to add another dimension to the arrays as they become too large.
Vectorized approach
One vectorized approach after extending A to 2D with the introduction of a new axis using np.newaxis/None and thus making use of broadcasting would be -
(1/(np.sqrt((A[:,None] - B)**2 + d**2))).sum(1)
Hybrid approach for large arrays
Now, for large arrays, we might have to divide the data into chunks.
Thus, with BSZ as the block size, we would have a hybrid approach, like so -
dsq = d**2
V = np.zeros((n//BSZ,BSZ))
for i in range(n//BSZ):
V[i] = (1/(np.sqrt((A[i*BSZ:(i+1)*BSZ,None] - B)**2 + dsq))).sum(1)
Runtime test
Approaches -
def original_app(A,B,d):
V = np.zeros(n)
for i in range(n):
xdist = A[i] - B
r = np.sqrt(xdist**2 + d**2)
dV = 1/r
V[i] = np.sum(dV)
return V
def vectorized_app1(A,B,d):
return (1/(np.sqrt((A[:,None] - B)**2 + d**2))).sum(1)
def vectorized_app2(A,B,d, BSZ = 100):
dsq = d**2
V = np.zeros((n//BSZ,BSZ))
for i in range(n//BSZ):
V[i] = (1/(np.sqrt((A[i*BSZ:(i+1)*BSZ,None] - B)**2 + dsq))).sum(1)
return V.ravel()
Timings and verification -
In [203]: # Setup inputs
...: n,m = 10000,2000
...: A = np.random.rand(n)
...: B = np.random.rand(m)
...: d = 10
...:
In [204]: out1 = original_app(A,B,d)
...: out2 = vectorized_app1(A,B,d)
...: out3 = vectorized_app2(A,B,d, BSZ = 100)
...:
...: print np.allclose(out1, out2)
...: print np.allclose(out1, out3)
...:
True
True
In [205]: %timeit original_app(A,B,d)
10 loops, best of 3: 133 ms per loop
In [206]: %timeit vectorized_app1(A,B,d)
10 loops, best of 3: 138 ms per loop
In [207]: %timeit vectorized_app2(A,B,d, BSZ = 100)
10 loops, best of 3: 65.2 ms per loop
We can play around with the parameter block size BSZ -
In [208]: %timeit vectorized_app2(A,B,d, BSZ = 200)
10 loops, best of 3: 74.5 ms per loop
In [209]: %timeit vectorized_app2(A,B,d, BSZ = 50)
10 loops, best of 3: 67.4 ms per loop
Thus, the best one seems to be giving a 2x speedup with a block size of 100 at my end.
EDIT: My answer turned out to be nearly identical to Divakar's after a closer look. However, you can save some memory by doing the operations in-place. Taking the sum along the second axis is more efficient than long the first.
import numpy
a = numpy.random.randint(0,10,10) * 1.
b = numpy.random.randint(0,10,10) * 1.
xdist = a[:,None] - b
xdist **= 2
xdist += d**2
xdist **= -1
V = numpy.sum(xdist, axis=1)
which gives the same solution as your code.
I would like to use a function similar to scipy.spatial.distance.cdist which doesn't work for 1D arrays and I don't want to add another dimension to the arrays as they become too large.
cdist works fine, you just have to reshape the arrays to have shape (n, 1) instead of (n,). You can add another dimension to a one-dimensional array A without copying the underlying data by using A[:, None] or A.reshape(-1, 1).
For example,
In [56]: from scipy.spatial.distance import cdist
In [57]: A
Out[57]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [58]: B
Out[58]: array([0, 2, 4, 6, 8])
In [59]: A[:, None]
Out[59]:
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
In [60]: cdist(A[:, None], B[:, None])
Out[60]:
array([[ 0., 2., 4., 6., 8.],
[ 1., 1., 3., 5., 7.],
[ 2., 0., 2., 4., 6.],
[ 3., 1., 1., 3., 5.],
[ 4., 2., 0., 2., 4.],
[ 5., 3., 1., 1., 3.],
[ 6., 4., 2., 0., 2.],
[ 7., 5., 3., 1., 1.],
[ 8., 6., 4., 2., 0.],
[ 9., 7., 5., 3., 1.]])
To compute V as shown in your code, you can use cdist with metric='sqeuclidean', as follows:
In [72]: d = 3.
In [73]: r = np.sqrt(cdist(A[:,None], B[:,None], metric='sqeuclidean') + d**2)
In [74]: V = (1/r).sum(axis=1)
Given two numpy arrays of nx3 and mx3, what is an efficient way to determine the row indices (counter) wherein the rows are common in the two arrays. For instance I have the following solution, which is significantly slow for not even much larger arrays
def arrangment(arr1,arr2):
hits = []
for i in range(arr2.shape[0]):
current_row = np.repeat(arr2[i,:][None,:],arr1.shape[0],axis=0)
x = current_row - arr1
for j in range(arr1.shape[0]):
if np.isclose(x[j,0],0.0) and np.isclose(x[j,1],0.0) and np.isclose(x[j,2],0.0):
hits.append(j)
return hits
It checks if rows of arr2 exist in arr1 and returns the row indices of arr1 where the rows match. I need this arrangement to be always sequentially ascending in terms of rows of arr2. For instance given
arr1 = np.array([[-1., -1., -1.],
[ 1., -1., -1.],
[ 1., 1., -1.],
[-1., 1., -1.],
[-1., -1., 1.],
[ 1., -1., 1.],
[ 1., 1., 1.],
[-1., 1., 1.]])
arr2 = np.array([[-1., 1., -1.],
[ 1., 1., -1.],
[ 1., 1., 1.],
[-1., 1., 1.]])
The function should return:
[3, 2, 6, 7]
quick and dirty answer
(arr1[:, None] == arr2).all(-1).argmax(0)
array([3, 2, 6, 7])
Better answer
Takes care of chance a row in arr2 doesn't match anything in arr1
t = (arr1[:, None] == arr2).all(-1)
np.where(t.any(0), t.argmax(0), np.nan)
array([ 3., 2., 6., 7.])
As pointed out by #Divakar np.isclose accounts for rounding error in comparing floats
t = np.isclose(arr1[:, None], arr2).all(-1)
np.where(t.any(0), t.argmax(0), np.nan)
I had a similar problem in the past and I came up with a fairly optimised solution for it.
First you need a generalisation of numpy.unique for multidimensional arrays, which for the sake of completeness I would copy it here
def unique2d(arr,consider_sort=False,return_index=False,return_inverse=False):
"""Get unique values along an axis for 2D arrays.
input:
arr:
2D array
consider_sort:
Does permutation of the values within the axis matter?
Two rows can contain the same values but with
different arrangements. If consider_sort
is True then those rows would be considered equal
return_index:
Similar to numpy unique
return_inverse:
Similar to numpy unique
returns:
2D array of unique rows
If return_index is True also returns indices
If return_inverse is True also returns the inverse array
"""
if consider_sort is True:
a = np.sort(arr,axis=1)
else:
a = arr
b = np.ascontiguousarray(a).view(np.dtype((np.void,
a.dtype.itemsize * a.shape[1])))
if return_inverse is False:
_, idx = np.unique(b, return_index=True)
else:
_, idx, inv = np.unique(b, return_index=True, return_inverse=True)
if return_index == False and return_inverse == False:
return arr[idx]
elif return_index == True and return_inverse == False:
return arr[idx], idx
elif return_index == False and return_inverse == True:
return arr[idx], inv
else:
return arr[idx], idx, inv
Now all you need is to concatenate (np.vstack) your arrays and find the unique rows. The reverse mapping together with np.searchsorted will give you the indices you need. So lets write another function similar to numpy.in2d but for multidimensional (2D) arrays
def in2d_unsorted(arr1, arr2, axis=1, consider_sort=False):
"""Find the elements in arr1 which are also in
arr2 and sort them as the appear in arr2"""
assert arr1.dtype == arr2.dtype
if axis == 0:
arr1 = np.copy(arr1.T,order='C')
arr2 = np.copy(arr2.T,order='C')
if consider_sort is True:
sorter_arr1 = np.argsort(arr1)
arr1 = arr1[np.arange(arr1.shape[0])[:,None],sorter_arr1]
sorter_arr2 = np.argsort(arr2)
arr2 = arr2[np.arange(arr2.shape[0])[:,None],sorter_arr2]
arr = np.vstack((arr1,arr2))
_, inv = unique2d(arr, return_inverse=True)
size1 = arr1.shape[0]
size2 = arr2.shape[0]
arr3 = inv[:size1]
arr4 = inv[-size2:]
# Sort the indices as they appear in arr2
sorter = np.argsort(arr3)
idx = sorter[arr3.searchsorted(arr4, sorter=sorter)]
return idx
Now all you need to do is call in2d_unsorted with your input parameters
>>> in2d_unsorted(arr1,arr2)
array([ 3, 2, 6, 7])
While may not be fully optimised this approach is much faster. Let's benchmark it against #piRSquareds solutions
def indices_piR(arr1,arr2):
t = np.isclose(arr1[:, None], arr2).all(-1)
return np.where(t.any(0), t.argmax(0), np.nan)
with the following arrays
n=150
arr1 = np.random.permutation(n).reshape(n//3, 3)
idx = np.random.permutation(n//3)
arr2 = arr1[idx]
In [13]: np.allclose(in2d_unsorted(arr1,arr2),indices_piR(arr1,arr2))
True
In [14]: %timeit indices_piR(arr1,arr2)
10000 loops, best of 3: 181 µs per loop
In [15]: %timeit in2d_unsorted(arr1,arr2)
10000 loops, best of 3: 85.7 µs per loop
Now, for n=1500
In [24]: %timeit indices_piR(arr1,arr2)
100 loops, best of 3: 10.3 ms per loop
In [25]: %timeit in2d_unsorted(arr1,arr2)
1000 loops, best of 3: 403 µs per loop
and for n=15000
In [28]: %timeit indices_piR(A,B)
1 loop, best of 3: 1.02 s per loop
In [29]: %timeit in2d_unsorted(arr1,arr2)
100 loops, best of 3: 4.65 ms per loop
So for largeish arrays this is over 200X faster compared to #piRSquared's vectorised solution.