Initialize a numpy sparse matrix efficiently - python

I have an array with m rows and arrays as values, which indicate the index of columns and are bounded to a large number n.
E.g:
Y = [[1,34,203,2032],...,[2984]]
Now I want an efficient way to initialize a sparse numpy matrix X with dimensions m,n and values corresponding to Y (X[i,j] = 1, if j is in Y[i], = 0 otherwise).

Your data are already close to csr format, so I suggest using that:
import numpy as np
from scipy import sparse
from itertools import chain
# create an example
m, n = 20, 10
X = np.random.random((m, n)) < 0.1
Y = [list(np.where(y)[0]) for y in X]
# construct the sparse matrix
indptr = np.fromiter(chain((0,), map(len, Y)), int, len(Y) + 1).cumsum()
indices = np.fromiter(chain.from_iterable(Y), int, indptr[-1])
data = np.ones_like(indices)
S = sparse.csr_matrix((data, indices, indptr), (m, n))
# or
S = sparse.csr_matrix((data, indices, indptr))
# check
assert np.all(S==X)

Related

Numpy array assignment by boolean indices array

I have a very large array, but I'll use a smaller one to explain.
Given source array X
X = [ [1,1,1,1],
[2,2,2,2],
[3,3,3,3]]
A target array with the same size Y
Y = [ [-1,-1,-1,-1],
[-2,-2,-2,-2],
[-3,-3,-3,-3]]
And an assigment array IDX:
IDX = [ [1,0,0,0],
[0,0,1,0],
[0,1,0,1]]
I want to assign Y to X by IDX - Only assign where IDX==1
In this case, something like:
X[IDX] = Y[IDX]
will result in:
X = [ [-1,1,1,1],
[2,2,-2,2],
[3,-3,3,-3]]
How can this be done efficiently (not a for-loop) in numpy/pandas?
Thx
If IDX is a NumPy array of Boolean type, and X and Y are NumPy arrays then your intuition works:
X = np.array(X)
Y = np.array(Y)
IDX = np.array(IDX).astype(bool)
X[IDX] = Y[IDX]
This changes X in place.
If you don't want to do all this type casting, or don't want to overwrite X, then np.where() does what you want in one go:
np.where(IDX==1, Y, X)

2D indexing of scipy sparse matrix

import numpy as np
import scipy.sparse
x = np.random.randint(0, 1000, (1000, 100))
# prob better way to do this
d = np.random.random((1000,1000))
d[d < 0.99] = 0
y = scipy.sparse.csr_matrix(d)
What I would like to do is to create a new matrix z containing the values of y at the indices in x.
ie [0, 0] of z should contain the y[0, x[0, 0]]
[0, 1] of z should contain the y[0, x[0, 1]]
%time for i in range(1000): x[i, y[i]].todense()
~247ms
%time for i in range(1000): np.take(x[i].todense(), y[i])
~150ms
both of the above work, but I am looking for a faster method- this is currently the bottleneck on my code.
Please assume that representing the whole scipy.sparse matrix as dense isn't feasible.
edit:
%time z = np.vstack([q.todense()[0, p] for q, p in zip(x, y)])
is ~110ms
The answer seems to be to use an appropriately shaped broadcasting index, as outlined here: How to generate multi-dimensional 2D numpy index using a sub-index for one dimension
(answer deserves more upvotes)!
%time res = y[np.arange(0, 1000).reshape((-1, 1)), x].todense()

Quickly generate numpy array

I want to generate a sparse numpy ndarray by using the row vector, column vector, and value vector of each element.
For example, if I have
row_index=np.array([0,1,2])
column_index=np.array([2,1,0])
value=np.array([4,5,6])
Then I want a matrix
[0,0,4
0,5,0
6,0,0]
Is there a function in numpy that can do the similar thing like scipy.sparse by using scipy.sparse.csc_matrix((data, (row_ind, col_ind)), [shape=(M, N)])? If not, is there a way to generate the matrix without for loops? I want to speed up the code but
scipy.sparse is quite slow during the calculation, and the matrix I want is not so large.
If the matrix you want is not very large, it might be faster to just create a regular (non-sparse) ndarray. For example, you can use the following code to generate a dense matrix using only numpy:
row_index = np.array([0, 1, 2])
column_index = np.array([2, 1, 0])
values = np.array([4, 5, 6])
# numpy dense
M = np.zeros((np.max(row_index) + 1, np.max(column_index) + 1))
M[row_index, column_index] = values
On my machine, creating the matrix (the last two lines) take approximately 6.3 μs to run. I compared it to the following code which uses scipy.sparse:
# scipy sparse
M = scipy.sparse.csc_matrix((values, (row_index, column_index)),
shape=(np.max(row_index) + 1, np.max(column_index) + 1))
This takes approximately 80 μs to run. Because you asked for a method to create a sparse array, I changed the first implementation to the following code, so that the created ndarray is converted into a sparse array:
# numpy sparse
M = np.zeros((np.max(row_index) + 1, np.max(column_index) + 1))
M[row_index, column_index] = values
M = scipy.sparse.csc_matrix(M)
This takes approximately 82 μs to run. The bottleneck in this code is clearly the operation of creating a sparse matrix.
Note that the scipy.sparse method scales very well as function of matrix size, and eventually becomes the fastest for larger matrices (on my machine, starting from approximately 360×360). See the figure below for an indication of the speed of each method as function of matrix size, from a 10×10 matrix up to a 1000×1000 matrix. Some outliers in the figure are most likely due to other programs on my machine interfering. Furthermore, I am not sure of the technical details behind the 'jumps' in the numpy dense method at ~360×360 and ~510×510. I have also added the code I used to run this comparison so that you can run it on your own machine.
import timeit
import matplotlib.pyplot as plt
import numpy as np
import scipy.sparse
def generate_indices(num_values):
row_index = np.arange(num_values)
column_index = np.arange(num_values)[::-1]
values = np.arange(num_values)
return row_index, column_index, values
def numpy_dense(N, row_index, column_index, values):
start = timeit.default_timer()
for _ in range(N):
M = np.zeros((np.max(row_index) + 1, np.max(column_index) + 1))
M[row_index, column_index] = values
end = timeit.default_timer()
return (end - start) / N
def numpy_sparse(N, row_index, column_index, values):
start = timeit.default_timer()
for _ in range(N):
M = np.zeros((np.max(row_index) + 1, np.max(column_index) + 1))
M[row_index, column_index] = values
M = scipy.sparse.csc_matrix(M)
end = timeit.default_timer()
return (end - start) / N
def scipy_sparse(N, row_index, column_index, values):
start = timeit.default_timer()
for _ in range(N):
M = scipy.sparse.csc_matrix((values, (row_index, column_index)),
shape=(np.max(row_index) + 1, np.max(column_index) + 1))
end = timeit.default_timer()
return (end - start) / N
ns = np.arange(10, 1001, 10) # matrix size to try with
runtimes_numpy_dense, runtimes_numpy_sparse, runtimes_scipy_sparse = [], [], []
for n in ns:
print(n)
indices = generate_indices(n)
# number of iterations for timing
# ideally, you want this to be as high as possible,
# but I didn't want to wait very long for this plot
N = 1000 if n < 500 else 100
runtimes_numpy_dense.append(numpy_dense(N, *indices))
runtimes_numpy_sparse.append(numpy_sparse(N, *indices))
runtimes_scipy_sparse.append(scipy_sparse(N, *indices))
fig, ax = plt.subplots()
ax.plot(ns, runtimes_numpy_dense, 'x-', markersize=4, label='numpy dense')
ax.plot(ns, runtimes_numpy_sparse, 'x-', markersize=4, label='numpy sparse')
ax.plot(ns, runtimes_scipy_sparse, 'x-', markersize=4, label='scipy sparse')
ax.set_yscale('log')
ax.set_xlabel('Matrix size')
ax.set_ylabel('Runtime (s)')
ax.legend()
plt.show()
You can create your (sparse) array in coordinate format, where you pass:
values to be put at specified coordinates,
row coordinates,
column coordinates.
The code to do it can be:
import scipy.sparse as ss
arr = ss.coo_matrix((value, (row_index - 1, column_index - 1)))
Note that you created row_index and column_index with 1-based
indexing in mind, whereas array indices are actually 0-based,
so in the code above I subtracted 1 from your both coordinate arrays.
When you print arr.toarray(), you will get just what you wanted:
array([[0, 0, 4],
[0, 5, 0],
[6, 0, 0]])

Fancy indexing of a numpy ndarray

Suppose i have an array shaped as a:
import numpy as np
n = 10
d = 5
a = np.zeros(shape = np.repeat(n,d))
And that I want to obtain the values corresponding to indexes (0,...,:,...,0) for the : along dimensions, resulting in a (n,d)-shaped array b, with b[i,j] = a[0,...,0,i,0,...,0] where the i is in the jth dimension.
How can i extractb from a ?
Get the flattened indices and just index for a vectorized solution -
n = len(a)
d = a.ndim
idxs = np.multiply.outer(n**np.arange(d), np.arange(n))
out = a.flat[idxs]
Easiest is to do a for loop:
# get the first slice of `a` along given dimension `j`
def get_slice(a,j):
idx = [0]*len(a.shape)
idx[j] = slice(None)
return a[tuple(idx)]
out = np.stack([get_slice(a,j) for j in range(len(a.shape))])
And out.shape is (10,5)

python sample from a scipy sparse matrix

I have a scipy sparse matrix as for example:
import scipy as sp
from scipy import sparse
X = sparse.csr_matrix(np.random.randint(0, 10, (100, 10)))
I need to add K rows to this matrix. Each column of these new rows should be obtained sampling from the same column in the original matrix.
So for example. The desired result should be something like:
Z = np.concat(X, X_sampled, axis=0)
where X_sampled[:,i] = np.random.choice(X[:,i], k)
How can I do that without moving to a dense matrix?
EDIT: An example with dense array
import numpy as np
import scipy as sp
k = 20
X = np.random.randint(0, 10, (100, 10))
X2 = np.zeros(shape=(k, X.shape[1]))
for col_id in range(X.shape[1]):
X2[:, col_id] = np.random.choice(X[:, col_id], k)
res = np.concatenate([X, X2])

Categories

Resources