I have a sparse CSC matrix, "A", in which I want to replace the first row with a vector that is all zeros, except for the first entry which is 1.
So far I am doing the inefficient version, e.g.:
import numpy as np
from scipy.sparse import csc_matrix
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6])
A = csc_matrix((data, (row, col)), shape=(3, 3))
replace = np.zeros(3)
replace[0] = 1
A[0,:] = replace
A.eliminate_zeros()
But I'd like to do it with .indptr, .data, etc. As it is a CSC, I am guessing that this might be inefficient as well? In my exact problem, the matrix is 66000 X 66000.
For a CSR sparse matrix I've seen it done as
A.data[1:A.indptr[1]] = 0
A.data[0] = 1.0
A.indices[0] = 0
A.eliminate_zeros()
So, basically I'd like to do the same for a CSC sparse matrix.
Expected result: To do exactly the same as above, just more efficiently (applicable to very large sparse matrices).
That is, start with:
[1, 0, 4],
[0, 0, 5],
[2, 3, 6]
and replace the upper row with a vector that is as long as the matrix, is all zeros except for 1 at the beginning. As such, one should end with
[1, 0, 0],
[0, 0, 5],
[2, 3, 6]
And be able to do it for large sparse CSC matrices efficiently.
Thanks in advance :-)
You can do it by indptr and indices. If you want to construct your matrix with indptr and indices parameters by:
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
A = csc_matrix((data, indices, indptr), shape=(3,3))
But if you want to set all elements in the first row except the first element in row 0, you need to set data values to zero for those that indices is zero. In other words:
data[indices == 0] = 0
The above line set all the elements of the first row to 0. To avoid setting the first element to zero we can do the following:
indices_tmp = indices == 0
indices_tmp[0] = False # to avoid removing the first element in row 0.
data[indices_tmp == True] = 0
A = csc_matrix((data, indices, indptr), shape=(3,3))
Hope it helps.
Related
I have a matrix in a sparse csr format for example:
from scipy.sparse import csr_matrix
import numpy as np
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
M = csr_matrix((data, (row, col)), shape=(3, 3))
M.A =
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])
I am re-ordering the matrix with the index [2,0,1] using the following approach:
order = np.array([2,0,1])
M = M[order,:]
M = M[:,order]
M.A
array([[6, 4, 5],
[2, 1, 0],
[3, 0, 0]])
This approach works but it is not feasible for my real csr_matrix which has the size of 16580746 X 1672751804 and causes memory error.
I took another approach like this:
edge_list = zip(row,col,dat)
index = dict(zip(order, range(len(order))))
all_coeff = zip(*((index[u], index[v],d) for u,v,d in edge_list if u in index and v in index))
new_row,new_col,new_data = all_coeff
n = len(order)
graph = csr_matrix((new_data, (new_row, new_col)), shape=(n, n))
This also works but fall into the same trap of memory error for large sparse matrix. Any suggestions to efficiently do this?
I've found using matrix operations to be the most efficient. Here's a function which will permute the rows and/or columns to a specified order. It can be modified to swap two specific rows/columns if you would like.
from scipy import sparse
def permute_sparse_matrix(M, new_row_order=None, new_col_order=None):
"""
Reorders the rows and/or columns in a scipy sparse matrix
using the specified array(s) of indexes
e.g., [1,0,2,3,...] would swap the first and second row/col.
"""
if new_row_order is None and new_col_order is None:
return M
new_M = M
if new_row_order is not None:
I = sparse.eye(M.shape[0]).tocoo()
I.row = I.row[new_row_order]
new_M = I.dot(new_M)
if new_col_order is not None:
I = sparse.eye(M.shape[1]).tocoo()
I.col = I.col[new_col_order]
new_M = new_M.dot(I)
return new_M
Let's think smart.
Instead of reordering the matrix, why don't you work directly on the row and column indexes that you provided at the start?
So for example, you can replace your row indexes in this way, from:
[0, 0, 1, 2, 2, 2]
to:
[2, 2, 0, 1, 1, 1]
And your column indexes, from:
[0, 2, 2, 0, 1, 2]
to:
[2, 1, 1, 2, 0, 1]
I am trying to get a fast dot product function for multiplying a sparse matrix(3*3) and an nd array(1*3) in such a way that every row of matrix gets dot product with nd array to get a (3*1) array.
My current implementation is to get each row of the matrix and then do the dot product but for scaling up the matrix dimension, it gets too slow.
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
matrix =csr_matrix((data, (row, col)), shape=(3, 3))
secondArray=np.random.rand((1, 3))
for idx, x in enumerate(matrix):
X_arr=X.getrow(idx).toarray()
prod=np.dot(np.array(X_arr[0]), secondArray)
I have a large sparse matrix whose each row contains multiple nonzero elements, for example
a = np.array([[1, 1,0,0,0,0], [2,0, 1,0,2,0], [3,0,4,0,0, 3]])
I want to be able to randomly select one nonzero element per row without for loop. Any good suggestion? As output, I am more interested in chosen elements' index than its value.
With a numpy array such as:
arr = np.array([5, 2, 6, 0, 2, 0, 0, 6])
you can do arr != 0 which will give a True / False array of values which pass the condition so in our case, where the values are not equal (!=) to 0. So:
array([ True, True, True, False, True, False, False, True], dtype=bool)
from here, we can 'index' arr with this boolean array by doing arr[arr != 0] which gives us:
array([5, 2, 6, 2, 6])
So now that we have a way of removing the non-zero values from a numpy array, we can do a simple list comprehension on each row in your a array. For each row, we remove the zeros and then perform a random.choice on the array. As so:
np.array([np.random.choice(r[r!=0]) for r in a])
which gives you back an array of length 3 containing random non-zero items from each row in a. :)
Hope this helps!
Update
If you want the indexes of the random non-zero numbers in the array, you can use .nonzero().
So if we have this array:
arr = np.array([5, 2, 6, 0, 2, 0, 0, 6])
we can do:
arr.nonzero()
which gives a tuple of the indexes of non-zero elements:
(array([0, 1, 2, 4, 7]),)
so as with before, we can use this and np.random.choice() in a list-comprehension to produce random indexes:
a = np.array([[1, 1, 0, 0, 0, 0], [2, 0, 1, 0, 2, 0], [3, 0, 4, 0, 0, 3]])
np.array([np.random.choice(r.nonzero()[0]) for r in a])
which returns an array of the form [x, y, z] where x, y and z are random indexes of non-zero elements from their corresponding rows.
E.g. one result could be:
array([1, 4, 2])
And if you want it to also return the rows, you could just add in a numpy.arrange() call on the length of a to get an array of row numbers:
([np.arange(len(a))], np.array([np.random.choice(r.nonzero()[0]) for r in a]))
so an example random output could be:
([array([0, 1, 2])], array([1, 2, 5]))
for a as:
array([[1, 1, 0, 0, 0, 0],
[2, 0, 1, 0, 2, 0],
[3, 0, 4, 0, 0, 3]])
Hope this does what you want now :)
I have a large sparse matrix from scipy (300k x 100k with all binary values, mostly zeros). I would like to set the rows of this matrix to be an RDD and then do some computations on those rows - evaluate a function on each row, evaluate functions on pairs of rows, etc.
Key thing is that it's quite sparse and I don't want to explode the cluster - can I convert the rows to SparseVectors? Or perhaps convert the whole thing to SparseMatrix?
Can you give an example where you read in a sparse array, setup rows into an RDD, and compute something from the cartesian product of those rows?
I had this issue recently--I think you can convert directly by constructing the SparseMatrix with the scipy csc_matrix attributes. (Borrowing from Yang Bryan)
import numpy as np
import scipy.sparse as sps
from pyspark.mllib.linalg import Matrices
# create a sparse matrix
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6])
sv = sps.csc_matrix((data, (row, col)), shape=(3, 3))
# convert to pyspark SparseMatrix
sparse_matrix = Matrices.sparse(sv.shape[0],sv.shape[1],sv.indptr,sv.indices,sv.data)
The only thing you have to is toarray()
import numpy as np
import scipy.sparse as sps
# create a sparse matrix
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6])
sv = sps.csc_matrix((data, (row, col)), shape=(3, 3))
sv.toarray()
> array([[1, 0, 4],
> [0, 0, 5],
> [2, 3, 6]])
type(sv)
<class 'scipy.sparse.csc.csc_matrix'>
#read sv as RDD
sv_rdd = sc.parallelize(sv.toarray()) #transfer saprse to array
sv_rdd.collect()
> [array([1, 0, 4]), array([0, 0, 5]), array([2, 3, 6])]
type(sv_rdd)
> <class 'pyspark.rdd.RDD'>
I have a matrix and a list of column indices that I want to select from the matrix for each row. How can I do that in numpy?
my_matrix = np.array([[1, 2], [4, 5]])
col_idx = np.array([1, 0])
selected = .... # selects 1st element of row 0 and 0th element of row 1.
print selected
# np.array([2, 4])
You can slice using range:
In [11]: my_matrix[np.arange(my_matrix.shape[0]), col_idx]
Out[11]: array([2, 4])
np.choose is very useful for this making these sorts of selections:
>>> np.choose(col_idx, my_matrix.T)
array([2, 4])
And on a larger matrix:
>>> my_matrix_2 = np.array([[1, 2], [4, 5], [3, 7], [4, 1]])
>>> col_idx_2 = np.array([1, 0, 0, 1])
>>> np.choose(col_idx_2, my_matrix_2.T)
array([2, 4, 3, 1])
The method returns a new array with the selected values (not a view of the original array).
There are more examples of this (initially slightly non-obvious) method in the documentation, but I'll explain what's happening using the second example above.
We're using np.choose to return a new array from an array of choices called my_matrix_2.T, where col_idx_2 specifies which row of the choice array we should pick from each time.
Notice we transpose my_matrix_2 for this to work:
# my_matrix_2.T
array([[1, 4, 3, 4], # row 0
[2, 5, 7, 1]]) # row 1
We have col_idx_2 = [1, 0, 0, 1]. Now stepping through this array one value at a time:
the first element of the new array will be the first element of row 1 of my_matrix_2.T. This is 2.
the second element of the new array will be the second element of row 0 of my_matrix_2.T. This is 4.
the third element of the new array will be the third element of row 0 of my_matrix_2.T. This is 3.
the fourth element of the new array will be the fourth element of row 1 of my_matrix_2.T. This is 1.
Hence the method returns array([2, 4, 3, 1]).
In [211]: M = np.array([[1, 2], [4, 5]])
In [212]: cid = [1, 0]
In [213]: M[[list(i) for i in zip(range(M.shape[0]), cid)]]
Out[213]: array([2, 4])