Create sparse RDD from scipy sparse matrix - python

I have a large sparse matrix from scipy (300k x 100k with all binary values, mostly zeros). I would like to set the rows of this matrix to be an RDD and then do some computations on those rows - evaluate a function on each row, evaluate functions on pairs of rows, etc.
Key thing is that it's quite sparse and I don't want to explode the cluster - can I convert the rows to SparseVectors? Or perhaps convert the whole thing to SparseMatrix?
Can you give an example where you read in a sparse array, setup rows into an RDD, and compute something from the cartesian product of those rows?

I had this issue recently--I think you can convert directly by constructing the SparseMatrix with the scipy csc_matrix attributes. (Borrowing from Yang Bryan)
import numpy as np
import scipy.sparse as sps
from pyspark.mllib.linalg import Matrices
# create a sparse matrix
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6])
sv = sps.csc_matrix((data, (row, col)), shape=(3, 3))
# convert to pyspark SparseMatrix
sparse_matrix = Matrices.sparse(sv.shape[0],sv.shape[1],sv.indptr,sv.indices,sv.data)

The only thing you have to is toarray()
import numpy as np
import scipy.sparse as sps
# create a sparse matrix
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6])
sv = sps.csc_matrix((data, (row, col)), shape=(3, 3))
sv.toarray()
> array([[1, 0, 4],
> [0, 0, 5],
> [2, 3, 6]])
type(sv)
<class 'scipy.sparse.csc.csc_matrix'>
#read sv as RDD
sv_rdd = sc.parallelize(sv.toarray()) #transfer saprse to array
sv_rdd.collect()
> [array([1, 0, 4]), array([0, 0, 5]), array([2, 3, 6])]
type(sv_rdd)
> <class 'pyspark.rdd.RDD'>

Related

Python: tf*idf transformation of existing sparse matrix

Suppose you have this SciPy sparse matrix:
from scipy.sparse import coo_matrix
X = coo_matrix(([1, 1, 1, 1, 1, 1], ([0, 0, 0, 1, 1, 2], [0, 1, 2, 1, 2, 2])), shape=(3, 3))
How can I get the tf*idf transformed sparse matrix?
It seems that sklearn.feature_extraction.text.TfidfTransformer could be a way, but whatever help I find starts from creating X from a text corpus. But I already have the M*N matrix X.
My versions are: scipy=1.9.1, sklearn=1.0.2

Re-ordering of the rows and columns in a CSR matrix

I have a matrix in a sparse csr format for example:
from scipy.sparse import csr_matrix
import numpy as np
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
M = csr_matrix((data, (row, col)), shape=(3, 3))
M.A =
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])
I am re-ordering the matrix with the index [2,0,1] using the following approach:
order = np.array([2,0,1])
M = M[order,:]
M = M[:,order]
M.A
array([[6, 4, 5],
[2, 1, 0],
[3, 0, 0]])
This approach works but it is not feasible for my real csr_matrix which has the size of 16580746 X 1672751804 and causes memory error.
I took another approach like this:
edge_list = zip(row,col,dat)
index = dict(zip(order, range(len(order))))
all_coeff = zip(*((index[u], index[v],d) for u,v,d in edge_list if u in index and v in index))
new_row,new_col,new_data = all_coeff
n = len(order)
graph = csr_matrix((new_data, (new_row, new_col)), shape=(n, n))
This also works but fall into the same trap of memory error for large sparse matrix. Any suggestions to efficiently do this?
I've found using matrix operations to be the most efficient. Here's a function which will permute the rows and/or columns to a specified order. It can be modified to swap two specific rows/columns if you would like.
from scipy import sparse
def permute_sparse_matrix(M, new_row_order=None, new_col_order=None):
"""
Reorders the rows and/or columns in a scipy sparse matrix
using the specified array(s) of indexes
e.g., [1,0,2,3,...] would swap the first and second row/col.
"""
if new_row_order is None and new_col_order is None:
return M
new_M = M
if new_row_order is not None:
I = sparse.eye(M.shape[0]).tocoo()
I.row = I.row[new_row_order]
new_M = I.dot(new_M)
if new_col_order is not None:
I = sparse.eye(M.shape[1]).tocoo()
I.col = I.col[new_col_order]
new_M = new_M.dot(I)
return new_M
Let's think smart.
Instead of reordering the matrix, why don't you work directly on the row and column indexes that you provided at the start?
So for example, you can replace your row indexes in this way, from:
[0, 0, 1, 2, 2, 2]
to:
[2, 2, 0, 1, 1, 1]
And your column indexes, from:
[0, 2, 2, 0, 1, 2]
to:
[2, 1, 1, 2, 0, 1]

Indexing and replacing values in sparse CSC matrix (Python)

I have a sparse CSC matrix, "A", in which I want to replace the first row with a vector that is all zeros, except for the first entry which is 1.
So far I am doing the inefficient version, e.g.:
import numpy as np
from scipy.sparse import csc_matrix
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6])
A = csc_matrix((data, (row, col)), shape=(3, 3))
replace = np.zeros(3)
replace[0] = 1
A[0,:] = replace
A.eliminate_zeros()
But I'd like to do it with .indptr, .data, etc. As it is a CSC, I am guessing that this might be inefficient as well? In my exact problem, the matrix is 66000 X 66000.
For a CSR sparse matrix I've seen it done as
A.data[1:A.indptr[1]] = 0
A.data[0] = 1.0
A.indices[0] = 0
A.eliminate_zeros()
So, basically I'd like to do the same for a CSC sparse matrix.
Expected result: To do exactly the same as above, just more efficiently (applicable to very large sparse matrices).
That is, start with:
[1, 0, 4],
[0, 0, 5],
[2, 3, 6]
and replace the upper row with a vector that is as long as the matrix, is all zeros except for 1 at the beginning. As such, one should end with
[1, 0, 0],
[0, 0, 5],
[2, 3, 6]
And be able to do it for large sparse CSC matrices efficiently.
Thanks in advance :-)
You can do it by indptr and indices. If you want to construct your matrix with indptr and indices parameters by:
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
A = csc_matrix((data, indices, indptr), shape=(3,3))
But if you want to set all elements in the first row except the first element in row 0, you need to set data values to zero for those that indices is zero. In other words:
data[indices == 0] = 0
The above line set all the elements of the first row to 0. To avoid setting the first element to zero we can do the following:
indices_tmp = indices == 0
indices_tmp[0] = False # to avoid removing the first element in row 0.
data[indices_tmp == True] = 0
A = csc_matrix((data, indices, indptr), shape=(3,3))
Hope it helps.

Faster dot product of CSR matrix and ndarray

I am trying to get a fast dot product function for multiplying a sparse matrix(3*3) and an nd array(1*3) in such a way that every row of matrix gets dot product with nd array to get a (3*1) array.
My current implementation is to get each row of the matrix and then do the dot product but for scaling up the matrix dimension, it gets too slow.
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
matrix =csr_matrix((data, (row, col)), shape=(3, 3))
secondArray=np.random.rand((1, 3))
for idx, x in enumerate(matrix):
X_arr=X.getrow(idx).toarray()
prod=np.dot(np.array(X_arr[0]), secondArray)

Numpy: vectorize matrix creation

If I want to create a matrix, I simply call
m = np.matrix([[x00, x01],
[x10, x11]])
, where x00, x01, x10 and x11 are numbers. However, I would like to vectorize this process. For example, if the x's are one-dimensional arrays with length l, then I would like m to become an array of matrices, or a lx2x2-dimensional array. Unfortunately,
zeros = np.zeros(10)
ones = np.ones(10)
m = np.matrix([[zeros, ones],
[zeros, ones]])
raises an error ("matrix must be 2-dimensional") and
m = np.array([[zeros, ones],
[zeros, ones]])
gives an 2x2xl-dimensional array instead. In order to solve this, I could call np.moveaxis(m, 2, 0), but I am looking for a direct solution that doesn't need to change the order of axes of a (potentially huge) array. This also only sets the axis-order right if I'm passing one-dimensional arrays as values for my matrix, not if they're higher dimensional.
Is there a general and efficient way of vectorizing the creation of matrices?
Let's try a 2d (4d after joining) case:
In [374]: ones = np.ones((3,4),int)
In [375]: arr = np.array([[ones*0, ones],[ones*2, ones*3]])
In [376]: arr
Out[376]:
array([[[[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]],
[[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1]]],
[[[2, 2, 2, 2],
[2, 2, 2, 2],
[2, 2, 2, 2]],
[[3, 3, 3, 3],
[3, 3, 3, 3],
[3, 3, 3, 3]]]])
In [377]: arr.shape
Out[377]: (2, 2, 3, 4)
Notice that the original array elements are 'together'. arr has its own databuffer, with copies of the original arrays, but it was made with relatively efficient block copies.
We can easily transpose axes:
In [378]: arr.transpose(2,3,0,1)
Out[378]:
array([[[[0, 1],
[2, 3]],
[[0, 1],
[2, 3]],
...
[[0, 1],
[2, 3]]]])
Now it's 12 (2,2) arrays. It is a view, using arr's databuffer. It just has a different shape and strides. Doing this transpose is quite efficient, and isn't any slower when arr is very big. And a lot of math on the transposed array will be nearly as efficient as on the original arr (because of stridded iteration). If there are differences in speed it will be because of caching at a deep level.
But some actions will require a copy. For example the transposed array can't be raveled without a copy. The original 0s,1s etc are no longer together.
In [379]: arr.transpose(2,3,0,1).ravel()
Out[379]:
array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1,
2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3,
0, 1, 2, 3])
I could construct the same 1d array with
In [380]: tarr = np.empty((3,4,2,2), int)
In [381]: tarr[...,0,0] = ones*0
In [382]: tarr[...,0,1] = ones*1
In [383]: tarr[...,1,0] = ones*2
In [384]: tarr[...,1,1] = ones*3
In [385]: tarr.ravel()
Out[385]:
array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1,
2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3,
0, 1, 2, 3])
This tarr is effectively what you are trying to produce 'directly'.
Another way to look at this construction, is to assign the values to the array's .flat with strides - insert 0s at every 4th slot, 1s at the adjacent ones, etc.:
In [386]: tarr.flat[0::4] = ones*0
In [387]: tarr.flat[1::4] = ones*1
In [388]: tarr.flat[2::4] = ones*2
In [389]: tarr.flat[3::4] = ones*3
Here's another 'direct' way - use np.stack (a version of concatenate) to create a (3,4,4) array, which can then be reshaped:
np.stack((ones*0,ones*1,ones*2,ones*3),2).reshape(3,4,2,2)
That stack is, in essence:
In [397]: ones1 = ones[...,None]
In [398]: np.concatenate((ones1*0, ones1*1, ones1*2, ones1*3),axis=2)
Notice that this target (3,4,2,2) could be reshaped to (12,4) (and v.v) at no cost. So the original problem becomes: is it easier to construct a (4,12) and transpose, or construct the (12,4) first? It's really a 2d problem, not a (m+n)d one.
np.matrix must be a 2D array. From numpy documentation of np.matrix
Returns a matrix from an array-like object, or from a string of data.
A matrix is a specialized 2-D array that retains its 2-D nature
through operations. It has certain special operators, such as *
(matrix multiplication) and ** (matrix power).
Note
It is no longer recommended to use this class, even for linear
algebra. Instead use regular arrays. The class may be removed in the
future.
Is there any reason you want np.matrix? Most numpy operations should be doable in the array object as the matrix class is quasi-deprecated.
From your example I tried using the transpose (.T) method:
zeros = np.zeros(10)
ones = np.ones(10)
twos = np.ones(10) * 2
threes = np.ones(10) * 3
m = np.array([[zeros, ones], [twos, threes]]).T
>> array([[0,2],[1,3]],...)
or
m = np.transpose(np.array([[zeros, ones], [twos, threes]]), (2,0,1))
>> array([[0,1],[2,3]],...)
This yields a (10, 2, 2) array

Categories

Resources