Re-ordering of the rows and columns in a CSR matrix - python

I have a matrix in a sparse csr format for example:
from scipy.sparse import csr_matrix
import numpy as np
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
M = csr_matrix((data, (row, col)), shape=(3, 3))
M.A =
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])
I am re-ordering the matrix with the index [2,0,1] using the following approach:
order = np.array([2,0,1])
M = M[order,:]
M = M[:,order]
M.A
array([[6, 4, 5],
[2, 1, 0],
[3, 0, 0]])
This approach works but it is not feasible for my real csr_matrix which has the size of 16580746 X 1672751804 and causes memory error.
I took another approach like this:
edge_list = zip(row,col,dat)
index = dict(zip(order, range(len(order))))
all_coeff = zip(*((index[u], index[v],d) for u,v,d in edge_list if u in index and v in index))
new_row,new_col,new_data = all_coeff
n = len(order)
graph = csr_matrix((new_data, (new_row, new_col)), shape=(n, n))
This also works but fall into the same trap of memory error for large sparse matrix. Any suggestions to efficiently do this?

I've found using matrix operations to be the most efficient. Here's a function which will permute the rows and/or columns to a specified order. It can be modified to swap two specific rows/columns if you would like.
from scipy import sparse
def permute_sparse_matrix(M, new_row_order=None, new_col_order=None):
"""
Reorders the rows and/or columns in a scipy sparse matrix
using the specified array(s) of indexes
e.g., [1,0,2,3,...] would swap the first and second row/col.
"""
if new_row_order is None and new_col_order is None:
return M
new_M = M
if new_row_order is not None:
I = sparse.eye(M.shape[0]).tocoo()
I.row = I.row[new_row_order]
new_M = I.dot(new_M)
if new_col_order is not None:
I = sparse.eye(M.shape[1]).tocoo()
I.col = I.col[new_col_order]
new_M = new_M.dot(I)
return new_M

Let's think smart.
Instead of reordering the matrix, why don't you work directly on the row and column indexes that you provided at the start?
So for example, you can replace your row indexes in this way, from:
[0, 0, 1, 2, 2, 2]
to:
[2, 2, 0, 1, 1, 1]
And your column indexes, from:
[0, 2, 2, 0, 1, 2]
to:
[2, 1, 1, 2, 0, 1]

Related

Diagonal array in numpy

If I have the array [[1,0,0],[0,1,0],[0,0,1]] (let's call it So) which is done as numpy.eye(3).
How can I get that the elements below the diagonal are only 2 and 3 like this [[1,0,0],[2,1,0],[3,2,1]] ?? How can I assign vectors of an array to a different set of values?
I know I could use numpy.concatenate to join 3 vectors and I know how to change rows/columns but I can't figure out how to change diagonals below the main diagonal.
I tried to do np.diagonal(So,-1)=2*np.diagonal(So,-1) to change the diagonal right below the main diagonal but I get the error message cannot assign to function call.
I would not start from numpy.eye but rather numpy.ones and use numpy.tril+cumsum to compute the next numbers on the lower triangle:
import numpy as np
np.tril(np.ones((3,3))).cumsum(axis=0).astype(int)
output:
array([[1, 0, 0],
[2, 1, 0],
[3, 2, 1]])
reversed output (from comment)
Assuming the array is square
n = 3
a = np.tril(np.ones((n,n)))
(a*(n+2)-np.eye(n)*n-a.cumsum(axis=0)).astype(int)
Output:
array([[1, 0, 0],
[3, 1, 0],
[2, 3, 1]])
Output for n=5:
array([[1, 0, 0, 0, 0],
[5, 1, 0, 0, 0],
[4, 5, 1, 0, 0],
[3, 4, 5, 1, 0],
[2, 3, 4, 5, 1]])
You can use np.fill_diagonal and index the matrix so the principal diagonal of your matrix is the one you want. This suposing you want to put other values than 2 and 3 is the a good solution:
import numpy as np
q = np.eye(3)
#if you want the first diagonal below the principal
# you can call q[1:,:] (this is not a 3x3 or 2x3 matrix but it'll work)
val =2
np.fill_diagonal(q[1:,:], val)
#note that here you can use an unique value 'val' or
# an array with values of corresponding size
#np.fill_diagonal(q[1:,:], [2, 2])
#then you can do the same on the last one column
np.fill_diagonal(q[2:,:], 3)
You could follow this approach:
def func(n):
... return np.array([np.array(list(range(i, 0, -1)) + [0,] * (n - i)) for i in range(1, n + 1)])
func(3)
OUTPUT
array([[1, 0, 0],
[2, 1, 0],
[3, 2, 1]])

How to modify a list if a matrix contain a value?

I have a list with values [5, 5, 5, 5, 5] and I have a matrix too filled with with 1 and 0.
I want to have a new list that have to be like this:
if there's a 1 into the matrix then sum a '2' into the v's value if it's the first row and sum a '3' it's the second row.
example:
list:
v = [5,5,5,5,5]
matrix:
m = [[0, 1, 1, 0, 0], [0, 0, 1, 1, 0]]
final result:
v1 = [5,7,10,8,5]
Create a function that adds array lines, you can have the parameters be 1D numeric arrays. Loops through the arrays and returns a result array that is the addition of each element.
If your task requires it, add a check if the lines are of equal length and abort the function with an error if so.
Run this function on all of the matrix lines and then run it for the result of that and the input array.
Hope I managed to be comprehensive enough
You can use NumPy package for efficient code.
import numpy as np
v = [5,5,5,5,5]
matrix = [[0, 1, 1, 0, 0],
[0, 0, 1, 1, 0]]
weights = np.array([2,3])
w_matrix = np.multiply(matrix, weights[:, np.newaxis]).sum(axis=0)
v1 = v + w_matrix
classical python:
You can use a loop comprehension:
to_add = [sum((A*B) for A,B in zip(factors,x)) for x in zip(*m)]
[a+b for a,b in zip(v, to_add)]
output: [5, 7, 10, 8, 5]
numpy:
That said, this is a perfect use case for numpy that is more efficient and less verbose:
import numpy as np
v = [5,5,5,5,5]
m = [[0, 1, 1, 0, 0], [0, 0, 1, 1, 0]]
factors = [2,3]
V = np.array(v)
M = np.array(m)
F = np.array(factors)
V+(M*F[:,None]).sum(0)
output: array([ 5, 7, 10, 8, 5])

Indexing and replacing values in sparse CSC matrix (Python)

I have a sparse CSC matrix, "A", in which I want to replace the first row with a vector that is all zeros, except for the first entry which is 1.
So far I am doing the inefficient version, e.g.:
import numpy as np
from scipy.sparse import csc_matrix
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6])
A = csc_matrix((data, (row, col)), shape=(3, 3))
replace = np.zeros(3)
replace[0] = 1
A[0,:] = replace
A.eliminate_zeros()
But I'd like to do it with .indptr, .data, etc. As it is a CSC, I am guessing that this might be inefficient as well? In my exact problem, the matrix is 66000 X 66000.
For a CSR sparse matrix I've seen it done as
A.data[1:A.indptr[1]] = 0
A.data[0] = 1.0
A.indices[0] = 0
A.eliminate_zeros()
So, basically I'd like to do the same for a CSC sparse matrix.
Expected result: To do exactly the same as above, just more efficiently (applicable to very large sparse matrices).
That is, start with:
[1, 0, 4],
[0, 0, 5],
[2, 3, 6]
and replace the upper row with a vector that is as long as the matrix, is all zeros except for 1 at the beginning. As such, one should end with
[1, 0, 0],
[0, 0, 5],
[2, 3, 6]
And be able to do it for large sparse CSC matrices efficiently.
Thanks in advance :-)
You can do it by indptr and indices. If you want to construct your matrix with indptr and indices parameters by:
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
A = csc_matrix((data, indices, indptr), shape=(3,3))
But if you want to set all elements in the first row except the first element in row 0, you need to set data values to zero for those that indices is zero. In other words:
data[indices == 0] = 0
The above line set all the elements of the first row to 0. To avoid setting the first element to zero we can do the following:
indices_tmp = indices == 0
indices_tmp[0] = False # to avoid removing the first element in row 0.
data[indices_tmp == True] = 0
A = csc_matrix((data, indices, indptr), shape=(3,3))
Hope it helps.

Assigning to slices of 2D NumPy array

I want to assign 0 to different length slices of a 2d array.
Example:
import numpy as np
arr = np.array([[1,2,3,4],
[1,2,3,4],
[1,2,3,4],
[1,2,3,4]])
idxs = np.array([0,1,2,0])
Given the above array arr and indices idxs how can you assign to different length slices. Such that the result is:
arr = np.array([[0,2,3,4],
[0,0,3,4],
[0,0,0,4],
[0,2,3,4]])
These don't work
slices = np.array([np.arange(i) for i in idxs])
arr[slices] = 0
arr[:, :idxs] = 0
You can use broadcasted comparison to generate a mask, and index into arr accordingly:
arr[np.arange(arr.shape[1]) <= idxs[:, None]] = 0
print(arr)
array([[0, 2, 3, 4],
[0, 0, 3, 4],
[0, 0, 0, 4],
[0, 2, 3, 4]])
This does the trick:
import numpy as np
arr = np.array([[1,2,3,4],
[1,2,3,4],
[1,2,3,4],
[1,2,3,4]])
idxs = [0,1,2,0]
for i,j in zip(range(arr.shape[0]),idxs):
arr[i,:j+1]=0
import numpy as np
arr = np.array([[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4]])
idxs = np.array([0, 1, 2, 0])
for i, idx in enumerate(idxs):
arr[i,:idx+1] = 0
Here is a sparse solution that may be useful in cases where only a small fraction of places should be zeroed out:
>>> idx = idxs+1
>>> I = idx.cumsum()
>>> cidx = np.ones((I[-1],), int)
>>> cidx[0] = 0
>>> cidx[I[:-1]]-=idx[:-1]
>>> cidx=np.cumsum(cidx)
>>> ridx = np.repeat(np.arange(idx.size), idx)
>>> arr[ridx, cidx]=0
>>> arr
array([[0, 2, 3, 4],
[0, 0, 3, 4],
[0, 0, 0, 4],
[0, 2, 3, 4]])
Explanation: We need to construct the coordinates of the positions we want to put zeros in.
The row indices are easy: we just need to go from 0 to 3 repeating each number to fill the corresponding slice.
The column indices start at zero and most of the time are incremented by 1. So to construct them we use cumsum on mostly ones. Only at the start of each new row we have to reset. We do that by subtracting the length of the corresponding slice such as to cancel the ones we have summed in that row.

Create sparse RDD from scipy sparse matrix

I have a large sparse matrix from scipy (300k x 100k with all binary values, mostly zeros). I would like to set the rows of this matrix to be an RDD and then do some computations on those rows - evaluate a function on each row, evaluate functions on pairs of rows, etc.
Key thing is that it's quite sparse and I don't want to explode the cluster - can I convert the rows to SparseVectors? Or perhaps convert the whole thing to SparseMatrix?
Can you give an example where you read in a sparse array, setup rows into an RDD, and compute something from the cartesian product of those rows?
I had this issue recently--I think you can convert directly by constructing the SparseMatrix with the scipy csc_matrix attributes. (Borrowing from Yang Bryan)
import numpy as np
import scipy.sparse as sps
from pyspark.mllib.linalg import Matrices
# create a sparse matrix
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6])
sv = sps.csc_matrix((data, (row, col)), shape=(3, 3))
# convert to pyspark SparseMatrix
sparse_matrix = Matrices.sparse(sv.shape[0],sv.shape[1],sv.indptr,sv.indices,sv.data)
The only thing you have to is toarray()
import numpy as np
import scipy.sparse as sps
# create a sparse matrix
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6])
sv = sps.csc_matrix((data, (row, col)), shape=(3, 3))
sv.toarray()
> array([[1, 0, 4],
> [0, 0, 5],
> [2, 3, 6]])
type(sv)
<class 'scipy.sparse.csc.csc_matrix'>
#read sv as RDD
sv_rdd = sc.parallelize(sv.toarray()) #transfer saprse to array
sv_rdd.collect()
> [array([1, 0, 4]), array([0, 0, 5]), array([2, 3, 6])]
type(sv_rdd)
> <class 'pyspark.rdd.RDD'>

Categories

Resources