Python: tf*idf transformation of existing sparse matrix

Python: tf*idf transformation of existing sparse matrix - python

Suppose you have this SciPy sparse matrix:
from scipy.sparse import coo_matrix
X = coo_matrix(([1, 1, 1, 1, 1, 1], ([0, 0, 0, 1, 1, 2], [0, 1, 2, 1, 2, 2])), shape=(3, 3))
How can I get the tf*idf transformed sparse matrix?
It seems that sklearn.feature_extraction.text.TfidfTransformer could be a way, but whatever help I find starts from creating X from a text corpus. But I already have the M*N matrix X.
My versions are: scipy=1.9.1, sklearn=1.0.2

Related

Fast GPU computation on PyTorch sparse tensor

Is it possible to do operations on each row of a PyTorch MxN tensor, but only at certain indices (for instance nonzero) to save time?
I'm particularly interested in the case of M and N very large where only a few elements on each row aren't null.
(Toy example) From this large tensor:
Large = Tensor([[0, 1, 3, 0, 0, 0],
[0, 0, 0, 0, 5, 0],
[1, 0, 0, 5, 0, 1]])
I'd like to use something like the following smaller "tensor":
irregular_tensor = [ [1, 3],
[5],
[1, 5, 1]]
and do the same exact computation on each row (for instance involving torch.cumsum and torch.exp) to obtain an output of size Mx1.
Is there a way to do that?

You might be interested in the Torch Sparse functionality. You can convert a PyTorch Tensor to a PyTorch Sparse tensor using the to_sparse() method of the Tensor class.
You can then access a tensor that contains all the indices in Coordinate format by the Sparse Tensor's indices() method, and a tensor that contains the associated values by the Sparse Tensor's values() method.
This also has the benefit of saving you memory in terms of storing the tensor.
There is some functionality for using other Torch functions on Sparse Tensors, but this is limited.
Also: be aware that this part of the API is still in Beta and subject to changes.

This reduces the tensor (list of lists ) with list comprehension method, which is fast.
# original large tensor
Large_tensor = [ [0, 1, 3, 0, 0, 0],
[0, 0, 0, 0, 5, 0],
[1, 0, 0, 5, 0, 1]]
# size of tensor
imax = len(Large_tensor)
jmax = len(Large_tensor[0])
print(f'the tensor is of size {imax} x {jmax}')
small_tensor = [[x for x in row if x!=0] for row in Large_tensor]
# result
print('large tensor: ', Large_tensor)
print('small tensor: ', small_tensor)
The results is:
the tensor is of size 3 x 6
large tensor: [[0, 1, 3, 0, 0, 0], [0, 0, 0, 0, 5, 0], [1, 0, 0, 5, 0, 1]]
small tensor: [[1, 3], [5], [1, 5, 1]]
An alternative method reduces the tensor by iterating through the components of the large_tensor to create a small_tensor with non-zero values (as per the question).
# original large tensor
Large_tensor = [ [0, 1, 3, 0, 0, 0],
[0, 0, 0, 0, 5, 0],
[1, 0, 0, 5, 0, 1]]
# example of how to reference tensor
i = 0
j = 2
imax = len(Large_tensor)
jmax = len(Large_tensor[0])
print(Large_tensor[i][j])
# the dimension of the tensor
print(f'the tensor is of size {imax} x {jmax}')
# empty list for the new small tensor
small_tensor = []
# process of reducing
for i in range(imax):
small_tensor.append([])
for j in range(jmax):
if Large_tensor[i][j]!=0:
small_tensor[i].append(Large_tensor[i][j])
print(i, j)
# result
print('large tensor: ', Large_tensor)
print('small tensor: ', small_tensor)
the output is:
large tensor: [[0, 1, 3, 0, 0, 0], [0, 0, 0, 0, 5, 0], [1, 0, 0, 5, 0, 1]]
small tensor: [[1, 3], [5], [1, 5, 1]]
conclusion:
there are two fast methods shown. There are probably also even faster methods using external modules such as numpy and SciPy.

Re-ordering of the rows and columns in a CSR matrix

I have a matrix in a sparse csr format for example:
from scipy.sparse import csr_matrix
import numpy as np
row = np.array([0, 0, 1, 2, 2, 2])
col = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
M = csr_matrix((data, (row, col)), shape=(3, 3))
M.A =
array([[1, 0, 2],
[0, 0, 3],
[4, 5, 6]])
I am re-ordering the matrix with the index [2,0,1] using the following approach:
order = np.array([2,0,1])
M = M[order,:]
M = M[:,order]
M.A
array([[6, 4, 5],
[2, 1, 0],
[3, 0, 0]])
This approach works but it is not feasible for my real csr_matrix which has the size of 16580746 X 1672751804 and causes memory error.
I took another approach like this:
edge_list = zip(row,col,dat)
index = dict(zip(order, range(len(order))))
all_coeff = zip(*((index[u], index[v],d) for u,v,d in edge_list if u in index and v in index))
new_row,new_col,new_data = all_coeff
n = len(order)
graph = csr_matrix((new_data, (new_row, new_col)), shape=(n, n))
This also works but fall into the same trap of memory error for large sparse matrix. Any suggestions to efficiently do this?

I've found using matrix operations to be the most efficient. Here's a function which will permute the rows and/or columns to a specified order. It can be modified to swap two specific rows/columns if you would like.
from scipy import sparse
def permute_sparse_matrix(M, new_row_order=None, new_col_order=None):
"""
Reorders the rows and/or columns in a scipy sparse matrix
using the specified array(s) of indexes
e.g., [1,0,2,3,...] would swap the first and second row/col.
"""
if new_row_order is None and new_col_order is None:
return M
new_M = M
if new_row_order is not None:
I = sparse.eye(M.shape[0]).tocoo()
I.row = I.row[new_row_order]
new_M = I.dot(new_M)
if new_col_order is not None:
I = sparse.eye(M.shape[1]).tocoo()
I.col = I.col[new_col_order]
new_M = new_M.dot(I)
return new_M

Let's think smart.
Instead of reordering the matrix, why don't you work directly on the row and column indexes that you provided at the start?
So for example, you can replace your row indexes in this way, from:
[0, 0, 1, 2, 2, 2]
to:
[2, 2, 0, 1, 1, 1]
And your column indexes, from:
[0, 2, 2, 0, 1, 2]
to:
[2, 1, 1, 2, 0, 1]

Numpy: vectorize matrix creation

If I want to create a matrix, I simply call
m = np.matrix([[x00, x01],
[x10, x11]])
, where x00, x01, x10 and x11 are numbers. However, I would like to vectorize this process. For example, if the x's are one-dimensional arrays with length l, then I would like m to become an array of matrices, or a lx2x2-dimensional array. Unfortunately,
zeros = np.zeros(10)
ones = np.ones(10)
m = np.matrix([[zeros, ones],
[zeros, ones]])
raises an error ("matrix must be 2-dimensional") and
m = np.array([[zeros, ones],
[zeros, ones]])
gives an 2x2xl-dimensional array instead. In order to solve this, I could call np.moveaxis(m, 2, 0), but I am looking for a direct solution that doesn't need to change the order of axes of a (potentially huge) array. This also only sets the axis-order right if I'm passing one-dimensional arrays as values for my matrix, not if they're higher dimensional.
Is there a general and efficient way of vectorizing the creation of matrices?

Let's try a 2d (4d after joining) case:
In [374]: ones = np.ones((3,4),int)
In [375]: arr = np.array([[ones*0, ones],[ones*2, ones*3]])
In [376]: arr
Out[376]:
array([[[[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]],
[[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1]]],
[[[2, 2, 2, 2],
[2, 2, 2, 2],
[2, 2, 2, 2]],
[[3, 3, 3, 3],
[3, 3, 3, 3],
[3, 3, 3, 3]]]])
In [377]: arr.shape
Out[377]: (2, 2, 3, 4)
Notice that the original array elements are 'together'. arr has its own databuffer, with copies of the original arrays, but it was made with relatively efficient block copies.
We can easily transpose axes:
In [378]: arr.transpose(2,3,0,1)
Out[378]:
array([[[[0, 1],
[2, 3]],
[[0, 1],
[2, 3]],
...
[[0, 1],
[2, 3]]]])
Now it's 12 (2,2) arrays. It is a view, using arr's databuffer. It just has a different shape and strides. Doing this transpose is quite efficient, and isn't any slower when arr is very big. And a lot of math on the transposed array will be nearly as efficient as on the original arr (because of stridded iteration). If there are differences in speed it will be because of caching at a deep level.
But some actions will require a copy. For example the transposed array can't be raveled without a copy. The original 0s,1s etc are no longer together.
In [379]: arr.transpose(2,3,0,1).ravel()
Out[379]:
array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1,
2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3,
0, 1, 2, 3])
I could construct the same 1d array with
In [380]: tarr = np.empty((3,4,2,2), int)
In [381]: tarr[...,0,0] = ones*0
In [382]: tarr[...,0,1] = ones*1
In [383]: tarr[...,1,0] = ones*2
In [384]: tarr[...,1,1] = ones*3
In [385]: tarr.ravel()
Out[385]:
array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1,
2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3,
0, 1, 2, 3])
This tarr is effectively what you are trying to produce 'directly'.
Another way to look at this construction, is to assign the values to the array's .flat with strides - insert 0s at every 4th slot, 1s at the adjacent ones, etc.:
In [386]: tarr.flat[0::4] = ones*0
In [387]: tarr.flat[1::4] = ones*1
In [388]: tarr.flat[2::4] = ones*2
In [389]: tarr.flat[3::4] = ones*3
Here's another 'direct' way - use np.stack (a version of concatenate) to create a (3,4,4) array, which can then be reshaped:
np.stack((ones*0,ones*1,ones*2,ones*3),2).reshape(3,4,2,2)
That stack is, in essence:
In [397]: ones1 = ones[...,None]
In [398]: np.concatenate((ones1*0, ones1*1, ones1*2, ones1*3),axis=2)
Notice that this target (3,4,2,2) could be reshaped to (12,4) (and v.v) at no cost. So the original problem becomes: is it easier to construct a (4,12) and transpose, or construct the (12,4) first? It's really a 2d problem, not a (m+n)d one.

np.matrix must be a 2D array. From numpy documentation of np.matrix
Returns a matrix from an array-like object, or from a string of data.
A matrix is a specialized 2-D array that retains its 2-D nature
through operations. It has certain special operators, such as *
(matrix multiplication) and ** (matrix power).
Note
It is no longer recommended to use this class, even for linear
algebra. Instead use regular arrays. The class may be removed in the
future.
Is there any reason you want np.matrix? Most numpy operations should be doable in the array object as the matrix class is quasi-deprecated.
From your example I tried using the transpose (.T) method:
zeros = np.zeros(10)
ones = np.ones(10)
twos = np.ones(10) * 2
threes = np.ones(10) * 3
m = np.array([[zeros, ones], [twos, threes]]).T
>> array([[0,2],[1,3]],...)
or
m = np.transpose(np.array([[zeros, ones], [twos, threes]]), (2,0,1))
>> array([[0,1],[2,3]],...)
This yields a (10, 2, 2) array

Efficiently applying a threshold function to SciPy sparse csr_matrix

I have a SciPy csr_matrix (a vector in this case) of 1 column and x rows. In it are float values which I need to convert to the discrete class labels -1, 0 and 1. This should be done with a threshold function which maps the float values to one of these 3 class labels.
Is there no way other than iterating over the elements as described in Iterating through a scipy.sparse vector (or matrix)? I would love to have some elegant way to just somehow map(thresholdfunc()) on all elements.
Note that while it is of type csr_matrix, it isn't actually sparse as it's just the return of another function where a sparse matrix was involved.

If you have an array, you can discretize based on some condition with the np.where function. e.g.:
>>> import numpy as np
>>> x = np.arange(10)
>>> np.where(x < 5, 0, 1)
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])
The syntax is np.where(BOOLEAN_ARRAY, VALUE_IF_TRUE, VALUE_IF_FALSE).
You can chain together two where statements to have multiple conditions:
>>> np.where(x < 3, -1, np.where(x > 6, 0, 1))
array([-1, -1, -1, 1, 1, 1, 1, 0, 0, 0])
To apply this to your data in the CSR or CSC sparse matrix, you can use the .data attribute, which gives you access to the internal array containing all the nonzero entries in the sparse matrix. For example:
>>> from scipy import sparse
>>> mat = sparse.csr_matrix(x.reshape(10, 1))
>>> mat.data = np.where(mat.data < 3, -1, np.where(mat.data > 6, 0, 1))
>>> mat.toarray()
array([[ 0],
[-1],
[-1],
[ 1],
[ 1],
[ 1],
[ 1],
[ 0],
[ 0],
[ 0]])

Create sparse RDD from scipy sparse matrix

I have a large sparse matrix from scipy (300k x 100k with all binary values, mostly zeros). I would like to set the rows of this matrix to be an RDD and then do some computations on those rows - evaluate a function on each row, evaluate functions on pairs of rows, etc.
Key thing is that it's quite sparse and I don't want to explode the cluster - can I convert the rows to SparseVectors? Or perhaps convert the whole thing to SparseMatrix?
Can you give an example where you read in a sparse array, setup rows into an RDD, and compute something from the cartesian product of those rows?

I had this issue recently--I think you can convert directly by constructing the SparseMatrix with the scipy csc_matrix attributes. (Borrowing from Yang Bryan)
import numpy as np
import scipy.sparse as sps
from pyspark.mllib.linalg import Matrices
# create a sparse matrix
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6])
sv = sps.csc_matrix((data, (row, col)), shape=(3, 3))
# convert to pyspark SparseMatrix
sparse_matrix = Matrices.sparse(sv.shape[0],sv.shape[1],sv.indptr,sv.indices,sv.data)

The only thing you have to is toarray()
import numpy as np
import scipy.sparse as sps
# create a sparse matrix
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6])
sv = sps.csc_matrix((data, (row, col)), shape=(3, 3))
sv.toarray()
> array([[1, 0, 4],
> [0, 0, 5],
> [2, 3, 6]])
type(sv)
<class 'scipy.sparse.csc.csc_matrix'>
#read sv as RDD
sv_rdd = sc.parallelize(sv.toarray()) #transfer saprse to array
sv_rdd.collect()
> [array([1, 0, 4]), array([0, 0, 5]), array([2, 3, 6])]
type(sv_rdd)
> <class 'pyspark.rdd.RDD'>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: tf*idf transformation of existing sparse matrix - python

Related

Fast GPU computation on PyTorch sparse tensor

Re-ordering of the rows and columns in a CSR matrix

Numpy: vectorize matrix creation

Efficiently applying a threshold function to SciPy sparse csr_matrix

Create sparse RDD from scipy sparse matrix

Categories

Resources