Pandas sparse dataframe multiplication

Pandas sparse dataframe multiplication - python

I have two pandas sparse dataframes, big_sdf and bigger_sdf.
When I try to multiply them:
result = big_sdf # bigger_sdf
I get an error:
"numpy.core._exceptions.MemoryError: Unable to allocate 3.6 TiB for an array with shape (160815, 3078149) and data type int64"
So I tried to convert these sparse dataframes to SciPy's csr matrices and multiply it, but the conversion doesn't succeed:
from scipy.sparse import csr_matrix
csr_big = csr_matrix(big_sdf)
csr_bigger = csr_matrix(bigger_sdf)
When I run the last row I get an error message:
"ValueError: unrecognized csr_matrix constructor usage"
It only happens for the bigger matrix, the smaller one is converted with success.
Any ideas? Maybe there's a Pandas native method to multiply sparse dataframes which I missed?
Thanks in advance!

Related

petsc4py converting rectangular numpy matrix to petsc matrix

The code below works only if input is a square numpy matrix (eg np.eye(3)) but not if input is a rectangular matrix.
import numpy as np
from petsc4py import PETSc
input = np.array([[0,1,0,0],[1,0,0,0],[0,0,0,1]])
#input = np.eye(3)
indices = np.nonzero(input)
A = PETSc.Mat().create()
A.setSizes(input.shape)
A.setType("aij")
A.setUp()
# First arg is list of row indices, second list of column indices
A.setValues(list(indices[0]),list(indices[1]), input)
A.assemble()
If I run the above I get the error message:
ValueError: incompatible array sizes: ni=3, nj=3, nv=12
Do PETSc matrices have to be square matrices or can I modify the code above to make this work ?
I tried transposing input.shape, but that did not help.

Memory error when converting matrix to sparse matrix, specified dtype is invalid

Purpose:
I want to invert the dense matrix first and then convert it to a sparse matrix, but it will report a memory error.
Description:
The original CSR sparse matrix (only 1 and 0) is:
<5910x4403279 sparse matrix of type '<class 'numpy.int64'>' with 73906823 stored elements in Compressed Sparse Row format>
I want to calculate the Tversky similarity between rows. Thus, I need to convert the sparse matrix to dense and invert the dense matrix firstly, and then use matrix * invert_matrix.T to calculate relative complement.
https://en.wikipedia.org/wiki/Tversky_index
So, I invert the dense matrix after change the dtype to "bool":
bool_mat = mat.astype(bool)
invert_arr = ~(bool_mat.todense())
invert_arr = invert_arr.astype("uint8")
invert_mat = np.asmatrix(invert_arr, dtype="uint8")
s_invert_mat = scipy.sparse.csr_matrix(invert_mat, dtype="uint8")
However, when I convert the invert_mat to csr matrix with dtype="uint8", memory error was raised.(I have tried dtype=int8,bool and uint8, but the program still throws memory error)
Error
Traceback (most recent call last):
File "/home/**/Project/**/Src/calculate_distance/tversky_distance.py", line 131, in <module>
tversjt_similarities(data)
File "/home/**/Project/**/Src/calculate_distance/tversky_distance.py", line 88, in tversjt_similarities
s_invert_mat = scipy.sparse.csr_matrix(invert_mat, dtype="uint8")
File "/home/**/.local/lib/python3.8/site-packages/scipy/sparse/compressed.py", line 86, in __init__
self._set_self(self.__class__(coo_matrix(arg1, dtype=dtype)))
File "/home/**/.local/lib/python3.8/site-packages/scipy/sparse/coo.py", line 189, in __init__
self.row, self.col = M.nonzero()
numpy.core._exceptions.MemoryError: Unable to allocate 387. GiB for an array with shape (25949472067, 2) and data type int64
Problem
The problem is: I have specify the dtype='uint8', but the dtype in the error message is int64, int64will require more memory.
I have searched related issues and found the problem: numpy will automatically convert int8 to int64.
int8 scipy sparse matrix creation errors creating int64 structure?
The core of this package was developed for linear algebra work (e.g. finite element ODE solutions). The csr format in particular is optimized for matrix multiplication. It uses compiled code (cython), which uses the standard c data types - integers, floats and doubles. In numpy terms that means int64, float32 and float64. Selected formats accept other dtypes like int8, but maintaining that dtype during conversions to other formats and calculations is difficult.
I know that the invert_mat is very dense, because there are so many 1s.
But is there a way to bypass or solve this problem?
I would be very grateful if anyone could give some advice

How can I do same calculation in numpy array list at once?

I want to normalize my list containing 2D NumPy arrays of different sizes with axis=1.
My NumPy array list is like below:
First, I tried to get mean and std using numpy.hstack.
An error occurred in the following operation that was successfully obtained:
mean,std = get_mean_std(uttrSpectro) # get_mean_std : my function which returns mean and std
uttrSpectro = (uttrSpectroList-mean)/std
Error Message : could not broadcast input array from shape (801,284) into shape (801)
How can I do this calculation without for-loop?

Memory issues with creating an adjacency matrix using Coo-matrix

Hi i am trying to generate an adjacency matrix with a dimension of about 24,000 from a CSV with two columns showing combinations of pairs of genes and a column of 1's to indicate a present interaction....My goal is to have it be square and populated with zeros for combinations not in the two columns
I am using the following Python script
import numpy as np
from scipy.sparse import coo_matrix
l, c, v = np.loadtxt("biogrid2.csv", dtype=(int), skiprows=0, delimiter=",").T[:3, :]
m =coo_matrix((l, (v-1, c-1)), shape=(v.max(), c.max()))
m.toarray()
and it runs ok until encountering the following errorIt seems
File "/home/charlie/anaconda3/lib/python3.6/site-packages/scipy/sparse/base.py", line 1184, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Any ideas about how to get around the memory limit in Scipy
Thanks

Most likely what you want isn't m.toarray but m.tocsr(). a csr matrix can do simple linear algebra (like .dot() and matrix powers) natively, for instance this works:
m.tocsr()
random_walk_2 = m.dot(m)
random_walk_n = m ** n
# see https://stackoverflow.com/questions/28702416/matrix-power-for-sparse-matrix-in-python
Covariance should be implementable as well, but I'm not sure what the specific implementation would be without seeing what your current process is.
EDIT: To turn the output back into a simpler format to read out to csv, you can follow up by returning to coo with .tocoo()
m.tocoo()
out = np.c_[m.data, m.row, m.col].T
np.savetxt("foo.csv", out, delimiter=",")
# see https://stackoverflow.com/questions/6081008/dump-a-numpy-array-into-a-csv-file

The function toarray() will convert your 24000*24000 sparse matrix (coo_matrix) into a dense array of 24000*24000 (assuming you are loading int) which needs in terms of memory at least
24000*24000*4 = around 2,15Gb.
To avoid using so much memory you should avoid converting to dense matrix (using toarray()) and do your operations with sparse matrix
If you need your matrix squared you can just do m*m or m.multiply(m) and you will get a sparse matrix.
To save your matrix you have several option.
Simplest one is NPZ see https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.sparse.save_npz.html or Save / load scipy sparse csr_matrix in portable data format
If you want to get your result as your initial CSV file coo_matrix has attributes
data COO format data array of the matrix
row COO format row index array of the matrix
col COO format column index array of the matrix
see https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html
which can be used to create the CSV file.

"Killed: 9" error when trying to construct a Scipy csr_matrix from a large NumPy array

I'm trying to solve a Markov chain problem in which the transition matrix contains about ~150,000 rows and columns, which is however sparse (only about ~450,000 elements are nonzero).
I notice that trying to construct a csr_matrix matrix from a np.zeros array of that size leads to a Killed: 9 error:
In [139]: N = 150000
In [140]: T = np.zeros((N, N))
In [142]: import scipy.sparse
In [143]: _T = scipy.sparse.csr_matrix(T)
Killed: 9
Is it possible to construct a csr_matrix of this size? Do I need to initiate the matrix T as a csr_matrix and dispense with NumPy arrays altogether?

Your process is "killed: 9" mostly because the process is taking too long or too much memory of the system and it's been terminated by the os. Just like in the comment, you can construct a sparse matrix directly using csr_matrix:
_T = scipy.sparse.csr_matrix((N,N))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas sparse dataframe multiplication - python

Related

petsc4py converting rectangular numpy matrix to petsc matrix

Memory error when converting matrix to sparse matrix, specified dtype is invalid

How can I do same calculation in numpy array list at once?

Memory issues with creating an adjacency matrix using Coo-matrix

"Killed: 9" error when trying to construct a Scipy csr_matrix from a large NumPy array

Categories

Resources