Initialize high dimensional sparse matrix - python

I want to initialize 300,000 x 300,0000 sparse matrix using sklearn, but it requires memory as if it was not sparse:
>>> from scipy import sparse
>>> sparse.rand(300000,300000,.1)
it gives the error:
MemoryError: Unable to allocate 671. GiB for an array with shape (300000, 300000) and data type float64
which is the same error as if I initialize using numpy:
np.random.normal(size=[300000, 300000])
Even when I go to a very low density, it reproduces the error:
>>> from scipy import sparse
>>> from scipy import sparse
>>> sparse.rand(300000,300000,.000000000001)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../python3.8/site-packages/scipy/sparse/construct.py", line 842, in rand
return random(m, n, density, format, dtype, random_state)
File ".../lib/python3.8/site-packages/scipy/sparse/construct.py", line 788, in random
ind = random_state.choice(mn, size=k, replace=False)
File "mtrand.pyx", line 980, in numpy.random.mtrand.RandomState.choice
File "mtrand.pyx", line 4528, in numpy.random.mtrand.RandomState.permutation
MemoryError: Unable to allocate 671. GiB for an array with shape (90000000000,) and data type int64
Is there a more memory-efficient way to create such a sparse matrix?

Just generate only what you need.
from scipy import sparse
import numpy as np
n, m = 300000, 300000
density = 0.00000001
size = int(n * m * density)
rows = np.random.randint(0, n, size=size)
cols = np.random.randint(0, m, size=size)
data = np.random.rand(size)
arr = sparse.csr_matrix((data, (rows, cols)), shape=(n, m))
This lets you build monster sparse arrays provided they're sparse enough to fit into memory.
>>> arr
<300000x300000 sparse matrix of type '<class 'numpy.float64'>'
with 900 stored elements in Compressed Sparse Row format>
This is probably how the sparse.rand constructor should be working anyway. If any row, col pairs collide it'll add the data values together, which is probably fine for all applications I can think of.

Try passing a reasonable density argument as seen in the docs... if you have like 10 trillion cells maybe like 0.00000001 or something...
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.rand.html#scipy.sparse.rand

#hpaulj's comment is spot on. There is a clue in the error message also.
MemoryError: Unable to allocate 671. GiB for an array with shape (90000000000,) and data type int64
There is a reference to int64 and not float64 and a linear array of size 300,000 X 300,000. This refers to an intermediate step of random sampling in the creation of the sparse matrix, which occupies a lot of memory anyway.
Note that while creating any sparse matrix (irrespective of the format), you have to account memory for the non-zero values and for representing the position of the values in the matrix.

Related

Memory error when converting matrix to sparse matrix, specified dtype is invalid

Purpose:
I want to invert the dense matrix first and then convert it to a sparse matrix, but it will report a memory error.
Description:
The original CSR sparse matrix (only 1 and 0) is:
<5910x4403279 sparse matrix of type '<class 'numpy.int64'>' with 73906823 stored elements in Compressed Sparse Row format>
I want to calculate the Tversky similarity between rows. Thus, I need to convert the sparse matrix to dense and invert the dense matrix firstly, and then use matrix * invert_matrix.T to calculate relative complement.
https://en.wikipedia.org/wiki/Tversky_index
So, I invert the dense matrix after change the dtype to "bool":
bool_mat = mat.astype(bool)
invert_arr = ~(bool_mat.todense())
invert_arr = invert_arr.astype("uint8")
invert_mat = np.asmatrix(invert_arr, dtype="uint8")
s_invert_mat = scipy.sparse.csr_matrix(invert_mat, dtype="uint8")
However, when I convert the invert_mat to csr matrix with dtype="uint8", memory error was raised.(I have tried dtype=int8,bool and uint8, but the program still throws memory error)
Error
Traceback (most recent call last):
File "/home/**/Project/**/Src/calculate_distance/tversky_distance.py", line 131, in <module>
tversjt_similarities(data)
File "/home/**/Project/**/Src/calculate_distance/tversky_distance.py", line 88, in tversjt_similarities
s_invert_mat = scipy.sparse.csr_matrix(invert_mat, dtype="uint8")
File "/home/**/.local/lib/python3.8/site-packages/scipy/sparse/compressed.py", line 86, in __init__
self._set_self(self.__class__(coo_matrix(arg1, dtype=dtype)))
File "/home/**/.local/lib/python3.8/site-packages/scipy/sparse/coo.py", line 189, in __init__
self.row, self.col = M.nonzero()
numpy.core._exceptions.MemoryError: Unable to allocate 387. GiB for an array with shape (25949472067, 2) and data type int64
Problem
The problem is: I have specify the dtype='uint8', but the dtype in the error message is int64, int64will require more memory.
I have searched related issues and found the problem: numpy will automatically convert int8 to int64.
int8 scipy sparse matrix creation errors creating int64 structure?
The core of this package was developed for linear algebra work (e.g. finite element ODE solutions). The csr format in particular is optimized for matrix multiplication. It uses compiled code (cython), which uses the standard c data types - integers, floats and doubles. In numpy terms that means int64, float32 and float64. Selected formats accept other dtypes like int8, but maintaining that dtype during conversions to other formats and calculations is difficult.
I know that the invert_mat is very dense, because there are so many 1s.
But is there a way to bypass or solve this problem?
I would be very grateful if anyone could give some advice

Import large .tiff file as sparse matrix

I have a large .tiff file (4.4gB, 79530 x 54980 values) with 1 band. Since only 16% of the values are valid, I was thinking it's better to import the file as sparse matrix, to save RAM. When I first open it as np.array and then transform it into a sparse matrix using csr_matrix(), my kernel already crashes. See code below.
from osgeo import gdal
import numpy as np
from scipy.sparse import csr_matrix
ds = gdal.Open("file.tif")
band = ds.GetRasterBand(1)
array = np.array(band.ReadAsArray())
csr_matrix(array)
Is there a better way to work with this file? In the end I have to make calculations based on the values in the raster. (Unfortunately, due to confidentiality, I cannot attach the relevant file.)
Can you tell where the crash occurs?
band = ds.GetRasterBand(1)
temp = band.ReadAsArray()
array = np.array(temp) # if temp is already an array, you don't need this
csr_matrix(array)
If array is 4.4gB, (79530, 54980)
In [62]: (79530 * 54980) / 1e9
Out[62]: 4.3725594 # 4.4gB makes sense for 1 byte/element
In [63]: (79530 * 54980) * 0.16 # 16% density
Out[63]: 699609504.0 # number of nonzero values
creating csr requires doing np.nonzero(array) to get the indices. That will produce 2 arrays of this 0.7 * 8 Gb size (indices are 8 byte ints). coo format actually requires those 2 arrays plus 0.7 for the nonzero values - about 12 Gb . Converted to csr, the row attribute is reduced to 79530 elements - so about 7 Gb . (corrected for 8 bytes/element)
So at 16% density, the sparse format is, at it's best, is still larger than the dense version.
Memory error when converting matrix to sparse matrix, specified dtype is invalid
is a recent case of a memory error - which occurred in nonzero step.
Assuming you know size of your matrix, you can create an empty sparse matrix, and then set only valid values one-by-one.
from osgeo import gdal
import numpy as np
from scipy.sparse import csr_matrix
ds = gdal.Open("file.tif")
band = ds.GetRasterBand(1)
matrix_size = (1000, 1000) # set you size
matrix = csr_matrix(matrix_size)
# for each valid value
matrix[i, j] = your_value
Edit 1
If you don't know size of your matrix, you should be able to check it like this:
from osgeo import gdal
ds = gdal.Open("file.tif")
width = ds.GetRasterXSize()
height = ds.GetRasterYSize()
matrix_size = (width, height)
Edit 2
I measured metrices suggested in comments (filled to the full). This is how I measured memory usage.
size 500x500
matrix
empty size
full size
filling time
csr_matrix
2856
2992
477.67 s
doc_matrix
726
35807578
3.15 s
lil_matrix
8840
8840
0.54 s
size 1000x1000
matrix
empty size
full size
filling time
csr_matrix
4856
4992
7164.94 s
doc_matrix
726
150578858
12.81 s
lil_matrix
16840
16840
2.19 s
Probably the best solution would be to use lil_matrix

Numpy memory error

I'm running into a memory error issue with numpy. The following line of code seems to be the issue:
self.D_r = numpy.diag(1/numpy.sqrt(self.r))
Where self.r is a relatively small numpy array.
The interesting thing is I monitored the memory usage and the process took up at most 3% of the RAM on the machine. So I'm thinking there's something that is killing the script before all the RAM is taken up because there is an expectation that the process will do so. If anybody has any ideas I would be very grateful.
Edit 1:
Here's the traceback:
Traceback (most recent call last):
File "/path_to_file/my_script.py", line 82, in <module>
mca_X = mca.mca(X)
File "/path_to_file/mca.py", line 54, in __init__
self.D_r = numpy.diag(1/numpy.sqrt(self.r.values))
File "/path_to_file/numpy/lib/twodim_base.py", line 302, in diag
res = zeros((n, n), v.dtype)
MemoryError
Running the script on KDD Cup 99 data (with one-hot-encoded nominal variables).
If the argument to np.diag() is a 1D, it creates a 2D array, using the 1D array as diagonal:
Signature: np.diag(v, k=0)
Parameters
v : array_like
If `v` is a 2-D array, return a copy of its `k`-th diagonal.
If `v` is a 1-D array, return a 2-D array with `v` on the `k`-th
diagonal.
This squares the memory size of the array.
if self.r is a 1D little array of more than 51000 éléments it can create a memory error :
In [85]: a=np.diag(arange(5e4))
In [86]: a.shape
Out[86]: (50000, 50000)
In [88]: a.size*a.itemsize
Out[88]: 20 000 000 000 # 20 Go
In [87]: a=np.diag(arange(5.1e4))
---------------------------------------------------------------------------
MemoryError

Performing PCA on large sparse matrix by using sklearn

I am trying to apply PCA on huge sparse matrix, in the following link it says that randomizedPCA of sklearn can handle sparse matrix of scipy sparse format.
Apply PCA on very large sparse matrix
However, I always get error. Can someone point out what I am doing wrong.
Input matrix 'X_train' contains numbers in float64:
>>>type(X_train)
<class 'scipy.sparse.csr.csr_matrix'>
>>>X_train.shape
(2365436, 1617899)
>>>X_train.ndim
2
>>>X_train[0]
<1x1617899 sparse matrix of type '<type 'numpy.float64'>'
with 81 stored elements in Compressed Sparse Row format>
I am trying to do:
>>>from sklearn.decomposition import RandomizedPCA
>>>pca = RandomizedPCA()
>>>pca.fit(X_train)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 567, in fit
self._fit(check_array(X))
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 334, in check_array
copy, force_all_finite)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
if I try to convert to dense matrix, I think I am out of memory.
>>> pca.fit(X_train.toarray())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 949, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
B = self._process_toarray_args(order, out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py", line 800, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Due to the nature of the PCA, even if the input is an sparse matrix, the output is not. You can check it with a quick example:
>>> from sklearn.decomposition import TruncatedSVD
>>> from scipy import sparse as sp
Create a random sparse matrix with 0.01% of its data as non-zeros.
>>> X = sp.rand(1000, 1000, density=0.0001)
Apply PCA to it:
>>> clf = TruncatedSVD(100)
>>> Xpca = clf.fit_transform(X)
Now, check the results:
>>> type(X)
scipy.sparse.coo.coo_matrix
>>> type(Xpca)
numpy.ndarray
>>> print np.count_nonzero(Xpca), Xpca.size
95000, 100000
which suggests that 95000 of the entries are non-zero, however,
>>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size
99481, 100000
99481 elements are close to 0 (<1e-15), but not 0.
Which means, in short, that for a PCA, even if the input is an sparse matrix, the output is not. Thus, if you try to extract 100,000,000 (1e8) components from your matrix, you will end up with a 1e8 x n_features (in your example 1e8 x 1617899) dense matrix, which of course, can't be hold in memory.
I'm not an expert statistician, but I believe there is currently no workaraound for this using scikit-learn, as is not a problem of scikit-learn's implementation, is just the mathematical definition of their Sparse PCA (by means of sparse SVD) which makes the result dense.
The only workaround that might work for you, is for you to start from a small amount of components, and increase it until you get a balance between the data that you can keep in memory, and the percentage of the data explained (which you can calculate as follows):
>>> clf.explained_variance_ratio_.sum()
PCA(X) is SVD(X-mean(X)).
Even If X is a sparse matrix, X-mean(X) is always a dense matrix.
Thus, randomized SVD(TruncatedSVD) is not efficient as like randomized SVD of a sparse matrix.
However, delayed evaluation
delay(X-mean(X))
can avoid expanding the sparse matrix X to the dense matrix X-mean(X).
The delayed evaluation enables efficient PCA of a sparse matrix using the randomized SVD.
This mechanism is implemented in my package :
https://github.com/niitsuma/delayedsparse/
You can see the code of the PCA using this mechanism :
https://github.com/niitsuma/delayedsparse/blob/master/delayedsparse/pca.py
Performance comparisons to existing methods show this mechanism drastically reduces required memory size :
https://github.com/niitsuma/delayedsparse/blob/master/demo-pca.sh
More detail description of this technique can be found in my patent :
https://patentscope2.wipo.int/search/ja/detail.jsf?docId=JP225380312

Python: L1-norm of a sparse non-square matrix

I have one problem while try to computing the 1-norm of a sparse matrix. I am using the function scipy.sparse.linalg.onenormest but it gives me an error because the operator can act only onto square matrix.
Here a code example:
from scipy import sparse
row = array([0,2,2,0,1,2])
col = array([0,0,1,2,2,2])
data = array([1,2,3,4,5,6])
A = sparse.csc_matrix( (data,(row,col)), shape=(5,3) )
onenormest(A)
this is the error:
Traceback (most recent call last):
File "<ipython console>", line 1, in <module>
File "C:\Python27\lib\site-packages\scipy\sparse\linalg\_onenormest.py", line 76, in onenormest
raise ValueError('expected the operator to act like a square matrix')
ValueError: expected the operator to act like a square matrix
The operator onenormest works if I define A as a square matrix, but this is not what I want.
Anyone knows how to calculate the 1-norm of a sparse non-square matrix?
I think that you want numpy.linalg.norm instead;
from numpy import linalg
from scipy import sparse
row = array([0,2,2,0,1,2])
col = array([0,0,1,2,2,2])
data = array([1,2,3,4,5,6])
A = sparse.csc_matrix( (data,(row,col)), shape=(5,3) )
print linalg.norm(A.todense(), ord=1) #15
It does not work to call A.data, since .data of a sparse matrix object is just the data - it appears as a vector instead.
If your sparse matrix is only small, then this is fine. If it is large, then obviously this is a problem. In which case, you can write your own routine.
If you are only interested in the L^1-norm, and casting to dense is not possible, then you could do it via something like this:
def sparseL1Norm = lambda A: max([numpy.abs(A).getcol(i).sum() for i in range(A.shape[1])])
This finds the L1-norm of each column:
from scipy import sparse
import numpy as np
row = np.array([0,2,2,0,1,2])
col = np.array([0,0,1,2,2,2])
data = np.array([1,2,3,-4,-5,-6]) # made negative to exercise abs
A = sparse.csc_matrix( (data,(row,col)), shape=(5,3) )
print(abs(A).sum(axis=0))
yields
[[ 3 3 15]]
You could then take the max to find the L1-norm of the matrix:
print(abs(A).sum(axis=0).max())
# 15
abs(A) is a sparse matrix:
In [29]: abs(A)
Out[29]:
<5x3 sparse matrix of type '<type 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Column format>
and sum and max are methods of the sparse matrix, so abs(A).sum(axis=0).max() computes the L1-norm without densifying the matrix.
Note: Most NumPy functions (such a np.abs) are not designed to work with sparse matrices. Although np.abs(A) returns the correct result, it arrives there through an indirect route. The more direct route is to use abs(A) which calls A.__abs__(). Thanks to pv. for point this out.

Categories

Resources