Numpy memory error - python

I'm running into a memory error issue with numpy. The following line of code seems to be the issue:
self.D_r = numpy.diag(1/numpy.sqrt(self.r))
Where self.r is a relatively small numpy array.
The interesting thing is I monitored the memory usage and the process took up at most 3% of the RAM on the machine. So I'm thinking there's something that is killing the script before all the RAM is taken up because there is an expectation that the process will do so. If anybody has any ideas I would be very grateful.
Edit 1:
Here's the traceback:
Traceback (most recent call last):
File "/path_to_file/my_script.py", line 82, in <module>
mca_X = mca.mca(X)
File "/path_to_file/mca.py", line 54, in __init__
self.D_r = numpy.diag(1/numpy.sqrt(self.r.values))
File "/path_to_file/numpy/lib/twodim_base.py", line 302, in diag
res = zeros((n, n), v.dtype)
MemoryError
Running the script on KDD Cup 99 data (with one-hot-encoded nominal variables).

If the argument to np.diag() is a 1D, it creates a 2D array, using the 1D array as diagonal:
Signature: np.diag(v, k=0)
Parameters
v : array_like
If `v` is a 2-D array, return a copy of its `k`-th diagonal.
If `v` is a 1-D array, return a 2-D array with `v` on the `k`-th
diagonal.
This squares the memory size of the array.

if self.r is a 1D little array of more than 51000 éléments it can create a memory error :
In [85]: a=np.diag(arange(5e4))
In [86]: a.shape
Out[86]: (50000, 50000)
In [88]: a.size*a.itemsize
Out[88]: 20 000 000 000 # 20 Go
In [87]: a=np.diag(arange(5.1e4))
---------------------------------------------------------------------------
MemoryError

Related

Memory error when converting matrix to sparse matrix, specified dtype is invalid

Purpose:
I want to invert the dense matrix first and then convert it to a sparse matrix, but it will report a memory error.
Description:
The original CSR sparse matrix (only 1 and 0) is:
<5910x4403279 sparse matrix of type '<class 'numpy.int64'>' with 73906823 stored elements in Compressed Sparse Row format>
I want to calculate the Tversky similarity between rows. Thus, I need to convert the sparse matrix to dense and invert the dense matrix firstly, and then use matrix * invert_matrix.T to calculate relative complement.
https://en.wikipedia.org/wiki/Tversky_index
So, I invert the dense matrix after change the dtype to "bool":
bool_mat = mat.astype(bool)
invert_arr = ~(bool_mat.todense())
invert_arr = invert_arr.astype("uint8")
invert_mat = np.asmatrix(invert_arr, dtype="uint8")
s_invert_mat = scipy.sparse.csr_matrix(invert_mat, dtype="uint8")
However, when I convert the invert_mat to csr matrix with dtype="uint8", memory error was raised.(I have tried dtype=int8,bool and uint8, but the program still throws memory error)
Error
Traceback (most recent call last):
File "/home/**/Project/**/Src/calculate_distance/tversky_distance.py", line 131, in <module>
tversjt_similarities(data)
File "/home/**/Project/**/Src/calculate_distance/tversky_distance.py", line 88, in tversjt_similarities
s_invert_mat = scipy.sparse.csr_matrix(invert_mat, dtype="uint8")
File "/home/**/.local/lib/python3.8/site-packages/scipy/sparse/compressed.py", line 86, in __init__
self._set_self(self.__class__(coo_matrix(arg1, dtype=dtype)))
File "/home/**/.local/lib/python3.8/site-packages/scipy/sparse/coo.py", line 189, in __init__
self.row, self.col = M.nonzero()
numpy.core._exceptions.MemoryError: Unable to allocate 387. GiB for an array with shape (25949472067, 2) and data type int64
Problem
The problem is: I have specify the dtype='uint8', but the dtype in the error message is int64, int64will require more memory.
I have searched related issues and found the problem: numpy will automatically convert int8 to int64.
int8 scipy sparse matrix creation errors creating int64 structure?
The core of this package was developed for linear algebra work (e.g. finite element ODE solutions). The csr format in particular is optimized for matrix multiplication. It uses compiled code (cython), which uses the standard c data types - integers, floats and doubles. In numpy terms that means int64, float32 and float64. Selected formats accept other dtypes like int8, but maintaining that dtype during conversions to other formats and calculations is difficult.
I know that the invert_mat is very dense, because there are so many 1s.
But is there a way to bypass or solve this problem?
I would be very grateful if anyone could give some advice

Memory error utilizing numpy arrays Python

My original list_ function has over 2 million lines of code and I get a memory error when I run the code that calculates . Is there a way I could could go around it. The list_ down below isa portion fo the actual numpy array.
Pandas data:
import pandas as pd
import math
import numpy as np
bigdata = 'input.csv'
data =pd.read_csv(Daily_url, low_memory=False)
#reverses all the table data values
data1 = data.iloc[::-1].reset_index(drop=True)
list_= np.array(data1['Close']
Code:
number = 5
list_= np.array([457.334015,424.440002,394.795990,408.903992,398.821014,402.152008,435.790985,423.204987,411.574005,
404.424988,399.519989,377.181000,375.467010,386.944000,383.614990,375.071991,359.511993,328.865997,
320.510010,330.079010,336.187012,352.940002,365.026001,361.562012,362.299011,378.549011,390.414001,
400.869995,394.773010,382.556000])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
std = np.std(rolling_window(list_, number), axis=1)
Error Message: MemoryError: Unable to allocate 198. GiB for an array with shape (2659448, 10000) and data type float64
Full length of the error message:
MemoryError Traceback (most recent call last)
<ipython-input-7-df0ab5649b16> in <module>
5 return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
6
----> 7 std1 = np.std(rolling_window(PC_list, number), axis=1)
<__array_function__ internals> in std(*args, **kwargs)
C:\Python3.7\lib\site-packages\numpy\core\fromnumeric.py in std(a, axis, dtype, out, ddof, keepdims)
3495
3496 return _methods._std(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
-> 3497 **kwargs)
3498
3499
C:\Python3.7\lib\site-packages\numpy\core\_methods.py in _std(a, axis, dtype, out, ddof, keepdims)
232 def _std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False):
233 ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
--> 234 keepdims=keepdims)
235
236 if isinstance(ret, mu.ndarray):
C:\Python3.7\lib\site-packages\numpy\core\_methods.py in _var(a, axis, dtype, out, ddof, keepdims)
200 # Note that x may not be inexact and that we need it to be an array,
201 # not a scalar.
--> 202 x = asanyarray(arr - arrmean)
203
204 if issubclass(arr.dtype.type, (nt.floating, nt.integer)):
MemoryError: Unable to allocate 198. GiB for an array with shape (2659448, 10000) and data type float64
Do us the favor of referencing your previous related questions (at least 2). I happened to recall seeing something similar and so look up your previous questions.
Also when asking about an error, show the full traceback (if possible). It should us (and you) identify where the problem occurs, and narrow down possible reasons and fixes.
With the sample list_ (why such a bad name for a numpy array?) of only (35,) shape, the rolling_window array isn't that large. Plus it's a view:
In [90]: x =rolling_window(list_, number)
In [91]: x.shape
Out[91]: (26, 5)
However an operation on this array might produce a copy, boosting memory use.
In [96]: np.std(x, axis=1)
Out[96]:
array([22.67653383, 10.3940773 , 14.60076482, 13.82801944, 13.68038469,
12.54834004, ...
8.07511323])
In [97]: _.shape
Out[97]: (26,)
np.std does:
std = sqrt(mean(abs(x - x.mean())**2))
x.mean(axis=1) is one value per row, but
In [102]: x.mean(axis=1).shape
Out[102]: (26,)
In [103]: (x-x.mean(axis=1, keepdims=True)).shape
Out[103]: (26, 5)
In [106]: (abs(x-x.mean(axis=1, keepdims=True))**2).shape
Out[106]: (26, 5)
produces an array as big as x, and will be a full copy; not a strided virtual copy.
Does the error message shape make sense? (2659448, 10000) Is your window size 10000? And the expected number of windows the other value?
198. GiB is a reasonable number given that dimension:
In [94]: 2659448*10000*8/1e9
Out[94]: 212.75584
I'm not going test your code with a large enough array to produce a memory error.
as_strided is a nice way of generating moving windows, and fast - but it easily blows up the memory usage.
Generally, there are two ways to deal with "cannot allocate 198GiB of memory":
Process the data in chunks, or line-by line.
Your algorithm appears to be suitable for this; rather than reading the data all at once, rewrite the rolling_window function so that it loads the initial window (first n lines of the file), then repeatedly drops one line and reads one line from the file. That way, you'll never have more than n lines of memory and it'll all work fine.
If it's a local file, it can be kept open during the whole calculation, which is easiest. If it's a remote object, you may find connections timing out; if so, you may need to either copy the data to a local file, or use the relevant seek/offset parameter to reopen the file for each additional line (or each additional chunk, which you buffer locally).
Alternately, buy (rent) a machine with more than 200 GiB of memory; machines with over 1 TiB of memory are available off-the-shelf at AWS (and presumably GCP and Azure; or for direct purchase).
This is especially suitable if you're reasonably sure your requirements won't grow further and you just need to get this one job done. It'll save you rewriting your code to handle this, but it's not a sustainable solution in a longer term.

Initialize high dimensional sparse matrix

I want to initialize 300,000 x 300,0000 sparse matrix using sklearn, but it requires memory as if it was not sparse:
>>> from scipy import sparse
>>> sparse.rand(300000,300000,.1)
it gives the error:
MemoryError: Unable to allocate 671. GiB for an array with shape (300000, 300000) and data type float64
which is the same error as if I initialize using numpy:
np.random.normal(size=[300000, 300000])
Even when I go to a very low density, it reproduces the error:
>>> from scipy import sparse
>>> from scipy import sparse
>>> sparse.rand(300000,300000,.000000000001)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../python3.8/site-packages/scipy/sparse/construct.py", line 842, in rand
return random(m, n, density, format, dtype, random_state)
File ".../lib/python3.8/site-packages/scipy/sparse/construct.py", line 788, in random
ind = random_state.choice(mn, size=k, replace=False)
File "mtrand.pyx", line 980, in numpy.random.mtrand.RandomState.choice
File "mtrand.pyx", line 4528, in numpy.random.mtrand.RandomState.permutation
MemoryError: Unable to allocate 671. GiB for an array with shape (90000000000,) and data type int64
Is there a more memory-efficient way to create such a sparse matrix?
Just generate only what you need.
from scipy import sparse
import numpy as np
n, m = 300000, 300000
density = 0.00000001
size = int(n * m * density)
rows = np.random.randint(0, n, size=size)
cols = np.random.randint(0, m, size=size)
data = np.random.rand(size)
arr = sparse.csr_matrix((data, (rows, cols)), shape=(n, m))
This lets you build monster sparse arrays provided they're sparse enough to fit into memory.
>>> arr
<300000x300000 sparse matrix of type '<class 'numpy.float64'>'
with 900 stored elements in Compressed Sparse Row format>
This is probably how the sparse.rand constructor should be working anyway. If any row, col pairs collide it'll add the data values together, which is probably fine for all applications I can think of.
Try passing a reasonable density argument as seen in the docs... if you have like 10 trillion cells maybe like 0.00000001 or something...
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.rand.html#scipy.sparse.rand
#hpaulj's comment is spot on. There is a clue in the error message also.
MemoryError: Unable to allocate 671. GiB for an array with shape (90000000000,) and data type int64
There is a reference to int64 and not float64 and a linear array of size 300,000 X 300,000. This refers to an intermediate step of random sampling in the creation of the sparse matrix, which occupies a lot of memory anyway.
Note that while creating any sparse matrix (irrespective of the format), you have to account memory for the non-zero values and for representing the position of the values in the matrix.

how to make matrix diagonal with larger shape

i have 1D array with shape is 777599. i want to make matrix diagonal of my data 1D array be 2D array matrix diagonal. but i have a problem.
this is my code:
import numpy as np
a = np.linspace(0, 2000, 777599)
b = np.diag(a)
print(b.shape)
and the response is:
Traceback (most recent call last):
File "/home/willi/PycharmProjects/006_TA/017_gravkorCG5.py", line 29, in <module>
b = np.diag(a)
File "<__array_function__ internals>", line 6, in diag
File "/home/willi/PycharmProjects/venv/lib/python3.5/site-packages/numpy/lib/twodim_base.py", line 275, in diag
res = zeros((n, n), v.dtype)
MemoryError: Unable to allocate 4.40 TiB for an array with shape (777599, 777599) and data type float64
An array with 777599x777599 (i.e. 604660204801) elements is huge. Sparse matrices to the rescue (requires pip install scipy):
import numpy as np
from scipy import sparse
a = np.linspace(0, 2000, 777599)
b = sparse.csc_matrix((a, (range(a.shape[0]), range(a.shape[0]))))
It will be slower than dense matrix... if a dense matrix could fit into memory. :)

Performing PCA on large sparse matrix by using sklearn

I am trying to apply PCA on huge sparse matrix, in the following link it says that randomizedPCA of sklearn can handle sparse matrix of scipy sparse format.
Apply PCA on very large sparse matrix
However, I always get error. Can someone point out what I am doing wrong.
Input matrix 'X_train' contains numbers in float64:
>>>type(X_train)
<class 'scipy.sparse.csr.csr_matrix'>
>>>X_train.shape
(2365436, 1617899)
>>>X_train.ndim
2
>>>X_train[0]
<1x1617899 sparse matrix of type '<type 'numpy.float64'>'
with 81 stored elements in Compressed Sparse Row format>
I am trying to do:
>>>from sklearn.decomposition import RandomizedPCA
>>>pca = RandomizedPCA()
>>>pca.fit(X_train)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 567, in fit
self._fit(check_array(X))
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 334, in check_array
copy, force_all_finite)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
if I try to convert to dense matrix, I think I am out of memory.
>>> pca.fit(X_train.toarray())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 949, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
B = self._process_toarray_args(order, out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py", line 800, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Due to the nature of the PCA, even if the input is an sparse matrix, the output is not. You can check it with a quick example:
>>> from sklearn.decomposition import TruncatedSVD
>>> from scipy import sparse as sp
Create a random sparse matrix with 0.01% of its data as non-zeros.
>>> X = sp.rand(1000, 1000, density=0.0001)
Apply PCA to it:
>>> clf = TruncatedSVD(100)
>>> Xpca = clf.fit_transform(X)
Now, check the results:
>>> type(X)
scipy.sparse.coo.coo_matrix
>>> type(Xpca)
numpy.ndarray
>>> print np.count_nonzero(Xpca), Xpca.size
95000, 100000
which suggests that 95000 of the entries are non-zero, however,
>>> np.isclose(Xpca, 0, atol=1e-15).sum(), Xpca.size
99481, 100000
99481 elements are close to 0 (<1e-15), but not 0.
Which means, in short, that for a PCA, even if the input is an sparse matrix, the output is not. Thus, if you try to extract 100,000,000 (1e8) components from your matrix, you will end up with a 1e8 x n_features (in your example 1e8 x 1617899) dense matrix, which of course, can't be hold in memory.
I'm not an expert statistician, but I believe there is currently no workaraound for this using scikit-learn, as is not a problem of scikit-learn's implementation, is just the mathematical definition of their Sparse PCA (by means of sparse SVD) which makes the result dense.
The only workaround that might work for you, is for you to start from a small amount of components, and increase it until you get a balance between the data that you can keep in memory, and the percentage of the data explained (which you can calculate as follows):
>>> clf.explained_variance_ratio_.sum()
PCA(X) is SVD(X-mean(X)).
Even If X is a sparse matrix, X-mean(X) is always a dense matrix.
Thus, randomized SVD(TruncatedSVD) is not efficient as like randomized SVD of a sparse matrix.
However, delayed evaluation
delay(X-mean(X))
can avoid expanding the sparse matrix X to the dense matrix X-mean(X).
The delayed evaluation enables efficient PCA of a sparse matrix using the randomized SVD.
This mechanism is implemented in my package :
https://github.com/niitsuma/delayedsparse/
You can see the code of the PCA using this mechanism :
https://github.com/niitsuma/delayedsparse/blob/master/delayedsparse/pca.py
Performance comparisons to existing methods show this mechanism drastically reduces required memory size :
https://github.com/niitsuma/delayedsparse/blob/master/demo-pca.sh
More detail description of this technique can be found in my patent :
https://patentscope2.wipo.int/search/ja/detail.jsf?docId=JP225380312

Categories

Resources