My original list_ function has over 2 million lines of code and I get a memory error when I run the code that calculates . Is there a way I could could go around it. The list_ down below isa portion fo the actual numpy array.
Pandas data:
import pandas as pd
import math
import numpy as np
bigdata = 'input.csv'
data =pd.read_csv(Daily_url, low_memory=False)
#reverses all the table data values
data1 = data.iloc[::-1].reset_index(drop=True)
list_= np.array(data1['Close']
Code:
number = 5
list_= np.array([457.334015,424.440002,394.795990,408.903992,398.821014,402.152008,435.790985,423.204987,411.574005,
404.424988,399.519989,377.181000,375.467010,386.944000,383.614990,375.071991,359.511993,328.865997,
320.510010,330.079010,336.187012,352.940002,365.026001,361.562012,362.299011,378.549011,390.414001,
400.869995,394.773010,382.556000])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
std = np.std(rolling_window(list_, number), axis=1)
Error Message: MemoryError: Unable to allocate 198. GiB for an array with shape (2659448, 10000) and data type float64
Full length of the error message:
MemoryError Traceback (most recent call last)
<ipython-input-7-df0ab5649b16> in <module>
5 return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
6
----> 7 std1 = np.std(rolling_window(PC_list, number), axis=1)
<__array_function__ internals> in std(*args, **kwargs)
C:\Python3.7\lib\site-packages\numpy\core\fromnumeric.py in std(a, axis, dtype, out, ddof, keepdims)
3495
3496 return _methods._std(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
-> 3497 **kwargs)
3498
3499
C:\Python3.7\lib\site-packages\numpy\core\_methods.py in _std(a, axis, dtype, out, ddof, keepdims)
232 def _std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False):
233 ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
--> 234 keepdims=keepdims)
235
236 if isinstance(ret, mu.ndarray):
C:\Python3.7\lib\site-packages\numpy\core\_methods.py in _var(a, axis, dtype, out, ddof, keepdims)
200 # Note that x may not be inexact and that we need it to be an array,
201 # not a scalar.
--> 202 x = asanyarray(arr - arrmean)
203
204 if issubclass(arr.dtype.type, (nt.floating, nt.integer)):
MemoryError: Unable to allocate 198. GiB for an array with shape (2659448, 10000) and data type float64
Do us the favor of referencing your previous related questions (at least 2). I happened to recall seeing something similar and so look up your previous questions.
Also when asking about an error, show the full traceback (if possible). It should us (and you) identify where the problem occurs, and narrow down possible reasons and fixes.
With the sample list_ (why such a bad name for a numpy array?) of only (35,) shape, the rolling_window array isn't that large. Plus it's a view:
In [90]: x =rolling_window(list_, number)
In [91]: x.shape
Out[91]: (26, 5)
However an operation on this array might produce a copy, boosting memory use.
In [96]: np.std(x, axis=1)
Out[96]:
array([22.67653383, 10.3940773 , 14.60076482, 13.82801944, 13.68038469,
12.54834004, ...
8.07511323])
In [97]: _.shape
Out[97]: (26,)
np.std does:
std = sqrt(mean(abs(x - x.mean())**2))
x.mean(axis=1) is one value per row, but
In [102]: x.mean(axis=1).shape
Out[102]: (26,)
In [103]: (x-x.mean(axis=1, keepdims=True)).shape
Out[103]: (26, 5)
In [106]: (abs(x-x.mean(axis=1, keepdims=True))**2).shape
Out[106]: (26, 5)
produces an array as big as x, and will be a full copy; not a strided virtual copy.
Does the error message shape make sense? (2659448, 10000) Is your window size 10000? And the expected number of windows the other value?
198. GiB is a reasonable number given that dimension:
In [94]: 2659448*10000*8/1e9
Out[94]: 212.75584
I'm not going test your code with a large enough array to produce a memory error.
as_strided is a nice way of generating moving windows, and fast - but it easily blows up the memory usage.
Generally, there are two ways to deal with "cannot allocate 198GiB of memory":
Process the data in chunks, or line-by line.
Your algorithm appears to be suitable for this; rather than reading the data all at once, rewrite the rolling_window function so that it loads the initial window (first n lines of the file), then repeatedly drops one line and reads one line from the file. That way, you'll never have more than n lines of memory and it'll all work fine.
If it's a local file, it can be kept open during the whole calculation, which is easiest. If it's a remote object, you may find connections timing out; if so, you may need to either copy the data to a local file, or use the relevant seek/offset parameter to reopen the file for each additional line (or each additional chunk, which you buffer locally).
Alternately, buy (rent) a machine with more than 200 GiB of memory; machines with over 1 TiB of memory are available off-the-shelf at AWS (and presumably GCP and Azure; or for direct purchase).
This is especially suitable if you're reasonably sure your requirements won't grow further and you just need to get this one job done. It'll save you rewriting your code to handle this, but it's not a sustainable solution in a longer term.
Related
I have a rather large dataset X containing 60660 vectors each with 36 features. I cannot share the data since it is confidential, but np.random.rand(60660, 36) will create a random matrix of the same shape with float64 values from 0 to 1 which is very similar to my matrix. I am trying to calculate the Euclidian distance of this matrix, but I get the following memory error
Example Code:
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
X = np.random.rand(60660, 36)
distances = euclidean_distances(X, squared=True)
MemoryError: Unable to allocate 27.4 GiB for an array with shape (60660, 60660) and data type float64
This makes sense because I am essentially calculating a 60660 x 60660 array with float64 values which will certainly cause me to run out of ram. What I do not understand is if I reduce the datatype of the array to float16 for example I still get the same memory error.
Example Code:
X = np.random.rand(60660, 36).astype('float16')
distances = euclidean_distances(X, squared=True)
MemoryError Traceback (most recent call last)
Input In [18], in <cell line: 2>()
1 X = np.random.rand(60660, 36).astype('float16')
----> 2 distances = euclidean_distances(X, squared=True)
File ~\Anaconda3\envs\gm_work\lib\site-packages\sklearn\metrics\pairwise.py:328, in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
322 if Y_norm_squared.shape != (1, Y.shape[0]):
323 raise ValueError(
324 f"Incompatible dimensions for Y of shape {Y.shape} and "
325 f"Y_norm_squared of shape {original_shape}."
326 )
--> 328 return _euclidean_distances(X, Y, X_norm_squared, Y_norm_squared, squared)
File ~\Anaconda3\envs\gm_work\lib\site-packages\sklearn\metrics\pairwise.py:369, in _euclidean_distances(X, Y, X_norm_squared, Y_norm_squared, squared)
366 distances = _euclidean_distances_upcast(X, XX, Y, YY)
367 else:
368 # if dtype is already float64, no need to chunk and upcast
--> 369 distances = -2 * safe_sparse_dot(X, Y.T, dense_output=True)
370 distances += XX
371 distances += YY
MemoryError: Unable to allocate 27.4 GiB for an array with shape (60660, 60660) and data type float64
Why do I still get the same error noting the data type is still float64 even though I am casting X as float16?
I tried running this with a smaller input array (10000 x 36) with float16 datatype and it works just fine, but the output array is still float64, so is it not possible to reduce the datatype? Any suggestions on how I can get this to work with my original array?
so I have three Arrays of dimension (1949, 2649)
Jun_1TMean = xr.DataArray(Jun_1T.variables['__xarray_dataarray_variable__'])
lon = xr.DataArray(lon2)
lat = xr.DataArray(lat2)
When I do
June_1T =np.array( [Jun_1TMean, lat, lon])
June_1T.shape
I get (3, 1949, 2649)
However I actually want shape (1949, 2649, 1949, 2649, 1949, 2649) instead
Apart from the fact you cant just 'stack' NumPy arrays as separate axes, without a broadcastable function or ufunc like +, *, etc, I don't think you want to be doing that. A numpy array with those dimensions as you suggest, which has a dtype int64 (float will be worse) will take -
array_space = (1949*2649*1949*2649*1949*2649)*8 bytes
= 1100959591182509749608 bytes
= 1100959591182.51 GB
= 1100959.59 Petabytes
For reference, the combined data of Google, Amazon, Microsoft and Facebook collectively is estimated to be 1,200 petabytes
I want to reshape list of images and the train_data size is 639976.
this is how I am importing images,
train_data=[]
for img in tqdm(os.listdir('Images/train/images')):
path=os.path.join ('Images/train/images/',img)
image=cv2.imread(path,cv2.IMREAD_COLOR)
image=cv2.resize(image, (28,28)).astype('float32')/255
train_data.append(image)
return train_data
np.reshape(train_data,(-1,28,28,3))
I am getting memory error here.
np.reshape(train_data,(-1,28,28,3))
Error:
return array(a, dtype, copy=False, order=order)
MemoryError
Looks like train_data is a large list of small arrays. I'm not familiar with cv2, so I'm guessing that the
image=cv2.resize(image, (28,28)).astype('float32')/255
creates (28,28) or (28,28,3) array of floats. By itself, not very big. Apparently that works.
The error is in:
np.reshape(train_data,(-1,28,28,3))
Since train_data is list, reshape has to first create an array, probably with np.array(train_data). If the all the components are (28,28,3) this array will already be (n,28,28,3) shape. But that's where the memory error occurs. Apparently there are some of these small(ish) arrays, that it doesn't have memory to assemble them into one big array.
I'd experiment with a subset of the files.
In [1]: 639976*28*28*3
Out[1]: 1505223552 # floats
In [2]: _*8
Out[2]: 12041788416 # bytes
What's that, 12gb array? I'm not surprise you get a memory error. The list of arrays takes up more than that space, but they can be scattered in small blocks through out memory and swap. Make an array from the list and you double the memory usage.
Just for fun, try to make a blank array of that size:
np.ones((639976,28,28,3), 'float32')
If that works, try to make two.
I am working with binary (only 0's and 1's) matrices of rows and columns in the order of a few thousands. For example, the number of rows are between 2000 - 7000 and number of columns are between 4000 - 15000. My computer has more then 100g RAM.
I'm surprised that even with these sizes, I am getting MemoryError with the following code. For reproducibility, I'm including an example with a smaller matrix (10*20) Note than both of the following raise this error:
import numpy as np
my_matrix = np.random.randint(2,size=(10,20))
tr, tc = np.triu_indices(my_matrix.shape[0],1)
ut_sums = np.sum(my_matrix[tr] * my_matrix[tc], 1)
denominator = 100
value = 1 - ut_sums.astype(float)/denominator
np.einsum('i->', value)
I tried to replace the elementwise multiplication in the above code to einsum as below, but it also generates the same MemoryError:
import numpy as np
my_matrix = np.random.randint(2,size=(10,20))
tr, tc = np.triu_indices(my_matrix.shape[0],1)
ut_sums = np.einsum('ij,ij->i', my_matrix[tr], my_matrix[tc])
denominator = 100
value = 1 - ut_sums.astype(float)/denominator
np.einsum('i->', value)
In both cases, the printed Traceback points to the line where ut_sums is being calculated.
Please note that my code has other operations too, and there are other statistics calculated on matrices of similar sizes, but with more than 100 g, I thought it should not be a problem.
Just because your computer has 100 GB of physical memory does not mean that your operating system is willing or able to allocate such large amounts of contiguous memory. And it does have to be contiguous, because that's how NumPy arrays usually are.
You should figure out how large your output matrix is meant to be, and then try creating a similar one by itself:
arr = np.zeros((10000, 10000))
See if you're able to allocate a single array as large as you want.
I'm running into a memory error issue with numpy. The following line of code seems to be the issue:
self.D_r = numpy.diag(1/numpy.sqrt(self.r))
Where self.r is a relatively small numpy array.
The interesting thing is I monitored the memory usage and the process took up at most 3% of the RAM on the machine. So I'm thinking there's something that is killing the script before all the RAM is taken up because there is an expectation that the process will do so. If anybody has any ideas I would be very grateful.
Edit 1:
Here's the traceback:
Traceback (most recent call last):
File "/path_to_file/my_script.py", line 82, in <module>
mca_X = mca.mca(X)
File "/path_to_file/mca.py", line 54, in __init__
self.D_r = numpy.diag(1/numpy.sqrt(self.r.values))
File "/path_to_file/numpy/lib/twodim_base.py", line 302, in diag
res = zeros((n, n), v.dtype)
MemoryError
Running the script on KDD Cup 99 data (with one-hot-encoded nominal variables).
If the argument to np.diag() is a 1D, it creates a 2D array, using the 1D array as diagonal:
Signature: np.diag(v, k=0)
Parameters
v : array_like
If `v` is a 2-D array, return a copy of its `k`-th diagonal.
If `v` is a 1-D array, return a 2-D array with `v` on the `k`-th
diagonal.
This squares the memory size of the array.
if self.r is a 1D little array of more than 51000 éléments it can create a memory error :
In [85]: a=np.diag(arange(5e4))
In [86]: a.shape
Out[86]: (50000, 50000)
In [88]: a.size*a.itemsize
Out[88]: 20 000 000 000 # 20 Go
In [87]: a=np.diag(arange(5.1e4))
---------------------------------------------------------------------------
MemoryError