What does ''Failed to allocated to CPU memory" mean in MXNet?

What does ''Failed to allocated to CPU memory" mean in MXNet? - python

It is weird that no such questions yet on stackoverflow.
I am using MXNet, trying to run the VQA example on my PC, using mx.io.DataIter to read the data, but
mxnet.base.MXNetError: [11:05:07] c:\projects\mxnet-distro-win\mxnet-build\src\storage\./cpu_device_storage.h:70: Failed to allocate CPU Memory
on these two code
File "E:\PyProjects\TestVqa\VQAtrainIter.py", line 71, in reset
self.nd_img.append(mx.ndarray.array(self.img, dtype=self.dtype))
File "E:\PyProjects\Pro\venv\lib\site-packages\mxnet\ndarray\utils.py", line 146, in array
return _array(source_array, ctx=ctx, dtype=dtype)
The above is my code and the below is the library code
So what is exactly the "Failed to allocated to CPU memory" mean?
The part of code of VQAtrainIter.py is as below
def reset(self):
self.curr_idx = 0
self.nd_text = []
self.nd_img = []
self.ndlabel = []
for buck in self.data:
label = self.answer # self.answer, self.img are get from npy file
self.nd_text.append(mx.ndarray.array(buck, dtype=self.dtype))
self.nd_img.append(mx.ndarray.array(self.img, dtype=self.dtype))
self.ndlabel.append(mx.ndarray.array(label, dtype=self.dtype))

Based on what you mentioned, you're using npz, which is always loaded fully into memory. DataIter does get the data by batch, but if the source of that is npz, which is in-memory itself, then everything has to fit into memory. What DataIter does, it that it allows you to not require twice as much memory because it copies a small part of the data from numpy array into mxnet.nd.NDArray for processing.

Related

Gigantic memory use in example pytorch program. Why?

I have been trying to debug a program using vast amounts of memory and have distilled it into the following example:
# Caution, use carefully, this can utilise all available memory on your computer
# and render it effectively unresponsive, to the point where you cannot access
# the shell to kill the process; thus requiring reboot.
import numpy as np
import collections
import torch
# q = collections.deque(maxlen=1500) # Uses around 6.4GB
# q = collections.deque(maxlen=3000) # Uses around 12GB
q = collections.deque(maxlen=5000) # Uses around 18GB
def f():
nparray = np.zeros([4,84,84], dtype=np.uint8)
q.append(nparray)
nparray1 = np.zeros([32,4,84,84], dtype=np.float32)
tens = torch.tensor(nparray1, dtype=torch.float32)
while True:
f()
Please note the cautionary message in the 1st line of this program. If you set maxlen to a level where it uses too much of your available RAM, it can crash your computer.
I measured the memory using top (VIRT column), and its memory use seems wildly excessive (details on the commented lines above). From previous experience in my original program if maxlen is high enough it will crash my computer.
Why is it using so much memory?
I calculate the increase in expected memory from maxlen=1500 to maxlen=3000 to be:
4 * 84 * 84 * 15000 / (1024**2) == 403MB.
But we see an increase of 6GB.
There seems to be some sort of interaction between using collections and the tensor allocation as commenting either out causes memory use to be expected; eg commenting out the tensor line leads to total memory use of 2GB which seems much more reasonable.
Thanks for any help or insight,
Julian.

I think PyTorch store and update the computational graph each time you call f(), and thus the graph-size just keeps getting bigger and bigger.
Can you try to free the memory usage by using del(tens) (deleting the reference for the variable after usage), and let me know how it works? (found in PyTorch-documents here: https://pytorch.org/docs/stable/notes/faq.html)

Unable to allocate 47.5 MiB for an array with shape (1080, 1920, 3) and data type float64

Iam try to create a large video(longer than 3h)by CompositeVideoClip using moviepy.
The problem is it take too much ram (i have 32gb ram).It takes the whole ram (99%) by create a bunch of ffmpeg-win64-v4.2.2.exe ffmpeg-win64-v4.2.2.exe
after it a while it said Unable to allocate 47.5 MiB for an array with shape (1080, 1920, 3) and data type float64.
here is my code :
def CombieVideo():
global curentVideoLengt
masterVideo = NULL
for videoUrl in videoFiles:
print(videoUrl)
video = VideoFileClip(videoUrl).fx(vfx.fadein,1).fx(vfx.fadeout,1)
curentVideoLengt += video.duration
if curentVideoLengt >= (audioLen*60*60):
break
if masterVideo== NULL:
masterVideo= video
else:
masterVideo = CompositeVideoClip([masterVideo,video])
if curentVideoLengt < (audioLen*60*60):
videoUrl=random.choice(videoFiles)
print(videoUrl)
video =video(videoUrl).fx(vfx.fadein,1).fx(vfx.fadeout,1)
curentVideoLengt= curentVideoLengt+video.duration
masterVideo = CompositeVideoClip([masterVideo,video])
CombieVideo()
else:
masterVideo.audio = CompositeAudioClip(audios)
masterVideo.write_videofile('./MasterVideo/output_video.avi', fps=30, threads=4, codec="png")
CombieVideo()

Your issue isn't running allocating arrays, but rather you're opening too many instances of FFMPEG_VideoReader via VideoFileClip instances. Each of FFMPEG_VideoReader spawns an instance of ffmpeg, which eats up your memory resource.
You can try to close the reader immediately after instantiation:
video = VideoFileClip(videoUrl).fx(vfx.fadein,1).fx(vfx.fadeout,1)
video.close()
# continue as is
No guarantee if this will work, but it should points you in a right direction. This should fix your problem as moviepy documentation of VideoFileClip is suggesting for the immediate closure.
On a side note, do you really want to recurse this function? It's picking one random file then adding more files from the top of the videoFiles list if the random one isn't long enough.

Is there a way to open hdf5 files with the POSIX_FADV_DONTNEED flag?

We are working with large (1.2TB) uncompressed, unchunked hdf5 files with h5py in python for a machine learning application, which requires us to work through the full dataset repeatedly, loading slices of ~15MB individually in a randomized order. We are working on a linux (Ubuntu 18.04) machine with 192 GB RAM. We noticed that the program is slowly filling the cache. When total size of cache reaches size comparable with full machine RAM (free memory in top almost 0 but plenty ‘available’ memory) swapping occurs slowing down all other applications. In order to pinpoint the source of the problem, we wrote a separate minimal example to isolate our dataloading procedures - but found that the problem was independent of each part of our method.
We tried:
Building numpy memmap and accessing requested slice:
#on init:
f = h5py.File(tv_path, 'r')
hdf5_event_data = f["event_data"]
self.event_data = np.memmap(tv_path, mode="r", shape=hdf5_event_data.shape,
offset=hdf5_event_data.id.get_offset(),dtype=hdf5_event_data.dtype)
self.e = np.ones((512,40,40,19))
#on __getitem__:
self.e = self.event_data[index,:,:,:19]
return self.e
Reopening the memmap on each call to getitem:
#on __getitem__:
self.event_data = np.memmap(self.path, mode="r", shape=self.shape,
offset=self.offset, dtype=self.dtype)
self.e = self.event_data[index,:,:,:19]
return self.e
Addressing the h5 file directly and converting to a numpy array:
#on init:
f = h5py.File(tv_path, 'r')
hdf5_event_data = f["event_data"]
self.event_data = hdf5_event_data
self.e = np.ones((512,40,40,19))
#on __getitem__:
self.e = self.event_data[index,:,:,:19]
return self.e
We also tried the above approaches within pytorch Dataset/Dataloader framework - but it made no difference.
We observe high memory fragmentation as evidenced by /proc/buddyinfo. Dropping the cache via sync; echo 3 > /proc/sys/vm/drop_caches doesn’t help while application is running. Cleaning cache before application starts removes swapping behaviour until cache eats up the memory again - and swapping starts again.
Our working hypothesis is that the system is trying to hold on to cached file data which leads to memory fragmentation. Eventually when new memory is requested swapping is performed even though most memory is still ‘available’.
As such, we turned to ways to change the Linux environment’s behaviour around file caching and found this post . Is there a way to call the POSIX_FADV_DONTNEED flag when opening an h5 file in python or a portion of that we accessed via numpy memmap, so that this accumulation of cache does not occur? In our use case we will not be re-visiting that particular file location for a long time (till we access all other remaining ‘slices’ of the file)

You can use os.posix_fadvise to tell the OS how regions you plan to load will be used. This naturally requires a bit of low-level tweaking to determine your file descriptor, and get an idea of the regions you plan on reading.
The easiest way to get the file descriptor is to supply it yourself:
pf = open(tv_path, 'rb')
f = h5py.File(pf, 'r')
You can now set the advice. For the entire file:
os.posix_fadvise(os.fileno(pf), 0, f.id.get_filesize(), os.POSIX_FADV_DONTNEED)
Or for a particular dataset:
os.posix_fadvise(os.fileno(pf), hdf5_event_data.id.get_offset(),
hdf5_event_data.id.get_storage_size(), os.POSIX_FADV_DONTNEED)
Other things to look at
H5py does its own chunk caching. You may want to try turning this off:
f = h5py.File(..., rdcc_nbytes=0)
As an alternative, you may want to try using one of the other drivers provided in h5py, like 'sec2':
f = h5py.File(..., driver='sec2')

How does numpy's memmap copy-on-write mode work?

I'm confused by how numpy's memmap handles changes to data when using copy-on-write (mmap_mode=c). Since nothing is written to the original array on disk, I'm expecting that it has to store all changes in memory, and thus could run out of memory if you modify every single element. To my surprise, it didn't.
I am trying to reduce my memory usage for my machine learning scripts which I run on a shared cluster (the less mem each instance takes, the more instances I can run at the same time). My data are very large numpy arrays (each > 8 Gb). My hope is to use np.memmap to work with these arrays with small memory (<4Gb available).
However, each instance might modify the data differently (e.g. might choose to normalize the input data differently each time). This has implications for storage space. If I use the r+ mode, then normalizing the array in my script will permanently change the stored array.
Since I don't want redundant copies of the data, and just want to store the original data on disk, I thought I should use the 'c' mode (copy-on-write) to open the arrays. But then where do your changes go? Are the changes kept just in memory? If so, if I change the whole array won't I run out of memory on a small-memory system?
Here's an example of a test which I expected to fail:
On a large memory system, create the array:
import numpy as np
GB = 1000**3
GiB = 1024**3
a = np.zeros((50000, 20000), dtype='float32')
bytes = a.size * a.itemsize
print('{} GB'.format(bytes / GB))
print('{} GiB'.format(bytes / GiB))
np.save('a.npy', a)
# Output:
# 4.0 GB
# 3.725290298461914 GiB
Now, on a machine with just 2 Gb of memory, this fails as expected:
a = np.load('a.npy')
But these two will succeed, as expected:
a = np.load('a.npy', mmap_mode='r+')
a = np.load('a.npy', mmap_mode='c')
Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):
for i in range(a.shape[0]):
print('row {}'.format(i))
a[i,:] = i*np.arange(a.shape[1])
Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?
Issue 2: When I force the numpy to flush the changes every once in a while, both r+/c mode successfully finish the loop. But how can c mode do this? I didn't think flush() would do anything for c mode? The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?
for i in range(a.shape[0]):
if i % 100 == 0:
print('row {}'.format(i))
a.flush()
a[i,:] = i*np.arange(a.shape[1])

Numpy isn't doing anything clever here, it's just deferring to the builtin memmap module, which has an access argument that:
accepts one of four values: ACCESS_READ, ACCESS_WRITE, or ACCESS_COPY to specify read-only, write-through or copy-on-write memory respectively.
On linux, this works by calling the mmap system call with
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the
mapping are not visible to other processes mapping the same
file, and are not carried through to the underlying file.
Regarding your question
The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?
The changes likely are written to disk, but just not to the file you opened. They're likely paged into virtual memory somewhere.

Flushing memmap completely to disk [duplicate]

I have a huge dataset on which I wish to PCA. I am limited by RAM and computational efficency of PCA.
Therefore, I shifted to using Iterative PCA.
Dataset Size-(140000,3504)
The documentation states that This algorithm has constant memory complexity, on the order of batch_size, enabling use of np.memmap files without loading the entire file into memory.
This is really good, but unsure on how take advantage of this.
I tried load one memmap hoping it would access it in chunks but my RAM blew.
My code below ends up using a lot of RAM:
ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)
When I say "my RAM blew", the Traceback I see is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\sklearn\base.py", line 433, in fit_transfo
rm
return self.fit(X, **fit_params).transform(X)
File "C:\Python27\lib\site-packages\sklearn\decomposition\incremental_pca.py",
line 171, in fit
X = check_array(X, dtype=np.float)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 347, in
check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError
How can I improve this without comprising on accuracy by reducing the batch-size?
My ideas to diagnose:
I looked at the sklearn source code and in the fit() function Source Code I can see the following. This makes sense to me, but I am still unsure about what is wrong in my case.
for batch in gen_batches(n_samples, self.batch_size_):
self.partial_fit(X[batch])
return self
Edit:
Worst case scenario I will have to write my own code for iterativePCA which batch processes by reading and closing .npy files. But that would defeat the purpose of taking advantage of already present hack.
Edit2:
If somehow I could delete a batch of processed memmap file. It would make much sense.
Edit3:
Ideally if IncrementalPCA.fit() is just using batches it should not crash my RAM. Posting the whole code, just to make sure I am not making a mistake in flushing the memmap completely to disk before.
temp_train_data=X_train[1000:]
temp_labels=y[1000:]
out = np.empty((200001, 3504), np.int64)
for index,row in enumerate(temp_train_data):
actual_index=index+1000
data=X_train[actual_index-1000:actual_index+1].ravel()
__,cd_i=pywt.dwt(data,'haar')
out[index] = cd_i
out.flush()
pca_obj=IncrementalPCA()
clf = pca_obj.fit(out)
Surprisingly, I note out.flush doesn't free my memory. Is there a way to using del out to free my memory completely and then someone pass a pointer of the file to IncrementalPCA.fit().

You have hit a problem with sklearn in a 32 bit environment. I presume you are using np.float16 because you're in a 32 bit environment and you need that to allow you to create the memmap object without numpy thowing errors.
In a 64 bit environment (tested with Python3.3 64 bit on Windows), your code just works out of the box. So, if you have a 64 bit computer available - install python 64-bit and numpy, scipy, scikit-learn 64 bit and you are good to go.
Unfortunately, if you cannot do this, there is no easy fix. I have raised an issue on github here, but it is not easy to patch. The fundamental problem is that within the library, if your type is float16, a copy of the array to memory is triggered. The detail of this is below.
So, I hope you have access to a 64 bit environment with plenty of RAM. If not, you will have to split up your array yourself and batch process it, a rather larger task...
N.B It's really good to see you going to the source to diagnose your problem :) However, if you look at the line where the code fails (from the Traceback), you will see that the for batch in gen_batches code that you found is never reached.
Detailed diagnosis:
The actual error generated by OP code:
import numpy as np
from sklearn.decomposition import IncrementalPCA
ut = np.memmap('my_array.mmap', dtype=np.float16, mode='w+', shape=(140000,3504))
clf=IncrementalPCA(copy=False)
X_train=clf.fit_transform(ut)
is
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\sklearn\base.py", line 433, in fit_transfo
rm
return self.fit(X, **fit_params).transform(X)
File "C:\Python27\lib\site-packages\sklearn\decomposition\incremental_pca.py",
line 171, in fit
X = check_array(X, dtype=np.float)
File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 347, in
check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
MemoryError
The call to check_array(code link) uses dtype=np.float, but the original array has dtype=np.float16. Even though the check_array() function defaults to copy=False and passes this to np.array(), this is ignored (as per the docs), to satisfy the requirement that the dtype is different; therefore a copy is made by np.array.
This could be solved in the IncrementalPCA code by ensuring that the dtype was preserved for arrays with dtype in (np.float16, np.float32, np.float64). However, when I tried that patch, it only pushed the MemoryError further along the chain of execution.
The same copying problem occurs when the code calls linalg.svd() from the main scipy code and this time the error occurs during a call to gesdd(), a wrapped native function from lapack. Thus, I do not think there is a way to patch this (at least not an easy way - it is at minimum alteration of code in core scipy).

Does the following alone trigger the crash?
X_train_mmap = np.memmap('my_array.mmap', dtype=np.float16,
mode='w+', shape=(n_samples, n_features))
clf = IncrementalPCA(n_components=50).fit(X_train_mmap)
If not then you can use that model to transform (project your data iteratively) to a smaller data using batches:
X_projected_mmap = np.memmap('my_result_array.mmap', dtype=np.float16,
mode='w+', shape=(n_samples, clf.n_components))
for batch in gen_batches(n_samples, self.batch_size_):
X_batch_projected = clf.transform(X_train_mmap[batch])
X_projected_mmap[batch] = X_batch_projected
I have not tested that code but I hope that you get the idea.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What does ''Failed to allocated to CPU memory" mean in MXNet? - python

Related

Gigantic memory use in example pytorch program. Why?

Unable to allocate 47.5 MiB for an array with shape (1080, 1920, 3) and data type float64

Is there a way to open hdf5 files with the POSIX_FADV_DONTNEED flag?

How does numpy's memmap copy-on-write mode work?

Flushing memmap completely to disk [duplicate]

Categories

Resources