Numpy memmap with file deletion

Numpy memmap with file deletion - python

Is it possible to have a Numpy memmap file who's file will be deleted when the memmap array is garbage collected?
I have tried:
import tempfile
import numpy as np
arr = np.memmap(tempfile.NamedTemporaryFile(), mode='w+',
shape=(10, 10), dtype=np.int)
os.path.exists(arr.filename) # False
But it appears the reference to the temporary file isn't kept so is deleted.
I don't want to use a context manager on the temporary file, as I want to be able to return the array from a function and have the file live until the array is no longer used.
NB: similar question here: In Python, is it possible to overload Numpy's memmap to delete itself when the memmap object is no longer referenced? but the asker exhibits poor knowledge of Python scoping and the tempfile module.

As it turns out, jtaylor's answer to the question originally linked is correct. So the code:
import tempfile
import numpy as np
with tempfile.NamedTemporaryFile() as ntf:
temp_name = ntf.name
arr = np.memmap(ntf, mode='w+',
shape=(10, 10), dtype=np.int)
print(arr)
Works as desired, even though os.path.exists(temp_name) is false, because of the way the OS manages files. The file path (temp_name) is unlinked, and no longer accessible through the filesystem, however the actual file disk usage will remain available until the open file is closed, which the memmap object will keep.

Related

How does numpy's memmap copy-on-write mode work?

I'm confused by how numpy's memmap handles changes to data when using copy-on-write (mmap_mode=c). Since nothing is written to the original array on disk, I'm expecting that it has to store all changes in memory, and thus could run out of memory if you modify every single element. To my surprise, it didn't.
I am trying to reduce my memory usage for my machine learning scripts which I run on a shared cluster (the less mem each instance takes, the more instances I can run at the same time). My data are very large numpy arrays (each > 8 Gb). My hope is to use np.memmap to work with these arrays with small memory (<4Gb available).
However, each instance might modify the data differently (e.g. might choose to normalize the input data differently each time). This has implications for storage space. If I use the r+ mode, then normalizing the array in my script will permanently change the stored array.
Since I don't want redundant copies of the data, and just want to store the original data on disk, I thought I should use the 'c' mode (copy-on-write) to open the arrays. But then where do your changes go? Are the changes kept just in memory? If so, if I change the whole array won't I run out of memory on a small-memory system?
Here's an example of a test which I expected to fail:
On a large memory system, create the array:
import numpy as np
GB = 1000**3
GiB = 1024**3
a = np.zeros((50000, 20000), dtype='float32')
bytes = a.size * a.itemsize
print('{} GB'.format(bytes / GB))
print('{} GiB'.format(bytes / GiB))
np.save('a.npy', a)
# Output:
# 4.0 GB
# 3.725290298461914 GiB
Now, on a machine with just 2 Gb of memory, this fails as expected:
a = np.load('a.npy')
But these two will succeed, as expected:
a = np.load('a.npy', mmap_mode='r+')
a = np.load('a.npy', mmap_mode='c')
Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):
for i in range(a.shape[0]):
print('row {}'.format(i))
a[i,:] = i*np.arange(a.shape[1])
Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?
Issue 2: When I force the numpy to flush the changes every once in a while, both r+/c mode successfully finish the loop. But how can c mode do this? I didn't think flush() would do anything for c mode? The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?
for i in range(a.shape[0]):
if i % 100 == 0:
print('row {}'.format(i))
a.flush()
a[i,:] = i*np.arange(a.shape[1])

Numpy isn't doing anything clever here, it's just deferring to the builtin memmap module, which has an access argument that:
accepts one of four values: ACCESS_READ, ACCESS_WRITE, or ACCESS_COPY to specify read-only, write-through or copy-on-write memory respectively.
On linux, this works by calling the mmap system call with
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the
mapping are not visible to other processes mapping the same
file, and are not carried through to the underlying file.
Regarding your question
The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?
The changes likely are written to disk, but just not to the file you opened. They're likely paged into virtual memory somewhere.

Why is concurrent.futures holding onto memory when returning np.memmap?

The problem
My application is extracting a list of zip files in memory and writing the data to a temporary file. I then memory map the data in the temp file for use in another function. When I do this in a single process, it works fine, reading the data doesn't affect memory, max RAM is around 40MB. However when I do this using concurrent.futures the RAM goes up to 500MB.
I have looked at this example and I understand I could be submitting the jobs in a nicer way to save memory during processing. But I don't think my issue is related, as I am not running out of memory during processing. The issue I don't understand is why it is holding onto the memory even after the memory maps are returned. Nor do I understand what is in the memory, since doing this in a single process does not load the data in memory.
Can anyone explain what is actually in the memory and why this is different between single and parallel processing?
PS I used memory_profiler for measuring the memory usage
Code
Main code:
def main():
datadir = './testdata'
files = os.listdir('./testdata')
files = [os.path.join(datadir, f) for f in files]
datalist = download_files(files, multiprocess=False)
print(len(datalist))
time.sleep(15)
del datalist # See here that memory is freed up
time.sleep(15)
Other functions:
def download_files(filelist, multiprocess=False):
datalist = []
if multiprocess:
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
returned_future = [executor.submit(extract_file, f) for f in filelist]
for future in returned_future:
datalist.append(future.result())
else:
for f in filelist:
datalist.append(extract_file(f))
return datalist
def extract_file(input_zip):
buffer = next(iter(extract_zip(input_zip).values()))
with tempfile.NamedTemporaryFile() as temp_logfile:
temp_logfile.write(buffer)
del buffer
data = memmap(temp_logfile, dtype='float32', shape=(2000000, 4), mode='r')
return data
def extract_zip(input_zip):
with ZipFile(input_zip, 'r') as input_zip:
return {name: input_zip.read(name) for name in input_zip.namelist()}
Helper code for data
I can't share my actual data, but here's some simple code to create files that demonstrate the issue:
for i in range(1, 16):
outdir = './testdata'
outfile = 'file_{}.dat'.format(i)
fp = np.memmap(os.path.join(outdir, outfile), dtype='float32', mode='w+', shape=(2000000, 4))
fp[:] = np.random.rand(*fp.shape)
del fp
with ZipFile(outdir + '/' + outfile[:-4] + '.zip', mode='w', compression=ZIP_DEFLATED) as z:
z.write(outdir + '/' + outfile, outfile)

The problem is that you're trying to pass an np.memmap between processes, and that doesn't work.
The simplest solution is to instead pass the filename, and have the child process memmap the same file.
When you pass an argument to a child process or pool method via multiprocessing, or return a value from one (including doing so indirectly via a ProcessPoolExecutor), it works by calling pickle.dumps on the value, passing the pickle across processes (the details vary, but it doesn't matter whether it's a Pipe or a Queue or something else), and then unpickling the result on the other side.
A memmap is basically just an mmap object with an ndarray allocated in the mmapped memory.
And Python doesn't know how to pickle an mmap object. (If you try, you will either get a PicklingError or a BrokenProcessPool error, depending on your Python version.)
A np.memmap can be pickled, because it's just a subclass of np.ndarray—but pickling and unpickling it actually copies the data and gives you a plain in-memory array. (If you look at data._mmap, it's None.) It would probably be nicer if it gave you an error instead of silently copying all of your data (the pickle-replacement library dill does exactly that: TypeError: can't pickle mmap.mmap objects), but it doesn't.
It's not impossible to pass the underlying file descriptor between processes—the details are different on every platform, but all of the major platforms have a way to do that. And you could then use the passed fd to build an mmap on the receiving side, then build a memmap out of that. And you could probably even wrap this up in a subclass of np.memmap. But I suspect if that weren't somewhat difficult, someone would have already done it, and in fact it would probably already be part of dill, if not numpy itself.
Another alternative is to explicitly use the shared memory features of multiprocessing, and allocate the array in shared memory instead of a mmap.
But the simplest solution is, as I said at the top, to just pass the filename instead of the object, and let each side memmap the same file. This does, unfortunately, mean you can't just use a delete-on-close NamedTemporaryFile (although the way you were using it was already non-portable and wouldn't have worked on Windows the same way it does on Unix), but changing that is still probably less work than the other alternatives.

Load numpy array in google-cloud-ml job

In the model I want to launch, I have some variables which have to be initialized with specific values.
I currently store these variables into numpy arrays but I don't know how to adapt my code to make it work on a google-cloud-ml job.
Currently I initialize my variable like this:
my_variable = variables.model_variable('my_variable', shape=None, dtype=tf.float32, initializer=np.load('datasets/real/my_variable.npy'))
Can someone help me ?

First, you'll need to copy/store the data on GCS (using, e.g., gsutil) and ensure your training script has access to that bucket. The easiest way to do so is to copy the array to the same bucket as your data, since you'll likely already have configured that bucket for read access. If the bucket is in the same project as your training job and you followed these instructions (particularly, gcloud beta ml init-project), you should be set. If the data will be in another bucket, see these instructions.
Then you'll need to use a library capable of loading data from GCS. Tensorflow includes a module that can do this, although you're free to use any client library that can read from GCS. Here's an example of using TensorFlow's file_io module:
from StringIO import StringIO
import tensorflow as tf
import numpy as np
from tensorflow.python.lib.io import file_io
# Create a variable initialized to the value of a serialized numpy array
f = StringIO(file_io.read_file_to_string('gs://my-bucket/123.npy'))
my_variable = tf.Variable(initial_value=np.load(f), name='my_variable')
Note that we have to read the file into a string and use StringIO, since file_io.FileIO does not fully implement the seek function required by numpy.load.
Bonus: in case it's useful, you can directly store a numpy array to GCS using the file_io module, e.g.:
np.save(file_io.FileIO('gs://my-bucket/123', 'w'), np.array([[1,2,3], [4,5,6]]))
For Python 3, use from io import StringIO instead of from StringIO import StringIO.

I tried the accepted answer but ran into some problems. Eventually this worked for me (Python 3):
from io import BytesIO
import numpy as np
from tensorflow.python.lib.io import file_io
To save:
array = np.ones((100, ))
dest = 'gs://[BUCKET-NAME]/' # Destination to save in GCS
np.save(file_io.FileIO(dest, 'w'), array)
To load:
f = BytesIO(file_io.read_file_to_string(src, binary_mode=True))
arr = np.load(f)

Copying a PIL image as quickly as I can open it

I'm finding that in PIL I can load an image from disk substantially more quickly than I can copy it. Is there a faster way to copy an image than by calling image.copy()? (and how is this even possible?)
Sample code:
import os, PIL.Image, timeit
test_filepath = os.path.expanduser("~/Test images/C.jpg")
load_image_cmd = "PIL.Image.open('{}')".format(test_filepath)
print((PIL.Image.open(test_filepath)).__class__)
print(min(timeit.repeat(load_image_cmd, setup='import PIL.Image', number=10000)))
print(min(timeit.repeat("img.copy()", setup='import PIL.Image; img = {}'.format(load_image_cmd), number=10000)))
Produces:
PIL.JpegImagePlugin.JpegImageFile
0.916192054749
1.85366988182
Adding gc.enable to the setup for timeit doesn't change things much.

According to the PIL documentation, open() is a lazy operation, which means that it's not really doing all the work to use the image yet.
To do a copy() however, it almost certainly has to read the whole thing in and process it.
EDIT:
To test whether this is true, you should access a pixel in each image as part of your timeit.
EDIT 2:
Another glance at the doc shows that a load() after the open() ought to do the trick of making it do all its work.

In Python, is it possible to overload Numpy's memmap to delete itself when the memmap object is no longer referenced?

I am trying to use memmap when certain data doesn't fit in memory and employ memmap's ability to trick code into thinking it's just an ndarray. To further expand on this way of using memmap I was wondering if it would be possible to overload memmap's dereference operator to delete the memmap file.
So for example:
from tempfile import mkdtemp
import os.path as path
filename = path.join(mkdtemp(), 'tmpfile.dat')
{
out = np.memmap(filename, dtype=a.dtype, mode='w+', shape=a.shape)
}
# At this point out is out of scope, so the overloaded
# dereference function would delete tmpfile.dat
Does this sound feasible/has this been done? Is there something I am not thinking of?
Thank you!

just delete the file after it has been opened by np.memmap
the file will then be deleted by the system after the last reference to the file descriptor is closed.
python temporary files work like this and can very conveniently be used with the with context manger construct:
with tempfile.NamedTemporaryFile() as f:
# file gone now from the filesystem
# but f still holds a reference so it still exists and uses space (see /prof<pid>/fd)
# open it again (will not work on windows)
x = np.memmap(f.name, dtype=np.float64, mode='w+', shape=(3,4))
# file path is unlinked but exists on disk with the open file reference in x
del x
# now all references are gone and the file is properly deleted

A case if we do not want to use with and just have some class that handles it for us:
class tempmap(np.memmap):
"""
Extension of numpy memmap to automatically map to a file stored in temporary directory.
Usefull as a fast storage option when numpy arrays become large and we just want to do some quick experimental stuff.
"""
def __new__(subtype, dtype=np.uint8, mode='w+', offset=0,
shape=None, order='C'):
ntf = tempfile.NamedTemporaryFile()
self = np.memmap.__new__(subtype, ntf, dtype, mode, offset, shape, order)
self.temp_file_obj = ntf
return self
def __del__(self):
if hasattr(self,'temp_file_obj') and self.temp_file_obj is not None:
self.temp_file_obj.close()
del self.temp_file_obj
def np_as_tmp_map(nparray):
tmpmap = tempmap(dtype=nparray.dtype, mode='w+', shape=nparray.shape)
tmpmap[...] = nparray
return tmpmap
def test_memmap():
"""Test, that deleting a temp memmap also deletes the file."""
x = np_as_tmp_map(np.zeros(10, 10), np.float))
name = copy(x.temp_file_obj.name)
del x
x = None
assert not os.path.isfile(name)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.