Pyopencl: difference between to_device and Buffer - python

Let
import pyopencl as cl
import pyopencl.array as cl_array
import numpy
a = numpy.random.rand(50000).astype(numpy.float32)
mf = cl.mem_flags
What is the difference between
a_gpu = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
and
a_gpu = cl_array.to_device(self.ctx, self.queue, a)
?
And what is the difference between
result = numpy.empty_like(a)
cl.enqueue_copy(self.queue, result, result_gpu)
and
result = result_gpu.get()
?

Buffers are CL's version of malloc, while pyopencl.array.Array is a workalike of numpy arrays on the compute device.
So for the second version of the first part of your question, you may write a_gpu + 2 to get a new arrays that has 2 added to each number in your array, whereas in the case of the Buffer, PyOpenCL only sees a bag of bytes and cannot perform any such operation.
The second part of your question is the same in reverse: If you've got a PyOpenCL array, .get() copies the data back and converts it into a (host-based) numpy array. Since numpy arrays are one of the more convenient ways to get contiguous memory in Python, the second variant with enqueue_copy also ends up in a numpy array--but note that you could've copied this data into an array of any size (as long as it's big enough) and any type--the copy is performed as a bag of bytes, whereas .get() makes sure you get the same size and type on the host.
Bonus fact: There is of course a Buffer underlying each PyOpenCL array. You can get it from the .data attribute.

To answer the first question, Buffer(hostbuf=...) can be called with anything that implements the buffer interface (reference). pyopencl.array.to_device(...) must be called with an ndarray (reference). ndarray implements the buffer interface and works in either place. However, only hostbuf=... would be expected to work with for example a bytearray (which also implements the buffer interface). I have not confirmed this, but it appears to be what the docs suggest.
On the second question, I am not sure what type result_gpu is supposed to be when you call get() on it (did you mean Buffer.get_host_array()?) In any case, enqueue_copy() works between combination of Buffer, Image and host, can have offsets and regions, and can be asynchronous (with is_blocking=False), and I think these capabilities are only available that way (whereas get() would be blocking and return the whole buffer). (reference)

Related

Python equivalent of free() for numpy arrays?

I have a number of large numpy arrays that need to be stored as dask arrays. While trying to load each array from .npy and then convert it into dask.array, I noticed the RAM usage was almost just as much as regular numpy arrays even after I del arr after loading arr into dask.array.
In this example:
arr = np.random.random((100, 300))
print(f'Array ref count before conversion: {sys.getrefcount(arr) - 1}') # output: 1
dask_arr = da.from_array(arr)
print(f'Distributed array ref count: {sys.getrefcount(dask_arr) - 1}') # output: 1
print(f'Array ref count after conversion: {sys.getrefcount(arr) - 1}') # output: 3
My only guess is that while dask was loading the array, it created references to the numpy array.
How can I free up the memory and delete all references to the memory location (like free(ptr) in C)?
If you're getting a MemoryError you may have a few options:
Break your data into smaller chunks.
Manually trigger garbage collection and/or tweak the gc settings on the workers through a Worker Plugin (which op has tried but doesn't work; I'll include anyway for other readers)
Trim memory using malloc_trim (esp. if working with non-NumPy data or small NumPy chunks)
Make sure you can see the Dask Dashboard while your computations are running to figure out which approach is working.
From this resource:
"Another important cause of unmanaged memory on Linux and MacOSX, which is not widely known about, derives from the fact that the libc malloc()/free() manage a user-space memory pool, so free() won’t necessarily release memory back to the OS."

How does numpy's memmap copy-on-write mode work?

I'm confused by how numpy's memmap handles changes to data when using copy-on-write (mmap_mode=c). Since nothing is written to the original array on disk, I'm expecting that it has to store all changes in memory, and thus could run out of memory if you modify every single element. To my surprise, it didn't.
I am trying to reduce my memory usage for my machine learning scripts which I run on a shared cluster (the less mem each instance takes, the more instances I can run at the same time). My data are very large numpy arrays (each > 8 Gb). My hope is to use np.memmap to work with these arrays with small memory (<4Gb available).
However, each instance might modify the data differently (e.g. might choose to normalize the input data differently each time). This has implications for storage space. If I use the r+ mode, then normalizing the array in my script will permanently change the stored array.
Since I don't want redundant copies of the data, and just want to store the original data on disk, I thought I should use the 'c' mode (copy-on-write) to open the arrays. But then where do your changes go? Are the changes kept just in memory? If so, if I change the whole array won't I run out of memory on a small-memory system?
Here's an example of a test which I expected to fail:
On a large memory system, create the array:
import numpy as np
GB = 1000**3
GiB = 1024**3
a = np.zeros((50000, 20000), dtype='float32')
bytes = a.size * a.itemsize
print('{} GB'.format(bytes / GB))
print('{} GiB'.format(bytes / GiB))
np.save('a.npy', a)
# Output:
# 4.0 GB
# 3.725290298461914 GiB
Now, on a machine with just 2 Gb of memory, this fails as expected:
a = np.load('a.npy')
But these two will succeed, as expected:
a = np.load('a.npy', mmap_mode='r+')
a = np.load('a.npy', mmap_mode='c')
Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):
for i in range(a.shape[0]):
print('row {}'.format(i))
a[i,:] = i*np.arange(a.shape[1])
Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?
Issue 2: When I force the numpy to flush the changes every once in a while, both r+/c mode successfully finish the loop. But how can c mode do this? I didn't think flush() would do anything for c mode? The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?
for i in range(a.shape[0]):
if i % 100 == 0:
print('row {}'.format(i))
a.flush()
a[i,:] = i*np.arange(a.shape[1])
Numpy isn't doing anything clever here, it's just deferring to the builtin memmap module, which has an access argument that:
accepts one of four values: ACCESS_READ, ACCESS_WRITE, or ACCESS_COPY to specify read-only, write-through or copy-on-write memory respectively.
On linux, this works by calling the mmap system call with
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the
mapping are not visible to other processes mapping the same
file, and are not carried through to the underlying file.
Regarding your question
The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?
The changes likely are written to disk, but just not to the file you opened. They're likely paged into virtual memory somewhere.

share variable (data from file) among multiple python scripts with not loaded duplicates

I would like to load a big matrix contained in the matrix_file.mtx. This load must be made once. Once the variable matrix is loaded into the memory, I would like many python scripts to share it with not duplicates in order to have a memory efficient multiscript program in bash (or python itself). I can imagine some pseudocode like this:
# Loading and sharing script:
import share
matrix = open("matrix_file.mtx","r")
share.send_to_shared_ram(matrix, as_variable('matrix'))
# Shared matrix variable processing script_1
import share
pointer_to_matrix = share.share_variable_from_ram('matrix')
type(pointer_to_matrix)
# output: <type 'numpy.ndarray'>
# Shared matrix variable processing script_2
import share
pointer_to_matrix = share.share_variable_from_ram('matrix')
type(pointer_to_matrix)
# output: <type 'numpy.ndarray'>
...
The idea is pointer_to_matrix to point to matrix in RAM, which is only once loaded by the n scripts (not n times). They are separately called from a bash script (or if possible form a python main):
$ python Load_and_share.py
$ python script_1.py -args string &
$ python script_2.py -args string &
$ ...
$ python script_n.py -args string &
I'd also be interested in solutions via hard disk, i.e. matrix could be stored at disk while the share object access to it as being required. Nonetheless, the object (a kind of pointer) in RAM can be seen as the whole matrix.
Thank you for your help.
Between the mmap module and numpy.frombuffer, this is fairly easy:
import mmap
import numpy as np
with open("matrix_file.mtx","rb") as matfile:
mm = mmap.mmap(matfile.fileno(), 0, access=mmap.ACCESS_READ)
# Optionally, on UNIX-like systems in Py3.3+, add:
# os.posix_fadvise(matfile.fileno(), 0, len(mm), os.POSIX_FADV_WILLNEED)
# to trigger background read in of the file to the system cache,
# minimizing page faults when you use it
matrix = np.frombuffer(mm, np.uint8)
Each process would perform this work separately, and get a read only view of the same memory. You'd change the dtype to something other than uint8 as needed. Switching to ACCESS_WRITE would allow modifications to shared data, though it would require synchronization and possibly explicit calls to mm.flush to actually ensure the data was reflected in other processes.
A more complex solution that follows your initial design more closely might be to uses multiprocessing.SyncManager to create a connectable shared "server" for data, allowing a single common store of data to be registered with the manager and returned to as many users as desired; creating an Array (based on ctypes types) with the correct type on the manager, then register-ing a function that returns the same shared Array to all callers would work too (each caller would then convert the returned Array via numpy.frombuffer as before). It's much more involved (it would be easier to have a single Python process initialize an Array, then launch Processes that would share it automatically thanks to fork semantics), but it's the closest to the concept you describe.

recv_into a numpy array

I am transmiting images by sockets from a camera that runs wince :(
The images in the camera are just float arrays created using realloc for the given x * y size
On the other end, I am receiving these images in python.
I have this code working doing
img_dtype = np.float32
img_rcv = np.empty((img_y, img_x),
dtype = img_dtype)
p = sck.recv_into(img_rcv,
int(size_bytes),
socket.MSG_WAITALL)
if size_bytes != p:
print "Mismatch between expected and received data amount"
return img_rcv
I am a little bit confused about the way numpy creates its arrays and I am wondering if this img_rcv will be compatible with the way recv_into works.
My questions are:
How safe is this?
Does the memory allocation for the numpy array will be known for recv_into?
Are the numpy arrays creation routines equivalent to a malloc?
It is just working because I am lucky?
The answers are:
safe
yes, via the buffer interface
yes, in the sense that you get a block of memory you can work with
no

control memory alignment in python ctypes

I'm looking into using ctypes for using C functions manipulating SSE (__m128) data that have to be aligned on 16 bytes boundaries.
I could not find a simple way to control the alignment of memory allocated by ctypes, so, right now, I'm making ctypes call a C function that provides a correctly aligned memory buffer.
The problem I have with this approach is that I have to manually explicitly release this memory to prevent it from being leaked.
Is there a way to control the alignment of memory allocated by ctypes ? or is there a way to register a cleanup function to release memory allocated by a C function called by ctypes (apart from standard python operator __del__) ?
What is the best path to follow ?
I've been taking some time to investigate, I came up with a function that should allow me to allocate arbitrary aligned memory with ctypes, basically relying on the fact that ctypes should keep a reference on the unaligned memory buffer, while having an instance starting at an aligned position in the buffer.
Still have to test this in production.
import ctypes
def ctypes_alloc_aligned(size, alignment):
bufSize = size+(alignment-1)
raw_memory = bytearray(bufSize)
ctypes_raw_type = (ctypes.c_char * bufSize)
ctypes_raw_memory = ctypes_raw_type.from_buffer(raw_memory)
raw_address = ctypes.addressof(ctypes_raw_memory)
offset = raw_address % alignment
offset_to_aligned = (alignment - offset) % alignment
ctypes_aligned_type = (ctypes.c_char * (bufSize-offset_to_aligned))
ctypes_aligned_memory = ctypes_aligned_type.from_buffer(raw_memory, offset_to_aligned)
return ctypes_aligned_memory
I suppose that c_ulonglong (64 bit) must be 64-bit aligned; it's a start. Then the doc suggests that you can use _pack_ to control alignment of structures. These two are not exactly what you want, but by combining them you can allocate 8-byte aligned structs without holes.
Let's assume a struct with with 3 8-byte aligned elements .v0, .v1, .v2. Use addressof() to see if the struct is 16-byte aligned. If it is, use .v0 and .v1 for your 128-bit value; if it's not, use .v1 and .v2.

Categories

Resources