I am trying to pass output of some pycuda operation to the input of mxnet computational graph.
I am able to achieve this via numpy conversion with the following code
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import mxnet as mx
batch_shape = (1, 1, 10, 10)
h_input = np.zeros(shape=batch_shape, dtype=np.float32)
# init output with ones to see if contents really changed
h_output = np.ones(shape=batch_shape, dtype=np.float32)
device_ptr = cuda.mem_alloc(input.nbytes)
stream = cuda.Stream()
cuda.memcpy_htod_async(d_input, h_input, stream)
# here some actions with d_input may be performed, e.g. kernel calls
# but for the sake of simplicity we'll just transfer it back to host
cuda.memcpy_dtoh_async(d_input, h_output, stream)
stream.synchronize()
mx_input = mx.nd(h_output, ctx=mx.gpu(0))
print('output after pycuda calls: ', h_output)
print('mx_input: ', mx_input)
However i would like to avoid the overhead of device-to-host and host-to-device memory copying.
I couldn't find a way to construct mxnet.ndarray.NDArray directly from h_output.
The closest thing that i was able to find is construction of ndarray from dlpack.
But it is not clear how to work with dlpack object from python.
Is there a way fo achieve NDArray <-> pycuda interoperability without copying memory via host?
Unfortunately, it is not possible at the moment.
Related
I have been trying to debug a program using vast amounts of memory and have distilled it into the following example:
# Caution, use carefully, this can utilise all available memory on your computer
# and render it effectively unresponsive, to the point where you cannot access
# the shell to kill the process; thus requiring reboot.
import numpy as np
import collections
import torch
# q = collections.deque(maxlen=1500) # Uses around 6.4GB
# q = collections.deque(maxlen=3000) # Uses around 12GB
q = collections.deque(maxlen=5000) # Uses around 18GB
def f():
nparray = np.zeros([4,84,84], dtype=np.uint8)
q.append(nparray)
nparray1 = np.zeros([32,4,84,84], dtype=np.float32)
tens = torch.tensor(nparray1, dtype=torch.float32)
while True:
f()
Please note the cautionary message in the 1st line of this program. If you set maxlen to a level where it uses too much of your available RAM, it can crash your computer.
I measured the memory using top (VIRT column), and its memory use seems wildly excessive (details on the commented lines above). From previous experience in my original program if maxlen is high enough it will crash my computer.
Why is it using so much memory?
I calculate the increase in expected memory from maxlen=1500 to maxlen=3000 to be:
4 * 84 * 84 * 15000 / (1024**2) == 403MB.
But we see an increase of 6GB.
There seems to be some sort of interaction between using collections and the tensor allocation as commenting either out causes memory use to be expected; eg commenting out the tensor line leads to total memory use of 2GB which seems much more reasonable.
Thanks for any help or insight,
Julian.
I think PyTorch store and update the computational graph each time you call f(), and thus the graph-size just keeps getting bigger and bigger.
Can you try to free the memory usage by using del(tens) (deleting the reference for the variable after usage), and let me know how it works? (found in PyTorch-documents here: https://pytorch.org/docs/stable/notes/faq.html)
I am using cupy to do the following operations and this is pretty fast:
import cupy as cp
shape = (256, 170, 256)
deformation = cp.meshgrid(cp.arange(shape[0]),
cp.arange(shape[1]),
cp.arange(shape[2]),
indexing='ij')
However, if I convert it to an array as:
deformation = cp.array(cp.meshgrid(cp.arange(shape[0]),
cp.arange(shape[1]),
cp.arange(shape[2]),
indexing='ij'))
This seems to very slow or just hangs (I gave up after 5 minutes). I am not sure what I am doing wrong here.
I also tried passing copy=False to the cp.array call but this did not change anything.
I don't believe this conversion of list of cupy arrays to cupy array is supported. If I make your shape much smaller, e.g. (8,8,8) I get a python error.
If we study the documentation for cupy.meshgrid, we see that it returns:
Returns: list of cupy.ndarray
The cupy documentation specifically says:
Currently, cupy.array() or cupy.asarray() cannot create an array from Python object containing CuPy array (e.g., a list of CuPy arrays). Use cupy.stack() instead.
Using the suggestion there, this seems to work relatively quickly for me:
$ cat t6.py
import cupy as cp
shape = (256, 170, 256)
deformation = cp.stack(cp.meshgrid(cp.arange(shape[0]),
cp.arange(shape[1]),
cp.arange(shape[2]),
indexing='ij'))
$ time python t6.py
real 0m1.281s
user 0m0.608s
sys 0m0.492s
$
I have some code that uses Numba cuda.jit in order for me to run on the gpu, and I would like to layer dask on top of it if possible.
Example Code
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from numba import cuda, njit
import numpy as np
from dask.distributed import Client, LocalCluster
#cuda.jit()
def addingNumbersCUDA (big_array, big_array2, save_array):
i = cuda.grid(1)
if i < big_array.shape[0]:
for j in range (big_array.shape[1]):
save_array[i][j] = big_array[i][j] * big_array2[i][j]
if __name__ == "__main__":
cluster = LocalCluster()
client = Client(cluster)
big_array = np.random.random_sample((100, 3000))
big_array2 = np.random.random_sample((100, 3000))
save_array = np.zeros(shape=(100, 3000))
arraysize = 100
threadsperblock = 64
blockspergrid = (arraysize + (threadsperblock - 1))
d_big_array = cuda.to_device(big_array)
d_big_array2 = cuda.to_device(big_array2)
d_save_array = cuda.to_device(save_array)
addingNumbersCUDA[blockspergrid, threadsperblock](d_big_array, d_big_array2, d_save_array)
save_array = d_save_array.copy_to_host()
If my function addingNumbersCUDA didn't use any CUDA I would just put client.submit in front of my function (along with gather after) and it would work. But, since I'm using CUDA putting submit in front of the function doesn't work. The dask documentation says that you can target the gpu, but it's unclear as to how to actually set it up in practice. How would I set up my function to use dask with the gpu targeted and with cuda.jit if possible?
You may want to look through Dask's documentation on GPUs
But, since I'm using CUDA putting submit in front of the function doesn't work.
There is no particular reason why this should be the case. All Dask does it run your function on a different computer. It doesn't change or modify your function in any way.
In the model I want to launch, I have some variables which have to be initialized with specific values.
I currently store these variables into numpy arrays but I don't know how to adapt my code to make it work on a google-cloud-ml job.
Currently I initialize my variable like this:
my_variable = variables.model_variable('my_variable', shape=None, dtype=tf.float32, initializer=np.load('datasets/real/my_variable.npy'))
Can someone help me ?
First, you'll need to copy/store the data on GCS (using, e.g., gsutil) and ensure your training script has access to that bucket. The easiest way to do so is to copy the array to the same bucket as your data, since you'll likely already have configured that bucket for read access. If the bucket is in the same project as your training job and you followed these instructions (particularly, gcloud beta ml init-project), you should be set. If the data will be in another bucket, see these instructions.
Then you'll need to use a library capable of loading data from GCS. Tensorflow includes a module that can do this, although you're free to use any client library that can read from GCS. Here's an example of using TensorFlow's file_io module:
from StringIO import StringIO
import tensorflow as tf
import numpy as np
from tensorflow.python.lib.io import file_io
# Create a variable initialized to the value of a serialized numpy array
f = StringIO(file_io.read_file_to_string('gs://my-bucket/123.npy'))
my_variable = tf.Variable(initial_value=np.load(f), name='my_variable')
Note that we have to read the file into a string and use StringIO, since file_io.FileIO does not fully implement the seek function required by numpy.load.
Bonus: in case it's useful, you can directly store a numpy array to GCS using the file_io module, e.g.:
np.save(file_io.FileIO('gs://my-bucket/123', 'w'), np.array([[1,2,3], [4,5,6]]))
For Python 3, use from io import StringIO instead of from StringIO import StringIO.
I tried the accepted answer but ran into some problems. Eventually this worked for me (Python 3):
from io import BytesIO
import numpy as np
from tensorflow.python.lib.io import file_io
To save:
array = np.ones((100, ))
dest = 'gs://[BUCKET-NAME]/' # Destination to save in GCS
np.save(file_io.FileIO(dest, 'w'), array)
To load:
f = BytesIO(file_io.read_file_to_string(src, binary_mode=True))
arr = np.load(f)
Let
import pyopencl as cl
import pyopencl.array as cl_array
import numpy
a = numpy.random.rand(50000).astype(numpy.float32)
mf = cl.mem_flags
What is the difference between
a_gpu = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
and
a_gpu = cl_array.to_device(self.ctx, self.queue, a)
?
And what is the difference between
result = numpy.empty_like(a)
cl.enqueue_copy(self.queue, result, result_gpu)
and
result = result_gpu.get()
?
Buffers are CL's version of malloc, while pyopencl.array.Array is a workalike of numpy arrays on the compute device.
So for the second version of the first part of your question, you may write a_gpu + 2 to get a new arrays that has 2 added to each number in your array, whereas in the case of the Buffer, PyOpenCL only sees a bag of bytes and cannot perform any such operation.
The second part of your question is the same in reverse: If you've got a PyOpenCL array, .get() copies the data back and converts it into a (host-based) numpy array. Since numpy arrays are one of the more convenient ways to get contiguous memory in Python, the second variant with enqueue_copy also ends up in a numpy array--but note that you could've copied this data into an array of any size (as long as it's big enough) and any type--the copy is performed as a bag of bytes, whereas .get() makes sure you get the same size and type on the host.
Bonus fact: There is of course a Buffer underlying each PyOpenCL array. You can get it from the .data attribute.
To answer the first question, Buffer(hostbuf=...) can be called with anything that implements the buffer interface (reference). pyopencl.array.to_device(...) must be called with an ndarray (reference). ndarray implements the buffer interface and works in either place. However, only hostbuf=... would be expected to work with for example a bytearray (which also implements the buffer interface). I have not confirmed this, but it appears to be what the docs suggest.
On the second question, I am not sure what type result_gpu is supposed to be when you call get() on it (did you mean Buffer.get_host_array()?) In any case, enqueue_copy() works between combination of Buffer, Image and host, can have offsets and regions, and can be asynchronous (with is_blocking=False), and I think these capabilities are only available that way (whereas get() would be blocking and return the whole buffer). (reference)