I am transmiting images by sockets from a camera that runs wince :(
The images in the camera are just float arrays created using realloc for the given x * y size
On the other end, I am receiving these images in python.
I have this code working doing
img_dtype = np.float32
img_rcv = np.empty((img_y, img_x),
dtype = img_dtype)
p = sck.recv_into(img_rcv,
int(size_bytes),
socket.MSG_WAITALL)
if size_bytes != p:
print "Mismatch between expected and received data amount"
return img_rcv
I am a little bit confused about the way numpy creates its arrays and I am wondering if this img_rcv will be compatible with the way recv_into works.
My questions are:
How safe is this?
Does the memory allocation for the numpy array will be known for recv_into?
Are the numpy arrays creation routines equivalent to a malloc?
It is just working because I am lucky?
The answers are:
safe
yes, via the buffer interface
yes, in the sense that you get a block of memory you can work with
no
Related
I have a number of large numpy arrays that need to be stored as dask arrays. While trying to load each array from .npy and then convert it into dask.array, I noticed the RAM usage was almost just as much as regular numpy arrays even after I del arr after loading arr into dask.array.
In this example:
arr = np.random.random((100, 300))
print(f'Array ref count before conversion: {sys.getrefcount(arr) - 1}') # output: 1
dask_arr = da.from_array(arr)
print(f'Distributed array ref count: {sys.getrefcount(dask_arr) - 1}') # output: 1
print(f'Array ref count after conversion: {sys.getrefcount(arr) - 1}') # output: 3
My only guess is that while dask was loading the array, it created references to the numpy array.
How can I free up the memory and delete all references to the memory location (like free(ptr) in C)?
If you're getting a MemoryError you may have a few options:
Break your data into smaller chunks.
Manually trigger garbage collection and/or tweak the gc settings on the workers through a Worker Plugin (which op has tried but doesn't work; I'll include anyway for other readers)
Trim memory using malloc_trim (esp. if working with non-NumPy data or small NumPy chunks)
Make sure you can see the Dask Dashboard while your computations are running to figure out which approach is working.
From this resource:
"Another important cause of unmanaged memory on Linux and MacOSX, which is not widely known about, derives from the fact that the libc malloc()/free() manage a user-space memory pool, so free() won’t necessarily release memory back to the OS."
I'm confused by how numpy's memmap handles changes to data when using copy-on-write (mmap_mode=c). Since nothing is written to the original array on disk, I'm expecting that it has to store all changes in memory, and thus could run out of memory if you modify every single element. To my surprise, it didn't.
I am trying to reduce my memory usage for my machine learning scripts which I run on a shared cluster (the less mem each instance takes, the more instances I can run at the same time). My data are very large numpy arrays (each > 8 Gb). My hope is to use np.memmap to work with these arrays with small memory (<4Gb available).
However, each instance might modify the data differently (e.g. might choose to normalize the input data differently each time). This has implications for storage space. If I use the r+ mode, then normalizing the array in my script will permanently change the stored array.
Since I don't want redundant copies of the data, and just want to store the original data on disk, I thought I should use the 'c' mode (copy-on-write) to open the arrays. But then where do your changes go? Are the changes kept just in memory? If so, if I change the whole array won't I run out of memory on a small-memory system?
Here's an example of a test which I expected to fail:
On a large memory system, create the array:
import numpy as np
GB = 1000**3
GiB = 1024**3
a = np.zeros((50000, 20000), dtype='float32')
bytes = a.size * a.itemsize
print('{} GB'.format(bytes / GB))
print('{} GiB'.format(bytes / GiB))
np.save('a.npy', a)
# Output:
# 4.0 GB
# 3.725290298461914 GiB
Now, on a machine with just 2 Gb of memory, this fails as expected:
a = np.load('a.npy')
But these two will succeed, as expected:
a = np.load('a.npy', mmap_mode='r+')
a = np.load('a.npy', mmap_mode='c')
Issue 1: I run out of memory running this code, trying to modify the memmapped array (fails regardless of r+/c mode):
for i in range(a.shape[0]):
print('row {}'.format(i))
a[i,:] = i*np.arange(a.shape[1])
Why does this fail (especially, why does it fail even in r+ mode, where it can write to the disk)? I thought memmap would only load pieces of the array into memory?
Issue 2: When I force the numpy to flush the changes every once in a while, both r+/c mode successfully finish the loop. But how can c mode do this? I didn't think flush() would do anything for c mode? The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?
for i in range(a.shape[0]):
if i % 100 == 0:
print('row {}'.format(i))
a.flush()
a[i,:] = i*np.arange(a.shape[1])
Numpy isn't doing anything clever here, it's just deferring to the builtin memmap module, which has an access argument that:
accepts one of four values: ACCESS_READ, ACCESS_WRITE, or ACCESS_COPY to specify read-only, write-through or copy-on-write memory respectively.
On linux, this works by calling the mmap system call with
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the
mapping are not visible to other processes mapping the same
file, and are not carried through to the underlying file.
Regarding your question
The changes aren't written to disk, so they are kept in memory, and yet somehow all the changes, which must be over 3Gb, don't cause out-of-memory errors?
The changes likely are written to disk, but just not to the file you opened. They're likely paged into virtual memory somewhere.
I use the following code to load 24-bit binary data into a 16-bit numpy array :
temp = numpy.zeros((len(data) / 3, 4), dtype='b')
temp[:, 1:] = numpy.frombuffer(data, dtype='b').reshape(-1, 3)
temp2 = temp.view('<i4').flatten() >> 16 # >> 16 because I need to divide by 2**16 to load my data into 16-bit array, needed for my (audio) application
output = temp2.astype('int16')
I imagine that it's possible to improve the speed efficiency, but how?
It seems like you are being very roundabout here. Won't this do the same thing?
output = np.frombuffer(data,'b').reshape(-1,3)[:,1:].flatten().view('i2')
This would save some time from not zero-filling a temporary array, skipping the bitshift and avoiding some unneceessary data moves. I haven't actually benchmarked it yet, though, and I expect the savings to be modest.
Edit: I have now performed the benchmark. For len(data) of 12 million, I get 80 ms for your version and 39 ms for mine, so pretty much exactly a factor 2 speedup. Not a very big improvement, as expected, but then your starting point was already pretty fast.
Edit2: I should mention that I have assumed little endian here. However, the original question's code is also implicitly assuming little endian, so this is not a new assumption on my part.
(For big endian (data and architecture), you would replace 1: by :-1. If the data had a different endianness than the CPU, then you would also need to reverse the order of the bytes (::-1).)
Edit3: For even more speed, I think you will have to go outside python. This fortran function, which also uses openMP, gets me a factor 2+ speedup compared to my version (so 4+ times faster than yours).
subroutine f(a,b)
implicit none
integer*1, intent(in) :: a(:)
integer*1, intent(out) :: b(size(a)*2/3)
integer :: i
!$omp parallel do
do i = 1, size(a)/3
b(2*(i-1)+1) = a(3*(i-1)+2)
b(2*(i-1)+2) = a(3*(i-1)+3)
end do
!$omp end parallel do
end subroutine
Compile with FOPT="-fopenmp" f2py -c -m basj{,.f90} -lgomp. You can then import and use it in python:
import basj
def convert(data): return def mine2(data): return basj.f(np.frombuffer(data,'b')).view('i2')
You can control the number of cores to use via the environment variavble OMP_NUM_THREADS, but it defaults to using all available cores.
Inspired by #amaurea's answer, here is a cython version (I already used cython in my original code, so I'll continue with cython instead of mixing cython + fortran) :
import cython
import numpy as np
cimport numpy as np
def binary24_to_int16(char *data):
cdef int i
res = np.zeros(len(data)/3, np.int16)
b = <char *>((<np.ndarray>res).data)
for i in range(len(data)/3):
b[2*i] = data[3*i+1]
b[2*i+1] = data[3*i+2]
return res
There is a factor 4 speed gain :)
This question already has answers here:
What is the proper way to modify OpenGL vertex buffer?
(3 answers)
Closed 2 years ago.
I'd like to create polygons with draggable vertices in PyOpenGL. Having read around a bit, VBOs seemed like a sensible way to achieve this.
Having never used VBOs before, I'm having trouble figuring out how to dynamically update them - ideally I'd like to just modify elements of a numpy array of vertices, then propagate only the elements that changed up to the GPU. I had assumed that the OpenGL.arrays.vbo.VBO wrapper did this automagically with its copy_data() method, but it seems not.
Here's a silly example:
from OpenGL import GL as gl
from OpenGL import GLUT as glut
from OpenGL.arrays import vbo
import numpy as np
class VBOJiggle(object):
def __init__(self,nvert=100,jiggliness=0.01):
self.nvert = nvert
self.jiggliness = jiggliness
verts = 2*np.random.rand(nvert,2) - 1
self.verts = np.require(verts,np.float32,'F')
self.vbo = vbo.VBO(self.verts)
def draw(self):
gl.glClearColor(0,0,0,0)
gl.glClear(gl.GL_COLOR_BUFFER_BIT)
gl.glEnableClientState(gl.GL_VERTEX_ARRAY)
self.vbo.bind()
gl.glVertexPointer(2,gl.GL_FLOAT,0,self.vbo)
gl.glColor(0,1,0,1)
gl.glDrawArrays(gl.GL_LINE_LOOP,0,self.vbo.data.shape[0])
gl.glDisableClientState(gl.GL_VERTEX_ARRAY)
self.vbo.unbind()
self.jiggle()
glut.glutSwapBuffers()
def jiggle(self):
# jiggle half of the vertices around randomly
delta = (np.random.rand(self.nvert//2,2) - 0.5)*self.jiggliness
self.verts[:self.nvert:2] += delta
# the data attribute of the vbo is the same as the numpy array
# of vertices
assert self.verts is self.vbo.data
# # Approach 1:
# # it seems like this ought to work, but it doesn't - all the
# # vertices remain static even though the vbo's data gets updated
# self.vbo.copy_data()
# Approach 2:
# this works, but it seems unnecessary to copy the whole array
# up to the GPU, particularly if the array is large and I have
# modified only a small subset of vertices
self.vbo.set_array(self.verts)
if __name__ == '__main__':
glut.glutInit()
glut.glutInitDisplayMode( glut.GLUT_DOUBLE | glut.GLUT_RGB )
glut.glutInitWindowSize( 250, 250 )
glut.glutInitWindowPosition( 100, 100 )
glut.glutCreateWindow( None )
demo = VBOJiggle()
glut.glutDisplayFunc( demo.draw )
glut.glutIdleFunc( demo.draw )
glut.glutMainLoop()
To completely answer this question, I have to mention the OpenGL buffer update first.
The OpenGL instruction glBufferData creates and initializes a buffer object's data store. An existing data store of an buffer object is completely destroyed and a new data store (possibly with a different size) is created. If a data pointer is passed to the function, then the data store is completely initialized by the data. The size of the buffer and the size of the provided data is assumed to be equal.
glBufferSubData updates the entire data or a subset of the data of an existing data store. The data store is assumed to be created before, by glBufferData. No data store is destroyed or created.
Of course, technically glBufferData can always be use instead of glBufferSubData, but glBufferSubData will perform much better, because the expensive buffer creation (allocation) is eliminated.
Using
self.vbo.set_array(self.verts)
is a bad idea, because as seen in the implementation (PyOpenGL/OpenGL/arrays/vbo.py), this method creates a completely new buffer, with a possibly new size and will force the recreation of the buffer object's data store (because of self.copied = False).
If the buffer was created before, then self.vbo.copy_data() will update the data by glBufferSubData (see if self.copied: in copy_data). To make this work the buffer has to be the currently bound buffer (self.vbo.bind()). Further a copy information has to be set (VBO.copy_segments). The copy information is stated by the item setter (VBO.__setitem__).
This means, in "Approach 1" you would have to do something like the following:
self.vbo[:] = self.verts
self.vbo.bind()
self.vbo.copy_data()
Since OpenGL.arrays.vbo is nothing more than a wrapper for the OpenGL buffer instructions, I would prefer to use glBufferSubData directly, which will perform best in cases like this:
# Approach 3:
# Direct buffer update by `glBufferSubData`
self.vbo.bind()
self.vbo.implementation.glBufferSubData(self.vbo.target, 0, self.vbo.data)
With this approach even subsets of the data store can be updated. Note the 2nd parameter of glBufferSubData is a byte offset to the buffer objects data store. Further there is an overloaded implementation, which can process a buffer size and a direct data pointer.
Let
import pyopencl as cl
import pyopencl.array as cl_array
import numpy
a = numpy.random.rand(50000).astype(numpy.float32)
mf = cl.mem_flags
What is the difference between
a_gpu = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
and
a_gpu = cl_array.to_device(self.ctx, self.queue, a)
?
And what is the difference between
result = numpy.empty_like(a)
cl.enqueue_copy(self.queue, result, result_gpu)
and
result = result_gpu.get()
?
Buffers are CL's version of malloc, while pyopencl.array.Array is a workalike of numpy arrays on the compute device.
So for the second version of the first part of your question, you may write a_gpu + 2 to get a new arrays that has 2 added to each number in your array, whereas in the case of the Buffer, PyOpenCL only sees a bag of bytes and cannot perform any such operation.
The second part of your question is the same in reverse: If you've got a PyOpenCL array, .get() copies the data back and converts it into a (host-based) numpy array. Since numpy arrays are one of the more convenient ways to get contiguous memory in Python, the second variant with enqueue_copy also ends up in a numpy array--but note that you could've copied this data into an array of any size (as long as it's big enough) and any type--the copy is performed as a bag of bytes, whereas .get() makes sure you get the same size and type on the host.
Bonus fact: There is of course a Buffer underlying each PyOpenCL array. You can get it from the .data attribute.
To answer the first question, Buffer(hostbuf=...) can be called with anything that implements the buffer interface (reference). pyopencl.array.to_device(...) must be called with an ndarray (reference). ndarray implements the buffer interface and works in either place. However, only hostbuf=... would be expected to work with for example a bytearray (which also implements the buffer interface). I have not confirmed this, but it appears to be what the docs suggest.
On the second question, I am not sure what type result_gpu is supposed to be when you call get() on it (did you mean Buffer.get_host_array()?) In any case, enqueue_copy() works between combination of Buffer, Image and host, can have offsets and regions, and can be asynchronous (with is_blocking=False), and I think these capabilities are only available that way (whereas get() would be blocking and return the whole buffer). (reference)