Numpy array from cStringIO object and avoiding copies - python

This to understand things better. It is not an actual problem that I need to fix. A cstringIO object is supposed to emulate a string, file and also an iterator over the lines. Does it also emulate a buffer ? In anycase ideally one should be able to construct a numpy array as follows
import numpy as np
import cstringIO
c = cStringIO.StringIO('\x01\x00\x00\x00\x01\x00\x00\x00')
#Trying the iterartor abstraction
b = np.fromiter(c,int)
# The above fails with: ValueError: setting an array element with a sequence.
#Trying the file abstraction
b = np.fromfile(c,int)
# The above fails with: IOError: first argument must be an open file
#Trying the sequence abstraction
b = np.array(c, int)
# The above fails with: TypeError: long() argument must be a string or a number
#Trying the string abstraction
b = np.fromstring(c)
#The above fails with: TypeError: argument 1 must be string or read-only buffer
b = np.fromstring(c.getvalue(), int) # does work
My question is why does it behave this way.
The practical problem where this came up is the following: I have a iterator which yields a tuple. I am interested in making a numpy array from one of the components of the tuple with as little copying and duplication as possible. My first cut was to keep writing the interesting components of the yielded tuple into a StringIO object and then use its memory buffer for the array. I can of course use getvalue() but will create and return a copy. What would be a good way to avoid the extra copying.

The problem seems to be that numpy doesn't like being given characters instead of numbers. Remember, in Python, single characters and strings have the same type — numpy must have some type detection going on under the hood, and takes '\x01' to be a nested sequence.
The other problem is that a cStringIO iterates over its lines, not its characters.
Something like the following iterator should get around both of these problems:
def chariter(filelike):
octet = filelike.read(1)
while octet:
yield ord(octet)
octet = filelike.read(1)
Use it like so (note the seek!):
c.seek(0)
b = np.fromiter(chariter(c), int)

As cStringIO does not implement the buffer interface, if its getvalue returns a copy of the data, then there is no way to get its data without copying.
If getvalue returns the buffer as a string without making a copy, numpy.frombuffer(x.getvalue(), dtype='S1') will give a (read-only) numpy array referring to the string, without an additional copy.
The reason why np.fromiter(c, int) and np.array(c, int) do not work is that cStringIO, when iterated, returns a line at a time, similarly as files:
>>> list(iter(c))
['\x01\x00\x00\x00\x01\x00\x00\x00']
Such a long string cannot be converted to a single integer.
***
It's best not to worry too much about making copies unless it really turns out to be a problem. The reason is that the extra overhead in e.g. using a generator and passing it to numpy.fromiter may be actually larger than what is involved in constructing a list, and then passing that to numpy.array --- making the copies may be cheap compared to Python runtime overhead.
However, if the issue is with memory, then one solution is to put the items directly into the final Numpy array. If you know the size beforehand, you can pre-allocate it. If the size is unknown, you can use the .resize() method in the array to grow it as needed.

Related

cython insert element in array.array

I am trying to convert some python code into cython. In the python code I use data of type
array.array('i', [...]) and use the method array.insert to insert an element at a specific index. in cython however, when I try to insert an element using the same method I get this error: BufferError: cannot resize an array that is exporting buffers
basically:
from cpython cimport array
cdef array.array[int] a = array.array('i', [1,2,3,3])
a.insert(1,5) # insert 5 in the index 1 -> throws error
I have been looking at cyappend3 of this answer but I am using libcpp and not sure I understand the magic written there.
Any idea how to insert an element at a specific index in an array.array?
Partial answer:
BufferError: cannot resize an array that is exporting buffers
This is telling you that you have a memoryview (or similar) of the array somewhere. It isn't possible to resize it because that memoryview is looking directly into that array's data and resizing the array will require reallocating the data. You can replicate this error in Python too if you do view = memoryview(arr) before you try to insert.
In your case:
cdef array.array[int] a = array.array('i', [1,2,3,3])
cdef array.array[int] a is defining an array with a fast buffer to elements of the array, and it's this buffer that prevents you from resizing it. If you just do cdef array.array a it works fine. Obviously you lose the fast buffer access to individual elements, but that's because you're trying to change the data out from under the buffer.
I strongly recommend you don't resize arrays though. Not only does it involve the O(n) copy of every element of the array. Also, unlike Python list, array doesn't over-allocate so even append causes a complete reallocation and copy every time (i.e. is O(n) rather than amortized O(1)).
Instead I'd suggest keeping the data as a Python list (or maybe something else) until you've finalized the length and only then converting to array.
what has been answered here in this post is correct, (https://stackoverflow.com/a/74285371/4529589), and I have the same recommendation.
However I want to add this point that if you want to use the insert but as well if you want to use the insert and still define as the c buffer, you could use the std::vector. This will be faster.
from libcpp.vector cimport vector
cdef vector[int] vect = array.array('i', [1,2,3,3])
vect.insert(vect.begin() + 1 ,5)
and as well I recomend if you want to use this solution just drop the array and from the begining just use the vector initialization.

Numpy array: get the raw bytes without copying

I am trying to concatenate the bytes of multiple Numpy arrays into a single bytearray to send it in an HTTP post request.
The most efficient way of doing this, that I can think of, is to create a sufficiently large bytearray object and then write into it the bytes from all the numpy arrays contiguously.
The code will look something like this:
list_arr = [np.array([1, 2, 3]), np.array([4, 5, 6])]
total_nb_bytes = sum(a.nbytes for a in list_arr)
cb = bytearray(total_nb_bytes)
# Too Lazy Didn't do: generate list of delimiters and information to decode the concatenated bytes array
# concatenate the bytes
for arr in list_arr:
_bytes = arr.tobytes()
cb.extend(_bytes)
The method tobytes() isn't a zero-copy method. It will copy the raw data of the numpy array into a bytes object.
In python, buffers allow access to inner raw data value (this is called protocol buffer at the C level) Python documentation; numpy had this possibility in numpy1.13, the method was called getbuffer() link. Yet, this method is deprecated!
What is the right way of doing this?
You can make a numpy-compatible buffer out of your message bytearray and write to that efficiently using np.concatenate's out argument.
list_arr = [np.array([1,2,3]), np.array([4,5,6])]
total_nb_bytes = sum(a.nbytes for a in list_arr)
total_size = sum(a.size for a in list_arr)
cb = bytearray(total_nb_bytes)
np.concatenate(list_arr, out=np.ndarray(total_size, dtype=list_arr[0].dtype, buffer=cb))
And sure enough,
>>> cb
bytearray(b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00')
This method has the implication that your output is all the same format. To fix that, view your original arrays as np.uint8:
np.concatenate([a.view(np.uint8) for a in list_arr],
out=np.ndarray(total_nb_bytes, dtype=list_arr[0].dtype, buffer=cb))
This way, you don't need to compute total_size either, since you've already computed the number of bytes.
This approach is likely more efficient than looping through the list of arrays. You were right that the buffer protocol is your ticket to a solution. You can create an array object wrapped around the memory of any object supporting the buffer protocol using the low level np.ndarray constructor. From there, you can use all the usual numpy functions to interact with the buffer.
Just use arr.data. This returns a memoryview object which references the array’s memory without copying. It can be indexed and sliced (creating new memoryviews without copying) and appended to a bytearray (copying just once into the bytearray).

pythonic way to get first element of multidimensional numpy array

I am looking for a simple pythonic way to get the first element of a numpy array no matter it's dimension. For example:
For [1,2,3,4] that would be 1
For [[3,2,4],[4,5,6]] it would be 3
Is there a simple, pythonic way of doing this?
Using a direct index:
arr[(0,) * arr.ndim]
The commas in a normal index expression make a tuple. You can pass in a manually-constructed tuple as well.
You can get the same result from np.unravel_index:
arr[unravel_index(0, arr.shape)]
On the other hand, using the very tempting arr.ravel[0] is not always safe. ravel will generally return a view, but if your array is non-contiguous, it will make a copy of the entire thing.
A relatively cheap solution is
arr.flat[0]
flat is an indexable iterator. It will not copy your data.
Consider using .item, for example:
a = np.identity(3)
a.item(0)
# 1.0
But note that unlike regular indexing .item strives to return a native Python object, so for example an np.uint8 will be returned as plain int.
If that's acceptable this method seems a bit faster than other methods:
timeit(lambda:a.flat[0])
# 0.3602013469208032
timeit(lambda:a[a.ndim*(0,)])
# 0.3502263119444251
timeit(lambda:a.item(0))
# 0.2366882530041039

Python `ctypes` - How to copy buffer returned by C function into a bytearray

A pointer to buffer of type POINTER(c_ubyte) is returned by the C function (the image_data variable in the following code). I want this data to be managed by Python, so I want to copy it into a bytearray. Here's the function call
image_data = stb_image.stbi_load(filename_cstr, byref(width),
byref(height), byref(num_channels),
c_int(expected_num_channels))
We get to know the width and height of the image only after that call, so can't pre-allocate a bytearray.
I would have used
array_type = c.c_ubyte * (num_channels.value * width.value * height.value)
image_data_bytearray = bytearray(cast(image_data, array_type))
But the type to cast to must be a pointer, not array, so I get an error.
TypeError: cast() argument 2 must be a pointer type, not c_ubyte_Array_262144
What should I do?
OK, reading the answer to the question linked to in the comments (thanks, #"John Zwinck" and #"eryksun"), there are two ways of storing the data, either in a bytearray or a numpy.array. In all these snippets, image_data is of type POINTER(c_ubyte), and we have array_type defined as -
array_type = c_ubyte * num_channels * width * height
We can create a bytearray first and then loop over and set the bytes
arr_bytes = bytearray(array_size)
for i in range(array_size):
arr_bytes[i] = image_data[i]
Or a better way is to create a C array instance using from_address and then initialize a bytearray with it -
image_data_carray = array_type.from_address(addressof(image_data.contents))
# Copy into bytearray
image_data_bytearray = bytearray(image_data_carray)
And during writing the image (didn't ask this question, just sharing for completeness), we can obtain pointer to the bytearray data like this and give it to stbi_write_png
image_data_carray = array_type.from_buffer(image_data_bytearray)
image_data = cast(image_data_carray, POINTER(c_ubyte))
The numpy based way of doing it is as answered in the linked question
address = addressof(image_data.contents)
image_data_ptr = np.ctypeslib.as_array(array_type.from_address(address))
This alone however only points to the memory returned by the C function, doesn't copy into a Python-managed array object. We can copy by creating a numpy array as
image_data = np.array(image_data_ptr)
To confirm I have done an assert all(arr_np == arr_bytes) there. And arr_np.dtype is uint8.
And during writing the image, we can obtain a pointer to the numpy array's data like this
image_data = image_data_numpy.ctypes.data_as(POINTER(c_ubyte))
Your variable array_type shouldn't even be called thus as it is in fact not an initialized C array nor any kind of type, but a Python object prepared for doing the array initialization.
Well, initialized array also shouldn't be called thus as well. :D
You should be doing there an equivalent of:
unsigned char array[channels*width*height];
in C. Then array is a pointer to N*types unsigned char pointing to first byte of the array. (index 0)
A cast() should get a pointer to see the data's type,. So doing:
array = (c.c_ubyte*(channels*width*height))()
should do the trick. But you don't need extra allocated memory. So you can create a pointer as suggested in a comment.
But I suggest you use:
image_data = bytearray(c.string_at(image_data))
It should work, assuming, of course, that returned image is null terminated. Well, this also implies using signed chars but it doesn't have to be.
If you wrote the C portion, just allocate one byte extra to the memory that will contain an image which is declared/cast to contain unsigned chars and put the last item to 0.
Then leave the algorithm to work as before. If you do not null terminate it, you will still get the whole image with string_at(), but there will be a memory leak of 3 bytes or so more. Very undesirable.
I used this trick in my C module for colorspace conversion. It works extremely fast as there are no loops, No anything extra. string_at() just pulls in the buffer and creates Python string wrapper around it.
Then you can use numpy.fromstring(...) or array.array("B", image_data) or use bytearray() as above etc.
Otherwise, well, I saw your answer just now. You can do it as you wrote as well, but I think that my dirty trick is better (if you can change the C code, of course).
P.S. Whoops! I just saw in a doc string that string_at() can have an optional argument size. Perhaps using it will completely ignore the termination and there wouldn't be any leakage. I am asking myself now why didn't I use it in my project but messed with null termination.
Perhaps out of lazyness. Using size shouldn't require any modifications to C code. So it would be:
image_data = bytearray(c.string_at(image_data, channels*width*height))

Copying bytes in Python from Numpy array into string or bytearray

I am reading data from an UDP socket in a while loop. I need the most efficient way to
1) Read the data (*) (that's kind of solved, but comments are appreciated)
2) Dump the (manipulated) data periodically in a file (**) (The Question)
I am anticipating a bottleneck in the numpy's "tostring" method. Let's consider the following piece of (an incomplete) code:
import socket
import numpy
nbuf=4096
buf=numpy.zeros(nbuf,dtype=numpy.uint8) # i.e., an array of bytes
f=open('dump.data','w')
datasocket=socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
# ETC.. (code missing here) .. the datasocket is, of course, non-blocking
while True:
gotsome=True
try:
N=datasocket.recv_into(buf) # no memory-allocation here .. (*)
except(socket.error):
# do nothing ..
gotsome=False
if (gotsome):
# the bytes in "buf" will be manipulated in various ways ..
# the following write is done frequently (not necessarily in each pass of the while loop):
f.write(buf[:N].tostring()) # (**) The question: what is the most efficient way to do this?
f.close()
Now, at (**), as I understand it:
1) buf[:N] allocates memory for a new array object, having the length N+1, right? (maybe not)
.. and after that:
2) buf[:N].tostring() allocates memory for a new string, and the bytes from buf are copied into this string
That seems a lot of memory-allocation & swapping. In this same loop, in the future, I will read several sockets and write into several files.
Is there a way to just tell f.write to access directly the memory address of "buf" from 0 to N bytes and write them onto the disk?
I.e., to do this in the spirit of the buffer interface and avoid those two extra memory allocations?
P. S. f.write(buf[:N].tostring()) is equivalent to buf[:N].tofile(f)
Basically, it sounds like you want to use the array's tofile method or directly use the ndarray.data buffer object.
For your exact use-case, using the array's data buffer is the most efficient, but there are a lot of caveats that you need to be aware of for general use. I'll elaborate in a bit.
However, first let me answer a couple of your questions and provide a bit of clarification:
buf[:N] allocates memory for a new array object, having the length N+1, right?
It depends on what you mean by "new array object". Very little additional memory is allocated, regardless of the size of the arrays involved.
It does allocate memory for a new array object (a few bytes), but it does not allocate additional memory for the array's data. Instead, it creates a "view" that shares the original array's data buffer. Any changes you make to y = buf[:N] will affect buf as well.
buf[:N].tostring() allocates memory for a new string, and the bytes from buf are copied into this string
Yes, that's correct.
On a side note, you can actually go the opposite way (string to array) without allocating any additional memory:
somestring = 'This could be a big string'
arr = np.frombuffer(buffer(somestring), dtype=np.uint8)
However, because python strings are immutable, arr will be read-only.
Is there a way to just tell f.write to access directly the memory address of "buf" from 0 to N bytes and write them onto the disk?
Yep!
Basically, you'd want:
f.write(buf[:N].data)
This is very efficient and will work for any file-like object. It's almost definitely what you want in this exact case. However, there are several caveats!
First off, note that N will be in items in the array, not in bytes directly. They're equivalent in your example code (due to dtype=np.int8, or any other 8-bit datatype).
If you did want to write a number of bytes, you could do
f.write(buf.data[:N])
...but slicing the arr.data buffer will allocate a new string, so it's functionally similar to buf[:N].tostring(). At any rate, be aware that doing f.write(buf[:N].tostring()) is different than doing f.write(buf.data[:N]) for most dtypes, but both will allocate a new string.
Next, numpy arrays can share data buffers. In your example case, you don't need to worry about this, but in general, using somearr.data can lead to surprises for this reason.
As an example:
x = np.arange(10, dtype=np.uint8)
y = x[::2]
Now, y shares the same memory buffer as x, but it's not contiguous in memory (have a look at x.flags vs y.flags). Instead it references every other item in x's memory buffer (compare x.strides to y.strides).
If we try to access y.data, we'll get an error telling us that this is not a contiguous array in memory, and we can't get a single-segment buffer for it:
In [5]: y.data
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-54-364eeabf8187> in <module>()
----> 1 y.data
AttributeError: cannot get single-segment buffer for discontiguous array
This is a large part of the reason that numpy array's have a tofile method (it also pre-dates python's buffers, but that's another story).
tofile will write the data in the array to a file without allocating additional memory. However, because it's implemented at the C-level it only works for real file objects, not file-like objects (e.g. a socket, StringIO, etc).
For example:
buf[:N].tofile(f)
However, this is implemented at the C-level, and will only work for actual file objects, not sockets, StringIO, and other file-like objects.
This does allow you to use arbitrary array indexing, however.
buf[someslice].tofile(f)
Will make a new view (same memory buffer), and efficiently write it to disk. In your exact case, it will be slightly slower than slicing the arr.data buffer and directly writing it to disk.
If you'd prefer to use array indexing (and not number of bytes) then the ndarray.tofile method will be more efficient than f.write(arr.tostring()).

Categories

Resources