I am trying to concatenate the bytes of multiple Numpy arrays into a single bytearray to send it in an HTTP post request.
The most efficient way of doing this, that I can think of, is to create a sufficiently large bytearray object and then write into it the bytes from all the numpy arrays contiguously.
The code will look something like this:
list_arr = [np.array([1, 2, 3]), np.array([4, 5, 6])]
total_nb_bytes = sum(a.nbytes for a in list_arr)
cb = bytearray(total_nb_bytes)
# Too Lazy Didn't do: generate list of delimiters and information to decode the concatenated bytes array
# concatenate the bytes
for arr in list_arr:
_bytes = arr.tobytes()
cb.extend(_bytes)
The method tobytes() isn't a zero-copy method. It will copy the raw data of the numpy array into a bytes object.
In python, buffers allow access to inner raw data value (this is called protocol buffer at the C level) Python documentation; numpy had this possibility in numpy1.13, the method was called getbuffer() link. Yet, this method is deprecated!
What is the right way of doing this?
You can make a numpy-compatible buffer out of your message bytearray and write to that efficiently using np.concatenate's out argument.
list_arr = [np.array([1,2,3]), np.array([4,5,6])]
total_nb_bytes = sum(a.nbytes for a in list_arr)
total_size = sum(a.size for a in list_arr)
cb = bytearray(total_nb_bytes)
np.concatenate(list_arr, out=np.ndarray(total_size, dtype=list_arr[0].dtype, buffer=cb))
And sure enough,
>>> cb
bytearray(b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00')
This method has the implication that your output is all the same format. To fix that, view your original arrays as np.uint8:
np.concatenate([a.view(np.uint8) for a in list_arr],
out=np.ndarray(total_nb_bytes, dtype=list_arr[0].dtype, buffer=cb))
This way, you don't need to compute total_size either, since you've already computed the number of bytes.
This approach is likely more efficient than looping through the list of arrays. You were right that the buffer protocol is your ticket to a solution. You can create an array object wrapped around the memory of any object supporting the buffer protocol using the low level np.ndarray constructor. From there, you can use all the usual numpy functions to interact with the buffer.
Just use arr.data. This returns a memoryview object which references the array’s memory without copying. It can be indexed and sliced (creating new memoryviews without copying) and appended to a bytearray (copying just once into the bytearray).
Related
I am trying to convert some python code into cython. In the python code I use data of type
array.array('i', [...]) and use the method array.insert to insert an element at a specific index. in cython however, when I try to insert an element using the same method I get this error: BufferError: cannot resize an array that is exporting buffers
basically:
from cpython cimport array
cdef array.array[int] a = array.array('i', [1,2,3,3])
a.insert(1,5) # insert 5 in the index 1 -> throws error
I have been looking at cyappend3 of this answer but I am using libcpp and not sure I understand the magic written there.
Any idea how to insert an element at a specific index in an array.array?
Partial answer:
BufferError: cannot resize an array that is exporting buffers
This is telling you that you have a memoryview (or similar) of the array somewhere. It isn't possible to resize it because that memoryview is looking directly into that array's data and resizing the array will require reallocating the data. You can replicate this error in Python too if you do view = memoryview(arr) before you try to insert.
In your case:
cdef array.array[int] a = array.array('i', [1,2,3,3])
cdef array.array[int] a is defining an array with a fast buffer to elements of the array, and it's this buffer that prevents you from resizing it. If you just do cdef array.array a it works fine. Obviously you lose the fast buffer access to individual elements, but that's because you're trying to change the data out from under the buffer.
I strongly recommend you don't resize arrays though. Not only does it involve the O(n) copy of every element of the array. Also, unlike Python list, array doesn't over-allocate so even append causes a complete reallocation and copy every time (i.e. is O(n) rather than amortized O(1)).
Instead I'd suggest keeping the data as a Python list (or maybe something else) until you've finalized the length and only then converting to array.
what has been answered here in this post is correct, (https://stackoverflow.com/a/74285371/4529589), and I have the same recommendation.
However I want to add this point that if you want to use the insert but as well if you want to use the insert and still define as the c buffer, you could use the std::vector. This will be faster.
from libcpp.vector cimport vector
cdef vector[int] vect = array.array('i', [1,2,3,3])
vect.insert(vect.begin() + 1 ,5)
and as well I recomend if you want to use this solution just drop the array and from the begining just use the vector initialization.
I encountered a strange behavior of np.ndarray.tobytes() that makes me doubt that it is working deterministically, at least for arrays of dtype=object.
import numpy as np
print(np.array([1,[2]]).dtype)
# => object
print(np.array([1,[2]]).tobytes())
# => b'0h\xa3\t\x01\x00\x00\x00H{!-\x01\x00\x00\x00'
print(np.array([1,[2]]).tobytes())
# => b'0h\xa3\t\x01\x00\x00\x00\x88\x9d)-\x01\x00\x00\x00'
In the sample code, a list of mixed python objects ([1, [2]]) is first converted to a numpy array, and then transformed to a byte sequence using tobytes().
Why do the resulting byte-representations differ for repeated instantiations of the same data? The documentation just states that it converts an ndarray to raw python bytes, but it does not refer to any limitations. So far, I observed this just for dtype=object. Numeric arrays always yield the same byte sequence:
np.random.seed(42); print(np.random.rand(3).tobytes())
# b'\xecQ_\x1ew\xf8\xd7?T\xd6\xbbh#l\xee?Qg\x1e\x8f~l\xe7?'
np.random.seed(42); print(np.random.rand(3).tobytes())
# b'\xecQ_\x1ew\xf8\xd7?T\xd6\xbbh#l\xee?Qg\x1e\x8f~l\xe7?'
Have I missed an elementar thing about python's/numpy's memory architecture? I tested with numpy version 1.17.2 on a Mac.
Context: I encountered this problem when trying to compute a hash for arbitrary data structures. I hoped that I can rely on the basic serialization capabilities of tobytes(), but this appears to be a wrong premise. I know that pickle is the standard for serialization in python, but since I don't require portability and my data structures only contain numbers, I first sought help with numpy.
An array of dtype object stores pointers to the objects it contains. In CPython, this corresponds to the id. Every time you create a new list, it will be allocated at a new memory address. However, small integers are interned, so 1 will reference the same integer object every time.
You can see exactly how this works by checking the IDs of some sample objects:
>>> x = np.array([1, [2]])
>>> x.tobytes()
b'\x90\x91\x04a\xfb\x7f\x00\x00\xc8[J\xaa+\x02\x00\x00'
>>> id(x[0])
140717641208208
>>> id(1) # Small integers are interned
140717641208208
>>> id(x[0]).to_bytes(8, 'little') # Checks out as the first 8 bytes
b'\x90\x91\x04a\xfb\x7f\x00\x00'
>>> id(x[1]).to_bytes(8, 'little') # Checks out as the last 8 bytes
b'\xc8[J\xaa+\x02\x00\x00'
As you can see, it is quite deterministic, but serializes information that is essentially useless to you. The operation is the same for numeric arrays as for object arrays: it returns a view or copy of the underlying buffer. The contents of the buffer is what is throwing you off.
Since you mentioned that you are computing hashes, keep in mind that there is a reason that python lists are unhashable. You can have lists that are equal at one time and different at another. Using IDs is generally not a good idea for an effective hash.
I am reading data from an UDP socket in a while loop. I need the most efficient way to
1) Read the data (*) (that's kind of solved, but comments are appreciated)
2) Dump the (manipulated) data periodically in a file (**) (The Question)
I am anticipating a bottleneck in the numpy's "tostring" method. Let's consider the following piece of (an incomplete) code:
import socket
import numpy
nbuf=4096
buf=numpy.zeros(nbuf,dtype=numpy.uint8) # i.e., an array of bytes
f=open('dump.data','w')
datasocket=socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
# ETC.. (code missing here) .. the datasocket is, of course, non-blocking
while True:
gotsome=True
try:
N=datasocket.recv_into(buf) # no memory-allocation here .. (*)
except(socket.error):
# do nothing ..
gotsome=False
if (gotsome):
# the bytes in "buf" will be manipulated in various ways ..
# the following write is done frequently (not necessarily in each pass of the while loop):
f.write(buf[:N].tostring()) # (**) The question: what is the most efficient way to do this?
f.close()
Now, at (**), as I understand it:
1) buf[:N] allocates memory for a new array object, having the length N+1, right? (maybe not)
.. and after that:
2) buf[:N].tostring() allocates memory for a new string, and the bytes from buf are copied into this string
That seems a lot of memory-allocation & swapping. In this same loop, in the future, I will read several sockets and write into several files.
Is there a way to just tell f.write to access directly the memory address of "buf" from 0 to N bytes and write them onto the disk?
I.e., to do this in the spirit of the buffer interface and avoid those two extra memory allocations?
P. S. f.write(buf[:N].tostring()) is equivalent to buf[:N].tofile(f)
Basically, it sounds like you want to use the array's tofile method or directly use the ndarray.data buffer object.
For your exact use-case, using the array's data buffer is the most efficient, but there are a lot of caveats that you need to be aware of for general use. I'll elaborate in a bit.
However, first let me answer a couple of your questions and provide a bit of clarification:
buf[:N] allocates memory for a new array object, having the length N+1, right?
It depends on what you mean by "new array object". Very little additional memory is allocated, regardless of the size of the arrays involved.
It does allocate memory for a new array object (a few bytes), but it does not allocate additional memory for the array's data. Instead, it creates a "view" that shares the original array's data buffer. Any changes you make to y = buf[:N] will affect buf as well.
buf[:N].tostring() allocates memory for a new string, and the bytes from buf are copied into this string
Yes, that's correct.
On a side note, you can actually go the opposite way (string to array) without allocating any additional memory:
somestring = 'This could be a big string'
arr = np.frombuffer(buffer(somestring), dtype=np.uint8)
However, because python strings are immutable, arr will be read-only.
Is there a way to just tell f.write to access directly the memory address of "buf" from 0 to N bytes and write them onto the disk?
Yep!
Basically, you'd want:
f.write(buf[:N].data)
This is very efficient and will work for any file-like object. It's almost definitely what you want in this exact case. However, there are several caveats!
First off, note that N will be in items in the array, not in bytes directly. They're equivalent in your example code (due to dtype=np.int8, or any other 8-bit datatype).
If you did want to write a number of bytes, you could do
f.write(buf.data[:N])
...but slicing the arr.data buffer will allocate a new string, so it's functionally similar to buf[:N].tostring(). At any rate, be aware that doing f.write(buf[:N].tostring()) is different than doing f.write(buf.data[:N]) for most dtypes, but both will allocate a new string.
Next, numpy arrays can share data buffers. In your example case, you don't need to worry about this, but in general, using somearr.data can lead to surprises for this reason.
As an example:
x = np.arange(10, dtype=np.uint8)
y = x[::2]
Now, y shares the same memory buffer as x, but it's not contiguous in memory (have a look at x.flags vs y.flags). Instead it references every other item in x's memory buffer (compare x.strides to y.strides).
If we try to access y.data, we'll get an error telling us that this is not a contiguous array in memory, and we can't get a single-segment buffer for it:
In [5]: y.data
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-54-364eeabf8187> in <module>()
----> 1 y.data
AttributeError: cannot get single-segment buffer for discontiguous array
This is a large part of the reason that numpy array's have a tofile method (it also pre-dates python's buffers, but that's another story).
tofile will write the data in the array to a file without allocating additional memory. However, because it's implemented at the C-level it only works for real file objects, not file-like objects (e.g. a socket, StringIO, etc).
For example:
buf[:N].tofile(f)
However, this is implemented at the C-level, and will only work for actual file objects, not sockets, StringIO, and other file-like objects.
This does allow you to use arbitrary array indexing, however.
buf[someslice].tofile(f)
Will make a new view (same memory buffer), and efficiently write it to disk. In your exact case, it will be slightly slower than slicing the arr.data buffer and directly writing it to disk.
If you'd prefer to use array indexing (and not number of bytes) then the ndarray.tofile method will be more efficient than f.write(arr.tostring()).
This to understand things better. It is not an actual problem that I need to fix. A cstringIO object is supposed to emulate a string, file and also an iterator over the lines. Does it also emulate a buffer ? In anycase ideally one should be able to construct a numpy array as follows
import numpy as np
import cstringIO
c = cStringIO.StringIO('\x01\x00\x00\x00\x01\x00\x00\x00')
#Trying the iterartor abstraction
b = np.fromiter(c,int)
# The above fails with: ValueError: setting an array element with a sequence.
#Trying the file abstraction
b = np.fromfile(c,int)
# The above fails with: IOError: first argument must be an open file
#Trying the sequence abstraction
b = np.array(c, int)
# The above fails with: TypeError: long() argument must be a string or a number
#Trying the string abstraction
b = np.fromstring(c)
#The above fails with: TypeError: argument 1 must be string or read-only buffer
b = np.fromstring(c.getvalue(), int) # does work
My question is why does it behave this way.
The practical problem where this came up is the following: I have a iterator which yields a tuple. I am interested in making a numpy array from one of the components of the tuple with as little copying and duplication as possible. My first cut was to keep writing the interesting components of the yielded tuple into a StringIO object and then use its memory buffer for the array. I can of course use getvalue() but will create and return a copy. What would be a good way to avoid the extra copying.
The problem seems to be that numpy doesn't like being given characters instead of numbers. Remember, in Python, single characters and strings have the same type — numpy must have some type detection going on under the hood, and takes '\x01' to be a nested sequence.
The other problem is that a cStringIO iterates over its lines, not its characters.
Something like the following iterator should get around both of these problems:
def chariter(filelike):
octet = filelike.read(1)
while octet:
yield ord(octet)
octet = filelike.read(1)
Use it like so (note the seek!):
c.seek(0)
b = np.fromiter(chariter(c), int)
As cStringIO does not implement the buffer interface, if its getvalue returns a copy of the data, then there is no way to get its data without copying.
If getvalue returns the buffer as a string without making a copy, numpy.frombuffer(x.getvalue(), dtype='S1') will give a (read-only) numpy array referring to the string, without an additional copy.
The reason why np.fromiter(c, int) and np.array(c, int) do not work is that cStringIO, when iterated, returns a line at a time, similarly as files:
>>> list(iter(c))
['\x01\x00\x00\x00\x01\x00\x00\x00']
Such a long string cannot be converted to a single integer.
***
It's best not to worry too much about making copies unless it really turns out to be a problem. The reason is that the extra overhead in e.g. using a generator and passing it to numpy.fromiter may be actually larger than what is involved in constructing a list, and then passing that to numpy.array --- making the copies may be cheap compared to Python runtime overhead.
However, if the issue is with memory, then one solution is to put the items directly into the final Numpy array. If you know the size beforehand, you can pre-allocate it. If the size is unknown, you can use the .resize() method in the array to grow it as needed.
I'm generating many largish 'random' files (~500MB) in which the contents are the output of repeated calls to random.randint(...). I'd like to preallocate a large buffer, write longs to that buffer, and periodically flush that buffer to disk. I am currently using array.array() but I can't see a way to create a view into this buffer. I need to do this so that I can feed the part of the buffer with valid data into hashlib.update(...) and to write the valid part of the buffer to the file. I could use the slice operator but AFAICT that creates a copy of the buffer, which isn't what I want.
Is there a way to do this that I'm not seeing?
Update:
I went with numpy as user42005 and hgomersall suggested. Unfortunately this didn't give me the speedups I was looking for. My dirt-simple C program generates ~700MB of data in 11s, while my python equivalent using numpy takes around 700s! It's hard to believe that that's the difference in performance between the two (I'm more likely to believe that I made a naive mistake somewhere...)
I guess you could use numpy: http://www.numpy.org - the fundamental array type in numpy at least supports no-copy views.
Numpy is incredibly flexible and powerful when it comes to views into arrays whilst minimising copies. For example:
import numpy
a = numpy.random.randint(0, 10, size=10)
b = numpy.a[3:10]
b is now a view of the original array that was created.
Numpy arrays allow all manner of access directly to the data buffers, and can be trivially typecast. For example:
a = numpy.random.randint(0, 10, size=10)
b = numpy.frombuffer(a.data, dtype='int8')
b is now view into the memory with the data all as 8-bit integers (the data itself remains unchanged, so that each 64-bit int now becomes 8 8-bit ints). These buffer objects (from a.data) are standard python buffer objects and so can be used in all the places that are defined to work with buffers.
The same is true for multi-dimensional arrays. However, you have to bear in mind how the data lies in memory. For example:
a = numpy.random.randint(0, 10, size=(10, 10))
b = numpy.frombuffer(a[3,:].data, dtype='int8')
will work, but
b = numpy.frombuffer(a[:,3].data, dtype='int8')
returns an error about being unable to get single-segment buffer for discontiguous arrays. This problem is not obvious because simply allocating that same view to a variable using
b = a[:,3]
returns a perfectly adequate numpy array. However, it is not contiguous in memory as it's a view into the other array, which need not be (and in this case isn't) a view of contiguous memory. You can get info about the array using the flags attribute on an array:
a[:,3].flags
which returns (among other things) both C_CONTIGUOUS (C order, row major) and F_CONTIGUOUS (Fortran order, column major) as False, but
a[3,:].flags
returns them both as True (in 2D arrays, at most one of them can be true).