Best Way to Append C Arrays (or any arrays) in Cython

Best Way to Append C Arrays (or any arrays) in Cython - python

I have a function in an inner loop that takes two arrays and combines them. To get a feel for what it's doing look at this example using lists:
a = [[1,2,3]]
b = [[4,5,6],
[7,8,9]]
def combinearrays(a, b):
a = a + b
return a
def main():
print(combinearrays(a,b))
The output would be:
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
The key thing here is that I always have the same number of columns, but I want to append rows together. Also, the values are always ints.
As an added wrinkle, I cheated and created a as a list within a list. But in reality, it might be a single dimensional array that I want to still combine into a 2D array.
I am currently doing this using Numpy in real life (i.e. not the toy problem above) and this works. But I really want to make this as fast as possible and it seems like c arrays should be faster. Obviously one problem with c arrays if I pass them as parameters, is I won't know the actual number of rows in the arrays passed. But I can always add additional parameters to pass that.
So it's not like I don't have a solution to this problem using Numpy, but I really want to know what the single fastest way to do this is in Cython. Since this is a call inside an inner loop, it's going to get called thousands of times. So every little savings is going to count big.
One obvious idea here would be to use malloc or something like that.

While I'm not convinced this is the only option, let me recommend the simple option of building a standard Python list using append and then using np.vstack or np.concatenate at the end to build a full Numpy array.
Numpy arrays store all the data essentially contiguously in memory (this isn't 100% true for if you're taking slices, but for freshly allocated memory it's basically true). When you resize the array it may get lucky and have unallocated memory after the array and then be able to reallocate in place. However, in general this won't happen and the entire contents of the array will need to be copied to the new location. (This will likely apply for any solution you devise yourself with malloc/realloc).
Python lists are good for two reasons:
They are internally a list of PyObject* (in this case to the Numpy arrays it contains). If copying is needed during the resize you are only copying the pointers to the arrays, and not the whole arrays.
They are designed to handle resizing/appending intelligently by over-allocating the space needed, so that they need only re-allocate more memory occasionally. Numpy arrays could have this feature, but it's less obviously a good thing for Numpy than it is for Python lists (if you have a 10GB data array that barely fits in memory do you really want it over-allocated?)
My proposed solution uses the flexibly, easily-resized list class to build your array, and then only finalizes to the inflexible but faster Numpy array at the end, therefore (largely) getting the best of both.
A completely untested outline of the same structure using C to allocate would look like:
from libc.stdlib cimport malloc, free, realloc
cdef int** ptr_array = NULL
cdef int* current_row = NULL
# just to be able to return a numpy array
cdef int[:,::1] out
rows_allocated = 0
try:
for row in range(num_rows):
ptr_array = realloc(ptr_array, sizeof(int*)*(row+1))
current_row = ptr_array[r] = malloc(sizeof(int)*row_length)
rows_allocated = row+1
# fill in data on current_row
# pass to numpy so we can access in Python. There are other
# way of transfering the data to Python...
out = np.empty((rows_allocated,row_length),dtype=int)
for row in range(rows_allocated):
for n in range(row_length):
out[row,n] = ptr_array[row][n]
return out.base
finally:
# clean up memory we have allocated
for row in range(rows_allocated):
free(ptr_array[row])
free(ptr_array)
This is unoptimized - a better version would over-allocate ptr_array to avoid resizing each time. Because of this I don't actually expect it to be quick, but it's meant as an indication of how to start.

Related

cython insert element in array.array

I am trying to convert some python code into cython. In the python code I use data of type
array.array('i', [...]) and use the method array.insert to insert an element at a specific index. in cython however, when I try to insert an element using the same method I get this error: BufferError: cannot resize an array that is exporting buffers
basically:
from cpython cimport array
cdef array.array[int] a = array.array('i', [1,2,3,3])
a.insert(1,5) # insert 5 in the index 1 -> throws error
I have been looking at cyappend3 of this answer but I am using libcpp and not sure I understand the magic written there.
Any idea how to insert an element at a specific index in an array.array?

Partial answer:
BufferError: cannot resize an array that is exporting buffers
This is telling you that you have a memoryview (or similar) of the array somewhere. It isn't possible to resize it because that memoryview is looking directly into that array's data and resizing the array will require reallocating the data. You can replicate this error in Python too if you do view = memoryview(arr) before you try to insert.
In your case:
cdef array.array[int] a = array.array('i', [1,2,3,3])
cdef array.array[int] a is defining an array with a fast buffer to elements of the array, and it's this buffer that prevents you from resizing it. If you just do cdef array.array a it works fine. Obviously you lose the fast buffer access to individual elements, but that's because you're trying to change the data out from under the buffer.
I strongly recommend you don't resize arrays though. Not only does it involve the O(n) copy of every element of the array. Also, unlike Python list, array doesn't over-allocate so even append causes a complete reallocation and copy every time (i.e. is O(n) rather than amortized O(1)).
Instead I'd suggest keeping the data as a Python list (or maybe something else) until you've finalized the length and only then converting to array.

what has been answered here in this post is correct, (https://stackoverflow.com/a/74285371/4529589), and I have the same recommendation.
However I want to add this point that if you want to use the insert but as well if you want to use the insert and still define as the c buffer, you could use the std::vector. This will be faster.
from libcpp.vector cimport vector
cdef vector[int] vect = array.array('i', [1,2,3,3])
vect.insert(vect.begin() + 1 ,5)
and as well I recomend if you want to use this solution just drop the array and from the begining just use the vector initialization.

Fill up a 2D array while iterating through it

An example what I want to do is instead of doing what is shown below:
Z_old = [[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0]]
for each_axes in range(len(Z_old)):
for each_point in range(len(Z_old[each_axes])):
Z_old[len(Z_old)-1-each_axes][each_point] = arbitrary_function(each_point, each_axes)
I want now to not initialize the Z_old array with zeroes but rather fill it up with values while iterating through it which is going to be something like the written below although it's syntax is horribly wrong but that's what I want to reach in the end.
Z = np.zeros((len(x_list), len(y_list))) for Z[len(x_list) -1 - counter_1][counter_2] is equal to power_at_each_point(counter_1, counter_2] for counter_1 in range(len(x_list)) and counter_2 in range(len(y_list))]

As I explained in my answer to your previous question, you really need to vectorize arbitrary_function.
You can do this by just calling np.vectorize on the function, something like this:
Z = np.vectorize(arbitrary_function)(np.arange(3), np.arange(5).reshape(5, 1))
But that will only give you a small speedup. In your case, since arbitrary_function is doing a huge amount of work (including opening and parsing an Excel spreadsheet), it's unlikely to make enough difference to even notice, much less to solve your performance problem.
The whole point of using NumPy for speedups is to find the slow part of the code that operates on one value at a time, and replace it with something that operates on the whole array (or at least a whole row or column) at once. You can't do that by looking at the very outside loop, you need to look at the very inside loop. In other words, at arbitrary_function.
In your case, what you probably want to do is read the Excel spreadsheet into a global array, structured in such a way that each step in your process can be written as an array-wide operation on that array. Whether that means multiplying by a slice of the array, indexing the array using your input values as indices, or something completely different, it has to be something NumPy can do for you in C, or NumPy isn't going to help you.
If you can't figure out how to do that, you may want to consider not using NumPy, and instead compiling your inner loop with Cython, or running your code under PyPy. You'll still almost certainly need to move the "open and parse a whole Excel spreadsheet" outside of the inner loop, but at least you won't have to figure out how to rethink your problem in terms of vectorized operations, so it may be easier for you.

rows = 10
cols = 10
Z = numpy.array([ arbitrary_function(each_point, each_axes) for each_axes in range(cols) for each_point in range(rows) ]).reshape((rows,cols))
maybe?

Why does hstack() copy data but hsplit() create a view on it?

In NumPy, why does hstack() copy the data from the arrays being stacked:
A, B = np.array([1,2]), np.array([3,4])
C = np.hstack((A,B))
A[0]=99
gives for C:
array([1, 2, 3, 4])
whereas hsplit() creates a view on the data:
a = np.array(((1,2),(3,4)))
b, c = np.hsplit(a,2)
a[0][0]=99
gives for b:
array([[99],
[ 3]])
I mean - what is the reasoning behind the implementation of this behaviour (which I find inconsistent and hard to remember): I accept that this happens because it's coded that way...

Basically the underlying ndarray data structure only has a single pointer to the start of its data's memory and then stride information about how to move through each dimension. If you concatenate two arrays, it won't know how to move from one memory location to the other. On the other hand, if you split an array into two arrays, each can easily store a pointer to the first element (which is somewhere inside the original array).
The basic C implementation is here, and there is a good discussion at:
http://scipy-lectures.github.io/advanced/advanced_numpy/index.html#life-of-ndarray

NumPy generally tries to create views whenever possible, since memory copies are inefficient and can quite quickly eat up a lot of cycles.
hsplit splits the input array into multiple output arrays. The output arrays can each be views into a portion of the original parent array (since they are basically simple slices). Thus, for efficiency, NumPy creates views, instead of copies.
hstack combines two completely separate arrays into a single output array. The underlying array implementation cannot handle two separate data sources in a single array, so there is no way to share the data with the original. Thus, NumPy is forced to create a copy.

How can one efficiently remove a range of rows from a large numpy array?

Given a large 2d numpy array, I would like to remove a range of rows, say rows 10000:10010 efficiently. I have to do this multiple times with different ranges, so I would like to also make it parallelizable.
Using something like numpy.delete() is not efficient, since it needs to copy the array, taking too much time and memory. Ideally I would want to do something like create a view, but I am not sure how I could do this in this case. A masked array is also not an option since the downstream operations are not supported on masked arrays.
Any ideas?

Because of the strided data structure that defines a numpy array, what you want will not be possible without using a masked array. Your best option might be to use a masked array (or perhaps your own boolean array) to mask the deleted the rows, and then do a single real delete operation of all the rows to be deleted before passing it downstream.

There isn't really a good way to speed up the delete operation, as you've already alluded to, this kind of deleting requires the data to be copied in memory. The one thing you can do, as suggested by #WarrenWeckesser, is combine multiple delete operations and apply them all at once. Here's an example:
ranges = [(10, 20), (25, 30), (50, 100)]
mask = np.ones(len(array), dtype=bool)
# Update the mask with all the rows you want to delete
for start, end in ranges:
mask[start:stop] = False
# Apply all the changes at once
new_array = array[mask]
It doesn't really make sense to parallelize this because you're just copying stuff in memory so this will be memory bound anyways, adding more cpus will not help.

I don't know how fast this is, relative to the above, but say you have a list L of row indices of the rows you wish to keep from array A (by "rows" I mean the first index, for higher dimensional arrays). All other rows will be deleted. We'll let A hold the result.
A = A[np.ix_(L)]

Is there a way to get a view into a python array.array()?

I'm generating many largish 'random' files (~500MB) in which the contents are the output of repeated calls to random.randint(...). I'd like to preallocate a large buffer, write longs to that buffer, and periodically flush that buffer to disk. I am currently using array.array() but I can't see a way to create a view into this buffer. I need to do this so that I can feed the part of the buffer with valid data into hashlib.update(...) and to write the valid part of the buffer to the file. I could use the slice operator but AFAICT that creates a copy of the buffer, which isn't what I want.
Is there a way to do this that I'm not seeing?
Update:
I went with numpy as user42005 and hgomersall suggested. Unfortunately this didn't give me the speedups I was looking for. My dirt-simple C program generates ~700MB of data in 11s, while my python equivalent using numpy takes around 700s! It's hard to believe that that's the difference in performance between the two (I'm more likely to believe that I made a naive mistake somewhere...)

I guess you could use numpy: http://www.numpy.org - the fundamental array type in numpy at least supports no-copy views.

Numpy is incredibly flexible and powerful when it comes to views into arrays whilst minimising copies. For example:
import numpy
a = numpy.random.randint(0, 10, size=10)
b = numpy.a[3:10]
b is now a view of the original array that was created.
Numpy arrays allow all manner of access directly to the data buffers, and can be trivially typecast. For example:
a = numpy.random.randint(0, 10, size=10)
b = numpy.frombuffer(a.data, dtype='int8')
b is now view into the memory with the data all as 8-bit integers (the data itself remains unchanged, so that each 64-bit int now becomes 8 8-bit ints). These buffer objects (from a.data) are standard python buffer objects and so can be used in all the places that are defined to work with buffers.
The same is true for multi-dimensional arrays. However, you have to bear in mind how the data lies in memory. For example:
a = numpy.random.randint(0, 10, size=(10, 10))
b = numpy.frombuffer(a[3,:].data, dtype='int8')
will work, but
b = numpy.frombuffer(a[:,3].data, dtype='int8')
returns an error about being unable to get single-segment buffer for discontiguous arrays. This problem is not obvious because simply allocating that same view to a variable using
b = a[:,3]
returns a perfectly adequate numpy array. However, it is not contiguous in memory as it's a view into the other array, which need not be (and in this case isn't) a view of contiguous memory. You can get info about the array using the flags attribute on an array:
a[:,3].flags
which returns (among other things) both C_CONTIGUOUS (C order, row major) and F_CONTIGUOUS (Fortran order, column major) as False, but
a[3,:].flags
returns them both as True (in 2D arrays, at most one of them can be true).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.