In NumPy, why does hstack() copy the data from the arrays being stacked:
A, B = np.array([1,2]), np.array([3,4])
C = np.hstack((A,B))
A[0]=99
gives for C:
array([1, 2, 3, 4])
whereas hsplit() creates a view on the data:
a = np.array(((1,2),(3,4)))
b, c = np.hsplit(a,2)
a[0][0]=99
gives for b:
array([[99],
[ 3]])
I mean - what is the reasoning behind the implementation of this behaviour (which I find inconsistent and hard to remember): I accept that this happens because it's coded that way...
Basically the underlying ndarray data structure only has a single pointer to the start of its data's memory and then stride information about how to move through each dimension. If you concatenate two arrays, it won't know how to move from one memory location to the other. On the other hand, if you split an array into two arrays, each can easily store a pointer to the first element (which is somewhere inside the original array).
The basic C implementation is here, and there is a good discussion at:
http://scipy-lectures.github.io/advanced/advanced_numpy/index.html#life-of-ndarray
NumPy generally tries to create views whenever possible, since memory copies are inefficient and can quite quickly eat up a lot of cycles.
hsplit splits the input array into multiple output arrays. The output arrays can each be views into a portion of the original parent array (since they are basically simple slices). Thus, for efficiency, NumPy creates views, instead of copies.
hstack combines two completely separate arrays into a single output array. The underlying array implementation cannot handle two separate data sources in a single array, so there is no way to share the data with the original. Thus, NumPy is forced to create a copy.
Related
I have a function in an inner loop that takes two arrays and combines them. To get a feel for what it's doing look at this example using lists:
a = [[1,2,3]]
b = [[4,5,6],
[7,8,9]]
def combinearrays(a, b):
a = a + b
return a
def main():
print(combinearrays(a,b))
The output would be:
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
The key thing here is that I always have the same number of columns, but I want to append rows together. Also, the values are always ints.
As an added wrinkle, I cheated and created a as a list within a list. But in reality, it might be a single dimensional array that I want to still combine into a 2D array.
I am currently doing this using Numpy in real life (i.e. not the toy problem above) and this works. But I really want to make this as fast as possible and it seems like c arrays should be faster. Obviously one problem with c arrays if I pass them as parameters, is I won't know the actual number of rows in the arrays passed. But I can always add additional parameters to pass that.
So it's not like I don't have a solution to this problem using Numpy, but I really want to know what the single fastest way to do this is in Cython. Since this is a call inside an inner loop, it's going to get called thousands of times. So every little savings is going to count big.
One obvious idea here would be to use malloc or something like that.
While I'm not convinced this is the only option, let me recommend the simple option of building a standard Python list using append and then using np.vstack or np.concatenate at the end to build a full Numpy array.
Numpy arrays store all the data essentially contiguously in memory (this isn't 100% true for if you're taking slices, but for freshly allocated memory it's basically true). When you resize the array it may get lucky and have unallocated memory after the array and then be able to reallocate in place. However, in general this won't happen and the entire contents of the array will need to be copied to the new location. (This will likely apply for any solution you devise yourself with malloc/realloc).
Python lists are good for two reasons:
They are internally a list of PyObject* (in this case to the Numpy arrays it contains). If copying is needed during the resize you are only copying the pointers to the arrays, and not the whole arrays.
They are designed to handle resizing/appending intelligently by over-allocating the space needed, so that they need only re-allocate more memory occasionally. Numpy arrays could have this feature, but it's less obviously a good thing for Numpy than it is for Python lists (if you have a 10GB data array that barely fits in memory do you really want it over-allocated?)
My proposed solution uses the flexibly, easily-resized list class to build your array, and then only finalizes to the inflexible but faster Numpy array at the end, therefore (largely) getting the best of both.
A completely untested outline of the same structure using C to allocate would look like:
from libc.stdlib cimport malloc, free, realloc
cdef int** ptr_array = NULL
cdef int* current_row = NULL
# just to be able to return a numpy array
cdef int[:,::1] out
rows_allocated = 0
try:
for row in range(num_rows):
ptr_array = realloc(ptr_array, sizeof(int*)*(row+1))
current_row = ptr_array[r] = malloc(sizeof(int)*row_length)
rows_allocated = row+1
# fill in data on current_row
# pass to numpy so we can access in Python. There are other
# way of transfering the data to Python...
out = np.empty((rows_allocated,row_length),dtype=int)
for row in range(rows_allocated):
for n in range(row_length):
out[row,n] = ptr_array[row][n]
return out.base
finally:
# clean up memory we have allocated
for row in range(rows_allocated):
free(ptr_array[row])
free(ptr_array)
This is unoptimized - a better version would over-allocate ptr_array to avoid resizing each time. Because of this I don't actually expect it to be quick, but it's meant as an indication of how to start.
As I understand it, a numpy array is an object storing values in a contiguous block of memory, while Python's built-in containers (list, tuple, set, dict) contain references to objects. Basic slicing of a numpy array returns a view containing a specified subset of those values.
On the surface, the view looks like another numpy array (type(aView) returns numpy.ndarray), but its values are not copies but rather the same values as those of the original array; altering the values of a view in-place alters the values of the original array as well.
How does a view do this? I would imagine a view should contain some sort of pointers to the values in the original array, but a couple things give me pause:
The values in the array aren't objects, and I don't know how references can be made to small pieces of the same object.
creating a copy of an array is much slower than making a view, and if the view is just an array of some sort of pointer values, I would expect both copies and views to be made with about the same speed.
A NumPy array knows its base address, data type, shape, and strides. Most applications don't need to explicitly deal with the strides, but they are what make some of this work. The strides indicate how many bytes must be added to increment a given dimension by one logical unit (e.g. row).
If you start with a 3x3 array of float64 (aka f8) at address 0x1000, and you want a view of the 2x2 subarray which starts in the center of the original, all you need is to increment the base address by 4 elements (3 for the entire first row, 1 to move from the left to the center of the middle row) and remember that the each row starts 24 bytes after the previous one (despite being only 16 bytes long).
Conceptually we go from this:
base=0x1000
shape=(3,3)
strides=(24,8)
dtype='f8'
To this:
base=0x1020 (added 1*24 + 1*8 for [1:,1:] view)
shape=(2,2)
strides=(24,8)
dtype='f8'
And the view takes these elements:
. . .
. 4 5
. 7 8
Some flags are adjusted on the view, such as the C_CONTIGUOUS flag which needs to be unset because the view is not a contiguous region anymore.
Strides not only support views of NumPy arrays, but also views of data structures which did not originate in NumPy. For example if you have a C array of structs and the first member of each is a point (x,y), you can construct a view of only these points by setting the stride to the size of the entire struct, despite the dtype being just the two numbers.
I have a list of several large hdf5 files, each with a 4D dataset. I would like to obtain a concatenation of them on the first axis, as in, an array-like object that would be used as if all datasets were concatenated. My final intent is to sequentially read chunks of the data along the same axis (e.g. [0:100,:,:,:], [100:200,:,:,:], ...), multiple times.
Datasets in h5py share a significant part of the numpy array API, which allows me to call numpy.concatenate to get the job done:
files = [h5.File(name, 'r') for name in filenames]
X = np.concatenate([f['data'] for f in files], axis=0)
On the other hand, the memory layout is not the same, and memory cannot be shared among them (related question). Alas, concatenate will eagerly copy the entire content of each array-like object into a new array, which I cannot accept in my use case. The source code of the array concatenation function confirms this.
How can I obtain a concatenated view over multiple array-like objects, without eagerly reading them to memory? As far as this view is concerned, slicing and indexing over this view would behave just as if I had a concatenated array.
I can imagine that writing a custom wrapper would work, but I would like to know whether such an implementation already exists as a library, or whether another solution to the problem is available and just as feasible. My searches so far have yielded nothing of this sort. I am also willing to accept solutions specific to h5py.
flist = [f['data'] for f in files] is a list of dataset objects. The actual data is on the h5 files, is accessible as long as those files remain open.
When you do
arr = np.concatenate(flist, axis=0)
I imagine concatenate first does
tmep = [np.asarray(a) for a in flist]
that is, construct a list of numpy arrays. I assume np.asarray(f['data']) is the same as f['data'].value or f['data'][:] (as I discussed 2 yrs ago in the linked SO question). I should do some time tests comparing that with
arr = np.concatenate([a.value for a in flist], axis=0)
flist is a kind of lazy compilation of these data sets, in that the data still resides on the file, and is accessed only when you do something more.
[a.value[:,:,:10] for a in flist]
would load a portion of each of those data sets into memory; I expect that a concatenate on that list would be the equivalent of arr[:,:,:10].
Generators or generator comprehensions are a form of lazy evaluation, but I think they have to be turned into lists before use in concatenate. In any case, the result of concatenate is always an array with all the data in a contiguous block of memory. It is never blocks of data residing in files.
You need to tell us more about what intend to do with this large concatenated array of data sets. As outline I think you can construct arrays that contain slices of all the data sets. You could also perform other actions as I demonstrate in the previous answer - but with an access time cost.
I have a function that I want to have quickly access the first (aka zeroth) element of a given Numpy array, which itself might have any number of dimensions. What's the quickest way to do that?
I'm currently using the following:
a.reshape(-1)[0]
This reshapes the perhaps-multi-dimensionsal array into a 1D array and grabs the zeroth element, which is short, sweet and often fast. However, I think this would work poorly with some arrays, e.g., an array that is a transposed view of a large array, as I worry this would end up needing to create a copy rather than just another view of the original array, in order to get everything in the right order. (Is that right? Or am I worrying needlessly?) Regardless, it feels like this is doing more work than what I really need, so I imagine some of you may know a generally faster way of doing this?
Other options I've considered are creating an iterator over the whole array and drawing just one element from it, or creating a vector of zeroes containing one zero for each dimension and using that to fancy-index into the array. But neither of these seems all that great either.
a.flat[0]
This should be pretty fast and never require a copy. (Note that a.flat is an instance of numpy.flatiter, not an array, which is why this operation can be done without a copy.)
You can use a.item(0); see the documentation at numpy.ndarray.item.
A possible disadvantage of this approach is that the return value is a Python data type, not a numpy object. For example, if a has data type numpy.uint8, a.item(0) will be a Python integer. If that is a problem, a.flat[0] is better--see #user2357112's answer.
np.hsplit(x, 2)[0]
Source: https://numpy.org/doc/stable/reference/generated/numpy.dsplit.html
Source:
https://numpy.org/doc/stable/reference/generated/numpy.hsplit.html
## y -- numpy array of shape (1, Ty)
if you want to get the first element:
use y.shape[0]
if you want to get the second element:
use y.shape[1]
Source:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.take.html
You can also use the take for more complicated extraction (to get few elements):
numpy.take(a, indices, axis=None, out=None, mode='raise')[source] Take
elements from an array along an axis.
I'm generating many largish 'random' files (~500MB) in which the contents are the output of repeated calls to random.randint(...). I'd like to preallocate a large buffer, write longs to that buffer, and periodically flush that buffer to disk. I am currently using array.array() but I can't see a way to create a view into this buffer. I need to do this so that I can feed the part of the buffer with valid data into hashlib.update(...) and to write the valid part of the buffer to the file. I could use the slice operator but AFAICT that creates a copy of the buffer, which isn't what I want.
Is there a way to do this that I'm not seeing?
Update:
I went with numpy as user42005 and hgomersall suggested. Unfortunately this didn't give me the speedups I was looking for. My dirt-simple C program generates ~700MB of data in 11s, while my python equivalent using numpy takes around 700s! It's hard to believe that that's the difference in performance between the two (I'm more likely to believe that I made a naive mistake somewhere...)
I guess you could use numpy: http://www.numpy.org - the fundamental array type in numpy at least supports no-copy views.
Numpy is incredibly flexible and powerful when it comes to views into arrays whilst minimising copies. For example:
import numpy
a = numpy.random.randint(0, 10, size=10)
b = numpy.a[3:10]
b is now a view of the original array that was created.
Numpy arrays allow all manner of access directly to the data buffers, and can be trivially typecast. For example:
a = numpy.random.randint(0, 10, size=10)
b = numpy.frombuffer(a.data, dtype='int8')
b is now view into the memory with the data all as 8-bit integers (the data itself remains unchanged, so that each 64-bit int now becomes 8 8-bit ints). These buffer objects (from a.data) are standard python buffer objects and so can be used in all the places that are defined to work with buffers.
The same is true for multi-dimensional arrays. However, you have to bear in mind how the data lies in memory. For example:
a = numpy.random.randint(0, 10, size=(10, 10))
b = numpy.frombuffer(a[3,:].data, dtype='int8')
will work, but
b = numpy.frombuffer(a[:,3].data, dtype='int8')
returns an error about being unable to get single-segment buffer for discontiguous arrays. This problem is not obvious because simply allocating that same view to a variable using
b = a[:,3]
returns a perfectly adequate numpy array. However, it is not contiguous in memory as it's a view into the other array, which need not be (and in this case isn't) a view of contiguous memory. You can get info about the array using the flags attribute on an array:
a[:,3].flags
which returns (among other things) both C_CONTIGUOUS (C order, row major) and F_CONTIGUOUS (Fortran order, column major) as False, but
a[3,:].flags
returns them both as True (in 2D arrays, at most one of them can be true).