I have some functions, part of a big analysis software, that require a boolean mask to divide array items in two groups. These functions are like this:
def process(data, a_mask):
b_mask = -a_mask
res_a = func_a(data[a_mask])
res_b = func_b(data[b_mask])
return res_a, res_b
Now, I need to use these functions (with no modification) with a big array that has items of only class "a", but I would like to save RAM and do not pass a boolean mask with all True. For example I could pass a slice like slice(None, None).
The problem is that the line b_mask = -a_mask will fail if a_mask is a slice. Ideally -a_mask should give a 0-items selection.
I was thinking of creating a "modified" slice object that implements the __neg__() method as a null slice (for example slice(0, 0)). I don't know if this is possible.
Other solutions that allow to don't modify the process() function but at the same time avoid allocating an all-True boolean array will be accepted as well.
Unfortunately we can't add a __neg__() method to slice, since it cannot be subclassed. However, tuple can be subclassed, and we can use it to hold a single slice object.
This leads me to a very, very nasty hack which should just about work for you:
class NegTuple(tuple):
def __neg__(self):
return slice(0)
We can create a NegTuple containing a single slice object:
nt = NegTuple((slice(None),))
This can be used as an index, and negating it will yield an empty slice resulting in a 0-length array being indexed:
a = np.arange(5)
print a[nt]
# [0 1 2 3 4]
print a[-nt]
# []
You would have to be very desperate to resort to something like this, though. Is it totally out of the question to modify process like this?
def process(data, a_mask=None):
if a_mask is None:
a_mask = slice(None) # every element
b_mask = slice(0) # no elements
else:
b_mask = -a_mask
res_a = func_a(data[a_mask])
res_b = func_b(data[b_mask])
return res_a, res_b
This is way more explicit, and should not have any affect on its behavior for your current use cases.
Your solution is very similar to a degenerate sparse boolean array, although I don't know of any implementations of the same. My knee-jerk reaction is one of dislike, but if you really can't modify process it's probably the best way.
If you are concerned about memory use, then advanced indexing may be a bad idea. From the docs
Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).
As it stands, the process function has:
data of size n say
a_mask of size n (assuming advanced indexing)
And creates:
b_mask of size n
data[a_mask] of size m say
data[b_mask] of size n - m
This is effectively 4 arrays of size n.
Basic slicing seems to be your best option then, however Python doesn't seem to allow subclassing slice:
TypeError: Error when calling the metaclass bases
type 'slice' is not an acceptable base type
See #ali_m's answer for a solution that incorporates slicing.
Alternatively, you could just bypass process and get your results as
result = func_a(data), func_b([])
Related
I want to assign values to part of an array which is specified by multiple conditions.
For example:
import numpy as np
temp = np.arange(120).reshape((2,3,4,5))
mask = temp > 22
submask = temp[mask] < 43
(temp[mask])[submask] = 0 # assign zeros to the part of the array specified via mask and submask
print(temp) # notice that temp is unchanged - I want it to be changed
This example is artificial. Generally I want to do something more complex which involves a combination of indexing and boolean masks. Using a list index fails in similar circumstances. For example: temp[:,:,0,[1,3,2]]=0 is a valid assignment, but temp[:,:,0,[1,3,2]][mask]=0 will fail.
My understanding is that the assignment is failing because the complex indexing is prompting numpy to make a copy of the array object and assigning to that, rather than to the original array. So it isn't that the assignment is failing per se, just that the assignment is directed towards the "wrong" object.
I have tried using functions such as np.copyto and np.putmask but these also fail, presumably because the backend implementation mimics the original problem. For example: np.putmask(temp, mask, 0) will work but np.putmask(temp[mask], submask, 0) will not.
Is there a good way to do what I want to do?
Question
Regarding Numpy Indexing which can either return copy or view, please confirm if my understandings are correct. If not, please provide explanations and provide pointers to the related specifications.
Q: Basic slicing
The Numpy Indexing documentation has Basic Slicing and Indexing section. I believe this section talks only about basic slicing as mentioned one of three ways in Indexing.
Indexing
There are three kinds of indexing available: field access, basic slicing, advanced indexing. Which one occurs depends on obj.
It only uses Python slice object and it returns view only. Even when a slice is in a tuple, which may fall into the section below, it still returns view.
Dealing with variable numbers of indices within programs
If one supplies to the index a tuple, the tuple will be interpreted as a list of indices.
This code will return a view.
>>> indices = (1,1,1,slice(0,2)) # same as [1,1,1,0:2]
>>> z[indices]
array([39, 40])
Are these correct understanding?
Q: Combining basic slicing and advance indexing
There is a section in the document:
Combining advanced and basic indexing
When there is at least one slice (:), ellipsis (...) or newaxis in the index (or the array has more dimensions than there are advanced indexes), then the behavior can be more complicated.
I believe, basic indexing in the section title means basic slicing. There is no such basic indexing that is different from basic slicing because Indexing says there are only three ways, field access, basic slicing, advanced indexing.
When advanced indexing and basic indexing are combined, it will return copy.
Are these correct understandings?
Background
Numpy indexing has so many options, some return a view others return copy. It is quite confusing to have clear mental classification which operation may return reference/view, or copy.
Indexing
There are three kinds of indexing available: field access, basic slicing, advanced indexing. Which one occurs depends on obj.
Basic Slicing and Indexing
Basic slicing occurs when obj is a slice object (constructed by start:stop:step notation inside of brackets), an integer, or a tuple of slice objects and integers. Ellipsis and newaxis objects can be interspersed with these as well.
NumPy slicing creates a view instead of a copy as in the case of builtin Python sequences such as string, tuple and list.
Advanced Indexing
Advanced indexing is triggered when the selection object, obj, is a
non-tuple sequence object, an ndarray (of data type integer or bool),
or a tuple with at least one sequence object or ndarray (of data type
integer or bool). There are two types of advanced indexing: integer
and Boolean.
Advanced indexing always returns a copy of the data (contrast with
basic slicing that returns a view).
Dealing with variable numbers of indices within programs
If one supplies to the index a tuple, the tuple will be interpreted as a list of indices.
z = np.arange(81).reshape(3,3,3,3)
>>> indices = (1,1,1,1)
>>> z[indices]
40
Field Access
If the ndarray object is a structured array the fields of the array can be accessed by indexing the array with strings, dictionary-like. Indexing x['field-name'] returns a new view to the array
Note
Which numpy operations copy and which mutate?
It's true that in order to get a good grasp of what returns a view and what returns a copy, you need to be thorough with the documentation (which sometimes doesn't really mention it as well). I will not be able to provide you a complete set of operations and their output types (view or copy) however, maybe this could help you on your quest.
You can use np.shares_memory() to check whether a function returns a view or a copy of the original array.
x = np.array([1, 2, 3, 4])
x1 = x
x2 = np.sqrt(x)
x3 = x[1:2]
x4 = x[1::2]
x5 = x.reshape(-1,2)
x6 = x[:,None]
x7 = x[None,:]
x8 = x6+x7
x9 = x5[1,0:2]
x10 = x5[[0,1],0:2]
print(np.shares_memory(x, x1))
print(np.shares_memory(x, x2))
print(np.shares_memory(x, x3))
print(np.shares_memory(x, x4))
print(np.shares_memory(x, x5))
print(np.shares_memory(x, x6))
print(np.shares_memory(x, x7))
print(np.shares_memory(x, x8))
print(np.shares_memory(x, x9))
print(np.shares_memory(x, x10))
True
False
True
True
True
True
True
False
True
False
Notice the last 2 advance+basic indexing examples. One is a view while other is a copy. The explaination of this difference as mentioned in the documentation (also provides insight on how these are implemented) is -
When there is at least one slice (:), ellipsis (...) or newaxis in the index (or the array has more dimensions than there are advanced indexes), then the behaviour can be more complicated. It is like concatenating the indexing result for each advanced index element
From Python to Numpy. Nicolas P. Rougier : Views and copies.
First, we have to distinguish between indexing and fancy indexing. The first will always return a view while the second will return a copy. This difference is important because in the first case, modifying the view modifies the base array while this is not true in the second case:
If you are unsure if the result of your indexing is a view or a copy, you can check what is the base of your result. If it is None, then you result is a copy:
>>> Z = np.random.uniform(0,1,(5,5))
>>> Z1 = Z[:3,:]
>>> Z2 = Z[[0,1,2], :]
>>> print(np.allclose(Z1,Z2))
True
>>> print(Z1.base is Z)
True
>>> print(Z2.base is Z)
False
>>> print(Z2.base is None)
True
If the base refers to the source, it is NOT None, hence a view.
I am currently struggling writing tidy code in NumPy using advanced indexing.
arr = np.arange(100).reshape(10,10) # array I want to manipulate
sl1 = arr[:,-1] # basic indexing
# Do stuff with sl1...
sl1[:] = -1
# arr has been changed as well
sl2 = arr[arr >= 50] # advanced indexing
# Do stuff with sl2...
sl2[:] = -2
# arr has not been changed,
# changes must be written back into it
arr[arr >= 50] = sl2 # What I'd like to avoid
I'd like to avoid this "write back" operation because it feels superfluous and I often forget it. Is there a more elegant way to accomplish the same thing?
Both boolean and integer array indexing, fall under the category of advanced indexing methods. In the second example (boolean indexing), you'll see that the original array is not updated, this is because advanced indexing always returns a copy of the data (see second paragraph in the advanced indexing section of the docs). This means that once you do arr[arr >= 50] this is already a copy of arr, and whatever changes you apply over it they won't affect arr.
The reason why it does not return a view is that advanced indexing cannot be expressed as a slice, and hence cannot be addressed with offsets, strides, and counts, which is required to be able to take a view of the array's elements.
We can easily verify that we are viewing different objects in the case of advanced indexing with:
np.shares_memory(arr, arr[arr>50])
# False
np.shares_memory(arr, arr[:,-1])
# True
Views are only returned when performing basic slicing operations. So you'll have to assign back as you're doing in the last example. In reference to the question in the comments, when assigning back in a same expression:
arr[arr >= 50] = -2
This is translated by the python interpreter as:
arr.__setitem__(arr >= 50, -2)
Here the thing to understand is that the expression can be evaluated in-place, hence there's no new object creation involved since there is no need for it.
I want to implement a function that can compute basic math operations on large array (that won't whole fit in RAM). Therefor I wanted to create a function that will process given operation block by block over selected axis. Main thought of this function is like this:
def process_operation(inputs, output, operation):
shape = inputs[0].shape
for index in range(shape[axis]):
output[index,:] = inputs[0][index:] + inputs[1][index:]
but I want to be able to change the axis by that the blocks should be sliced/indexed.
is it possible to do indexing some sort of dynamic way, not using the ':' syntactic sugar?
I found some help here but so far wasn't much helpful:
thanks
I think you could achieve what you want using python's builtin slice type.
Under the hood, :-expressions used inside square brackets are transformed into instances of slice, but you can also use a slice to begin with. To iterate over different axes of your input you can use a tuple of slices of the correct length.
This might look something like:
def process_operation(inputs, output, axis=0):
shape = inputs[0].shape
for index in range(shape[axis]):
my_slice = (slice(None),) * axis + (index,)
output[my_slice] = inputs[0][my_slice] + inputs[1][my_slice]
I believe this should work with h5py datasets or memory-mapped arrays without any modifications.
Background on slice and __getitem__
slice works in conjunction with the __getitem__ to evaluate the x[key] syntax. x[key] is evaluated in two steps:
If key contains any expressions such as :, i:j or i:j:k then these are de-sugared into slice instances.
key is passed to the __getitem__ method of the object x. This method is responsible for returning the correct value of x[key]
For the example the expressions:
x[2]
y[:, ::2]
are equivalent to:
x.__getitem__(2)
y.__getitem__((slice(None), slice(None, None, 2)))
You can explore how values are converted to slices using a class like the following:
class Sliceable:
def __getitem__(self, key):
print(key)
x = Sliceable()
x[::2] # prints "slice(None, None, 2)"
The benefits and simplistic mapping that h5py provides (through HDF5) for persisting datasets on disk is exceptional. I run some analysis on a set of files and store the result into a dataset, one for each file. At the end of this step, I have a set of h5py.Dataset objects which contain 2D arrays. The arrays all have the same number of columns, but different number of rows, i.e., (A,N), (B,N), (C,N), etc.
I would now like to access these multiple 2D arrays as a single array 2D array. That is, I would like to read them on-demand as an array of shape (A+B+C, N).
For this purpose, h5py.Link classes do not help as it works at the level of HDF5 nodes.
Here is some pseudocode:
import numpy as np
import h5py
a = h5py.Dataset('a',data=np.random.random((100, 50)))
b = h5py.Dataset('b',data=np.random.random((300, 50)))
c = h5py.Dataset('c',data=np.random.random((253, 50)))
# I want to view these arrays as a single array
combined = magic_array_linker([a,b,c], axis=1)
assert combined.shape == (100+300+253, 50)
For my purposes, suggestions of copying the arrays into a new file do not work. I'm also open to solving this on the numpy level, but I don't find any suitable options with numpy.view or numpy.concatenate that would work without copying out the data.
Does anybody know of a way to view multiple arrays as a stacked set of arrays, without copying and from h5py.Dataset?
First up, I don't think there is a way to do this without copying the data in order to return a single array. As far as I can tell, it's not possible to concatenate numpy views into one array - unless, of course, you create your own wrapper.
Here I demonstrate a proof of concept using Object/Region references. The basic premise is that we make a new dataset in the file which is an array of references to the constituent subarrays. By storing references like this, the subarrays can change size dynamically and indexing the wrapper will always index the correct subarrays.
As this is just a proof of concept, I haven't implemented proper slicing, just very simple indexing. There's also no attempt at error checking - this will almost definitely break in production.
class MagicArray(object):
"""Magically index an array of references
"""
def __init__(self, file, references, axis=0):
self.file = file
self.references = references
self.axis = axis
def __getitem__(self, items):
# We need to modify the indices, so make sure items is a list
items = list(items)
for item in items:
if hasattr(item, 'start'):
# items is a slice object
raise ValueError('Slices not implemented')
for ref in self.references:
size = self.file[ref].shape[self.axis]
# Check if the requested index is in this subarray
# If not, subtract the subarray size and move on
if items[self.axis] < size:
item_ref = ref
break
else:
items[self.axis] = items[self.axis] - size
return self.file[item_ref][tuple(items)]
Here's how you use it:
with h5py.File("/tmp/so_hdf5/test.h5", 'w') as f:
a = f.create_dataset('a',data=np.random.random((100, 50)))
b = f.create_dataset('b',data=np.random.random((300, 50)))
c = f.create_dataset('c',data=np.random.random((253, 50)))
ref_dtype = h5py.special_dtype(ref=h5py.Reference)
ref_dataset = f.create_dataset("refs", (3,), dtype=ref_dtype)
for i, key in enumerate([a, b, c]):
ref_dataset[i] = key.ref
with h5py.File("/tmp/so_hdf5/test.h5", 'r') as f:
foo = MagicArray(f, f['refs'], axis=0)
print(foo[104, 4])
print(f['b'][4,4])
This should be fairly trivial to extend to fancier indexing (i.e. being able to handle slices), but I can't see how to do so without copying data.
You might be able to subclass from numpy.ndarray and get all the usual methods as well.