I want to implement a function that can compute basic math operations on large array (that won't whole fit in RAM). Therefor I wanted to create a function that will process given operation block by block over selected axis. Main thought of this function is like this:
def process_operation(inputs, output, operation):
shape = inputs[0].shape
for index in range(shape[axis]):
output[index,:] = inputs[0][index:] + inputs[1][index:]
but I want to be able to change the axis by that the blocks should be sliced/indexed.
is it possible to do indexing some sort of dynamic way, not using the ':' syntactic sugar?
I found some help here but so far wasn't much helpful:
thanks
I think you could achieve what you want using python's builtin slice type.
Under the hood, :-expressions used inside square brackets are transformed into instances of slice, but you can also use a slice to begin with. To iterate over different axes of your input you can use a tuple of slices of the correct length.
This might look something like:
def process_operation(inputs, output, axis=0):
shape = inputs[0].shape
for index in range(shape[axis]):
my_slice = (slice(None),) * axis + (index,)
output[my_slice] = inputs[0][my_slice] + inputs[1][my_slice]
I believe this should work with h5py datasets or memory-mapped arrays without any modifications.
Background on slice and __getitem__
slice works in conjunction with the __getitem__ to evaluate the x[key] syntax. x[key] is evaluated in two steps:
If key contains any expressions such as :, i:j or i:j:k then these are de-sugared into slice instances.
key is passed to the __getitem__ method of the object x. This method is responsible for returning the correct value of x[key]
For the example the expressions:
x[2]
y[:, ::2]
are equivalent to:
x.__getitem__(2)
y.__getitem__((slice(None), slice(None, None, 2)))
You can explore how values are converted to slices using a class like the following:
class Sliceable:
def __getitem__(self, key):
print(key)
x = Sliceable()
x[::2] # prints "slice(None, None, 2)"
Related
In my code I have a large 2D array from where, inside a double for-loop, I extract an element at the time.
Now, there are situations where the values inside this matrix are all equal. I'm trying to create a "dummy"-array that, when sliced, always returns the same value without actually performing the __getitem__ operation that would be a useless waste of time.
A possible inelegant solution could be to use a lambda function and replace the __getitem__ with a __call__. Something like:
if <values are all equal>:
x = lambda i,j : x0
else:
x = lambda i,j : x_values[i,j]
Then I'd need to replace x[i,j] with x(i,j) inside the code, that would look to something like:
for i in range(max_i):
for j in range(max_j):
...
x(i,j)
...
I find this, however, somehow un-intuitive to read and somehow cumbersome. The thing that I dislike the most is replacing x[i,j] with x(i,j) as it is less intuitive to understand that x(i,j) is a sort of slicing and not a real function.
Another solution could be to code a class like:
class constant_array:
def __init__(self, val):
self.val = val
def __getitem__(self, _)
return self.val
Another issue with both this and the previous method is that x would no longer be a numpy.array. So calls like np.mean(x) would fail.
Is there a better way to create an object that when sliced with x[i,j] always return a constant value independently from i and j without having to change all the following code?
EDIT:
As my question was badly formulated, I decided to rewrite it.
Does numpy allow to create an array with a function, without using Python's standard list comprehension ?
With list comprehension I could have:
array = np.array([f(i) for i in range(100)])
with f a given function.
But if the constructed array is really big, using Python's list would be slow and would eat a lot of memory.
If such a way doesn't exist, I suppose I could first create an array of my wanted size
array = np.arange(100)
And then map a function over it.
array = f(array)
According to results from another post, it seems that it would be a reasonable solution.
Let's say I want to use the add function with a simple int value, it will be as follows:
array = np.array([i for i in range(5)])
array + 5
But now what if I want the value (here 5) as something that varies according to the index of the array element. For example the operation:
array + [i for i in range(5)]
What object can I use to define special rules for a variable value within a vectorized operation ?
You can add two arrays together like this:
Simple adding two arrays using numpy in python?
This assumes your "variable by index" is just another array.
For your specific example, a jury-rigged solution would be to use numpy.arange() as in:
In [4]: array + np.arange(5)
Out[4]: array([0, 2, 4, 6, 8])
In general, you can find some numpy ufunc that does the job of your custom function or you can compose then in a python function to do so, which then returns an ndarray, something like:
def custom_func():
# code for your tasks
return arr
You can then simply add the returned result to your already defined array as in:
array + cusom_func()
I am very new to Python, and I am trying to get used to performing Python's array operations rather than looping through arrays. Below is an example of the kind of looping operation I am doing, but am unable to work out a suitable pure array operation that does not rely on loops:
import numpy as np
def f(arg1, arg2):
# an arbitrary function
def myFunction(a1DNumpyArray):
A = a1DNumpyArray
# Create a square array with each dimension the size of the argument array.
B = np.zeros((A.size, A.size))
# Function f is a function of two elements of the 1D array. For each
# element, i, I want to perform the function on it and every element
# before it, and store the result in the square array, multiplied by
# the difference between the ith and (i-1)th element.
for i in range(A.size):
B[i,:i] = f(A[i], A[:i])*(A[i]-A[i-1])
# Sum through j and return full sums as 1D array.
return np.sum(B, axis=0)
In short, I am integrating a function which takes two elements of the same array as arguments, returning an array of results of the integral.
Is there a more compact way to do this, without using loops?
The use of an arbitrary f function, and this [i, :i] business complicates by passing a loop.
Most of the fast compiled numpy operations work on the whole array, or whole rows and/or columns, and effectively do so in parallel. Loops that are inherently sequential (value from one loop depends on the previous) don't fit well. And different size lists or arrays in each loop are also a good indicator that 'vectorizing' will be difficult.
for i in range(A.size):
B[i,:i] = f(A[i], A[:i])*(A[i]-A[i-1])
With a sample A and known f (as simple as arg1*arg2), I'd generate a B array, and look for patterns that treat B as a whole. At first glance it looks like your B is a lower triangle. There are functions to help index those. But that final sum might change the picture.
Sometimes I tackle these problems with a bottom up approach, trying to remove inner loops first. But in this case, I think some sort of big-picture approach is needed.
This has given me a lot of trouble, and I am perplexed by the incompatibility of numpy arrays with pandas series. When I create a boolean array using a series, for instance
x = np.array([1,2,3,4,5,6,7])
y = pd.Series([1,2,3,4,5,6,7])
delta = np.percentile(x, 50)
deltamask = x- y > delta
delta mask creates a boolean pandas series.
However, if you do
x[deltamask]
y[deltamask]
You find that the array ignores completely the mask. No error is raised, but you end up with two objects of different length. This means that an operation like
x[deltamask]*y[deltamask]
results in an error:
print type(x-y)
print type(x[deltamask]), len(x[deltamask])
print type(y[deltamask]), len(y[deltamask])
Even more perplexing, I noticed that the operator < is treated differently. For instance
print type(2*x < x*y)
print type(2 < x*y)
will give you a pd.series and np.array respectively.
Also,
5 < x - y
results in a series, so it seems that the series takes precedence, whereas the boolean elements of a series mask are promoted to integers when passed to a numpy array and result in a sliced array.
What is the reason for this?
Fancy Indexing
As numpy currently stands, fancy indexing in numpy works as follows:
If the thing between brackets is a tuple (whether with explicit parens or not), the elements of the tuple are indices for different dimensions of x. For example, both x[(True, True)] and x[True, True] will raise IndexError: too many indices for array in this case because x is 1D. However, before the exception happens, a telling warning will be raised too: VisibleDeprecationWarning: using a boolean instead of an integer will result in an error in the future.
If the thing between brackets is exactly an ndarray, not a subclass or other array-like, and has a boolean type, it will be applied as a mask. This is why x[deltamask.values] gives the expected result (empty array since deltamask is all False.
If the thing between brackets is any array-like, whether a subclass like Series or just a list, or something else, it is converted to an np.intp array (if possible) and used as an integer index. So x[deltamask] yeilds something equivalent to x[[False] * 7] or just x[[0] * 7]. In this case, len(deltamask)==7 and x[0]==1 so the result is [1, 1, 1, 1, 1, 1, 1].
This behavior is counterintuitive, and the FutureWarning: in the future, boolean array-likes will be handled as a boolean array index it generates indicates that a fix is in the works. I will update this answer as I find out about/make any changes to numpy.
This information can be found in Sebastian Berg's response to my initial query on Numpy discussion here.
Relational Operators
Now let's address the second part of your question about how the comparison works. Relational operators (<, >, <=, >=) work by calling the corresponding method on one of the objects being compared. For < this is __lt__. However, instead of just calling x.__lt__(y) for the expression x < y, Python actually checks the types of the objects being compared. If y is a subtype of x that implements the comparison, then Python prefers to call y.__gt__(x) instead, regardless of how you wrote the original comparison. The only way that x.__lt__(y) will get called if y is a subclass of x is if y.__gt__(x) returns NotImplemented to indicate that the comparison is not supported in that direction.
A similar thing happens when you do 5 < x - y. While ndarray is not a subclass of int, the comparison int.__lt__(ndarray) returns NotImplemented, so Python actually ends up calling (x - y).__gt__(5), which is of course defined and works just fine.
A much more succinct explanation of all this can be found in the Python docs.
I have some functions, part of a big analysis software, that require a boolean mask to divide array items in two groups. These functions are like this:
def process(data, a_mask):
b_mask = -a_mask
res_a = func_a(data[a_mask])
res_b = func_b(data[b_mask])
return res_a, res_b
Now, I need to use these functions (with no modification) with a big array that has items of only class "a", but I would like to save RAM and do not pass a boolean mask with all True. For example I could pass a slice like slice(None, None).
The problem is that the line b_mask = -a_mask will fail if a_mask is a slice. Ideally -a_mask should give a 0-items selection.
I was thinking of creating a "modified" slice object that implements the __neg__() method as a null slice (for example slice(0, 0)). I don't know if this is possible.
Other solutions that allow to don't modify the process() function but at the same time avoid allocating an all-True boolean array will be accepted as well.
Unfortunately we can't add a __neg__() method to slice, since it cannot be subclassed. However, tuple can be subclassed, and we can use it to hold a single slice object.
This leads me to a very, very nasty hack which should just about work for you:
class NegTuple(tuple):
def __neg__(self):
return slice(0)
We can create a NegTuple containing a single slice object:
nt = NegTuple((slice(None),))
This can be used as an index, and negating it will yield an empty slice resulting in a 0-length array being indexed:
a = np.arange(5)
print a[nt]
# [0 1 2 3 4]
print a[-nt]
# []
You would have to be very desperate to resort to something like this, though. Is it totally out of the question to modify process like this?
def process(data, a_mask=None):
if a_mask is None:
a_mask = slice(None) # every element
b_mask = slice(0) # no elements
else:
b_mask = -a_mask
res_a = func_a(data[a_mask])
res_b = func_b(data[b_mask])
return res_a, res_b
This is way more explicit, and should not have any affect on its behavior for your current use cases.
Your solution is very similar to a degenerate sparse boolean array, although I don't know of any implementations of the same. My knee-jerk reaction is one of dislike, but if you really can't modify process it's probably the best way.
If you are concerned about memory use, then advanced indexing may be a bad idea. From the docs
Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view).
As it stands, the process function has:
data of size n say
a_mask of size n (assuming advanced indexing)
And creates:
b_mask of size n
data[a_mask] of size m say
data[b_mask] of size n - m
This is effectively 4 arrays of size n.
Basic slicing seems to be your best option then, however Python doesn't seem to allow subclassing slice:
TypeError: Error when calling the metaclass bases
type 'slice' is not an acceptable base type
See #ali_m's answer for a solution that incorporates slicing.
Alternatively, you could just bypass process and get your results as
result = func_a(data), func_b([])