Confusion about numpy's apply along axis and list comprehensions - python

Alright, so I apologize ahead of time if I'm just asking something silly, but I really thought I understood how apply_along_axis worked. I just ran into something that might be an edge case that I just didn't consider, but it's baffling me. In short, this is the code that is confusing me:
class Leaf(object):
def __init__(self, location):
self.location = location
def __len__(self):
return self.location.shape[0]
def bulk_leaves(child_array, axis=0):
test = np.array([Leaf(location) for location in child_array]) # This is what I want
check = np.apply_along_axis(Leaf, 0, child_array) # This returns an array of individual leafs with the same shape as child_array
return test, check
if __name__ == "__main__":
test, check = bulk_leaves(np.random.ran(100, 50))
test == check # False
I always feel silly using a list comprehension with numpy and then casting back to an array, but I'm just nor sure of another way to do this. Am I just missing something obvious?

The apply_along_axis is pure Python that you can look at and decode yourself. In this case it essentially does:
check = np.empty(child_array.shape,dtype=object)
for i in range(child_array.shape[1]):
check[:,i] = Leaf(child_array[:,i])
In other words, it preallocates the container array, and then fills in the values with an iteration. That certainly is better than appending to the array, but rarely better than appending values to a list (which is what the comprehension is doing).
You could take the above template and adjust it to produce the array that you really want.
for i in range(check.shape[0]):
check[i]=Leaf(child_array[i,:])
In quick tests this iteration times the same as the comprehension. The apply_along_axis, besides being wrong, is slower.

The problem seems to be that apply_along_axis uses isscalar to determine whether the returned object is a scalar, but isscalar returns False for user-defined classes. The documentation for apply_along_axis says:
The shape of outarr is identical to the shape of arr, except along the axis dimension, where the length of outarr is equal to the size of the return value of func1d.
Since your class's __len__ returns the length of the array it wraps, numpy "expands" the resulting array into the original shape. If you don't define a __len__, you'll get an error, because numpy doesn't think user-defined types are scalars, so it will still try to call len on it.
As far as I can see, there is no way to make this work with a user-defined class. You can return 1 from __len__, but then you'll still get an Nx1 2D result, not a 1D array of length N. I don't see any way to make Numpy see a user-defined instance as a scalar.
There is a numpy bug about the apply_along_axis behavior, but surprisingly I can't find any discussion of the underlying issue that isscalar returns False for non-numpy objects. It may be that numpy just decided to punt and not guess whether user-defined types are vector or scalar. Still, it might be worth asking about this on the numpy list, as it seems odd to me that things like isscalar(object()) return False.
However, if as you say you don't care about performance anyway, it doesn't really matter. Just use your first way with the list comprehension, which already does what you want.

Related

I set 3 arrays to the same thing, changing a single entry in one of them also changes the other two arrays. How can I make the three arrays separate?

I am making a puzzle game in a command terminal. I have three arrays for the level, originlevel, which is the unaltered level that the game will return to if you restart the level. Emptylevel is the level without the player. Level is just the level. I need all 3, because I will be changing the space around the player.
def Player(matrix,width):
originlevel = matrix
emptylevel = matrix
emptylevel[PlayerPositionFind(matrix)]="#"
level = matrix
The expected result is that it would set one entry to "#" in the emptylevel array, but it actually sets all 3 arrays to the same thing! My theory is that the arrays are somehow linked because they are originally said to the same thing, but this ruins my code entirely! How can I make the arrays separate, so changing one would not change the other?
I should not that matrix is an array, it is not an actual matrix.
I tried a function which would take the array matrix, and then just return it, thinking that this layer would unlink the arrays. It did not. (I called the function IHATEPYTHON).
I've also read that setting them to the same array is supposed to do this, but I didn't actually find an answer how to make them NOT do that. Do I make a function which is just something like
for i in range(0,len(array)):
newarray.append(array[i])
return newarray
I feel like that would solve the issue but that's so stupid, can I not do it in another way?
This issue is caused by the way variables work in Python. If you want more background on why this is happening, you should look up 'pass by value versus pass by reference'.
In order for each of these arrays to be independent, you need to create a copy each time you assign it. The easiest way to do that is to use an array slice. This means you will get a new copy of the array each time.
def Player(matrix,width):
originlevel = matrix[:]
emptylevel = matrix[:]
emptylevel[PlayerPositionFind(matrix)]="#"
level = matrix[:]

Ensure a variable is an array, regardless if it is a list or a scalar

I often find in the situation where I need to be sure that x is an array-like object, regardless of whether it comes to me as a float or as a list.
I ultimately need numpy arrays so I expected that np.array() could be a straightforward solution. But actually the brackets still remain a problem.
The bestsolution I have figured out is
def EnsureArray(x):
if np.isscalar(x):
return np.array([x])
else:
return np.array(np.x)
Is it ok, or is there something better (without defining my own function?)

Application of numpy methods

I'm confused with how numpy methods are applied to nd-arrays. for example:
import numpy as np
a = np.array([[1,2,2],[5,2,3]])
b = a.transpose()
a.sort()
Here the transpose() method is not changing anything to a, but is returning the transposed version of a, while the sort() method is sorting a and is returning a NoneType. Anybody an idea why this is and what is the purpose of this different functionality?
Because numpy authors decided that some methods will be in place and some won't. Why? I don't know if anyone but them can answer that question.
'in-place' operations have the potential to be faster, especially when dealing with large arrays, as there is no need to re-allocate and copy the entire array, see answers to this question
BTW, most if not all arr methods have a static version that returns a new array. For example, arr.sort has a static version numpy.sort(arr) which will accept an array and return a new, sorted array (much like the global sorted function and list.sort()).
In a Python class (OOP) methods which operate in place (modify self or its attributes) are acceptable, and if anything, more common than ones that return a new object. That's also true for built in classes like dict or list.
For example in numpy we often recommend the list append approach to building an new array:
In [296]: alist = []
In [297]: for i in range(3):
...: alist.append(i)
...:
In [298]: alist
Out[298]: [0, 1, 2]
This is common enough that we can readily write it as a list comprehension:
In [299]: [i for i in range(3)]
Out[299]: [0, 1, 2]
alist.sort operates in-place, sorted(alist) returns a new list.
In numpy methods that return a new array are much more common. In fact sort is about the only in-place method I can think of off hand. That and a direct modification of shape: arr.shape=(...).
A number of basic numpy operations return a view. That shares data memory with the source, but the array object wrapper is new. In fact even indexing an element returns a new object.
So while you ultimately need to check the documentation, it's usually safe to assume a numpy function or method returns a new object, as opposed to operating in-place.
More often users are confused by the numpy functions that have the same name as a method. In most of those cases the function makes sure the argument(s) is an array, and then delegates the action to its method. Also keep in mind that in Python operators are translated into method calls - + to __add__, [index] to __getitem__() etc. += is a kind of in-place operation.

Why does numpy have a corresponding function for many ndarray methods?

A few examples:
numpy.sum()
ndarray.sum()
numpy.amax()
ndarray.max()
numpy.dot()
ndarray.dot()
... and quite a few more. Is it to support some legacy code, or is there a better reason for that? And, do I choose only on the basis of how my code 'looks', or is one of the two ways better than the other?
I can imagine that one might want numpy.dot() to use reduce (e.g., reduce(numpy.dot, A, B, C, D)) but I don't think that would be as useful for something like numpy.sum().
As others have noted, the identically-named NumPy functions and array methods are often equivalent (they end up calling the same underlying code). One might be preferred over the other if it makes for easier reading.
However, in some instances the two behave different slightly differently. In particular, using the ndarray method sometimes emphasises the fact that the method is modifying the array in-place.
For example, np.resize returns a new array with the specified shape. On the other hand, ndarray.resize changes the shape of the array in-place. The fill values used in each case are also different.
Similarly, a.sort() sorts the array a in-place, while np.sort(a) returns a sorted copy.
In most cases the method is the basic compiled version. The function uses that method when available, but also has some sort of backup when the argument(s) is not an array. It helps to look at the code and/or docs of the function or method.
For example if in Ipython I ask to look at the code for the sum method, I see that it is compiled code
In [711]: x.sum??
Type: builtin_function_or_method
String form: <built-in method sum of numpy.ndarray object at 0xac1bce0>
...
Refer to `numpy.sum` for full documentation.
Do the same on np.sum I get many lines of documentation plus some Python code:
if isinstance(a, _gentype):
res = _sum_(a)
if out is not None:
out[...] = res
return out
return res
elif type(a) is not mu.ndarray:
try:
sum = a.sum
except AttributeError:
return _methods._sum(a, axis=axis, dtype=dtype,
out=out, keepdims=keepdims)
# NOTE: Dropping the keepdims parameters here...
return sum(axis=axis, dtype=dtype, out=out)
else:
return _methods._sum(a, axis=axis, dtype=dtype,
out=out, keepdims=keepdims)
If I call np.sum(x) where x is an array, it ends up calling x.sum():
sum = a.sum
return sum(axis=axis, dtype=dtype, out=out)
np.amax similar (but simpler). Note that the np. form can handle a an object that isn't an array (that doesn't have the method), e.g. a list: np.amax([1,2,3]).
np.dot and x.dot both show as 'built-in' function, so we can't say anything about priority. They probably both end up calling some underlying C function.
np.reshape is another that deligates if possible:
try:
reshape = a.reshape
except AttributeError:
return _wrapit(a, 'reshape', newshape, order=order)
return reshape(newshape, order=order)
So np.reshape(x,(2,3)) is identical in functionality to x.reshape((2,3)). But the _wrapit expression enables np.reshape([1,2,3,4],(2,2)).
np.sort returns a copy by doing an inplace sort on a copy:
a = asanyarray(a).copy()
a.sort(axis, kind, order)
return a
x.resize is built-in, while np.resize ends up doing a np.concatenate and reshape.
If your array is a subclass, like matrix or masked, it may have its own variant. The action of a matrix .sum is:
return N.ndarray.sum(self, axis, dtype, out, keepdims=True)._collapse(axis)
Elaborating on Peter's comment for visibility:
We could make it more consistent by removing methods from ndarray and sticking to just functions. But this is impossible because it would break everyone's existing code that uses methods.
Or, we could move all functions to also be methods. But this is impossible because new users and packages are constantly defining new functions. Plus continuing to multiply these duplicate methods violates "there should be one obvious way to do it".
If we could go back in time then I'd probably argue for not having these methods on ndarray at all, and using functions exclusively. ... So this all argues for using functions exclusively
numpy issue: More consistency with array-methods #7452

Mapping function to numpy array, varying a parameter

First, let me show you the codez:
a = array([...])
for n in range(10000):
func_curry = functools.partial(func, y=n)
result = array(map(func_curry, a))
do_something_else(result)
...
What I'm doing here is trying to apply func to an array, changing every time the value of the func's second parameter. This is SLOOOOW (creating a new function every iteration surely does not help), and I also feel I missed the pythonic way of doing it. Any suggestion?
Could a solution that gives me a 2D array be a good idea? I don't know, but maybe it is.
Answers to possible questions:
Yes, this is (using a broad definition), an optimization problem (do_something_else() hides this)
No, scipy.optimize hasn't worked because I'm dealing with boolean values and it never seems to converge.
Did you try numpy.vectorize?
...
vfunc_curry = vectorize(functools.partial(func, y=n))
result = vfunc_curry(a)
...
If a is of significant size the bottleneck should not be the creation of the function, but the duplication of the array.
Can you rewrite the function? If possible, you should write the function to take two numpy arrays a and numpy.arange(n). You may need to reshape to get the arrays to line up for broadcasting.

Categories

Resources