What is the difference between NumPy object methods and NumPy function calls? - python

If I have a NumPy array,
>>> x = np.arange(10)
what is the difference between getting information about that array using the object method
>>> x.mean()
4.5
compared to using the NumPy functions
>>> np.mean(x)
4.5
I expect the object method is calling the function, but there are examples where a function is not included as an method, such as
>>> np.median(x)
4.5
>>> x.median()
AttributeError: 'numpy.ndarray' object has no attribute 'median'
The exclusion of some functions seems to indicate a functional approach is more complete or preferred to the object oriented approach because it eliminates the need to switch back and forth. Is the exclusion of some methods intentional? Is there an inherent advantage for one approach compared to the other?

There is a notable difference between numpy.sort and ndarray.sort: the former returns a copy of the array, the latter sorts in-place.
For other methods, one can use equivalent functions just as well. The function form will accept array-like collections that can be converted to a NumPy array; this is sometimes convenient. On the other hand, this comes at the expense of a few more checks and function calls, so the method form should be a tiny bit faster. In practice, this is probably negligible; for me the deciding factor is usually that methods take fewer characters to type.
Some mathematical operations are more naturally written as methods: compare np.transpose(A) and A.T, when A is a 2-dimensional array.

Related

Application of numpy methods

I'm confused with how numpy methods are applied to nd-arrays. for example:
import numpy as np
a = np.array([[1,2,2],[5,2,3]])
b = a.transpose()
a.sort()
Here the transpose() method is not changing anything to a, but is returning the transposed version of a, while the sort() method is sorting a and is returning a NoneType. Anybody an idea why this is and what is the purpose of this different functionality?
Because numpy authors decided that some methods will be in place and some won't. Why? I don't know if anyone but them can answer that question.
'in-place' operations have the potential to be faster, especially when dealing with large arrays, as there is no need to re-allocate and copy the entire array, see answers to this question
BTW, most if not all arr methods have a static version that returns a new array. For example, arr.sort has a static version numpy.sort(arr) which will accept an array and return a new, sorted array (much like the global sorted function and list.sort()).
In a Python class (OOP) methods which operate in place (modify self or its attributes) are acceptable, and if anything, more common than ones that return a new object. That's also true for built in classes like dict or list.
For example in numpy we often recommend the list append approach to building an new array:
In [296]: alist = []
In [297]: for i in range(3):
...: alist.append(i)
...:
In [298]: alist
Out[298]: [0, 1, 2]
This is common enough that we can readily write it as a list comprehension:
In [299]: [i for i in range(3)]
Out[299]: [0, 1, 2]
alist.sort operates in-place, sorted(alist) returns a new list.
In numpy methods that return a new array are much more common. In fact sort is about the only in-place method I can think of off hand. That and a direct modification of shape: arr.shape=(...).
A number of basic numpy operations return a view. That shares data memory with the source, but the array object wrapper is new. In fact even indexing an element returns a new object.
So while you ultimately need to check the documentation, it's usually safe to assume a numpy function or method returns a new object, as opposed to operating in-place.
More often users are confused by the numpy functions that have the same name as a method. In most of those cases the function makes sure the argument(s) is an array, and then delegates the action to its method. Also keep in mind that in Python operators are translated into method calls - + to __add__, [index] to __getitem__() etc. += is a kind of in-place operation.

"Direct" numpy functions on an array vs numpy array functions

I have a question about the design of Python. I have realised that some functions are implemented directly on container classes (e.g. numpy arrays) while other function that act on these containers must be called from numpy itself. An example would be:
import numpy as np
y = np.array([4,7,9,1])
m1 = np.mean(y) # Ok
m2 = y.mean() # Ok
print(m1 == m2) # True
x = [2,3]
r1 = np.concatenate([x, y]) # Ok
r2 = y.concatenate(x) # AttributeError: 'numpy.ndarray' object has no attribute 'concatenate'
print(r1 == r2)
Why can the mean be calculated directly from the array while the array as no method to concatenate another one to it? Is there a general rule which functions can be called directly on the array and which ones not? And if both is possible what is the pythonic way to do it?
The overview of NumPy history gives an indication of why not everything is consistent: it has two predecessors that were developed independently. Backward compatibility requires the project to keep array methods like max. Ongoing development favors the function syntax np.fun(array). I suppose one reason for the latter is that it allows array_like input (the term used throughout NumPy documentation): anything that NumPy can turn into an ndarray.
The question of why there are both methods and functions of the same name has been discussed and links provided.
But to focus on your two examples:
mean uses just one array. Logically it can be an ndarray method.
concatenate takes a list of arrays, and doesn't give priority to any one of them.
There is a np.append function that looks superficially like the list .append method. But it just passes the task on to concatenate with just a few modifications. And it causes all kinds of newby errors - it isn't inplace, it ravels, and it is slow compared to the list method.
Or consider the large family of ufunc. Those are functions, some take one array, others two. They share a common ufunc functionality.
np.add(a,b) <=> a+b <=> a.__add__(b)
np.sin(a) # no a.sin()
I suspect the choice to make sin a ufunc rather than a method has been influenced by common mathematical notation.
To me a big plus to the function approach is that it can be applied to a list or scalar. np.sin(1) works just as well as np.sin([0,.5,1]) or np.sin(np.arange(0,1,.5)).
Yes, history goes a long way toward excusing the mix of functions of methods, but many of the choices are logical.

numpy.product vs numpy.prod vs ndarray.prod

I'm reading through the Numpy docs, and it appears that the functions np.prod(...), np.product(...) and the ndarray method a.prod(...) are all equivalent.
Is there a preferred version to use, both in terms of style/readability and performance? Are there different situations where different versions are preferable? If not, why are there three separate but very similar ways to perform the same operation?
As of the master branch today (1.15.0), np.product just uses np.prod, and may be deprecated eventually. See MAINT: Remove duplicate implementation for aliased functions. #10653.
And np.prod and ndarray.prod both end up calling umath.multiply.reduce, so there is really no difference between them, besides the fact that the free functions can accept array-like types (like Python lists) in addition to NumPy arrays.
Prior to this, like in NumPy 1.14.2, the documentation claimed np.product and np.prod were the same, but there were bugs because of the duplicated implementation that Parag mentions. i.e. Eric Weiser's example from #10651:
>>> class CanProd(object):
def prod(self, axis, dtype, out): return "prod"
>>> np.product(CanProd())
<__main__.CanProd object at 0x0000023BAF7B29E8>
>>> np.prod(CanProd())
'prod'
So in short, now they're the same, and favor np.prod over np.product since the latter is an alias that may be deprecated.
This is what I could gather from the source codes of NumPy 1.14.0. For the answer relevant to the current Master branch (NumPy 1.15.0), see the answer of miradulo.
For an ndarray, prod() and product() are equivalent.
For an ndarray, prod() and product() will both call um.multiply.reduce().
If the object type is not ndarray but it still has a prod method, then prod() will return prod(axis=axis, dtype=dtype, out=out, **kwargs) whereas product will try to use um.multiply.reduce.
If the object is not an ndarray and it does not have a prod method, then it will behave as product().
The ndarray.prod() is equivalent to prod().
I am not sure about the latter part of your question regarding preference and readability.

why I can't use x.unique() in numpy, however, x.sum() or x.mean() works?

I'm learning numpy, however, I don't understand that, for example:
import numpy as np
ints = np.array([3,3,3,2,2,1,1,4,4])
ints.unique() # this won't work
np.unique(ints) # this works
however, some function works both ways
ints.sum()
np.sum(ints)
And I was reading numpy documents, what's the different between attributes vs methods? arributes will return something as well as methods.
unique unlike sum is a free function only and not a class (instance to be precise) method. The difference between the two is
obj.foo() # instance method, obj is implicitly passed to foo()
foo(obj) # free function, obj is explicity passed to foo()
Have a look here for some explanation on different variants of methods. In NumPy, this is mainly a design decision, I believe, however there are certain reasons for some functions to be a free function. One reason that comes to mind, is that unlike in other technical languages (such as MATLAB), numpy arrays can be structured or unstructured and can be flexible in terms of containing objects of different types, for example
a = np.array([[1,2],[3,4]]) # structured array
b = np.array([[1,2],[3,4,5]]) # unstructured array
c = np.array([[1,2],["abc",True]]) # unstructured array with flexible data type
In such scenarios, having to make every function/method an instance method, would lead to confusing behaviour. Even the sum function behaves differently with structured and unstructured arrays
In [18]: a.sum() # sums all elements of the array
Out[18]: 10
In [19]: b.sum() # concatenates all elements of the array
Out[19]: [1, 2, 3, 4, 5]
In contrast, some functions like unique have a much narrower scope in terms of their applications. For example unique only works for structured arrays/buffers of uniform data type and operates on the flattened (1D dimensional) version of the arrays.
attributes of numpy arrays typically tell you about the underlying data type, shape, dimensionality, memory layout/strides and data ownership of the array, for instance:
In [20]: a=np.random.rand(3,4)
In [21]: a.flags
Out[21]:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
In [22]: a.shape
Out[22]: (3, 4)
In [23]: a.dtype
Out[23]: dtype('float64')
are all attributes and not array methods per say, in other words they are properties.
np.sum is a function that takes an array, or anything that can be turned into an array, and applies it's sum method. See np.source(np.sum) for details.
arr.sum is a method of the arr array. For a ndarray is compiled code. A subclassed array may have a different sum method.
Most of the cases where the are like-named functions and methods, a relationship like this holds.
Look at the source for np.unique to see a different design. One difference that comes to mind is that unique only works with 1d arrays, or with a flattened array. It's not as general purpose a method like sum or mean.
Some of these differences follow a pattern, or are explained, others are probably more the result of a development history. Often it is easier to add new functionality by writing a 'stand-alone' function, rather than adding a method to an existing class. The method is more closely integrated with the class.
To get into more details you'll have to spend time reading the development archives. For roughly that last 5 years, much of that can found by searching the respective github repository and its issues.

Why does numpy have a corresponding function for many ndarray methods?

A few examples:
numpy.sum()
ndarray.sum()
numpy.amax()
ndarray.max()
numpy.dot()
ndarray.dot()
... and quite a few more. Is it to support some legacy code, or is there a better reason for that? And, do I choose only on the basis of how my code 'looks', or is one of the two ways better than the other?
I can imagine that one might want numpy.dot() to use reduce (e.g., reduce(numpy.dot, A, B, C, D)) but I don't think that would be as useful for something like numpy.sum().
As others have noted, the identically-named NumPy functions and array methods are often equivalent (they end up calling the same underlying code). One might be preferred over the other if it makes for easier reading.
However, in some instances the two behave different slightly differently. In particular, using the ndarray method sometimes emphasises the fact that the method is modifying the array in-place.
For example, np.resize returns a new array with the specified shape. On the other hand, ndarray.resize changes the shape of the array in-place. The fill values used in each case are also different.
Similarly, a.sort() sorts the array a in-place, while np.sort(a) returns a sorted copy.
In most cases the method is the basic compiled version. The function uses that method when available, but also has some sort of backup when the argument(s) is not an array. It helps to look at the code and/or docs of the function or method.
For example if in Ipython I ask to look at the code for the sum method, I see that it is compiled code
In [711]: x.sum??
Type: builtin_function_or_method
String form: <built-in method sum of numpy.ndarray object at 0xac1bce0>
...
Refer to `numpy.sum` for full documentation.
Do the same on np.sum I get many lines of documentation plus some Python code:
if isinstance(a, _gentype):
res = _sum_(a)
if out is not None:
out[...] = res
return out
return res
elif type(a) is not mu.ndarray:
try:
sum = a.sum
except AttributeError:
return _methods._sum(a, axis=axis, dtype=dtype,
out=out, keepdims=keepdims)
# NOTE: Dropping the keepdims parameters here...
return sum(axis=axis, dtype=dtype, out=out)
else:
return _methods._sum(a, axis=axis, dtype=dtype,
out=out, keepdims=keepdims)
If I call np.sum(x) where x is an array, it ends up calling x.sum():
sum = a.sum
return sum(axis=axis, dtype=dtype, out=out)
np.amax similar (but simpler). Note that the np. form can handle a an object that isn't an array (that doesn't have the method), e.g. a list: np.amax([1,2,3]).
np.dot and x.dot both show as 'built-in' function, so we can't say anything about priority. They probably both end up calling some underlying C function.
np.reshape is another that deligates if possible:
try:
reshape = a.reshape
except AttributeError:
return _wrapit(a, 'reshape', newshape, order=order)
return reshape(newshape, order=order)
So np.reshape(x,(2,3)) is identical in functionality to x.reshape((2,3)). But the _wrapit expression enables np.reshape([1,2,3,4],(2,2)).
np.sort returns a copy by doing an inplace sort on a copy:
a = asanyarray(a).copy()
a.sort(axis, kind, order)
return a
x.resize is built-in, while np.resize ends up doing a np.concatenate and reshape.
If your array is a subclass, like matrix or masked, it may have its own variant. The action of a matrix .sum is:
return N.ndarray.sum(self, axis, dtype, out, keepdims=True)._collapse(axis)
Elaborating on Peter's comment for visibility:
We could make it more consistent by removing methods from ndarray and sticking to just functions. But this is impossible because it would break everyone's existing code that uses methods.
Or, we could move all functions to also be methods. But this is impossible because new users and packages are constantly defining new functions. Plus continuing to multiply these duplicate methods violates "there should be one obvious way to do it".
If we could go back in time then I'd probably argue for not having these methods on ndarray at all, and using functions exclusively. ... So this all argues for using functions exclusively
numpy issue: More consistency with array-methods #7452

Categories

Resources