why I can't use x.unique() in numpy, however, x.sum() or x.mean() works? - python

I'm learning numpy, however, I don't understand that, for example:
import numpy as np
ints = np.array([3,3,3,2,2,1,1,4,4])
ints.unique() # this won't work
np.unique(ints) # this works
however, some function works both ways
ints.sum()
np.sum(ints)
And I was reading numpy documents, what's the different between attributes vs methods? arributes will return something as well as methods.

unique unlike sum is a free function only and not a class (instance to be precise) method. The difference between the two is
obj.foo() # instance method, obj is implicitly passed to foo()
foo(obj) # free function, obj is explicity passed to foo()
Have a look here for some explanation on different variants of methods. In NumPy, this is mainly a design decision, I believe, however there are certain reasons for some functions to be a free function. One reason that comes to mind, is that unlike in other technical languages (such as MATLAB), numpy arrays can be structured or unstructured and can be flexible in terms of containing objects of different types, for example
a = np.array([[1,2],[3,4]]) # structured array
b = np.array([[1,2],[3,4,5]]) # unstructured array
c = np.array([[1,2],["abc",True]]) # unstructured array with flexible data type
In such scenarios, having to make every function/method an instance method, would lead to confusing behaviour. Even the sum function behaves differently with structured and unstructured arrays
In [18]: a.sum() # sums all elements of the array
Out[18]: 10
In [19]: b.sum() # concatenates all elements of the array
Out[19]: [1, 2, 3, 4, 5]
In contrast, some functions like unique have a much narrower scope in terms of their applications. For example unique only works for structured arrays/buffers of uniform data type and operates on the flattened (1D dimensional) version of the arrays.
attributes of numpy arrays typically tell you about the underlying data type, shape, dimensionality, memory layout/strides and data ownership of the array, for instance:
In [20]: a=np.random.rand(3,4)
In [21]: a.flags
Out[21]:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
In [22]: a.shape
Out[22]: (3, 4)
In [23]: a.dtype
Out[23]: dtype('float64')
are all attributes and not array methods per say, in other words they are properties.

np.sum is a function that takes an array, or anything that can be turned into an array, and applies it's sum method. See np.source(np.sum) for details.
arr.sum is a method of the arr array. For a ndarray is compiled code. A subclassed array may have a different sum method.
Most of the cases where the are like-named functions and methods, a relationship like this holds.
Look at the source for np.unique to see a different design. One difference that comes to mind is that unique only works with 1d arrays, or with a flattened array. It's not as general purpose a method like sum or mean.
Some of these differences follow a pattern, or are explained, others are probably more the result of a development history. Often it is easier to add new functionality by writing a 'stand-alone' function, rather than adding a method to an existing class. The method is more closely integrated with the class.
To get into more details you'll have to spend time reading the development archives. For roughly that last 5 years, much of that can found by searching the respective github repository and its issues.

Related

Reinterpreting NumPy arrays as a different dtype

Say I have a large NumPy array of dtype int32
import numpy as np
N = 1000 # (large) number of elements
a = np.random.randint(0, 100, N, dtype=np.int32)
but now I want the data to be uint32. I could do
b = a.astype(np.uint32)
or even
b = a.astype(np.uint32, copy=False)
but in both cases b is a copy of a, whereas I want to simply reinterpret the data in a as being uint32, as to not duplicate the memory. Similarly, using np.asarray() does not help.
What does work is
a.dtpye = np.uint32
which simply changes the dtype without altering the data at all. Here's a striking example:
import numpy as np
a = np.array([-1, 0, 1, 2], dtype=np.int32)
print(a)
a.dtype = np.uint32
print(a) # shows "overflow", which is what I want
My questions are about the solution of simply overwriting the dtype of the array:
Is this legitimate? Can you point me to where this feature is documented?
Does it in fact leave the data of the array untouched, i.e. no duplication of the data?
What if I want two arrays a and b sharing the same data, but view it as different dtypes? I've found the following to work, but again I'm concerned if this is really OK to do:
import numpy as np
a = np.array([0, 1, 2, 3], dtype=np.int32)
b = a.view(np.uint32)
print(a) # [0 1 2 3]
print(b) # [0 1 2 3]
a[0] = -1
print(a) # [-1 1 2 3]
print(b) # [4294967295 1 2 3]
Though this seems to work, I find it weird that the underlying data of the two arrays does not seem to be located the same place in memory:
print(a.data)
print(b.data)
Actually, it seems that the above gives different results each time it is run, so I don't understand what's going on there at all.
This can be extended to other dtypes, the most extreme of which is probably mixing 32 and 64 bit floats:
import numpy as np
a = np.array([0, 1, 2, np.pi], dtype=np.float32)
b = a.view(np.float64)
print(a) # [0. 1. 2. 3.1415927]
print(b) # [0.0078125 50.12387848]
b[0] = 8
print(a) # [0. 2.5 2. 3.1415927]
print(b) # [8. 50.12387848]
Again, is this condoned, if the obtained behaviour is really what I'm after?
Is this legitimate? Can you point me to where this feature is documented?
This is legitimate. However, using np.view (which is equivalent) is better since it is compatible with a static analysers (so it is somehow safer). Indeed, the documentation states:
It’s possible to mutate the dtype of an array at runtime. [...]
This sort of mutation is not allowed by the types. Users who want to write statically typed code should instead use the numpy.ndarray.view method to create a view of the array with a different dtype.
Does it in fact leave the data of the array untouched, i.e. no duplication of the data?
Yes. Since the array is still a view on the same internal memory buffer (a basic byte array). Numpy will just reinterpret it differently (this is directly done the C code of each Numpy computing function).
What if I want two arrays a and b sharing the same data, but view it as different dtypes? [...]
np.view can be used in this case as you did in your example. However, the result is platform dependent. Indeed, Numpy just reinterpret bytes of memory and theoretically the representation of negative numbers can change from one machine to another. Hopefully, nowadays, all mainstream modern processors use use the two's complement (source). This means that a np.in32 value like -1 will be reinterpreted as 2**32-1 = 4294967295 with a view of type np.uint32. Positive signed values are unchanged. As long as you are aware of this, this is fine and the behaviour is predictable.
This can be extended to other dtypes, the most extreme of which is probably mixing 32 and 64 bit floats.
Well, put it shortly, this is really like playing fire. In this case this certainly unsafe although it may work on your specific machine. Let us venturing into troubled waters.
First of all, the documentation of np.view states:
The behavior of the view cannot be predicted just from the superficial appearance of a. It also depends on exactly how a is stored in memory. Therefore if a is C-ordered versus fortran-ordered, versus defined as a slice or transpose, etc., the view may give different results.
The thing is Numpy reinterpret the pointer using a C code. Thus, AFAIK, the strict aliasing rule applies. This means that reinterpreting a np.float32 value to a np.float64 cause an undefined behaviour. One reason is that the alignment requirements are not the same for np.float32 (typically 4) and np.float32 (typically 8) and so reading an unaligned np.float64 value from memory can cause a crash on some architecture (eg. POWER) although x86-64 processors support this. Another reason comes from the compiler which can over-optimize the code due to the strict aliasing rule by making wrong assumptions in your case (like a np.float32 value and a np.float64 value cannot overlap in memory so the modification of the view should not change the original array). However, since Numpy is called from CPython and no function calls are inlined from the interpreter (probably not with Cython), this last point should not be a problem (it may be the case be if you use Numba or any JIT though). Note that this is safe to get an np.uint8 view of a np.float32 since it does not break the strict aliasing rule (and the alignment is Ok). This could be useful to efficiently serialize Numpy arrays. The opposite operation is not safe (especially due to the alignment).
Update about last section: a deeper analysis from the Numpy code show that some part of the code like type-conversion functions perform a safe type punning using the memmove C call, while some other functions like all basic unary operators or binary ones do not appear to do a proper type punning yet! Moreover, such feature is barely tested by users and tricky corner cases are likely to cause weird bugs (especially if you read and write in two views of the same array). Thus, use it at your own risk.

Application of numpy methods

I'm confused with how numpy methods are applied to nd-arrays. for example:
import numpy as np
a = np.array([[1,2,2],[5,2,3]])
b = a.transpose()
a.sort()
Here the transpose() method is not changing anything to a, but is returning the transposed version of a, while the sort() method is sorting a and is returning a NoneType. Anybody an idea why this is and what is the purpose of this different functionality?
Because numpy authors decided that some methods will be in place and some won't. Why? I don't know if anyone but them can answer that question.
'in-place' operations have the potential to be faster, especially when dealing with large arrays, as there is no need to re-allocate and copy the entire array, see answers to this question
BTW, most if not all arr methods have a static version that returns a new array. For example, arr.sort has a static version numpy.sort(arr) which will accept an array and return a new, sorted array (much like the global sorted function and list.sort()).
In a Python class (OOP) methods which operate in place (modify self or its attributes) are acceptable, and if anything, more common than ones that return a new object. That's also true for built in classes like dict or list.
For example in numpy we often recommend the list append approach to building an new array:
In [296]: alist = []
In [297]: for i in range(3):
...: alist.append(i)
...:
In [298]: alist
Out[298]: [0, 1, 2]
This is common enough that we can readily write it as a list comprehension:
In [299]: [i for i in range(3)]
Out[299]: [0, 1, 2]
alist.sort operates in-place, sorted(alist) returns a new list.
In numpy methods that return a new array are much more common. In fact sort is about the only in-place method I can think of off hand. That and a direct modification of shape: arr.shape=(...).
A number of basic numpy operations return a view. That shares data memory with the source, but the array object wrapper is new. In fact even indexing an element returns a new object.
So while you ultimately need to check the documentation, it's usually safe to assume a numpy function or method returns a new object, as opposed to operating in-place.
More often users are confused by the numpy functions that have the same name as a method. In most of those cases the function makes sure the argument(s) is an array, and then delegates the action to its method. Also keep in mind that in Python operators are translated into method calls - + to __add__, [index] to __getitem__() etc. += is a kind of in-place operation.

"Direct" numpy functions on an array vs numpy array functions

I have a question about the design of Python. I have realised that some functions are implemented directly on container classes (e.g. numpy arrays) while other function that act on these containers must be called from numpy itself. An example would be:
import numpy as np
y = np.array([4,7,9,1])
m1 = np.mean(y) # Ok
m2 = y.mean() # Ok
print(m1 == m2) # True
x = [2,3]
r1 = np.concatenate([x, y]) # Ok
r2 = y.concatenate(x) # AttributeError: 'numpy.ndarray' object has no attribute 'concatenate'
print(r1 == r2)
Why can the mean be calculated directly from the array while the array as no method to concatenate another one to it? Is there a general rule which functions can be called directly on the array and which ones not? And if both is possible what is the pythonic way to do it?
The overview of NumPy history gives an indication of why not everything is consistent: it has two predecessors that were developed independently. Backward compatibility requires the project to keep array methods like max. Ongoing development favors the function syntax np.fun(array). I suppose one reason for the latter is that it allows array_like input (the term used throughout NumPy documentation): anything that NumPy can turn into an ndarray.
The question of why there are both methods and functions of the same name has been discussed and links provided.
But to focus on your two examples:
mean uses just one array. Logically it can be an ndarray method.
concatenate takes a list of arrays, and doesn't give priority to any one of them.
There is a np.append function that looks superficially like the list .append method. But it just passes the task on to concatenate with just a few modifications. And it causes all kinds of newby errors - it isn't inplace, it ravels, and it is slow compared to the list method.
Or consider the large family of ufunc. Those are functions, some take one array, others two. They share a common ufunc functionality.
np.add(a,b) <=> a+b <=> a.__add__(b)
np.sin(a) # no a.sin()
I suspect the choice to make sin a ufunc rather than a method has been influenced by common mathematical notation.
To me a big plus to the function approach is that it can be applied to a list or scalar. np.sin(1) works just as well as np.sin([0,.5,1]) or np.sin(np.arange(0,1,.5)).
Yes, history goes a long way toward excusing the mix of functions of methods, but many of the choices are logical.

What is the identity of "ndim, shape, size, ..etc" of ndarray in numpy

I'm quite new at Python.
After using Matlab for many many years, recently, I started studying numpy/scipy
It seems like the most basic element of numpy seems to be ndarray.
In ndarray, there are following attributes:
ndarray.ndim
ndarray.shape
ndarray.size
...etc
I'm quite familiar with C++/JAVA classes, but I'm a novice at Python OOP.
Q1: My first question is what is the identity of the above attributes?
At first, I assumed that the above attribute might be public member variables. But soon, I found that a.ndim = 10 doesn't work (assuming a is an object of ndarray) So, it seems it is not a public member variable.
Next, I guessed that they might be public methods similar to getter methods in C++. However, when I tried a.nidm() with a parenthesis, it doesn't work. So, it seems that it is not a public method.
The other possibility might be that they are private member variables, but but print a.ndim works, so they cannot be private data members.
So, I cannot figure out what is the true identity of the above attributes.
Q2. Where I can find the Python code implementation of ndarray? Since I installed numpy/scipy on my local PC, I guess there might be some ways to look at the source code, then I think everything might be clear.
Could you give some advice on this?
numpy is implemented as a mix of C code and Python code. The source is available for browsing on github, and can be downloaded as a git repository. But digging your way into the C source takes some work. A lot of the files are marked as .c.src, which means they pass through one or more layers of perprocessing before compiling.
And Python is written in a mix of C and Python as well. So don't try to force things into C++ terms.
It's probably better to draw on your MATLAB experience, with adjustments to allow for Python. And numpy has a number of quirks that go beyond Python. It is using Python syntax, but because it has its own C code, it isn't simply a Python class.
I use Ipython as my usual working environment. With that I can use foo? to see the documentation for foo (same as the Python help(foo), and foo?? to see the code - if it is writen in Python (like the MATLAB/Octave type(foo))
Python objects have attributes, and methods. Also properties which look like attributes, but actually use methods to get/set. Usually you don't need to be aware of the difference between attributes and properties.
x.ndim # as noted, has a get, but no set; see also np.ndim(x)
x.shape # has a get, but can also be set; see also np.shape(x)
x.<tab> in Ipython shows me all the completions for a ndarray. There are 4*18. Some are methods, some attributes. x._<tab> shows a bunch more that start with __. These are 'private' - not meant for public consumption, but that's just semantics. You can look at them and use them if needed.
Off hand x.shape is the only ndarray property that I set, and even with that I usually use reshape(...) instead. Read their docs to see the difference. ndim is the number of dimensions, and it doesn't make sense to change that directly. It is len(x.shape); change the shape to change ndim. Likewise x.size shouldn't be something you change directly.
Some of these properties are accessible via functions. np.shape(x) == x.shape, similar to MATLAB size(x). (MATLAB doesn't have . attribute syntax).
x.__array_interface__ is a handy property, that gives a dictionary with a number of the attributes
In [391]: x.__array_interface__
Out[391]:
{'descr': [('', '<f8')],
'version': 3,
'shape': (50,),
'typestr': '<f8',
'strides': None,
'data': (165646680, False)}
The docs for ndarray(shape, dtype=float, buffer=None, offset=0,
strides=None, order=None), the __new__ method lists these attributes:
`Attributes
----------
T : ndarray
Transpose of the array.
data : buffer
The array's elements, in memory.
dtype : dtype object
Describes the format of the elements in the array.
flags : dict
Dictionary containing information related to memory use, e.g.,
'C_CONTIGUOUS', 'OWNDATA', 'WRITEABLE', etc.
flat : numpy.flatiter object
Flattened version of the array as an iterator. The iterator
allows assignments, e.g., ``x.flat = 3`` (See `ndarray.flat` for
assignment examples; TODO).
imag : ndarray
Imaginary part of the array.
real : ndarray
Real part of the array.
size : int
Number of elements in the array.
itemsize : int
The memory use of each array element in bytes.
nbytes : int
The total number of bytes required to store the array data,
i.e., ``itemsize * size``.
ndim : int
The array's number of dimensions.
shape : tuple of ints
Shape of the array.
strides : tuple of ints
The step-size required to move from one element to the next in
memory. For example, a contiguous ``(3, 4)`` array of type
``int16`` in C-order has strides ``(8, 2)``. This implies that
to move from element to element in memory requires jumps of 2 bytes.
To move from row-to-row, one needs to jump 8 bytes at a time
(``2 * 4``).
ctypes : ctypes object
Class containing properties of the array needed for interaction
with ctypes.
base : ndarray
If the array is a view into another array, that array is its `base`
(unless that array is also a view). The `base` array is where the
array data is actually stored.
All of these should be treated as properties, though I don't think numpy actually uses the property mechanism. In general they should be considered to be 'read-only'. Besides shape, I only recall changing data (pointer to a data buffer), and strides.
Regarding your first question, Python has syntactic sugar for properties, including fine-grained control of getting, setting, deleting them, as well as restricting any of the above.
So, for example, if you have
class Foo(object):
#property
def shmip(self):
return 3
then you can write Foo().shmip to obtain 3, but, if that is the class definition, you've disabled setting Foo().shmip = 4.
In other words, those are read-only properties.
Question 1
The list you're mentioning is one that contains attributes for a Numpy array.
For example:
a = np.array([1, 2, 3])
print(type(a))
>><class 'numpy.ndarray'>
Since a is an nump.ndarray you're able to use those attributes to find out more about it. (i.e a.size will result in 3). To get information about what each one does visit SciPy's documentation about the attributes.
Question 2
You can start here to get yourself familiar with some of the basics tools of Numpy as well as the Reference Manual assuming you're using v1.9. For information specific to Numpy Array you can go to Array Objects.
Their documentation is very extensive and very helpful. Examples are provided throughout the website showing multiple examples.

input/output validation/casting of a numpy calculation

This is a situation that happens quite often in my codes. Say I have a function do_sth(a,b), that, only for the sake of this example, simply calculates a+b, with a,b either 1D numpy arrays or scalars. In many occasions, I need the function to broadcast the operation, so that if both a,b are 1D arrays, the result will be a 2D array. An example of what I mean follows:
do_sth(1,2) -> 3
do_sth([1,2],0) -> array([1, 2])
do_sth(0,[3,4]) -> array([3, 4])
do_sth([1,2],[3,4]) -> array([[4, 5], [5, 6]])
This is a bit similar to how a numpy ufunc behaves. A possible implementation follows:
from numpy import newaxis, atleast_1d
def do_sth(a, b):
"a,b should be either 1d numpy arrays or scalars"
a, b = map(atleast_1d, [a, b])
# the line below mocks a more complicated calculation
res = a[:, newaxis] + b[newaxis]
conds = [a.size == 1, b.size == 1]
if all(conds):
return res[0, 0]
elif any(conds):
return res.ravel()
else:
return res
As you can see, there's quite a lot of boilerplate. The first question is: is this the right way to do this input/output casting? Is there any reason to not use a decorator to deal with a situation like this? Is there any guideline on the matter?
Moreover, the more complicated calculation, here mocked by the addition, often fails badly if a or b are numpy arrays with 2D,3D shape for example. I say badly in the sense that the point where the calculation fails is not obvious, or may change with time in different revisions of the code, and it is hard to see the connection between the error and the wrong input shape. I think it is then NOT advisable to put the complicated calculation in a try/except block (following python EAFP). In this case, is it correct to check the shape of the 2 arrays at the beginning of the function? Is there any alternative? Is there a numpy function that allows at the same time to convert the input to a numpy array, and also check that the input is compatible with a certain number of dimensions, something like asarray_withdim(arr,ndim=5)?
Regarding the use of decorators - I haven't seen much use of decorators in numpy code, but I think that's because most of the functionality was developed before decorators become common in Python. If you can make it work, there shouldn't be a any downside (but I'm not an expert with either decorators or ufunc).
Non complied numpy functions often have a lot of code that massages the inputs into convenient dimensions. Then they do the core action, followed by final reshaping and type wrapping. They might use functions like np.atleast_2d to ensure there are enough dimensions, and .reshape(-1,1,1) to compress excess dimensions.
np.tensordot is an example of one that performs axes transpose plus reshape on the inputs so it can apply the compiled np.dot. np.insert starts with a number of ndim and isinstance tests. Special cases are handled early, while the general one is left to the end. np.einsum is compiled, but there's a lot of preprocessing being done in C code, before it finally creates an nditer object and does the calculation.

Categories

Resources