Numba signatures protocol - python

Despite searching Stack Overflow, and the internet as a whole, and reading several Stack Overflow questions and numba.pydata.org pages, and learning some clues to how to tell Numba what types I want to give to and get from functions, I'm not finding the actual logic of how it works.
For example, I experimented with a function that processes a list of integers and puts out another list of integers, and while the decorator #numba.jit(numba.int64[:](numba.int64[:])) worked, the decorators #numba.njit(numba.int64[:](numba.int64[:])) and #numba.vectorize(numba.int64[:](numba.int64[:])) did not work.
(njit successfully went past the decorator and stumbled on the function itself; I'm guessing that concatenating elements to a list is not an available function in 'no python' mode. vectorize, however, complains about the signature, TypeError: 'Signature' object is not iterable; maybe it's worried that a 1D array could include a single element without brackets, which is not an iterable?)
Is there a simple way to understand how Numba works to enough depth to anticipate how I should express the signature?

The simplest answer for jit (and njit which is just and alias for nopython=True) is try to avoid writing signatures altogether - in the common cases, type inference will get you there.
Specific to your question numba.int64[:](numba.int64[:]) is a valid signature and works for jit.
numba.vectorize- expects an iterable of signatures (error message), so your signature(s) need to be wrapped in a list. Additional, vectorize creates a numpy ufunc, which is defined by scalar operations (which are then broadcast), so your signature must be of scalar types. E.g.
#numba.vectorize([numba.int64(numba.int64)])
def add_one(v):
return v + 1
add_one(np.array([4, 5, 6], dtype=np.int64))
# Out[117]: array([5, 6, 7], dtype=int64)

Related

numpy.product vs numpy.prod vs ndarray.prod

I'm reading through the Numpy docs, and it appears that the functions np.prod(...), np.product(...) and the ndarray method a.prod(...) are all equivalent.
Is there a preferred version to use, both in terms of style/readability and performance? Are there different situations where different versions are preferable? If not, why are there three separate but very similar ways to perform the same operation?
As of the master branch today (1.15.0), np.product just uses np.prod, and may be deprecated eventually. See MAINT: Remove duplicate implementation for aliased functions. #10653.
And np.prod and ndarray.prod both end up calling umath.multiply.reduce, so there is really no difference between them, besides the fact that the free functions can accept array-like types (like Python lists) in addition to NumPy arrays.
Prior to this, like in NumPy 1.14.2, the documentation claimed np.product and np.prod were the same, but there were bugs because of the duplicated implementation that Parag mentions. i.e. Eric Weiser's example from #10651:
>>> class CanProd(object):
def prod(self, axis, dtype, out): return "prod"
>>> np.product(CanProd())
<__main__.CanProd object at 0x0000023BAF7B29E8>
>>> np.prod(CanProd())
'prod'
So in short, now they're the same, and favor np.prod over np.product since the latter is an alias that may be deprecated.
This is what I could gather from the source codes of NumPy 1.14.0. For the answer relevant to the current Master branch (NumPy 1.15.0), see the answer of miradulo.
For an ndarray, prod() and product() are equivalent.
For an ndarray, prod() and product() will both call um.multiply.reduce().
If the object type is not ndarray but it still has a prod method, then prod() will return prod(axis=axis, dtype=dtype, out=out, **kwargs) whereas product will try to use um.multiply.reduce.
If the object is not an ndarray and it does not have a prod method, then it will behave as product().
The ndarray.prod() is equivalent to prod().
I am not sure about the latter part of your question regarding preference and readability.

What is the identity of "ndim, shape, size, ..etc" of ndarray in numpy

I'm quite new at Python.
After using Matlab for many many years, recently, I started studying numpy/scipy
It seems like the most basic element of numpy seems to be ndarray.
In ndarray, there are following attributes:
ndarray.ndim
ndarray.shape
ndarray.size
...etc
I'm quite familiar with C++/JAVA classes, but I'm a novice at Python OOP.
Q1: My first question is what is the identity of the above attributes?
At first, I assumed that the above attribute might be public member variables. But soon, I found that a.ndim = 10 doesn't work (assuming a is an object of ndarray) So, it seems it is not a public member variable.
Next, I guessed that they might be public methods similar to getter methods in C++. However, when I tried a.nidm() with a parenthesis, it doesn't work. So, it seems that it is not a public method.
The other possibility might be that they are private member variables, but but print a.ndim works, so they cannot be private data members.
So, I cannot figure out what is the true identity of the above attributes.
Q2. Where I can find the Python code implementation of ndarray? Since I installed numpy/scipy on my local PC, I guess there might be some ways to look at the source code, then I think everything might be clear.
Could you give some advice on this?
numpy is implemented as a mix of C code and Python code. The source is available for browsing on github, and can be downloaded as a git repository. But digging your way into the C source takes some work. A lot of the files are marked as .c.src, which means they pass through one or more layers of perprocessing before compiling.
And Python is written in a mix of C and Python as well. So don't try to force things into C++ terms.
It's probably better to draw on your MATLAB experience, with adjustments to allow for Python. And numpy has a number of quirks that go beyond Python. It is using Python syntax, but because it has its own C code, it isn't simply a Python class.
I use Ipython as my usual working environment. With that I can use foo? to see the documentation for foo (same as the Python help(foo), and foo?? to see the code - if it is writen in Python (like the MATLAB/Octave type(foo))
Python objects have attributes, and methods. Also properties which look like attributes, but actually use methods to get/set. Usually you don't need to be aware of the difference between attributes and properties.
x.ndim # as noted, has a get, but no set; see also np.ndim(x)
x.shape # has a get, but can also be set; see also np.shape(x)
x.<tab> in Ipython shows me all the completions for a ndarray. There are 4*18. Some are methods, some attributes. x._<tab> shows a bunch more that start with __. These are 'private' - not meant for public consumption, but that's just semantics. You can look at them and use them if needed.
Off hand x.shape is the only ndarray property that I set, and even with that I usually use reshape(...) instead. Read their docs to see the difference. ndim is the number of dimensions, and it doesn't make sense to change that directly. It is len(x.shape); change the shape to change ndim. Likewise x.size shouldn't be something you change directly.
Some of these properties are accessible via functions. np.shape(x) == x.shape, similar to MATLAB size(x). (MATLAB doesn't have . attribute syntax).
x.__array_interface__ is a handy property, that gives a dictionary with a number of the attributes
In [391]: x.__array_interface__
Out[391]:
{'descr': [('', '<f8')],
'version': 3,
'shape': (50,),
'typestr': '<f8',
'strides': None,
'data': (165646680, False)}
The docs for ndarray(shape, dtype=float, buffer=None, offset=0,
strides=None, order=None), the __new__ method lists these attributes:
`Attributes
----------
T : ndarray
Transpose of the array.
data : buffer
The array's elements, in memory.
dtype : dtype object
Describes the format of the elements in the array.
flags : dict
Dictionary containing information related to memory use, e.g.,
'C_CONTIGUOUS', 'OWNDATA', 'WRITEABLE', etc.
flat : numpy.flatiter object
Flattened version of the array as an iterator. The iterator
allows assignments, e.g., ``x.flat = 3`` (See `ndarray.flat` for
assignment examples; TODO).
imag : ndarray
Imaginary part of the array.
real : ndarray
Real part of the array.
size : int
Number of elements in the array.
itemsize : int
The memory use of each array element in bytes.
nbytes : int
The total number of bytes required to store the array data,
i.e., ``itemsize * size``.
ndim : int
The array's number of dimensions.
shape : tuple of ints
Shape of the array.
strides : tuple of ints
The step-size required to move from one element to the next in
memory. For example, a contiguous ``(3, 4)`` array of type
``int16`` in C-order has strides ``(8, 2)``. This implies that
to move from element to element in memory requires jumps of 2 bytes.
To move from row-to-row, one needs to jump 8 bytes at a time
(``2 * 4``).
ctypes : ctypes object
Class containing properties of the array needed for interaction
with ctypes.
base : ndarray
If the array is a view into another array, that array is its `base`
(unless that array is also a view). The `base` array is where the
array data is actually stored.
All of these should be treated as properties, though I don't think numpy actually uses the property mechanism. In general they should be considered to be 'read-only'. Besides shape, I only recall changing data (pointer to a data buffer), and strides.
Regarding your first question, Python has syntactic sugar for properties, including fine-grained control of getting, setting, deleting them, as well as restricting any of the above.
So, for example, if you have
class Foo(object):
#property
def shmip(self):
return 3
then you can write Foo().shmip to obtain 3, but, if that is the class definition, you've disabled setting Foo().shmip = 4.
In other words, those are read-only properties.
Question 1
The list you're mentioning is one that contains attributes for a Numpy array.
For example:
a = np.array([1, 2, 3])
print(type(a))
>><class 'numpy.ndarray'>
Since a is an nump.ndarray you're able to use those attributes to find out more about it. (i.e a.size will result in 3). To get information about what each one does visit SciPy's documentation about the attributes.
Question 2
You can start here to get yourself familiar with some of the basics tools of Numpy as well as the Reference Manual assuming you're using v1.9. For information specific to Numpy Array you can go to Array Objects.
Their documentation is very extensive and very helpful. Examples are provided throughout the website showing multiple examples.

How to write a function which takes a slice?

I would like to write a function in Python which takes a slice as a parameter. Ideally a user would be to be able to call the function as follows:
foo(a:b:c)
Unfortunately, this syntax is not permitted by Python - the use of a:b:c is only allowed within [], not ().
I therefore see three possibilities for my function:
Require the user to use a slice "constructor" (where s_ acts like the version provided by numpy):
foo(slice(a, b, c))
foo(s_[a:b:c])
Put the logic of my function into a __getitem__ method:
foo[a:b:c]
Give up trying to take a slice and take start, stop and step individually:
foo(a, b, c)
Is there a way to get the original syntax to work? If not, which of the workaround syntaxes would be preferred? Or is there another, better option?
Don't surprise your users.
If you use the slicing syntax consistently with what a developer expects from a slicing syntax, that same developer will expect square brackets operation, i.e. a __getitem__() method.
If instead the returned object is not somehow a slice of the original object, people will be confused if you stick to a __getitem__() solution. Use a function call foo(a, b, c), don't mention slices at all, and optionally assign default values if that makes sense.
Slices make more sense when they're expressed as a slice of something. So, another alternative is to be more object-oriented: create a temporary object that represents your slice of something, and put your function as a method of it.
For example, if your function is really:
foo(bar, a:b:c)
or
bar.foo(a:b:c)
then you can replace this with:
bar[a:b:c].foo()
If bar[a:b:c] already has a different meaning, then come up with a another name baz and do:
bar.baz[a:b:c].foo()
It's hard to give convincing examples without a real context, because you're trying to name related things with names that make intuitive sense, let you write unambiguous code, and are relatively short.
If you're really just writing a function on its own operating on a slice, then either:
Your function modifies a slice, returning a different slice:
bar[foo(a:b:c)]
If this is the case, whatever valid syntax you choose is going to look a little confusing. You probably don't want to use slices if you're aiming for a broad audience of Python programmers.
Your function really operates on a slice of the integers, so you can make that explicit with a temporary object:
the_integers[a:b:c].foo()
The use of [a:b:c] is, as you note, a syntax thing. The interpreter raises a syntax error for (a:b:c) right away, before your code has any chance to do something with the values. There isn't a way around this syntax without rewriting the interpreter.
It's worth keeping in mind that the interpreter translates foo[a:b:c] to
foo.__getitem__(slice(a,b,c))
The slice object itself is not very complicated. It just has 3 attributes (start,step,stop) and a method indices. It's the getitem method that makes sense of those values.
np.s_ and other functions/classes in np.lib.index_tricks are good examples of how __getitem__ and slice can be used to extend (or simplify) indexing. For example, these are equivalent:
np.r_[3:4:10j]
np.linspace(3,4,10)
As to the foo(a,b,c) syntax, the very common np.arange() uses it. As does range and xrange. So you, and your users, should be quite familiar with it.
Since the alternatives all end up giving you the start/step/stop trio of values, they are functionally equivalent (in speed). So the choice comes down to user preferences and familiarity.
While your function can't take a:b:c notation directly, it can be written to handle a variety of inputs - a slice, 3 positional arguments, a tuple, a tuple of slices (as from s_), or keyword arguments. And following the basic numpy indexing you could distinguish between tuples and lists.

input/output validation/casting of a numpy calculation

This is a situation that happens quite often in my codes. Say I have a function do_sth(a,b), that, only for the sake of this example, simply calculates a+b, with a,b either 1D numpy arrays or scalars. In many occasions, I need the function to broadcast the operation, so that if both a,b are 1D arrays, the result will be a 2D array. An example of what I mean follows:
do_sth(1,2) -> 3
do_sth([1,2],0) -> array([1, 2])
do_sth(0,[3,4]) -> array([3, 4])
do_sth([1,2],[3,4]) -> array([[4, 5], [5, 6]])
This is a bit similar to how a numpy ufunc behaves. A possible implementation follows:
from numpy import newaxis, atleast_1d
def do_sth(a, b):
"a,b should be either 1d numpy arrays or scalars"
a, b = map(atleast_1d, [a, b])
# the line below mocks a more complicated calculation
res = a[:, newaxis] + b[newaxis]
conds = [a.size == 1, b.size == 1]
if all(conds):
return res[0, 0]
elif any(conds):
return res.ravel()
else:
return res
As you can see, there's quite a lot of boilerplate. The first question is: is this the right way to do this input/output casting? Is there any reason to not use a decorator to deal with a situation like this? Is there any guideline on the matter?
Moreover, the more complicated calculation, here mocked by the addition, often fails badly if a or b are numpy arrays with 2D,3D shape for example. I say badly in the sense that the point where the calculation fails is not obvious, or may change with time in different revisions of the code, and it is hard to see the connection between the error and the wrong input shape. I think it is then NOT advisable to put the complicated calculation in a try/except block (following python EAFP). In this case, is it correct to check the shape of the 2 arrays at the beginning of the function? Is there any alternative? Is there a numpy function that allows at the same time to convert the input to a numpy array, and also check that the input is compatible with a certain number of dimensions, something like asarray_withdim(arr,ndim=5)?
Regarding the use of decorators - I haven't seen much use of decorators in numpy code, but I think that's because most of the functionality was developed before decorators become common in Python. If you can make it work, there shouldn't be a any downside (but I'm not an expert with either decorators or ufunc).
Non complied numpy functions often have a lot of code that massages the inputs into convenient dimensions. Then they do the core action, followed by final reshaping and type wrapping. They might use functions like np.atleast_2d to ensure there are enough dimensions, and .reshape(-1,1,1) to compress excess dimensions.
np.tensordot is an example of one that performs axes transpose plus reshape on the inputs so it can apply the compiled np.dot. np.insert starts with a number of ndim and isinstance tests. Special cases are handled early, while the general one is left to the end. np.einsum is compiled, but there's a lot of preprocessing being done in C code, before it finally creates an nditer object and does the calculation.

Python and Numba for vectorized functions

Good day, I'm writing a Python module for some numeric work. Since there's a lot of stuff going on, I've been spending the last few days optimizing code to improve calculations times.
However, I have a question concerning Numba.
Basically, I have a class with some fields which are numpy arrays, which I initialize in the following way:
def init(self):
a = numpy.arange(0, self.max_i, 1)
self.vibr_energy = self.calculate_vibr_energy(a)
def calculate_vibr_energy(i):
return numpy.exp(-self.harmonic * i - self.anharmonic * (i ** 2))
So, the code is vectorized, and using Numba's JIT results in some improvement. However, sometimes I need to access the calculate_vibr_energy function from outside the class, and pass a single integer instead of an array in place of i.
As far as I understand, if I use Numba's JIT on the calculate_vibr_energy, it will have to always take an array as an argument.
So, which of the following options is better:
1) Create a new function calculate_vibr_energy_single(i), which will only take a single integer number, and use Numba on it too
2) Replace all usages of the function that are similar to this one:
myclass.calculate_vibr_energy(1)
with this:
tmp = np.array([1])
myclass.calculate_vibr_energy(tmp)[0]
Or are there other, more efficient (or at least, more Python-ic) ways of doing that?
I have only played a little with numba yet so I may be mistaken, but as far as I've understood it, using the "autojit" decorator should give functions that can take arguments of any type.
See e.g. http://numba.pydata.org/numba-doc/dev/pythonstuff.html

Categories

Resources