Type hint 2D numpy array - python

As a follow-up to this question I have a function that will return a 2D numpy.array of fixed columns but variable rows
import numpy.typing as npt
def example() -> npt.ArrayLike:
data = np.array([[1,2,3],
[4,5,6],
...,
[x,y,z]])
How can I specifically hint that the returned array will be 3 columns by N (variable) rows?

Update on October 24, 2022
This is still not possible currently, but according to this comment on a Numpy GitHub issue, it will be possible once mypy supports PEP 646. Please see the relevant issue on mypy's GitHub repo. That issue is open at the time of writing.
Python 3.11 has been released today with support for PEP646. Once mypy supports PEP646, users will be able to type-hint the shapes of Numpy arrays.
Older answer
It seems like it is not possible to type-hint the shape (or data type) of a numpy.ndarray at this point (September 13, 2022). There are, however, some recent pull requests to numpy working towards this goal.
https://github.com/numpy/numpy/pull/17719
makes the np.ndarray class generic w.r.t. its shape and dtype: np.ndarray[~Shape, ~DType]
However, an explicit non-goal of that PR is to make runtime-subscriptable aliases for numpy.ndarray. According to that PR, those changes will come in later PRs.
https://github.com/numpy/numpy/issues/16544
Issue discussing typing support for shapes. It is still open at the time of writing.
PEP 646 is related to this and has been accepted into Python 3.11. According to numpy/numpy issue #16544, one will be able to type-hint shape and data type of arrays after type checkers like mypy add support for PEP 646.
This is possible to do with the nptyping package, but that is not part of numpy.
from typing import Any
from nptyping import NDArray
# Nx3 array with Any data type.
NDArray[(Any, 3), Any]

Related

What is the use of PyArrow Tensor class?

In the Arrow documentation there is a class named Tensor that is created from numpy ndarrays. However, the documentation is pretty sparse, and after playing a bit I haven't found an use case for it. For example, you can't construct a table with it:
import pyarrow as pa
import numpy as np
x = np.random.normal(0, 1.5, size=(4, 3, 2))
T = pa.Tensor.from_numpy(x, dim_names="xyz")
# error
pa.table([pa.array([0, 1, 2, 3]), T], names=["f1", "f2"])
Neither there is a type for schemas and structs. So my question is: what is it there for? Can someone provide a simple example using them?
Here's a related question from over 5 years ago, but it asked about Parquet. While I'm interested in persisting these tensors, before that I should understand how to use them, and as of today, I don't.
AFAIK the pyarrow Tensor class is only used in IPC (serializing): https://arrow.apache.org/docs/dev/format/Other.html (so as a message in IPC specification).
To use tensors in pyarrow Table you would have to use an extension type for it. We are currently working on that and here you can find an umbrella issue:
https://github.com/apache/arrow/issues/33924
And you can see how it will be used in the PyArrow implementation example:
https://github.com/apache/arrow/pull/33948/files

Why do we use array_name.dtype versus dtype(array_name) in python, numpy

In numpy, to check the type of array the code is
type(array_name)
but to check the type of the values stored in the array the code is
array_name.dtype
I would have thought it would be
dtype(array_name)
This problem keeps arises in different contexts as well.
dtype is the type of the contents of the array, and it's a numpy (and pandas)-specific thing. It's easier and more convenient both for the developers and the users of the library to store it as such.
type returns the Python type of any object, is a Python built-in function. While the designers of Python could have made it a property on every object, they chose to make it a global function.
dtype and type seem very similar in this case, but in reality, they have nothing to do with each other.
In Python, type is a built-in function that returns the type of anything you pass to it in argument. You could call type(x) without any assumption on x and Python would tell you about the type of x.
On the other hand, numpy arrays are objects. As such they have a certain number of attributes and one of them is dtype. Only numpy arrays (and other objects that follow the same logic) have dtypes: it wouldn't make sense to ask for the dtype of an integer for example.

Why does Mypy think adding two Jax arrays returns a numpy array?

Consider the following file:
import jax.numpy as jnp
def test(a: jnp.ndarray, b: jnp.ndarray) -> jnp.ndarray:
return a + b
Running mypy mypytest.py returns the following error:
mypytest.py:4: error: Incompatible return value type (got "numpy.ndarray[Any, dtype[bool_]]", expected "jax._src.numpy.lax_numpy.ndarray")
For some reason it believes adding two jax.numpy.ndarrays returns a NumPy array of bools. Am I doing something wrong? Or is this a bug in MyPy, or Jax's type annotations?
At least statically, jnp.ndarray is a subclass of np.ndarray with very minimal modifications
class ndarray(np.ndarray, metaclass=_ArrayMeta):
dtype: np.dtype
shape: Tuple[int, ...]
size: int
def __init__(shape, dtype=None, buffer=None, offset=0, strides=None,
order=None):
raise TypeError("jax.numpy.ndarray() should not be instantiated explicitly."
" Use jax.numpy.array, or jax.numpy.zeros instead.")
As such, it inherits np.ndarray's method type signatures.
I guess the runtime behaviour is achieved via the jnp.array function. Unless I've missed some stub files or type trickery, the result of jnp.array matches jnp.ndarray simply because jnp.array is untyped. You can test this out with
def foo(_: str) -> None:
pass
foo(jnp.array(0))
which passes mypy.
So to answer your questions, I don't think you're doing anything wrong. It's a bug in the sense that it's probably not what they mean, but it's not actually incorrect because you do get an np.ndarray when you add jnp.ndarrays because a jnp.ndarray is an np.ndarray.
As for why bools, that's likely because your jnp.arrays are missing generic parameters and the first valid overload for __add__ on np.ndarray is
#overload
def __add__(self: NDArray[bool_], other: _ArrayLikeBool_co) -> NDArray[bool_]: ... # type: ignore[misc]
so it's just defaulted to bool.
In general, JAX has very poor compatibility with mypy, because it's very difficult to satisfy mypy's constraints with JAX's transformation model, which often calls functions with transform-specific tracer values that act as stand-ins for arrays (See How To Think in JAX: JIT Mechanics for a brief discussion of this mechanism).
This use of tracer types as standins for arrays means that mypy will raise errors when strictly-typed JAX functions are transformed, and for this reason throughout the JAX codebase we tend to alias Array to Any, and use this as the return type annotation for JAX functions that return arrays.
It would be good to improve on this, because an Any return type is not very useful for effective type checking, but it's just the first of many challenges for making mypy play well with JAX. If you want to read some of the last few years worth of discussions surrounding this issue, I would start here: https://github.com/google/jax/issues/943
And in the meantime, my suggestion would be to use Any as a type annotation for JAX arrays.

numpy.product vs numpy.prod vs ndarray.prod

I'm reading through the Numpy docs, and it appears that the functions np.prod(...), np.product(...) and the ndarray method a.prod(...) are all equivalent.
Is there a preferred version to use, both in terms of style/readability and performance? Are there different situations where different versions are preferable? If not, why are there three separate but very similar ways to perform the same operation?
As of the master branch today (1.15.0), np.product just uses np.prod, and may be deprecated eventually. See MAINT: Remove duplicate implementation for aliased functions. #10653.
And np.prod and ndarray.prod both end up calling umath.multiply.reduce, so there is really no difference between them, besides the fact that the free functions can accept array-like types (like Python lists) in addition to NumPy arrays.
Prior to this, like in NumPy 1.14.2, the documentation claimed np.product and np.prod were the same, but there were bugs because of the duplicated implementation that Parag mentions. i.e. Eric Weiser's example from #10651:
>>> class CanProd(object):
def prod(self, axis, dtype, out): return "prod"
>>> np.product(CanProd())
<__main__.CanProd object at 0x0000023BAF7B29E8>
>>> np.prod(CanProd())
'prod'
So in short, now they're the same, and favor np.prod over np.product since the latter is an alias that may be deprecated.
This is what I could gather from the source codes of NumPy 1.14.0. For the answer relevant to the current Master branch (NumPy 1.15.0), see the answer of miradulo.
For an ndarray, prod() and product() are equivalent.
For an ndarray, prod() and product() will both call um.multiply.reduce().
If the object type is not ndarray but it still has a prod method, then prod() will return prod(axis=axis, dtype=dtype, out=out, **kwargs) whereas product will try to use um.multiply.reduce.
If the object is not an ndarray and it does not have a prod method, then it will behave as product().
The ndarray.prod() is equivalent to prod().
I am not sure about the latter part of your question regarding preference and readability.

What is the identity of "ndim, shape, size, ..etc" of ndarray in numpy

I'm quite new at Python.
After using Matlab for many many years, recently, I started studying numpy/scipy
It seems like the most basic element of numpy seems to be ndarray.
In ndarray, there are following attributes:
ndarray.ndim
ndarray.shape
ndarray.size
...etc
I'm quite familiar with C++/JAVA classes, but I'm a novice at Python OOP.
Q1: My first question is what is the identity of the above attributes?
At first, I assumed that the above attribute might be public member variables. But soon, I found that a.ndim = 10 doesn't work (assuming a is an object of ndarray) So, it seems it is not a public member variable.
Next, I guessed that they might be public methods similar to getter methods in C++. However, when I tried a.nidm() with a parenthesis, it doesn't work. So, it seems that it is not a public method.
The other possibility might be that they are private member variables, but but print a.ndim works, so they cannot be private data members.
So, I cannot figure out what is the true identity of the above attributes.
Q2. Where I can find the Python code implementation of ndarray? Since I installed numpy/scipy on my local PC, I guess there might be some ways to look at the source code, then I think everything might be clear.
Could you give some advice on this?
numpy is implemented as a mix of C code and Python code. The source is available for browsing on github, and can be downloaded as a git repository. But digging your way into the C source takes some work. A lot of the files are marked as .c.src, which means they pass through one or more layers of perprocessing before compiling.
And Python is written in a mix of C and Python as well. So don't try to force things into C++ terms.
It's probably better to draw on your MATLAB experience, with adjustments to allow for Python. And numpy has a number of quirks that go beyond Python. It is using Python syntax, but because it has its own C code, it isn't simply a Python class.
I use Ipython as my usual working environment. With that I can use foo? to see the documentation for foo (same as the Python help(foo), and foo?? to see the code - if it is writen in Python (like the MATLAB/Octave type(foo))
Python objects have attributes, and methods. Also properties which look like attributes, but actually use methods to get/set. Usually you don't need to be aware of the difference between attributes and properties.
x.ndim # as noted, has a get, but no set; see also np.ndim(x)
x.shape # has a get, but can also be set; see also np.shape(x)
x.<tab> in Ipython shows me all the completions for a ndarray. There are 4*18. Some are methods, some attributes. x._<tab> shows a bunch more that start with __. These are 'private' - not meant for public consumption, but that's just semantics. You can look at them and use them if needed.
Off hand x.shape is the only ndarray property that I set, and even with that I usually use reshape(...) instead. Read their docs to see the difference. ndim is the number of dimensions, and it doesn't make sense to change that directly. It is len(x.shape); change the shape to change ndim. Likewise x.size shouldn't be something you change directly.
Some of these properties are accessible via functions. np.shape(x) == x.shape, similar to MATLAB size(x). (MATLAB doesn't have . attribute syntax).
x.__array_interface__ is a handy property, that gives a dictionary with a number of the attributes
In [391]: x.__array_interface__
Out[391]:
{'descr': [('', '<f8')],
'version': 3,
'shape': (50,),
'typestr': '<f8',
'strides': None,
'data': (165646680, False)}
The docs for ndarray(shape, dtype=float, buffer=None, offset=0,
strides=None, order=None), the __new__ method lists these attributes:
`Attributes
----------
T : ndarray
Transpose of the array.
data : buffer
The array's elements, in memory.
dtype : dtype object
Describes the format of the elements in the array.
flags : dict
Dictionary containing information related to memory use, e.g.,
'C_CONTIGUOUS', 'OWNDATA', 'WRITEABLE', etc.
flat : numpy.flatiter object
Flattened version of the array as an iterator. The iterator
allows assignments, e.g., ``x.flat = 3`` (See `ndarray.flat` for
assignment examples; TODO).
imag : ndarray
Imaginary part of the array.
real : ndarray
Real part of the array.
size : int
Number of elements in the array.
itemsize : int
The memory use of each array element in bytes.
nbytes : int
The total number of bytes required to store the array data,
i.e., ``itemsize * size``.
ndim : int
The array's number of dimensions.
shape : tuple of ints
Shape of the array.
strides : tuple of ints
The step-size required to move from one element to the next in
memory. For example, a contiguous ``(3, 4)`` array of type
``int16`` in C-order has strides ``(8, 2)``. This implies that
to move from element to element in memory requires jumps of 2 bytes.
To move from row-to-row, one needs to jump 8 bytes at a time
(``2 * 4``).
ctypes : ctypes object
Class containing properties of the array needed for interaction
with ctypes.
base : ndarray
If the array is a view into another array, that array is its `base`
(unless that array is also a view). The `base` array is where the
array data is actually stored.
All of these should be treated as properties, though I don't think numpy actually uses the property mechanism. In general they should be considered to be 'read-only'. Besides shape, I only recall changing data (pointer to a data buffer), and strides.
Regarding your first question, Python has syntactic sugar for properties, including fine-grained control of getting, setting, deleting them, as well as restricting any of the above.
So, for example, if you have
class Foo(object):
#property
def shmip(self):
return 3
then you can write Foo().shmip to obtain 3, but, if that is the class definition, you've disabled setting Foo().shmip = 4.
In other words, those are read-only properties.
Question 1
The list you're mentioning is one that contains attributes for a Numpy array.
For example:
a = np.array([1, 2, 3])
print(type(a))
>><class 'numpy.ndarray'>
Since a is an nump.ndarray you're able to use those attributes to find out more about it. (i.e a.size will result in 3). To get information about what each one does visit SciPy's documentation about the attributes.
Question 2
You can start here to get yourself familiar with some of the basics tools of Numpy as well as the Reference Manual assuming you're using v1.9. For information specific to Numpy Array you can go to Array Objects.
Their documentation is very extensive and very helpful. Examples are provided throughout the website showing multiple examples.

Categories

Resources