What is the use of PyArrow Tensor class?

What is the use of PyArrow Tensor class? - python

In the Arrow documentation there is a class named Tensor that is created from numpy ndarrays. However, the documentation is pretty sparse, and after playing a bit I haven't found an use case for it. For example, you can't construct a table with it:
import pyarrow as pa
import numpy as np
x = np.random.normal(0, 1.5, size=(4, 3, 2))
T = pa.Tensor.from_numpy(x, dim_names="xyz")
# error
pa.table([pa.array([0, 1, 2, 3]), T], names=["f1", "f2"])
Neither there is a type for schemas and structs. So my question is: what is it there for? Can someone provide a simple example using them?
Here's a related question from over 5 years ago, but it asked about Parquet. While I'm interested in persisting these tensors, before that I should understand how to use them, and as of today, I don't.

AFAIK the pyarrow Tensor class is only used in IPC (serializing): https://arrow.apache.org/docs/dev/format/Other.html (so as a message in IPC specification).
To use tensors in pyarrow Table you would have to use an extension type for it. We are currently working on that and here you can find an umbrella issue:
https://github.com/apache/arrow/issues/33924
And you can see how it will be used in the PyArrow implementation example:
https://github.com/apache/arrow/pull/33948/files

Related

Type hint 2D numpy array

As a follow-up to this question I have a function that will return a 2D numpy.array of fixed columns but variable rows
import numpy.typing as npt
def example() -> npt.ArrayLike:
data = np.array([[1,2,3],
[4,5,6],
...,
[x,y,z]])
How can I specifically hint that the returned array will be 3 columns by N (variable) rows?

Update on October 24, 2022
This is still not possible currently, but according to this comment on a Numpy GitHub issue, it will be possible once mypy supports PEP 646. Please see the relevant issue on mypy's GitHub repo. That issue is open at the time of writing.
Python 3.11 has been released today with support for PEP646. Once mypy supports PEP646, users will be able to type-hint the shapes of Numpy arrays.
Older answer
It seems like it is not possible to type-hint the shape (or data type) of a numpy.ndarray at this point (September 13, 2022). There are, however, some recent pull requests to numpy working towards this goal.
https://github.com/numpy/numpy/pull/17719
makes the np.ndarray class generic w.r.t. its shape and dtype: np.ndarray[~Shape, ~DType]
However, an explicit non-goal of that PR is to make runtime-subscriptable aliases for numpy.ndarray. According to that PR, those changes will come in later PRs.
https://github.com/numpy/numpy/issues/16544
Issue discussing typing support for shapes. It is still open at the time of writing.
PEP 646 is related to this and has been accepted into Python 3.11. According to numpy/numpy issue #16544, one will be able to type-hint shape and data type of arrays after type checkers like mypy add support for PEP 646.
This is possible to do with the nptyping package, but that is not part of numpy.
from typing import Any
from nptyping import NDArray
# Nx3 array with Any data type.
NDArray[(Any, 3), Any]

How does python matplotlib.pyplot support datatype from pandas, xarray etc.?

To put the question in the simplest way -- how does the last line of this code work?
import numpy as np
import xarray as xr
tmp = xr.DataArray(np.random.rand(100,100))
howthiswork = np.array(tmp)
I'm under the impression that pandas is built on the top numpy link and then xarray is sort of based on pandas link. However, I don't find any info stating that numpy itself supports xarray.DataArray, plus not a lot of people are using xarray than numpy. So why can I initialize a numpy.ndarray with an xarray.DataArray object? One of my guess is that corresponding supporting is provided in the xarray package but I don't see a mechanism where codes from xarray package could affect numpy.ndarray.__init__ (or whatever function from numpy package)..
Can anyone give me an explanation of how this up-down support is achieved?

NumPy's looks for a __array__ method for how to convert arbitrary objects to numpy array, which both pandas and xarray objects define.
This is pretty easy to implement for your own class, e.g.,
import numpy as np
class MyArray:
def __array__(self, dtype=None):
return np.arange(5)
np.array(MyArray())
# array([0, 1, 2, 3, 4])

What is the identity of "ndim, shape, size, ..etc" of ndarray in numpy

I'm quite new at Python.
After using Matlab for many many years, recently, I started studying numpy/scipy
It seems like the most basic element of numpy seems to be ndarray.
In ndarray, there are following attributes:
ndarray.ndim
ndarray.shape
ndarray.size
...etc
I'm quite familiar with C++/JAVA classes, but I'm a novice at Python OOP.
Q1: My first question is what is the identity of the above attributes?
At first, I assumed that the above attribute might be public member variables. But soon, I found that a.ndim = 10 doesn't work (assuming a is an object of ndarray) So, it seems it is not a public member variable.
Next, I guessed that they might be public methods similar to getter methods in C++. However, when I tried a.nidm() with a parenthesis, it doesn't work. So, it seems that it is not a public method.
The other possibility might be that they are private member variables, but but print a.ndim works, so they cannot be private data members.
So, I cannot figure out what is the true identity of the above attributes.
Q2. Where I can find the Python code implementation of ndarray? Since I installed numpy/scipy on my local PC, I guess there might be some ways to look at the source code, then I think everything might be clear.
Could you give some advice on this?

numpy is implemented as a mix of C code and Python code. The source is available for browsing on github, and can be downloaded as a git repository. But digging your way into the C source takes some work. A lot of the files are marked as .c.src, which means they pass through one or more layers of perprocessing before compiling.
And Python is written in a mix of C and Python as well. So don't try to force things into C++ terms.
It's probably better to draw on your MATLAB experience, with adjustments to allow for Python. And numpy has a number of quirks that go beyond Python. It is using Python syntax, but because it has its own C code, it isn't simply a Python class.
I use Ipython as my usual working environment. With that I can use foo? to see the documentation for foo (same as the Python help(foo), and foo?? to see the code - if it is writen in Python (like the MATLAB/Octave type(foo))
Python objects have attributes, and methods. Also properties which look like attributes, but actually use methods to get/set. Usually you don't need to be aware of the difference between attributes and properties.
x.ndim # as noted, has a get, but no set; see also np.ndim(x)
x.shape # has a get, but can also be set; see also np.shape(x)
x.<tab> in Ipython shows me all the completions for a ndarray. There are 4*18. Some are methods, some attributes. x._<tab> shows a bunch more that start with __. These are 'private' - not meant for public consumption, but that's just semantics. You can look at them and use them if needed.
Off hand x.shape is the only ndarray property that I set, and even with that I usually use reshape(...) instead. Read their docs to see the difference. ndim is the number of dimensions, and it doesn't make sense to change that directly. It is len(x.shape); change the shape to change ndim. Likewise x.size shouldn't be something you change directly.
Some of these properties are accessible via functions. np.shape(x) == x.shape, similar to MATLAB size(x). (MATLAB doesn't have . attribute syntax).
x.__array_interface__ is a handy property, that gives a dictionary with a number of the attributes
In [391]: x.__array_interface__
Out[391]:
{'descr': [('', '<f8')],
'version': 3,
'shape': (50,),
'typestr': '<f8',
'strides': None,
'data': (165646680, False)}
The docs for ndarray(shape, dtype=float, buffer=None, offset=0,
strides=None, order=None), the __new__ method lists these attributes:
`Attributes
----------
T : ndarray
Transpose of the array.
data : buffer
The array's elements, in memory.
dtype : dtype object
Describes the format of the elements in the array.
flags : dict
Dictionary containing information related to memory use, e.g.,
'C_CONTIGUOUS', 'OWNDATA', 'WRITEABLE', etc.
flat : numpy.flatiter object
Flattened version of the array as an iterator. The iterator
allows assignments, e.g., ``x.flat = 3`` (See `ndarray.flat` for
assignment examples; TODO).
imag : ndarray
Imaginary part of the array.
real : ndarray
Real part of the array.
size : int
Number of elements in the array.
itemsize : int
The memory use of each array element in bytes.
nbytes : int
The total number of bytes required to store the array data,
i.e., ``itemsize * size``.
ndim : int
The array's number of dimensions.
shape : tuple of ints
Shape of the array.
strides : tuple of ints
The step-size required to move from one element to the next in
memory. For example, a contiguous ``(3, 4)`` array of type
``int16`` in C-order has strides ``(8, 2)``. This implies that
to move from element to element in memory requires jumps of 2 bytes.
To move from row-to-row, one needs to jump 8 bytes at a time
(``2 * 4``).
ctypes : ctypes object
Class containing properties of the array needed for interaction
with ctypes.
base : ndarray
If the array is a view into another array, that array is its `base`
(unless that array is also a view). The `base` array is where the
array data is actually stored.
All of these should be treated as properties, though I don't think numpy actually uses the property mechanism. In general they should be considered to be 'read-only'. Besides shape, I only recall changing data (pointer to a data buffer), and strides.

Regarding your first question, Python has syntactic sugar for properties, including fine-grained control of getting, setting, deleting them, as well as restricting any of the above.
So, for example, if you have
class Foo(object):
#property
def shmip(self):
return 3
then you can write Foo().shmip to obtain 3, but, if that is the class definition, you've disabled setting Foo().shmip = 4.
In other words, those are read-only properties.

Question 1
The list you're mentioning is one that contains attributes for a Numpy array.
For example:
a = np.array([1, 2, 3])
print(type(a))
>><class 'numpy.ndarray'>
Since a is an nump.ndarray you're able to use those attributes to find out more about it. (i.e a.size will result in 3). To get information about what each one does visit SciPy's documentation about the attributes.
Question 2
You can start here to get yourself familiar with some of the basics tools of Numpy as well as the Reference Manual assuming you're using v1.9. For information specific to Numpy Array you can go to Array Objects.
Their documentation is very extensive and very helpful. Examples are provided throughout the website showing multiple examples.

Calculate mean of hue angles

I have been struggling with this for some time, despite there being related questions on SO (e.g. this one).
def circmean(arr):
arr = np.deg2rad(arr)
return np.rad2deg(np.arctan2(np.mean(np.sin(arr)),np.mean(np.cos(arr))))
But the results I'm getting don't make sense! I regularly get negative values, e.g.:
test = np.array([323.64,161.29])
circmean(test)
>> -117.53500000000004
I don't know if (a) my function is incorrect, (b) the method I'm using is incorrect, or (c) I just have to do a transformation to the negative values (add 360 degrees?). My research suggests that the problem isn't (a), and I've seen implementations (e.g. here) matching my own, so I'm leaning towards (c), but I really don't know.

Following this question, I've done some research that led me to find the circmean function in the scipy library.
Considering you're using the numpy library, I thought that a proper implementation in the scipy library shall suit your needs.
As noted in my answer to the aforementioned question, I haven't found any documentation of that function, but inspecting its source code revealed the proper way it should be invoked:
>>> import numpy as np
>>> from scipy import stats
>>>
>>> test = np.array([323.64,161.29])
>>> stats.circmean(test, high=360)
242.46499999999995
>>>
>>> test = np.array([5, 350])
>>> stats.circmean(test, high=360)
357.49999999999994
This might not be of any use to you, since some time passed since you posted your question and considering you've already implemented the function yourself, but I hope it may benefit future readers who are struggling with the same issue.

Python Shared Memory Array, no attribute get_obj()

I am working on manipulating numpy arrays using the multiprocessing module and am running into an issue trying out some of the code I have run across here. Specifically, I am creating a ctypes array from a numpy array and then trying to return the ctypes array to a numpy array. Here is the code:
shared_arr = multiprocessing.RawArray(_numpy_to_ctypes[array.dtype.type],array.size)
I do not need any kind of synchronization lock, so I am using RawArray. The ctypes data type is pulled from a dictionary based on the dtype of the input array. That is working wonderfully.
shared_arr = numpy.ctypeslib.as_array(shared_arr.get_obj())
Here I get a stack trace stating:
AttributeError: 'c_double_Array_16154769' object has no attribute 'get_obj'
I have also tried the following from this post, but get an identical error.
def tonumpyarray(shared_arr):
return numpy.frombuffer(shared_arr.get_obj())
I am stuck running python 2.6 and do not know if that is the issue, if it is an issue with sharing the variable name (I am trying to keep memory usage as low as possible and am trying not to duplicate the numpy array and the ctypes array in memory), or something else as I am just learning about this component of python.
Suggestions?

Since you use RawArray, it's just a ctypes array allocated from shared memory, There is no wrapped object, so you don't need get_obj() method to get the wrapped object:
>>> shared_arr = multiprocessing.RawArray("d",10)
>>> t = np.frombuffer(shared_arr, dtype=float)
>>> t[0] = 2
>>> shared_arr[0]
2.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is the use of PyArrow Tensor class? - python

Related

Type hint 2D numpy array

How does python matplotlib.pyplot support datatype from pandas, xarray etc.?

What is the identity of "ndim, shape, size, ..etc" of ndarray in numpy

Calculate mean of hue angles

Python Shared Memory Array, no attribute get_obj()

Categories

Resources