what is the use case of numpy array of scalar value? - python

In the latest scipy version, I found:
>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> a = csr_matrix((3, 4), dtype=np.int8)
>>> a[0,0]
array(0) #instead of `0`
and you can create numpy array of scaler value (instead of vector/matrix) np.array(0), which is different from np.array([0]). what is the use case of np.array(0)? how to get the value inside the array from np.array(0) (not type conversion use int)?

You've created a sparse matrix, shape (3,4), but no elements:
In [220]: a = sparse.csr_matrix((3, 4), dtype=np.int8)
In [221]: a
Out[221]:
<3x4 sparse matrix of type '<class 'numpy.int8'>'
with 0 stored elements in Compressed Sparse Row format>
In [222]: a.toarray()
Out[222]:
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]], dtype=int8)
Selecting one element:
In [223]: a[0,0]
Out[223]: array(0, dtype=int8)
Converting it to a dense np.matrix:
In [224]: a.todense()
Out[224]:
matrix([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]], dtype=int8)
In [225]: a.todense()[0,0]
Out[225]: 0
and to other sparse formats:
In [226]: a.tolil()[0,0]
Out[226]: 0
In [227]: a.todok()[0,0]
Out[227]: 0
It looks like csr is some what unique in returning a scalar array like this. I'm not sure if it's intentional, a feature, or a bug. I haven't noticed it before. Usually we work with the whole matrix, rather than specific elements.
But a 0d array is allowed, even if in most cases it isn't useful. If we can have 2d or 1d arrays, why not 0?
There are a couple of ways of extracting that element from a 0d array:
In [233]: np.array(0, 'int8')
Out[233]: array(0, dtype=int8)
In [234]: _.shape
Out[234]: ()
In [235]: __.item()
Out[235]: 0
In [236]: ___[()] # index with an empty tuple
Out[236]: 0
Scipy version 1.3.0 release notes includes:
CSR and CSC sparse matrix fancy indexing performance has been improved substantially
https://github.com/scipy/scipy/pull/7827 - looks like this pull request was a long time in coming, and had a lot of faults (and may still). If this behavior is a change from previous scipy releases, we need to see if there's a related issue (and possibly create one).
https://github.com/scipy/scipy/pull/10207 BUG: Compressed matrix indexing should return a scalar
Looks like it will be fixed in 1.4.

What are they?
They're seemingly an single-element array, like an array with one element.
How do I get the value out of it?
By using:
>>> np.array(0).item()
0
>>>

Related

What is the most efficient way to convert from a list of values to a scipy sparse matrix?

I have a list of values that I'm using a loop to convert to a scipy.sparse.dok_matrix. I'm aware of numpy.bincount but it doesn't work with sparse matrices. I'm wondering if there is a more efficient way to perform this conversion because the construction time for a dok_matrix is really long.
Example below for one row but I'm scaling to a 2D matrix by looping. The number of times a value x appears in the input list is the value of the xth element of the result matrix.
values = [1, 3, 3, 4]
expected_result = [0, 1, 0, 2, 1]
matrix = dok_matrix((1, MAXIMUM_EXPECTED_VALUE))
for value in values:
matrix[0, value] = matrix.get((0, card)) + 1
MAXIMUM_EXPECTED_VALUE is in the order of 100000000 but len(values) < 100, which is why I'm using a sparse matrix. Possibly off-topic: there are also only a little over 10000 actual values that are used in the range of MAXIMUM_EXPECTED_VALUE but I think hashing to a contiguous range and converting back might be more complicated.
Looks like the standard coo style inputs suits you case:
In [143]: from scipy import sparse
In [144]: values = [1,3,3,4]
In [145]: col = np.array(values)
In [146]: row = np.zeros_like(col)
In [147]: data = np.ones_like(col)
In [148]: M = sparse.coo_matrix((data, (row,col)), shape=(1,10))
In [149]: M
Out[149]:
<1x10 sparse matrix of type '<class 'numpy.int64'>'
with 4 stored elements in COOrdinate format>
In [150]: M.A
Out[150]: array([[0, 1, 0, 2, 1, 0, 0, 0, 0, 0]])

Pytorch: accessing a subtensor using lists of indices

I have a pair of tensors S and T of dimensions (s1,...,sm) and (t1,...,tn) with si < ti. I want to specify a list of indices in each dimensions of T to "embed" S in T. If I1 is a list of s1 indices in (0,1,...,t1) and likewise for I2 up to In, I would like to do something like
T.select(I1,...,In)=S
that will have the effect that now T has entries equal to the entries of S over the indices (I1,...,In).
for example
`S=
[[1,1],
[1,1]]
T=
[[0,0,0],
[0,0,0],
[0,0,0]]
T.select([0,2],[0,2])=S
T=
[[1,0,1],
[0,0,0],
[1,0,1]]`
If you're flexible with using NumPy only for the indices part, then here's one approach by constructing an open mesh using numpy.ix_() and using this mesh to fill-in the values from the tensor S. If this is not acceptable, then you can use torch.meshgrid()
Below is an illustration of both approaches with descriptions interspersed in comments.
# input tensors to work with
In [174]: T
Out[174]:
tensor([[0, 0, 0],
[0, 0, 0],
[0, 0, 0]])
# I'm using unique tensor just for clarity; But any tensor should work.
In [175]: S
Out[175]:
tensor([[10, 11],
[12, 13]])
# indices where we want the values from `S` to be filled in, along both dimensions
In [176]: idxs = [[0,2], [0,2]]
Now we will leverage np.ix_() or torch.meshgrid() to generate an open mesh by passing in the indices:
# mesh using `np.ix_`
In [177]: mesh = np.ix_(*idxs)
# as an alternative, we can use `torch.meshgrid()`
In [191]: mesh = torch.meshgrid([torch.tensor(lst) for lst in idxs])
# replace the values from tensor `S` using basic indexing
In [178]: T[mesh] = S
# sanity check!
In [179]: T
Out[179]:
tensor([[10, 0, 11],
[ 0, 0, 0],
[12, 0, 13]])

Why does .nnz indicate more non-zero elements being present in sparse matrices?

In SciPy, I have a CSR matrix. I create a LiL matrix by choosing certain rows of each column from this matrix. Then convert resulting matrix into an CSR matrix.
I have a special case, where all rows of each column are selected. While doing this, I notice a slight perturbation of the values. Here is a MWE. It can be tried with a CSR matrix given here.
from scipy import sparse
import numpy as np
ip_mat = sparse.load_npz('a_di_mat.npz')
# this is not useful for me in practice
# this is being done for comparison
lil_mat_direct = ip_mat.tolil()
csr_mat_direct = lil_mat_direct.tocsr()
# I need to copy column-by-column.
# this a MWE representing the special case where the entire column is copied
lil_mat_steps = sparse.lil_matrix((ip_mat.shape), dtype=np.float64)
for i_col in range(ip_mat.shape[1]):
    lil_mat_steps[:, i_col] = ip_mat[:, i_col]
csr_mat_steps = lil_mat_steps.tocsr()
diff_mat = csr_mat_direct - csr_mat_steps
print('nnz: direct copy: {} columnwise copy: {} diff: {}'.format(
    csr_mat_direct.nnz, csr_mat_steps.nnz, diff_mat.nnz))
# a colleague suggested the following
ind_x, ind_y = ip_mat.nonzero()
print('ip_mat: nonzero indices {} nnz {}'.format(len(ind_x), ip_mat.nnz))
In the first print statement, one expects:
nnz: direct copy: 2886100 columnwise copy: 2886100 diff: 0
However, one obtains:
nnz: direct copy: 2886100 columnwise copy: 2879757 diff: 0
The difference matrix being all zeros shows that the matrices are very close, if not exactly same. How to explain the reduction in the number of non-zeros? This means that non-zero values are getting perturbed. In the case of values which are very close to zero in the original matrix, they are getting perturbed and becoming zero in the output matrix. I am afraid this perturbation happens to all the non-zero elements and could affect the more general case, where only a subset of the rows are selected for each column.
For the second print statement, one obtains:
ip_mat: nonzero indices 2879757 nnz 2886100
So, could the error be in the way .nnz is implemented? Did the column-wise copying remove some values that are meant to be zeros?
I haven't tried to follow your manipulations, but can suggest a couple of sources of differences. Some operations (on a csr) just set data elements to 0 without removing them from the sparsity structure. A sparse matrix stores "non-zero" elements in several arrays. nnz essentially reports the size of these arrays.
Look at the code for nonzero
A = self.tocoo()
nz_mask = A.data != 0
return (A.row[nz_mask], A.col[nz_mask])
it does an extra data!=0 test, so can be different from nnz.
csr also has a eliminate_zeros to clean up the sparsity in-place. This operation is sufficiently expensive that csr does not perform it everything time you do something to the matrix.
So yes, it is possible to have a nnz that's larger than the nonzero count. For a csr created from 'scratch' (ie. from a dense array, or from coo style inputs), nnz should match the nonzero. But ip_mat was loaded from a npz file, with csr attribute arrays. If the saved csr was not "clean", the loaded one won't be either.
Illustration
In [103]: from scipy import sparse
In [104]: M = sparse.csr_matrix([[1,0,0,2],[0,3,4,0]])
In [105]: M
Out[105]:
<2x4 sparse matrix of type '<class 'numpy.longlong'>'
with 4 stored elements in Compressed Sparse Row format>
In [106]: M.A
Out[106]:
array([[1, 0, 0, 2],
[0, 3, 4, 0]], dtype=int64)
In [107]: M.data
Out[107]: array([1, 2, 3, 4], dtype=int64)
Modifying elements:
In [108]: M[0,1] = 12
/usr/local/lib/python3.6/dist-packages/scipy/sparse/_index.py:84: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
self._set_intXint(row, col, x.flat[0])
In [109]: M.data
Out[109]: array([ 1, 12, 2, 3, 4], dtype=int64)
In [110]: M
Out[110]:
<2x4 sparse matrix of type '<class 'numpy.longlong'>'
with 5 stored elements in Compressed Sparse Row format>
In [111]: M.nnz
Out[111]: 5
In [112]: M[0,1] = 0 # data is set to 0, but indices does not change
In [113]: M
Out[113]:
<2x4 sparse matrix of type '<class 'numpy.longlong'>'
with 5 stored elements in Compressed Sparse Row format>
In [114]: M.data
Out[114]: array([1, 0, 2, 3, 4], dtype=int64)
cleaning up:
In [115]: M.eliminate_zeros()
In [116]: M
Out[116]:
<2x4 sparse matrix of type '<class 'numpy.longlong'>'
with 4 stored elements in Compressed Sparse Row format>
making an array from csr style inputs (with 0 in data):
In [120]: M1 = sparse.csr_matrix(([1,0,2,3,4],[0,1,3,1,2],[0,3,5]))
In [121]: M1
Out[121]:
<2x4 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>
In [122]: M1.data
Out[122]: array([1, 0, 2, 3, 4])
In [123]: M1.nonzero()
Out[123]: (array([0, 0, 1, 1], dtype=int32), array([0, 3, 1, 2], dtype=int32))
In [124]: M1.eliminate_zeros()
In [125]: M1
Out[125]:
<2x4 sparse matrix of type '<class 'numpy.int64'>'
with 4 stored elements in Compressed Sparse Row format>

storing numpy object array of equal-size ndarrays to a .mat file using scipy.io.savemat

I am trying to create .mat data files using python. The matlab code expects the data to have a certain format, where two-dimensional ndarrays of non-uniform sizes are stored as objects in a column vector. So, in my case, there would be k numpy arrays of shape (m_i, n) - with different m_i for each array - stored in a numpy array with dtype=object of shape (k, 1). I then add this object array to a dictionary and pass it to scipy.io.savemat().
This works fine so long as the m_i are indeed different. If all k arrays happen to have the same number of rows m_i, the behaviour becomes strange. First of all, it requires very explicit assignment to a numpy array of dtype=object that has been initialised to the final size k, otherwise numpy simply creates a three-dimensional array. But even when I have the correct format in python and store it to a .mat file using savemat, there is some kind of problem in the translation to the matlab format.
When I reload the data from the .mat file using scipy.io.loadmat, I find that I still have an object array of shape (k, 1), which still has elements of shape (m, n). However, each element is no longer an int or a float but is instead a numpy array of shape (1, 1) that has to be further indexed to access the contained int or float. So an individual element of an object vector that was supposed to be a numpy array of shape (2, 4) would look something like this:
[array([[array([[0.82374894]]), array([[0.50730055]]),
array([[0.36721625]]), array([[0.45036349]])],
[array([[0.26119276]]), array([[0.16843872]]),
array([[0.28649524]]), array([[0.64239569]])]], dtype=object)]
This also poses a problem for the matlab code that I am trying to build my data files for. It runs fine for the arrays of objects that have different shapes but will break when there are arrays containing arrays of the same shape.
I know this is a rather obscure and possibly unavoidable issue but I figured I would see if anyone else has encountered it and found a fix. Thanks.
I'm not quite clear about the problem. Let me try to recreate your case:
In [58]: from scipy.io import loadmat, savemat
In [59]: A = np.empty((2,1), object)
In [61]: A[0,0]=np.arange(4).reshape(2,2)
In [62]: A[1,0]=np.arange(6).reshape(3,2)
In [63]: A
Out[63]:
array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3],
[4, 5]])]], dtype=object)
In [64]: B=A[[0,0],:]
In [65]: B
Out[65]:
array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3]])]], dtype=object)
As I explained earlier today, creating an object dtype array from arrays of matching size requires special handling. np.array(...) tries to create a higher dimensional array. https://stackoverflow.com/a/56243305/901925
Saving:
In [66]: savemat('foo.mat', {'A':A, 'B':B})
Loading:
In [74]: loadmat('foo.mat')
Out[74]:
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Tue May 21 11:20:42 2019',
'__version__': '1.0',
'__globals__': [],
'A': array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3],
[4, 5]])]], dtype=object),
'B': array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3]])]], dtype=object)}
In [75]: _74['A'][1,0]
Out[75]:
array([[0, 1],
[2, 3],
[4, 5]])
Your problem case looks like it's a object dtype array containing numbers:
In [89]: C = np.arange(4).reshape(2,2).astype(object)
In [90]: C
Out[90]:
array([[0, 1],
[2, 3]], dtype=object)
In [91]: savemat('foo1.mat', {'C': C})
In [92]: loadmat('foo1.mat')
Out[92]:
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Tue May 21 11:39:31 2019',
'__version__': '1.0',
'__globals__': [],
'C': array([[array([[0]]), array([[1]])],
[array([[2]]), array([[3]])]], dtype=object)}
Evidently savemat has converted the integer objects into 2d MATLAB compatible arrays. In MATLAB everything, even scalars, is at least 2d.
===
And in Octave, the object dtype arrays all produce cells, and the 2d numeric arrays produce matrices:
>> load foo.mat
>> A
A =
{
[1,1] =
0 1
2 3
[2,1] =
0 1
2 3
4 5
}
>> B
B =
{
[1,1] =
0 1
2 3
[2,1] =
0 1
2 3
}
>> load foo1.mat
>> C
C =
{
[1,1] = 0
[2,1] = 2
[1,2] = 1
[2,2] = 3
}
Python: Issue reading in str from MATLAB .mat file using h5py and NumPy
is a relatively recent SO that showed there's a difference between the Octave HDF5 and MATLAB.

Pandas series to array conversion is getting me arrays of array objects

I have a Pandas series and here are two first two rows:
X.head(2)
Which has 1D arrays for each row: the column header is mels_flatten
mels_flatten
0 [0.0171469795289, 0.0173154008662, 0.395695541...
1 [0.0471267533454, 0.0061760868171, 0.005647608...
I want to store the values in a single array to feed to a classifier model.
np.vstack(X.values)
or
np.array(X.values)
both returns following
array([[ array([ 1.71469795e-02, 1.73154009e-02, 3.95695542e-01, ...,
2.35955651e-04, 8.64118460e-04, 7.74663408e-04])],
[ array([ 0.04712675, 0.00617609, 0.00564761, ..., 0.00277199,
0.00205229, 0.00043118])],
I am not sure how to process array of array objects.
My expected result is:
array([[ 1.71469795e-02, 1.73154009e-02, 3.95695542e-01, ...,
2.35955651e-04, 8.64118460e-04, 7.74663408e-04]],
[ 0.04712675, 0.00617609, 0.00564761, ..., 0.00277199,
0.00205229, 0.00043118]],
Have tried np.concatenate and np.resize as some other posts suggested with no luck.
I find it likely that not all of your 1d arrays are the same length, i.e. your series is not compatible with a rectangular 2d array.
Consider the following dummy example:
import pandas as pd
import numpy as np
X = pd.Series([np.array([1,2,3]),np.array([4,5,6])])
# 0 [1, 2, 3]
# 1 [4, 5, 6]
# dtype: object
np.vstack(X.values)
# array([[1, 2, 3],
# [4, 5, 6]])
As the above demonstrate, a collection of 1d arrays (or lists) of the same size will be nicely stacked to a 2d array. Check the size of your arrays, and you'll probably find that there are some discrepancies:
>>> X.apply(len)
0 3
1 3
dtype: int64
If X.apply(len).unique() returns an array with more than 1 elements, you'll see the proof of the problem. In the above rectangular case:
>>> X.apply(len).unique()
array([3])
In a non-conforming example:
>>> Y = pd.Series([np.array([1,2,3]),np.array([4,5])])
>>> np.array(Y.values)
array([array([1, 2, 3]), array([4, 5])], dtype=object)
>>> Y.apply(len).unique()
array([3, 2])
As you can see, the nested array result is coupled to the non-unique length of items inside the original array.

Categories

Resources