I have a pandas series features that has the following values (features.values)
array([array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),
array([0, 0, 0, ..., 0, 0, 0]), ...,
array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),
array([0, 0, 0, ..., 0, 0, 0])], dtype=object)
Now I really want this to be recognized as matrix, but if I do
>>> features.values.shape
(10000,)
rather than (10000, 3000) which is what I would expect.
How can I get this to be recognized as 2d rather than a 1d array with arrays as values. Also why does it not automatically detect it as a 2d array?
In response your comment question, let's compare 2 ways of creating an array
First make an array from a list of arrays (all same length):
In [302]: arr = np.array([np.arange(3), np.arange(1,4), np.arange(10,13)])
In [303]: arr
Out[303]:
array([[ 0, 1, 2],
[ 1, 2, 3],
[10, 11, 12]])
The result is a 2d array of numbers.
If instead we make an object dtype array, and fill it with arrays:
In [304]: arr = np.empty(3,object)
In [305]: arr[:] = [np.arange(3), np.arange(1,4), np.arange(10,13)]
In [306]: arr
Out[306]:
array([array([0, 1, 2]), array([1, 2, 3]), array([10, 11, 12])],
dtype=object)
Notice that this display is like yours. This is, by design a 1d array. Like a list it contains pointers to arrays elsewhere in memory. Notice that it requires an extra construction step. The default behavior of np.array is to create a multidimensional array where it can.
It takes extra effort to get around that. Likewise it takes some extra effort to undo that - to create the 2d numeric array.
Simply calling np.array on it does not change the structure.
In [307]: np.array(arr)
Out[307]:
array([array([0, 1, 2]), array([1, 2, 3]), array([10, 11, 12])],
dtype=object)
stack does change it to 2d. stack treats it as a list of arrays, which it joins on a new axis.
In [308]: np.stack(arr)
Out[308]:
array([[ 0, 1, 2],
[ 1, 2, 3],
[10, 11, 12]])
Shortening #hpauli answer:
your_2d_arry = np.stack(arr_of_arr_object)
Related
I'm working with a huge matrix of type csr using SciPy. Sparse, and I want to add a value to an empty spot that's never used or assigned to when the matrix was created, without converting it to dense matrix. I have the desired row, col&data values but I just don't know how to access that specific element and add the value to it
update:
I tried this method but I get a weird "kernel error" and It doesn't work, assume: (we are at row k)
data(of row k) =np.insert(data,index,5)
col(indices of the row)=np.insert(col,index, colIndex)
row[k+1:] +=1
I cant understand what did I do wrong.
A sample csr matrix:
In [66]: M = (sparse.random(5,5,.2,'csr')*10).astype(int)
In [67]: M
Out[67]:
<5x5 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>
In [69]: M.A
Out[69]:
array([[0, 0, 4, 0, 0],
[2, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[3, 6, 0, 0, 0],
[0, 0, 0, 0, 0]])
Look at its attributes:
In [70]: M.indptr
Out[70]: array([0, 1, 2, 3, 5, 5], dtype=int32)
In [71]: M.indices
Out[71]: array([2, 0, 2, 0, 1], dtype=int32)
In [72]: M.data
Out[72]: array([4, 2, 0, 3, 6])
Now use indexing to create a new nonzero value:
In [73]: M[2,3] = 12
/usr/local/lib/python3.8/dist-packages/scipy/sparse/_index.py:82: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
self._set_intXint(row, col, x.flat[0])
Look at the new attributes - all the 3 have changed.
In [74]: M.indptr
Out[74]: array([0, 1, 2, 4, 6, 6], dtype=int32)
In [75]: M.indices
Out[75]: array([2, 0, 2, 3, 0, 1], dtype=int32)
In [76]: M.data
Out[76]: array([ 4, 2, 0, 12, 3, 6])
Values in indptr have changed, and values have been added to indices and data. While I have a general idea of what is happening, I have not worked out the details.
As the warning says, it is easier to visualize, and perform an addition to a lil format matrix:
In [88]: Ml = M.tolil()
In [89]: Ml.rows
Out[89]:
array([list([2]), list([0, 1, 3, 4]), list([1, 2, 4]), list([0, 1, 3]),
list([])], dtype=object)
In [90]: Ml.data
Out[90]:
array([list([5]), list([3, 0, 5, 1]), list([12, 1, 8]), list([7, 4, 8]),
list([])], dtype=object)
In [91]: Ml.rows[2].append(3)
In [92]: Ml.data[2].append(23)
In [93]: Ml.A
Out[93]:
array([[ 0, 0, 5, 0, 0],
[ 3, 0, 0, 5, 1],
[ 0, 12, 1, 23, 8],
[ 7, 4, 0, 8, 0],
[ 0, 0, 0, 0, 0]])
I have two arrays, a and b.
a has shape (1, 2, 3, 4)
b has shape (4, 3, 2, 1)
I would like to make them both (4, 3, 3, 4) with the new positions filled with 0's.
I can do:
new_shape = (4, 3, 3, 4)
a = np.resize(a, new_shape)
b = np.resize(b, new_shape)
..but this repeats the elements of each to form the new elements, which does not work for me.
Instead I thought I could do:
a = a.resize(new_shape)
b = b.resize(new_shape)
..which according to the documentation pads with 0's.
But it doesn't work for multi-dimensional arrays, raising error:
ValueError: resize only works on single-segment arrays
So is there a different way to achieve this? ie. same as np.resize but with 0-padding?
NB: I am only looking for pure-numpy solutions.
EDIT: I'm using numpy version 1.20.2
EDIT: I just found out that is works for numbers, but not for objects, I forgot to mention that it is an array of objects not numbers.
resize method pads with 0s in a flattened sense; the function pads with repeats.
To illustrate how resize "flattens" before padding:
In [108]: a = np.arange(12).reshape(1,4,3)
In [109]: a
Out[109]:
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]]])
In [110]: a1 = a.copy()
In [111]: a1.resize((2,4,4))
In [112]: a1
Out[112]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[ 0, 0, 0, 0]],
[[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0]]])
If instead I make a target array of the right shape, and copy, I can maintain the original multidimensional block:
In [114]: res = np.zeros((2,4,4),a.dtype)
In [115]: res[:a.shape[0],:a.shape[1],:a.shape[2]]=a
In [116]: res
Out[116]:
array([[[ 0, 1, 2, 0],
[ 3, 4, 5, 0],
[ 6, 7, 8, 0],
[ 9, 10, 11, 0]],
[[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0],
[ 0, 0, 0, 0]]])
I wrote out the slices explicitly (for clarity). Such a tuple could be created programmatically if needed.
I have a 2d ndarray of ndarrays, which looks like:
array([[array([0]), array([0, 1]), array([0, 1]), None, None,array([0, 1])],
[array([0, 1]), None, array([0, 1]), array([0, 1]), array([0, 1]),None],
[None, None, array([0, 1]), None, None, None],
[array([0, 1]), array([0, 1]), None, array([0, 1]), array([0, 1]),array([0, 1])],
[array([0, 1]), None, None, None, array([0, 1]), None],
[array([0, 1]), None, array([0, 1]), array([0, 1]), array([0, 1]),None]], dtype=object)
My aim is to get the indices of the elements, sorted by the len(element), also skips those elements which are None. Like:
array([[0,0], --> len= 1
[0,1], --> len=2
[0,2], --> len=2
[0,5], ...
[1,0],
[1,2],
[1,3],
[1,4],
... ])
I have tried to first convert the elements to their len, which will give us something like:
array([[1, 2, 2, 0, 0, 2],
[2, 0, 2, 2, 2, 0],
[0, 0, 2, 0, 0, 0],
[2, 2, 0, 2, 2, 2],
[2, 0, 0, 0, 2, 0],
[2, 0, 2, 2, 2, 0]], dtype=object)
However, I couldn't find an effective way to generate the indices list(or ndarray will do).
Please help me with this. I appreciate anyone who could solve this problem or give me some clue.
Edit:
I have found a close but not perfect solution:
Due to the data constraints, the elements of "lenArray" can only have 3 kinds of values: 1, 2, inf.
So I can take advantage of this to do:
ones = np.column_stack(np.where(lenArray==1))
twos = np.column_stack(np.where(lenArray==2))
infs = np.column_stack(np.where(lenArray==inf))
sort_lenidx = np.concatenate((ones,twos,infs))
Where sort_lenidx will match my needs.
However, this is not a very general(if the possible value nums be very big, this will be useless) and elegant way to solve my problem. I still hope someone could give me a better way to do it.
I will appreciate for your help in any form.
Let's call the array containing lengths lenArray.
The right way of doing this would be by creating another 2d array- rowNcol, that contain the row and column indices of lenArray as elements. Then, implement a sorting algorithm on lenArray and perform identical operations on rowNcol to finally obtain the desired array of indices.
That being said, you could exploit the fact that we know beforehand, the type (int) and range of elements inside lenArray and simply iterate through the possible elements in the following manner:
from numpy import array, amax
lenArray = array([[1, 2, 2, 0, 0, 2],
[2, 0, 2, 2, 2, 0],
[0, 0, 2, 0, 0, 0],
[2, 2, 0, 2, 2, 2],
[2, 0, 0, 0, 2, 0],
[2, 0, 2, 2, 2, 0]])
rows, cols = lenArray.shape
lenMax = amax(lenArray)
for lenVal in range(lenMax):
for i in range(rows):
for j in range(cols):
if (lenArray[i,j] == lenVal):
print(str(i) + ',' + str(j))
This is however, extremely inefficient if the size of lenArray is very large, since you are parsing it over and over.
Edit: I later came across numpy.argsort which appears to do exactly what you want.
I have two numpy arrays of arrays (A and B). They look something like this when printed:
A:
[array([0, 0, 0]) array([0, 0, 0]) array([1, 0, 0]) array([0, 0, 0])
array([0, 0, 0]) array([0, 0, 0]) array([0, 0, 0]) array([0, 0, 0])
array([0, 0, 0]) array([0, 0, 0]) array([0, 0, 1]) array([0, 0, 0])
array([1, 0, 0]) array([0, 0, 1]) array([0, 0, 0]) array([0, 0, 0])
array([0, 0, 0]) array([1, 0, 0]) array([0, 0, 1]) array([0, 0, 0])]
B:
[[ 4.302135e-01 4.320091e-01 4.302135e-01 4.302135e-01
1.172584e+08]
[ 4.097128e-01 4.097128e-01 4.077675e-01 4.077675e-01
4.397120e+07]
[ 3.796353e-01 3.796353e-01 3.778396e-01 3.778396e-01
2.643200e+07]
[ 3.871173e-01 3.890626e-01 3.871173e-01 3.871173e-01
2.161040e+07]
[ 3.984899e-01 4.002856e-01 3.984899e-01 3.984899e-01
1.836240e+07]
[ 4.227315e-01 4.246768e-01 4.227315e-01 4.227315e-01
1.215760e+07]
[ 4.433817e-01 4.451774e-01 4.433817e-01 4.433817e-01
9.340800e+06]
[ 4.620867e-01 4.638823e-01 4.620867e-01 4.620867e-01
1.173760e+07]]
type(A), type(A[0]), type(B), type(B[0]) are all <class 'numpy.ndarray'>.
However, A.shape is (20,), while B.shape is (8, 5).
Question 1: Why is A.shape one-dimensional, and how do I make it two-dimensional like B.shape? They're both arrays of arrays, right?
Question 2, possibly related to Q1: Why does printing A show the calls of array(), while printing B doesn't, and why do the elements of the subarrays of B not have commas in-between them?
Thanks in advance.
A.dtype is O, object, B.dtype is float.
A is a 1d array that contains objects, which happen to be arrays. They could just as well be lists or None`.
B is a 2d array of floats. Indexing one row of B gives a 1d array.
So A[0] and B[0] can appear to produce the same thing, but the selection process is different.
Try np.concatenate(A), or np.vstack(A). Both of these then treat A as a list of arrays, and join them either in 1 or 2d.
Converting object arrays to regular comes up quite often.
Converting a 3D List to a 3D NumPy array
is a little more general that what you need, but gives a lot of useful information.
also
Convert a numpy array of lists to a numpy array
==================
In [28]: A=np.empty((5,),object)
In [31]: A
Out[31]: array([None, None, None, None, None], dtype=object)
In [32]: for i in range(5):A[i]=np.zeros((3,),int)
In [33]: A
Out[33]:
array([array([0, 0, 0]), array([0, 0, 0]), array([0, 0, 0]),
array([0, 0, 0]), array([0, 0, 0])], dtype=object)
In [34]: print(A)
[array([0, 0, 0]) array([0, 0, 0]) array([0, 0, 0]) array([0, 0, 0])
array([0, 0, 0])]
In [35]: np.vstack(A)
Out[35]:
array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0]])
Edit
np.stack(A)
can join the arrays on a new leading axis.
If the subarrays differ in shape, these 'stack' functions will raise an error. It's up to you to find the problem array(s).
Take a look at this piece of code:
import numpy as np
a = np.random.random(10)
indicies = [
np.array([1, 4, 3]),
np.array([2, 5, 8, 7, 3]),
np.array([1, 2]),
np.array([3, 2, 1])
]
result = np.zeros(2)
result[0] = a[indicies[0]].sum()
result[1] = a[indicies[2]].sum()
Is there any way to get result more efficiently? In my case a is a very large array.
In other words I want to select elements from a with several varying size index arrays and then sum over them in one operation, resulting in a single array.
With your a and indicies list:
In [280]: [a[i].sum() for i in indicies]
Out[280]:
[1.3986792680307709,
2.6354365193743732,
0.83324677494990895,
1.8195179021311731]
Which of course could wrapped in np.array().
For a subset of the indicies items use:
In [281]: [a[indicies[i]].sum() for i in [0,2]]
Out[281]: [1.3986792680307709, 0.83324677494990895]
A comment suggests indicies comes from an Adjacency matrix, possibly sparse.
I could recreate such an array with:
In [289]: A=np.zeros((4,10),int)
In [290]: for i in range(4): A[i,indicies[i]]=1
In [291]: A
Out[291]:
array([[0, 1, 0, 1, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 1, 0, 0, 0, 0, 0, 0]])
and use a matrix product (np.dot) to do the selection and sum:
In [292]: A.dot(a)
Out[292]: array([ 1.39867927, 2.63543652, 0.83324677, 1.8195179 ])
A[[0,2],:].dot(a) would use a subset of rows.
A sparse matrix version has that list of row indices:
In [294]: Al=sparse.lil_matrix(A)
In [295]: Al.rows
Out[295]: array([[1, 3, 4], [2, 3, 5, 7, 8], [1, 2], [1, 2, 3]], dtype=object)
And a matrix product with that gives the same numbers:
In [296]: Al*a
Out[296]: array([ 1.39867927, 2.63543652, 0.83324677, 1.8195179 ])
If your array a is very large you might have memory issues if your array of indices contains many arrays of many indices when looping through it.
To avoid this issue use an iterator instead of a list :
indices = iter(indices)
and then loop through your iterator.