I'm trying to to update an element in an array. If I've got an array say:
[[0, 0],
[0, 0]]
as far as I knew the way to update eg. the first element to 0.5, was
array[0,0] = 0.5
However when I print the array the contents are unchanged. I read some things on Stack Overflow about copies being created of arrays but I don't know if this applies.
Any help would be great
Your problem is that your array is integer-valued (because you initialize it with integers), and when you write a float to it, it gets rounded to 0. You can check that this is the case if you write
array = np.array([[0, 0], [0, 0]])
array[0, 0] = 1.5
>>> array = array([[1, 0],
[0, 0]])
To get the expected behaviour, either initialize it with floats
array = np.array([[0., 0.], [0., 0.]])
or explicitly specify dtype
array = np.array([[0, 0], [0, 0]], dtype=np.float32)
You need to change the data type of the numpy array before updating the value to float
import numpy as np
a = [[0,0],[0,0]]
a = np.array(a)
a = a.astype('float64')
a[0,0] = 0.5
print(a)
this will give you
[[0.5 0. ]
[0. 0. ]]
The data type of the array is automatically set to int, 0.5 as an int is 0.
# For example:
In [12]: int(0.5)
Out[12]: 0
# To construct the array try:
array = np.array([[0.0,0.0],[0.0,0.0]])
# or:
array = np.array([[0,0],[0,0]], dtype=float)
Then:
In [9]: array[0,0]=0.5
In [10]: array
Out[10]:
array([[0.5, 0. ],
[0. , 0. ]])
Python nested list objects don't support array-like indexing. You can use only a single value to to index a list
arr = [[0,0], [0,0]]
arr[0][0] = 0.5
arr # [[0.5, 0], [0, 0]]
To use the kind of indexing you mention in your post, you'll have to use a numpy array
import numpy as np
np_arr = np.array([[0,0], [0,0]], dtype=np.float32)
np_arr[0,0] = 0.5
Related
Question
I have a CSR matrix, and I want to be able to retrieve the column indices and the values stored.
Data
For different reasons I'm not allowed to share my data, but here's a look (the numpy library is imported as np):
print(type(data) == type(ind) == list) # data and ind are lists
# OUT: True
print(len(data) == len(ind) == 134464) # data and ind have a size of 134,464
# OUT: True
print(np.alltrue([type(subarray) == np.ndarray for subarray in data])) # data (and ind) contains ndarray
# OUT: True
print(np.alltrue([len(data[i]) == len(ind[i]) for i in range(len(data))])) # each ndarray of data have the same length than the corresponding ndarray of ind
# OUT: True
print(min([len(data[i]) for i in range(len(data))]) >= 1) # each subarray of data (and of ind) has at least a length of 1
# OUT: True
print(np.alltrue([subarray.dtype == np.float64 for subarray in data])) # each subarray of data (and of ind) contains floats
# OUT: True
Code
Here is how I create the matrix (using csr_matrix from scipy.sparse):
indptr = np.empty(nbr_of_rows + 1) # nbr_of_rows = 134,464 = len(data)
indptr[0] = 0
for i in range(1, len(indptr)):
indptr[i] = indptr[i-1] + len(data[i-1])
data = np.concatenate(data) # now I have type(data) = np.darray, data.dtype = np.float64 and len(data) = 2,821,574
ind = np.concatenante(ind) # same than above
X = csr_matrix((data, ind, indptr), shape=(nbr_of_rows, nbr_of_columns)) # nbr_of_columns = 3,991 = max(ind) + 1 (since min(ind) = 0)
print(f"The matrix has a shape of {X.shape} and a sparsity of {(1 - (X.nnz / (X.shape[0] * X.shape[1]))): .2%}.")
# OUT: The matrix has a shape of (134464, 3991) and a sparsity of 99.47%.
So far so good (at least I think so). But now, even though I manage to retrieve the column indices, I can’t successfully retrieve the values:
print(np.alltrue(ind == X.nonzero()[1])) # Retrieving the columns indices
# OUT: True
print(np.alltrue(data == X[X.nonzero()])) # Trying to retrieve the values
# OUT: False
print(np.alltrue(np.sort(data) == np.sort(X[X.nonzero()]))) # Seeing if the values are at least the same
# OUT: False
print(np.sum(data) == np.sum(X[X.nonzero()])) # Seeing if the values add up to the same total
# OUT: False
When I look deeper, I find that I get almost all the values (only a small amount of mistakes):
print(len(data) == len(X[X.nonzero()].tolist()[0]))
# OUT: True
print(len(np.argwhere((data != X[X.nonzero()]))))
# OUT: 2184
So I get "only" 2,184 wrong values out of 2,821,574 total values.
Can someone please help me in getting all the correct values from my CSR matrix?
EDIT
I know now thanks to #hpaulj that I can use the class attributes X.indices and X.data to retrieve the CSR format index array and the CSR format data array of the matrix. However, I still would like to know why, in my case, I don't have np.altrue(X[X.nonzero()] == X.data).
Without your data I can't replicate your problem, and probably wouldn't want to do so even with such a large array.
But I'll try to illustrate what I expect to happen when constructing a matrix this way. From another question I have a small matrix in a Ipython session:
In [60]: Mx
Out[60]:
<1x3 sparse matrix of type '<class 'numpy.intc'>'
with 2 stored elements in Compressed Sparse Row format>
In [61]: Mx.A
Out[61]: array([[0, 1, 2]], dtype=int32)
nonzero returns the coo format indices, row, col
In [62]: Mx.nonzero()
Out[62]: (array([0, 0], dtype=int32), array([1, 2], dtype=int32))
The csr attributes are:
In [63]: Mx.data,Mx.indices,Mx.indptr
Out[63]:
(array([1, 2], dtype=int32),
array([1, 2], dtype=int32),
array([0, 2], dtype=int32))
Now lets make a new matrix, using the attributes of Mx. Assuming you constructed your indptr, indices, and data correctly this should imitate what you've done:
In [64]: newM = sparse.csr_matrix((Mx.data, Mx.indices, Mx.indptr))
In [65]: newM.A
Out[65]: array([[0, 1, 2]], dtype=int32)
data matches between the two matrices:
In [68]: Mx.data==newM.data
Out[68]: array([ True, True])
id of the data don't match, but their bases do. See my recent answer to see why this is relevant
https://stackoverflow.com/a/74543855/901925
In [75]: id(Mx.data.base), id(newM.data.base)
Out[75]: (2255407394864, 2255407394864)
That means changes to newA will appear in Mx:
In [77]: newM[0,1] = 100
In [78]: newM.A
Out[78]: array([[ 0, 100, 2]], dtype=int32)
In [79]: Mx.A
Out[79]: array([[ 0, 100, 2]], dtype=int32)
fuller test
Let's try a small scale test of your code:
In [92]: data = np.array([[1.23,2],[3],[]],object); ind = np.array([[1,2],[3],[]],object)
...: indptr = np.empty(4)
...: indptr[0] = 0
...: for i in range(1, 4):
...: indptr[i] = indptr[i-1] + len(data[i-1])
...: data = np.concatenate(data).ravel()
...: ind = np.concatenate(ind).ravel() # same than above
In [93]: data,ind,indptr
Out[93]: (array([1.23, 2. , 3. ]), array([1., 2., 3.]), array([0., 2., 3., 3.]))
And the sparse matrix:
In [94]: X = sparse.csr_matrix((data, ind, indptr), shape=(3,3))
In [95]: X
Out[95]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>
data matches:
In [96]: X.data
Out[96]: array([1.23, 2. , 3. ])
In [97]: data == X.data
Out[97]: array([ True, True, True])
and is infact a view:
In [98]: data[1]+=.23; data
Out[98]: array([1.23, 2.23, 3. ])
In [99]: X.A
Out[99]:
array([[0. , 1.23, 2.23],
[0. , 0. , 0. ],
[3. , 0. , 0. ]])
oops
I made an error in specifying the X shape:
In [110]: X = sparse.csr_matrix((data, ind, indptr), shape=(3,4))
In [111]: X.A
Out[111]:
array([[0. , 1.23, 2.23, 0. ],
[0. , 0. , 0. , 3. ],
[0. , 0. , 0. , 0. ]])
In [112]: X.data
Out[112]: array([1.23, 2.23, 3. ])
In [113]: X.nonzero()
Out[113]: (array([0, 0, 1], dtype=int32), array([1, 2, 3], dtype=int32))
In [114]: X[X.nonzero()]
Out[114]: matrix([[1.23, 2.23, 3. ]])
In [115]: data
Out[115]: array([1.23, 2.23, 3. ])
In [116]: data == X[X.nonzero()]
Out[116]: matrix([[ True, True, True]])
Depending on the type of the values you store in the matrix, numpy.float64 or numpy.int64, perhaps, the following post might answer your question: https://github.com/scipy/scipy/issues/13329#issuecomment-753541268
In particular, the comment "Apparently I don't get an error when data is a numpy array rather than a list." suggests that having data as numpy.array rather than a list could solve your problem.
Hopefully, this at least sets you on the right track.
I have a numpy array of shape (100, 100, 20) (in python 3)
I want to find for each 'pixel' the 15 channels with minimum values, and make them zeros (meaning: make the array sparse, keep only the 5 highest values).
Example:
input: array = [[1,2,3], [7,6,9], [12,71,3]], num_channles_to_zero = 2
output: [[0,0,3], [0,0,9], [0,71,0]]
How can I do it?
what I have for now:
array = numpy.random.rand(100, 100, 20)
inds = numpy.argsort(array, axis=-1) # also shape (100, 100, 20)
I want to do something like
array[..., inds[..., :15]] = 0
but it doesn't give me what I want
np.argsort outputs indices suitable for the [...]_along_axis functions of numpy. This includes np.put_along_axis:
import numpy as np
array = np.random.rand(100, 100, 20)
print(array[0,0])
#[0.44116124 0.94656705 0.20833932 0.29239585 0.33001399 0.82396784
# 0.35841905 0.20670957 0.41473762 0.01568006 0.1435386 0.75231818
# 0.5532527 0.69366173 0.17247832 0.28939985 0.95098187 0.63648877
# 0.90629116 0.35841627]
inds = np.argsort(array, axis=-1)
np.put_along_axis(array, inds[..., :15], 0, axis=-1)
print(array[0,0])
#[0. 0.94656705 0. 0. 0. 0.82396784
# 0. 0. 0. 0. 0. 0.75231818
# 0. 0. 0. 0. 0.95098187 0.
# 0.90629116 0. ]
As it mentioned in the numpy documentation
From each row, a specific element should be selected. The row index is just [0, 1, 2] and the column index specifies the element to choose for the corresponding row, here [0, 1, 0]. Using both together the task can be solved using advanced indexing:
>>>x = np.array([[1, 2], [3, 4], [5, 6]])
>>>x[[0, 1, 2], [0, 1, 0]]
array([1, 4, 5])
So, for your example:
a = np.array([[1,2,3], [7,6,9], [12,71,3]])
amax = a.argmax(axis=-1)
a[np.arange(a.shape[0]), amax] = 0
a
array([[ 1, 2, 0],
[ 7, 6, 0],
[12, 0, 3]])
I'm looking to see if there is a more efficient way (i.e. using native NumPy functionality) to achieve what I'm doing currently.
My process is I start with an array a:
a = np.array([[0,2,0,-1],[-0.2,0,-0.1,0],[0,0,-0.1,0],[0,0,0,0]])
array([[ 0. , 2. , 0. , -1. ],
[-0.2, 0. , -0.1, 0. ],
[ 0. , 0. , -0.1, 0. ],
[ 0. , 0. , 0. , 0. ]])
I then filter based on where the values are not equal to 0:
r_indices, c_indicies = np.where(a != 0)
(array([0, 0, 1, 1, 2]), array([1, 3, 0, 2, 2]))
From there, I create a Python dictionary b like so:
b = {i: c_indices[r_indices == i] for i in np.unique(r_indices)}
{
0: array([1, 3]),
1: array([0, 2]),
2: array([2])},
}
I do this because I want to know for a given unique row index r, which column indices are not 0.
My own preference is to try to use NumPy as much as possible to take advantage of speed benefits. However, I'm not sure how else to structure this in NumPy since the values in the dictionary could range from a length of 0 (no values are not zero) to 4 (all values are not zero).
Am I being paranoid about the potential speed benefits?
You can use Pandas in the following way:
import pandas as pd
import numpy as np
if __name__=='__main__':
a = np.array([[0, 2, 0, -1], [-0.2, 0, -0.1, 0], [0, 0, -0.1, 0], [0, 0, 0, 0]])
rows, cols = np.where(a !=0)
x = list(zip(rows, cols))
df = pd.DataFrame.from_records(data=x)
l = df.groupby(0)[1].apply(list)
L = [np.array(a) for a in l.values]
d = dict(zip(np.unique(rows), L))
Output
{0: array([1, 3]), 1: array([0, 2]), 2: array([2])}
As pandas works with numpy under the hood, this code will be much more efficient than the regular list comprehension.
Also, if all you need is a dictionary-like object - you could inhance the performance further by using the l Pandas.GroupBy as:
l.loc[0]
which will result in :
[1, 3]
which is equivalent to the b[0] in your example.
and omitting the last two lines altogether, as Pandas provide a very fast mechanisms for handling large amounts of tabular data, and generally preferable to a plain dict object, if they used for the same thing.
Cheers.
I know something similar to this question has been asked many times over already, but all answers given to similar questions only seem to work for arrays with 2 dimensions.
My understanding of np.argsort() is that np.sort(array) == array[np.argsort(array)] should be True.
I have found out that this is indeed correct if np.ndim(array) == 2, but it gives different results if np.ndim(array) > 2.
Example:
>>> array = np.array([[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]],
[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]]])
>>> np.sort(array)
array([[[ 0.62078744, 0.81774634],
[ 0.29718462, 0.43912609]],
[[ 0.1266578 , 0.82282054],
[ 0.79134389, 0.98180375]]])
>>> array.argsort()
array([[[1, 0],
[1, 0]],
[[0, 1],
[1, 0]]])
>>> array[array.argsort()]
array([[[[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]],
[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]]],
[[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]],
[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]]]],
[[[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]],
[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]]],
[[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]],
[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]]]]])
So, can anybody explain to me how exactly np.argsort() can be used as the indices to obtain the sorted array?
The only way I can come up with is:
args = np.argsort(array)
array_sort = np.zeros_like(array)
for i in range(array.shape[0]):
for j in range(array.shape[1]):
array_sort[i, j] = array[i, j, args[i, j]]
which is extremely tedious and cannot be generalized for any given number of dimensions.
Here is a general method:
import numpy as np
array = np.array([[[ 0.81774634, 0.62078744],
[ 0.43912609, 0.29718462]],
[[ 0.1266578 , 0.82282054],
[ 0.98180375, 0.79134389]]])
a = 1 # or 0 or 2
order = array.argsort(axis=a)
idx = np.ogrid[tuple(map(slice, array.shape))]
# if you don't need full ND generality: in 3D this can be written
# much more readable as
# m, n, k = array.shape
# idx = np.ogrid[:m, :n, :k]
idx[a] = order
print(np.all(array[idx] == np.sort(array, axis=a)))
Output:
True
Explanation: We must specify for each element of the output array the complete index of the corresponding element of the input array. Thus each index into the input array has the same shape as the output array or must be broadcastable to that shape.
The indices for the axes along which we do not sort/argsort stay in place. We therefore need to pass a broadcastable range(array.shape[i]) for each of those. The easiest way is to use ogrid to create such a range for all dimensions (If we used this directly, the array would come back unchanged.) and then replace the index correspondingg to the sort axis with the output of argsort.
UPDATE March 2019:
Numpy is becoming more strict in enforcing multi-axis indices being passed as tuples. Currently, array[idx] will trigger a deprecation warning. To be future proof use array[tuple(idx)] instead. (Thanks #Nathan)
Or use numpy's new (version 1.15.0) convenience function take_along_axis:
np.take_along_axis(array, order, a)
#Hameer's answer works, though it might use some simplification and explanation.
sort and argsort are working on the last axis. argsort returns a 3d array, same shape as the original. The values are the indices on that last axis.
In [17]: np.argsort(arr, axis=2)
Out[17]:
array([[[1, 0],
[1, 0]],
[[0, 1],
[1, 0]]], dtype=int32)
In [18]: _.shape
Out[18]: (2, 2, 2)
In [19]: idx=np.argsort(arr, axis=2)
To use this we need to construct indices for the other dimensions that broadcast to the same (2,2,2) shape. ix_ is a handy tool for this.
Just using idx as one of the ix_ inputs doesn't work:
In [20]: np.ix_(range(2),range(2),idx)
....
ValueError: Cross index must be 1 dimensional
Instead I use the last range, and then ignore it. #Hameer instead constructs the 2d ix_, and then expands them.
In [21]: I,J,K=np.ix_(range(2),range(2),range(2))
In [22]: arr[I,J,idx]
Out[22]:
array([[[ 0.62078744, 0.81774634],
[ 0.29718462, 0.43912609]],
[[ 0.1266578 , 0.82282054],
[ 0.79134389, 0.98180375]]])
So the indices for the other dimensions work with the (2,2,2) idx array:
In [24]: I.shape
Out[24]: (2, 1, 1)
In [25]: J.shape
Out[25]: (1, 2, 1)
That's the basics for constructing the other indices when you are given multidimensional index for one dimension.
#Paul constructs the same indices with ogrid:
In [26]: np.ogrid[slice(2),slice(2),slice(2)] # np.ogrid[:2,:2,:2]
Out[26]:
[array([[[0]],
[[1]]]), array([[[0],
[1]]]), array([[[0, 1]]])]
In [27]: _[0].shape
Out[27]: (2, 1, 1)
ogrid as a class works with slices, while ix_ requires a list/array/range.
argsort for a multidimensional ndarray (from 2015) works with a 2d array, but the same logic applies (find a range index(s) that broadcasts with the argsort).
Here's a vectorized implementation. It should be N-dimensional and quite a bit faster than what you're doing.
import numpy as np
def sort1(array, args):
array_sort = np.zeros_like(array)
for i in range(array.shape[0]):
for j in range(array.shape[1]):
array_sort[i, j] = array[i, j, args[i, j]]
return array_sort
def sort2(array, args):
shape = array.shape
idx = np.ix_(*tuple(np.arange(l) for l in shape[:-1]))
idx = tuple(ar[..., None] for ar in idx)
array_sorted = array[idx + (args,)]
return array_sorted
if __name__ == '__main__':
array = np.random.rand(5, 6, 7)
idx = np.argsort(array)
result1 = sort1(array, idx)
result2 = sort2(array, idx)
print(np.array_equal(result1, result2))
I want to make a function that when fed an array, it returns an array of the same shape but with all zeros expect for 1 value that is the max one. eg. with an array like this:
my_array = np.arange(9).reshape((3,3))
[[ 0. 1. 2.]
[ 3. 4. 5.]
[ 6. 7. 8.]]
when passed in the function I want it out like this:
[[ 0. 0. 0.]
[ 0. 0. 0.]
[ 0. 0. 8.]]
exeption:
When there is many max value that are equal, I only want one out of them and the rest gets zero'ed out (the order doesn't matter).
I am honeslty clueless as to how to make this in an elegant way that is furthermore efficient, how would you do it?
For efficiency, use array-initialization and argmax to get the max index (first one linearly indexed if more than one) -
def app_flat(my_array):
out = np.zeros_like(my_array)
idx = my_array.argmax()
out.flat[idx] = my_array.flat[idx]
return out
We can also use ndarray.ravel() in place of ndaarray.flat and I would think that the performance numbers would be comparable.
For this sparsey output, to gain memory efficiency and hence performance, you might want to use sparse matrices, especially for large arrays. Thus, for sparse matrix output, we would have an alternative one, like so -
from scipy.sparse import coo_matrix
def app_sparse(my_array):
idx = my_array.argmax()
r,c = np.unravel_index(idx, my_array.shape)
return coo_matrix(([my_array[r,c]],([r],[c])),shape=my_array.shape)
Sample run -
In [336]: my_array
Out[336]:
array([[0, 1, 2],
[3, 4, 5],
[8, 7, 8]])
In [337]: app_flat(my_array)
Out[337]:
array([[0, 0, 0],
[0, 0, 0],
[8, 0, 0]])
In [338]: app_sparse(my_array)
Out[338]:
<3x3 sparse matrix of type '<type 'numpy.int64'>'
with 1 stored elements in COOrdinate format>
In [339]: app_sparse(my_array).toarray() # just to confirm values
Out[339]:
array([[0, 0, 0],
[0, 0, 0],
[8, 0, 0]])
Runtime test on bigger array -
In [340]: my_array = np.random.randint(0,1000,(5000,5000))
In [341]: %timeit app_flat(my_array)
10 loops, best of 3: 34.9 ms per loop
In [342]: %timeit app_sparse(my_array) # sparse matrix output
100 loops, best of 3: 17.2 ms per loop
with few lines:
my_array = np.arange(9).reshape((3,3))
my_array2 = np.zeros(len(my_array.ravel()))
my_array2[np.argmax(my_array)] = np.max(my_array)
my_array2 = my_array2.reshape(my_array.shape)