Related
2d array, consists of 2 axes, axis=0 which represents the rows and the axis=1 represents the columns
aa = np.random.randn(10, 2) # Here is 2d array, first axis has 10 rows and second axis has 2 columns
array([[ 0.6999521 , -0.17597954],
[ 1.70622947, -0.85919459],
[-0.90019284, 0.80774052],
[-1.42953238, 0.19727917],
[-0.03416532, 0.49584749],
[-0.28981586, -0.77484498],
[-1.31129122, 0.423833 ],
[-0.43920016, -1.93541758],
[-0.06667634, 2.09925218],
[ 1.24633485, -0.04153847]])
why when I want to scatter the points I only consider the first column and the second column dimension from axis=1? do dimensions mean columns when plotting and at other times they mean axes? can you please explain more the reasons to do it like this? and if there are good references I could benefit myself on dimensions relating to this
plt.scatter(x[:,0], x[:,1]) # this also means dimensions or columns?
x[:,0], x[:,1] why not do x[0,:], x[:,1}
It can be difficult to visualize this, especially in multiple dimensions.
The parameters to the [] operator represent the dimensions. Your first dimension is the rows. The first row is array[0]. Your second dimension is the columns. The entire second column is called array[:,1] -- the ":" is a numpy notation that means "take all of this dimension". array[2,1] refers to the second column in the third row.
plt.scatter expects the x coordinate values as its first parameter, and the y coordinate values as its second parameter. plt.scatter(x[:,0], x[:,1]) means "take all of column 0" and "take all of column 1", which is the way scatter wants them.
With this randn call you make a 2d array with the specified shape. The dimensions, 10 and 2, don't represent anything - that's an abstract (10,2) array. Meaning comes from how you use it.
In [50]: aa = np.random.randn(10, 2)
In [51]: aa
Out[51]:
array([[-0.26769106, 0.09882999],
[-1.5605514 , -1.38614473],
[ 1.23312852, 0.86838848],
[ 1.2603898 , 2.19895989],
[-1.66937976, 0.79666952],
[-0.15596669, 1.47848784],
[ 1.74964902, 0.39280584],
[-1.0982447 , 0.46888408],
[ 0.84396231, -0.34809148],
[-0.83489678, -1.8093045 ]])
That's a display - with rows and columns.
Rather than pass the slices directly to scatter lets assign them to variables:
In [52]: x = aa[:,0]; y = aa[:,1]; x,y
Out[52]:
(array([-0.26769106, -1.5605514 , 1.23312852, 1.2603898 , -1.66937976,
-0.15596669, 1.74964902, -1.0982447 , 0.84396231, -0.83489678]),
array([ 0.09882999, -1.38614473, 0.86838848, 2.19895989, 0.79666952,
1.47848784, 0.39280584, 0.46888408, -0.34809148, -1.8093045 ]))
We now have two 1d arrays with shape (10,) (that's a 1 element tuple). We can then plot them with:
In [53]: plt.scatter(x,y)
I could just as well used
x = np.arange(10); y = np.random.randn(10)
to make two 1d arrays.
The dimensions of the aa array have nothing to do with the axes of a scatter plot.
I could select a 'row' of aa, but will only get a (2,) shape array. That can't be plotted against a (10,) array:
In [53]: aa[0,:]
Out[53]: array([-0.26769106, 0.09882999])
As for meaning of dimensions in sum/mean, why not experiement?
Sum all values:
In [54]: aa.sum()
Out[54]: 2.2598841819604134
sum down the columns, resulting in one value per column:
In [55]: aa.sum(axis=0)
Out[55]: array([-0.49960074, 2.75948492])
It can help to keepdims, producing a (1,2) array:
In [56]: aa.sum(axis=0, keepdims=True)
Out[56]: array([[-0.49960074, 2.75948492]])
or a (10,1) array:
In [57]: aa.sum(axis=1, keepdims=True)
Out[57]:
array([[-0.16886107],
[-2.94669614],
[ 2.101517 ],
[ 3.45934969],
[-0.87271024],
[ 1.32252115],
[ 2.14245486],
[-0.62936062],
[ 0.49587083],
[-2.64420128]])
There's some ambiguity when talking about summing along rows or columns when dealing with 2d arrays. It becomes clearer when we apply sum to 1d arrays (sum the only one), or 3d.
For example, note which dimension is missing when I do:
In [58]: np.arange(24).reshape(2,3,4).sum(axis=1).shape
Out[58]: (2, 4)
or
In [59]: np.arange(24).reshape(2,3,4).sum(axis=2)
Out[59]:
array([[ 6, 22, 38],
[54, 70, 86]])
Again - dimensions of numpy arrays are abstract things. An array can have 0, 1, 2 or more (up to 32) dimensions. Most of linear algebra deals with 2d arrays, matrices and "vectors". You can do LA with numpy, but numpy is used for much more.
edit
You could think of your aa as 10 2-element points. Then aa[:,0] are all the x coordinates. A mean with axis=0 would be the "center of mass" of those points.
In [60]: np.mean(aa, axis=0)
Out[60]: array([-0.04996007, 0.27594849])
Mean on axis=1 may not make sense, though you could calculate the norm of the points (sqrt(x^2+y^2)), or the length of the vectors represented by the points.
In [61]: np.linalg.norm(aa, axis=1)
Out[61]:
array([0.28535218, 2.08727523, 1.50821235, 2.53456249, 1.84973271,
1.48669159, 1.79320052, 1.19414978, 0.91292938, 1.99264533])
For direction of these points I'd use:
np.arctan2(aa[:,0], aa[:,1])
(or maybe switch the 0 and 1).
Say I have some matrix, W = MxN and a long array of indices z with shape of Mx1.
Now, assume I'd like to sum up the element of each row in W, excluding the index appears for that row in z.
1-d example:
import numpy as np
W = np.array([1.0, 2.0, 8.0])
z = 2
np.sum(np.delete(W,z))
MxN example and desired output:
import numpy as np
W = np.array([[1.0,2.0,8.0], [5.0,15.0,3.0]])
z = np.array([0,2]).reshape(2,1)
# desired output
# [10. 20.]
I tried to use np.delete and axis=1 with no success
I managed to get around it using tricks like:
W = np.array([[1.0,2.0,8.0], [5.0,15.0,3.0]])
z = np.array([0,2])
W[np.arange(z.shape[0]), z]=0
print(np.sum(W, axis=1))
# [10. 20.]
but I'm wondering if there's a more elegant way.
Using broadcasting to get the mask to simulate deletion and then sum-reduce -
(W*(z != np.arange(W.shape[-1]))).sum(-1)
Sample runs -
For 2D case :
In [61]: W = np.array([[1.0,2.0,8.0], [5.0,15.0,3.0]])
...: z = np.array([0,2]).reshape(2,1)
In [62]: (W*(z != np.arange(W.shape[-1]))).sum(-1)
Out[62]: array([10., 20.])
Works just as well for the 1D case :
In [59]: W = np.array([1.0, 2.0, 8.0])
...: z = 2
In [60]: (W*(z != np.arange(W.shape[-1]))).sum(-1)
Out[60]: 3.0
For 2D case :
With np.einsum for the sum-reduction -
In [53]: np.einsum('ij,ij->i',W,z != np.arange(W.shape[1]))
Out[53]: array([10., 20.])
Summing and then subtracting the z-indexed values for 2D case -
In [134]: W.sum(1) - np.take_along_axis(W,z,axis=1).squeeze(1)
Out[134]: array([10., 20.])
Extend to handle both 2D and 1D cases -
W.sum(-1)-np.take_along_axis(W,np.atleast_1d(z),axis=-1).squeeze(-1)
#Divaka answers are pretty good. I just give another perspective on your question. If you need masking to ignore certain indices and doing multiple operations on array, you should use numpy masked array np.ma.array instead of regular np.array. Masked array is truly for the purpose of ignore certain indices.
document of masked array for more info
z = np.array([0,2]).reshape(2,1)
W_ma = np.ma.array(W, mask=z == np.arange(W.shape[-1]))
In [36]: W_ma
Out[36]:
masked_array(
data=[[--, 2.0, 8.0],
[5.0, 15.0, --]],
mask=[[ True, False, False],
[False, False, True]],
fill_value=1e+20)
From this W_ma masked array, you may do almost all operations the same as np.array. For sum
W_ma.sum(1)
Out[44]:
masked_array(data=[10.0, 20.0],
mask=[False, False],
fill_value=1e+20)
To turn masked array to regular array, you may use compressed, filled, or compress_rowcols
In [46]: W_ma.sum(1).compressed()
Out[46]: array([10., 20.])
Note: I emphasize masked array is useful when you do multiple operations on ignore indices. If you only need to do one or two operations on ignore indices, there is no point to use masked array.
I'm trying to calculate the mean of non-zero values in each row of a sparse row matrix. Using the matrix's mean method doesn't do it:
>>> from scipy.sparse import csr_matrix
>>> a = csr_matrix([[0, 0, 2], [1, 3, 8]])
>>> a.mean(axis=1)
matrix([[ 0.66666667],
[ 4. ]])
The following works but is slow for large matrices:
>>> import numpy as np
>>> b = np.zeros(a.shape[0])
>>> for i in range(a.shape[0]):
... b[i] = a.getrow(i).data.mean()
...
>>> b
array([ 2., 4.])
Could anyone please tell me if there is a faster method?
With a CSR format matrix, you can do this even more easily:
sums = a.sum(axis=1).A1
counts = np.diff(a.indptr)
averages = sums / counts
Row-sums are directly supported, and the structure of the CSR format means that the difference between successive values in the indptr array correspond exactly to the number of nonzero elements in each row.
This seems the typical problem where you can use numpy.bincount. For this I made use of three functions:
(x,y,z)=scipy.sparse.find(a)
returns rows(x),columns(y) and values(z) of the sparse matrix. For instace, x is array([0, 1, 1, 1].
numpy.bincount(x) returns, for each row number, how meny nonzero elemnts you have.
numpy.bincount(x,wights=z) returns, for each row , the sums of non-zero elements.
A final working code:
from scipy.sparse import csr_matrix
a = csr_matrix([[0, 0, 2], [1, 3, 8]])
import numpy
import scipy.sparse
(x,y,z)=scipy.sparse.find(a)
countings=numpy.bincount(x)
sums=numpy.bincount(x,weights=z)
averages=sums/countings
print(averages)
returns:
[ 2. 4.]
I always like summing the values over whatever axis you are interested in and dividing by the total of the nonzero elements in the respective row/column.
Like so:
sp_arr = csr_matrix([[0, 0, 2], [1, 3, 8]])
col_avg = sp_arr.sum(0) / (sp_arr != 0).sum(0)
row_avg = sp_arr.sum(1) / (sp_arr != 0).sum(1)
print(col_avg)
matrix([[ 1., 3., 5.]])
print(row_avg)
matrix([[ 2.],
[ 4.]])
Basically you are summing the total value of all entries along the given axis and dividing by the sum of the True entries where the matrix != 0 (which is the number of real entries).
I find this approach less complicated and easier than the other options.
A simple method to return a list of average value:
a.sum(axis=0) / a.getnnz(axis=0)
Assume that you don't have any explicit zero in your matrix.
Change the axis if you will.
I'm a real beginner with Python, and I have a recurrent problem with my ndarrays.
I'm very confused with the brackets (is there any schematic synthesis of the use of brackets in Python anywhere?). I always end up having arrays with many dimensions.
Right now I have this one:
>>> values
Out[1]:
array([[[ array([[ 4.23156519, -0.93539198],
[ 3.50074853, -1.67043386],
[ 4.64192393, -1.03918172],
[ 4.52056725, 0.2561267 ],
[ 3.36400016, 0.26435125],
[ 3.82025672, 1.16503286]])]]], dtype=object)
From here, how can I reduce the dimensions? I just wanted a 6x2 array. I tried np.reshape but since the current shape of values is (1,1,1) I can't directly reshape the array in a 6x2 one.
I'm sorry for the silly question, I'm seeking a general and schematic answer that would explain me how to pass from a higher dimension to a lower one and vice versa.
Here is the way I created the array. values is clustered_points
indices=[] # initialize indices
clustered_points=[] # initialize array containing points in different sub-arrays=clusters
for k in range(len(mu)):
a=r[:,k]
index=[t for t in range(len(a)) if a[t] == 1]
indices.append(index)
clustered_points.append(data[indices[k]])
clustered_points=np.reshape(clustered_points,(len(clustered_points),1,1))
To make an array that matches your initial display, I have to take special care to embed one array within another:
In [402]: x=np.array([[ 4.23156519, -0.93539198],
[ 3.50074853, -1.67043386],
[ 4.64192393, -1.03918172],
[ 4.52056725, 0.2561267 ],
[ 3.36400016, 0.26435125],
[ 3.82025672, 1.16503286]])
In [403]: a=array([[[None]]],dtype=object)
In [404]: a[0,0,0]=x
In [405]: a
Out[405]:
array([[[ array([[ 4.23156519, -0.93539198],
[ 3.50074853, -1.67043386],
[ 4.64192393, -1.03918172],
[ 4.52056725, 0.2561267 ],
[ 3.36400016, 0.26435125],
[ 3.82025672, 1.16503286]])]]], dtype=object)
In [406]: a.shape
Out[406]: (1, 1, 1)
In [407]: a[0,0,0].shape
Out[407]: (6, 2)
Simply doing a cut-n-paste from the display produces a different array with shape (1,1,1,6,2). That does not have the inner array marking. Either way a[0,0,0] gives the inner (6,2) array.
reshape and squeeze work on a (1,1,1,6,2) array, but not on a (6,2) nested inside a (1,1,1). You need to understand the difference.
(edit)
To run your 'how I did it' clip, I have to make some guesses as to the inputs (that almost merits a downvote).
I'll guess at some inputs:
In [420]: mu=np.arange(3); r=np.ones((4,3));data=np.ones(5)
In [421]: %paste
indices=[] # initialize indices
clustered_points=[] # initialize array containing points in different sub-arrays=clusters
for k in range(len(mu)):
a=r[:,k]
index=[t for t in range(len(a)) if a[t] == 1]
indices.append(index)
clustered_points.append(data[indices[k]])
## -- End pasted text --
In [422]: clustered_points
Out[422]:
[array([ 1., 1., 1., 1.]),
array([ 1., 1., 1., 1.]),
array([ 1., 1., 1., 1.])]
cluster_points is a list with several 1d arrays.
I can do
np.reshape(clustered_points,(12,1,1))
np.reshape(clustered_points,(3,4,1,1))
though it would be better, I think, to do np.array(clustered_points) first, and may be even check its shape.
Since
np.reshape(clustered_points,(len(clustered_points),1,1))
supposedly works then clustered_points must be a list of n single element arrays. But this reshape should produce a (n,1,1) array, not your (1,1,1,...) array.
So that edit doesn't help.
=========================
I'm seeking a general and schematic answer that would explain me how to pass from a higher dimension to a lower one and vice versa.
The first step is be clear, to yourself and others, what is the structure of your array. That includes knowing shape and dtype. And if the dtype is anything other than simple numerics, pay attention to the structure of the elements (e.g. objects within the array).
Singular dimensions (value 1) can be removed with indexing, [0], or squeeze. reshape also removes demensions (or adds them), but you have to pay attention to the total number of elements. If the old shape had 12 elements, the new has to have 12 as well. But reshape does not operate across dtype boundaries.
I think you're looking for numpy.squeeze:
#!/usr/bin/env python
import numpy
a = [[[[[ 4.23156519, -0.93539198],
[ 3.50074853, -1.67043386],
[ 4.64192393, -1.03918172],
[ 4.52056725, 0.2561267 ],
[ 3.36400016, 0.26435125],
[ 3.82025672, 1.16503286]]]]]
a = numpy.array(a)
print("a.shape=%s" % str(a.shape))
b = numpy.squeeze(a)
print("b.shape=%s" % str(b.shape))
gives
a.shape=(1, 1, 1, 6, 2)
b.shape=(6, 2)
The underlying issue is probably, how did you create this array?
A python list object cannot be manipulated in the same ways that np.ndarrays can be. In general, once your final list is created, you can cast it into a numpy array:
values = []
# fill values with values.append(...)
# ...
values = np.asarray(values)
To find the shape of an array, you can use either A.shape or np.shape(A). To remove dimensions of length one, the best approach is to use the squeeze method either as A.squeeze() or np.squeeze(A), i.e:
>>> values.squeeze()
array([[4.23156519, -0.93539198],
[3.50074853, -1.67043386],
[4.64192393, -1.03918172],
[4.52056725, 0.2561267],
[3.36400016, 0.26435125],
[3.82025672, 1.16503286]], dtype=object)
If your values array is really what you've said, then it should also be fine to use reshape
>>> values.reshape(6,2)
array([[4.23156519, -0.93539198],
[3.50074853, -1.67043386],
[4.64192393, -1.03918172],
[4.52056725, 0.2561267],
[3.36400016, 0.26435125],
[3.82025672, 1.16503286]], dtype=object)
If you're getting an error trying to reshape values, is it possible it is actually a list instead of an array?
If you want to create 6x2 array then just do this:
A = array([[4.23156519, -0.93539198],
[3.50074853, -1.67043386],
[4.64192393, -1.03918172],
[4.52056725, 0.2561267],
[3.36400016, 0.26435125],
[3.82025672, 1.16503286]], dtype=object)
If you want to reduce your array:
A = array([[[ array([[ 4.23156519, -0.93539198],
[ 3.50074853, -1.67043386],
[ 4.64192393, -1.03918172],
[ 4.52056725, 0.2561267 ],
[ 3.36400016, 0.26435125],
[ 3.82025672, 1.16503286]])]]], dtype=object)
it is actually 1x1x1x6x2 array, you can get 6x2 by doing A[0][0][0]
I have seen a couple of codes using numpy.apply_along_axis and I always have to test the codes to see how this works 'cause I didn't understand the axis idea in Python yet.
For example, I tested this simple codes from the reference.
I can see that for the first case it was took the first column of each row of the matrix, and in the second case, the row itself was considered.
So I build an example to test how this works with an array of matrices (the problem that took me to this axis question), which can also be seen as a 3d matrix, where each row is a matrix, right?
a = [[[1,2,3],[2,3,4]],[[4,5,6],[9,8,7]]]
import numpy
data = numpy.array([b for b in a])
def my_func(x):
return (x[0] + x[-1]) * 0.5
b = numpy.apply_along_axis(my_func, 0, data)
b = numpy.apply_along_axis(my_func, 1, data)
Which gave me:
array([[ 2.5, 3.5, 4.5],
[ 5.5, 5.5, 5.5]])
And:
array([[ 1.5, 2.5, 3.5],
[ 6.5, 6.5, 6.5]])
For the first result I got what I expected. But for the second one, I though I would receive:
array([[ 2., 3.],
[ 5., 8.]])
Then I though that maybe should be an axis=2 and I got the previous result testing it. So, I'm wondering how this works to work it properly.
Thank you.
First, data=numpy.array(a) is already enough, no need to use numpy.array([b for b in a]).
data is now a 3D ndarray with the shape (2,2,3), and has 3 axes 0, 1, 2. The first axis has a length of 2, the second axis's length is also 2 and the third axis's length is 3.
Therefore both numpy.apply_along_axis(my_func, 0, data) and numpy.apply_along_axis(my_func, 1, data) will result in a 2D array of shape (2,3). In both cases the shape is (2,3), those of the remaining axes, 2nd and 3rd or 1st and 3rd.
numpy.apply_along_axis(my_func, 2, data) returns the (2,2) shape array you showed, where (2,2) is the shape of the first 2 axes, as you apply along the 3rd axis (by giving index 2).
The way to understand it is whichever axis you apply along will be 'collapsed' into the shape of your my_func, which in this case returns a single value. The order and shape of the remaining axis will remain unchanged.
The alternative way to think of it is: apply_along_axis means apply that function to the values on that axis, for each combination of the remaining axis/axes. Fetch the result, and organize them back into the shape of the remaining axis/axes. So, if my_func returns a tuple of 4 values:
def my_func(x):
return (x[0] + x[-1]) * 2,1,1,1
we will expect numpy.apply_along_axis(my_func, 0, data).shape to be (4,2,3).
See also numpy.apply_over_axes for applying a function repeatedly over multiple axes
Let there be an array of shape (2,2,3). It can be seen that axis 0, axis 1, axis 2 has 2 ,2, 3 data values respectively. These are the indexes of the elements of the array
[
[
[(0,0,0) (0,0,1), (0,0,2)],
[(0,1,0) (0,1,1), (0,1,2)]
],
[
[(1,0,0) (1,0,1), (1,0,2)],
[(1,1,0) (1,1,1), (1,1,2)]
]
]
Now if you apply some operation along some axis, then vary the indexes along this axis only keeping the indices along the two other axis constant.
Example: If we apply some operation F along axis 0, then the elements of the result would be
[
[F((0,0,0),(1,0,0)), F((0,0,1),(1,0,1)), F((0,0,2),(1,0,2))],
[F((0,1,0),(1,1,0)), F((0,1,1),(1,1,1)), F((0,1,2),(1,1,2))]
]
Along axis 1:
[
[F((0,0,0),(0,1,0)), F((0,0,1),(0,1,1)), F((0,0,2),(0,1,2))],
[F((0,1,0),(1,1,0)), F((0,1,1),(1,1,1)), F((0,1,2),(1,1,2))]
]
Along axis 2:
[
[F((0,0,0),(0,0,1),(0,0,2)), F((0,1,0),(0,1,1),(0,1,2))],
[F((1,0,0),(1,0,1),(1,0,2)), F((1,1,0),(1,1,1),(1,1,2))]
]
Also the shape of the resulting array can be inferred by omitting the given axis in the shape of given data.
Perhaps checking the shape of your array will help clarify which axis is which;
print data.shape
>> (2,2,3)
This means that calling
numpy.apply_along_axis(my_func, 2, data)
should indeed give a 2x2 matrix, namely
array([[ 2., 3.],
[ 5., 8.]])
because the 3rd axis (index 2) has length 3, while the remaining axes are both length 2.