Related
In my work I often need to aggregate and expand matrices of various quantities, and I am looking for the most efficient ways to do these actions. E.g. I'll have an NxN matrix that I want to aggregate from NxN into PxP where P < N. This is done using a correspondence between the larger dimensions and the smaller dimensions. Usually, P will be around 100 or so.
For example, I'll have a hypothetical 4x4 matrix like this (though in practice, my matrices will be much larger, around 1000x1000)
m=np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]])
>>> m
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16]])
and a correspondence like this (schematically):
0 -> 0
1 -> 1
2 -> 0
3 -> 1
that I usually store in a dictionary. This means that indices 0 and 2 (for rows and columns) both get allocated to new index 0 and indices 1 and 3 (for rows and columns) both get allocated to new index 1. The matrix could be anything at all, but the correspondence is always many-to-one when I want to compress.
If the input matrix is A and the output matrix is B, then cell B[0, 0] would be the sum of A[0, 0] + A[0, 2] + A[2, 0] + A[2, 2] because new index 0 is made up of original indices 0 and 2.
The aggregation process here would lead to:
array([[ 1+3+9+11, 2+4+10+12 ],
[ 5+7+13+15, 6+8+14+16 ]])
= array([[ 24, 28 ],
[ 40, 44 ]])
I can do this by making an empty matrix of the right size and looping over all 4x4=16 cells of the initial matrix and accumulating in nested loops, but this seems to be inefficient and the vectorised nature of numpy is always emphasised by people. I have also done it by using np.ix_ to make sets of indices and use m[row_indices, col_indices].sum(), but I am wondering what the most efficient numpy-like way to do it is.
Conversely, what is the sensible and efficient way to expand a matrix using the correspondence the other way? For example with the same correspondence but in reverse I would go from:
array([[ 1, 2 ],
[ 3, 4 ]])
to
array([[ 1, 2, 1, 2 ],
[ 3, 4, 3, 4 ],
[ 1, 2, 1, 2 ],
[ 3, 4, 3, 4 ]])
where the values simply get replicated into the new cells.
In my attempts so far for the aggregation, I have used approaches with pandas methods with groupby on index and columns and then extracting the final matrix with, e.g. df.values. However, I don't know the equivalent way to expand a matrix, without using a lot of things like unstack and join and so on. And I see people often say that using pandas is not time-efficient.
Edit 1: I was asked in a comment about exactly how the aggregation should be done. This is how it would be done if I were using nested loops and a dictionary lookup between the original dimensions and the new dimensions:
>>> m=np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
>>> mnew=np.zeros((2,2))
>>> big2small={0:0, 1:1, 2:0, 3:1}
>>> for i in range(4):
... inew = big2small[i]
... for j in range(4):
... jnew = big2small[j]
... mnew[inew, jnew] += m[i, j]
...
>>> mnew
array([[24., 28.],
[40., 44.]])
Edit 2: Another comment asked for the aggregation example towards the start to be made more explicit, so I have done so.
Assuming you don't your indices don't have a regular structure I would do it try sparse matrices.
import scipy.sparse as ss
import numpy as np
# your current array of indices
g=np.array([[0,0],[1,1],[2,0],[3,1]])
# a sparse matrix of (data=ones, (row_ind=g[:,0], col_ind=g[:,1]))
# it is one for every pair (g[i,0], g[i,1]), zero elsewhere
u=ss.csr_matrix((np.ones(len(g)), (g[:,0], g[:,1])))
Aggregate
m=np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
u.T # m # u
Expand
m2 = np.array([[1,2],[3,4]])
u # m2 # u.T
Context
I have the following example-arrays in numpy:
import numpy as np
# All arrays in this example have the shape (15,)
# Note: All values > 0 are unqiue!
a = np.array([8,5,4,-1,-1, 7,-1,-1,12,11,-1,-1,14,-1,-1])
reference = np.array([0,1,2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14])
lookup = np.array([3,6,0,-2,-2,24,-2,-2,24,48,-2,-2,84,-2,-2])
My goal is to find the elements inside the reference in a, then get the index in a and use it to extract the corresponding elements in lookup.
Finding out the matching elements and their indices works with np.flatnonzero( np.isin() ).
I can also lookup the correspodning values:
# Example how to find the index
np.flatnonzero( np.isin( reference, a) )
# -> array([ 4, 5, 7, 8, 11, 12, 14])
# Example how to find corresponding values:
lookup[ np.flatnonzero( np.isin( a, reference) ) ]
# -> array([ 3, 6, 0, 24, 24, 48, 84], dtype=int64)
Problem
I want to fill an array z with the values I looked up, following the reference.
This means, that e.g. the 8th element of z corresponds to the 8th element in the lookup-value for the 8th element in reference (= 8). This value would be 3 (reference[8] -> a[0] because a==8 here -> lookup[0] -> 3).
z = np.zeros(reference.size)
z[np.flatnonzero(np.isin(reference, a))] = ? -> numpy-array of correctly ordered lookup_values
The expected outcome for z would be:
z = [ 0 0 0 0 0 6 0 24 3 0 0 48 24 0 84]
I cannot get my head around this; I have to avoid for-loops due to performance reasons and would need a pure numpy-solution (best without udfs).
How can I fill z according with the lookup-values at the correct position?
Note: As stated in the code above, all values a > 0 are unique. Thus, there is no need to take care about the duplicated values for a < 0.
You say that you 'have to avoid for-loops due to performance reasons', so I assume that your real-world datastructure a is going to be large (thousands or millions of elements?). Since np.isin(reference, a) performs a linear search in a for every element of reference, your runtime will be O(len(reference) * len(a)).
I would strongly suggest using a dict for a, allowing lookup in O(1) per element of reference, and loop in python using for. For sufficiently large a this will outperform the 'fast' linear search performed by np.isin.
The most natural way I can think of is to just treat a and lookup as a dictionary:
In [82]: d = dict(zip(a, lookup))
In [83]: np.array([d.get(i, 0) for i in reference])
Out[83]: array([ 0, 0, 0, 0, 0, 6, 0, 24, 3, 0, 0, 48, 24, 0, 84])
This does have a bit of memory overhead but nothing crazy if reference isn't too large.
I actually had an enlightenment.
# Initialize the result
# All non-indexed entries shall be 0
z = np.zeros(reference.size, dtype=np.int64)
Now evaluate which elements in a are relevant:
mask = np.flatnonzero(np.isin(a, reference))
# Short note: If we know that any positive element of a is a number
# Which has to be in the reference, we can also shorten this to
# a simple boolean mask. This will be significantly faster to process.
mask = (a > 0)
Now the following trick: All values a > 0 are unique. Additionally, their value corresponds to the position in reference (e.g. 8 in a shall correspond to the 8th position in reference. Thus, we can use the values as index themselves:
z[ a[mask] ] = lookup[mask]
This results in the desired outcome:
z = [ 0 0 0 0 0 6 0 24 3 0 0 48 24 0 84]
I have this numpy matrix (ndarray).
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
I want to calculate the column-wise and row-wise sums.
I know this is done by calling respectively
np.sum(mat, axis=0) ### column-wise sums
np.sum(mat, axis=1) ### row-wise sums
but I cannot understand these two calls.
Why is axis 0 giving me the sums column-by-column?!
Shouldn't it be the other way around?
I thought the rows are axis 0, and the columns are axis 1.
What I am seeing as a behavior here looks counter-intuitive
(but I am sure it's OK, I guess I am just missing something important).
I am just looking for some intuitive explanation here.
Thanks in advance.
Intuition around arrays and axes
I want to offer 3 types of intuitions here.
Graphical (How to imagine them visually)
Physical (How they are physically stored)
Logical (How to work with them logically)
Graphical intuition
Consider a numpy array as a n-dimensional object. This n-dimensional object contains elements in each of the directions as below.
Axes in this representation are the direction of the tensor. So, a 2D matrix has only 2 axes, while a 4D tensor has 4 axes.
Sum in a given axis can be essentially considered as a reduction in that direction. Imagine a 3D tensor being squashed in such a way that it becomes flat (a 2D tensor). The axis tells us which direction to squash or reduce it in.
Physical intuition
Numpy stores its ndarrays as contiguous blocks of memory. Each element is stored in a sequential manner every n bytes after the previous.
(images referenced from this excellent SO post)
So if your 3D array looks like this -
Then in memory its stores as -
When retrieving an element (or a block of elements), NumPy calculates how many strides (bytes) it needs to traverse to get the next element in that direction/axis. So, for the above example, for axis=2 it has to traverse 8 bytes (depending on the datatype) but for axis=1 it has to traverse 8*4 bytes, and axis=0 it needs 8*8 bytes.
Axes in this representation is basically the series of next elements after a given stride. Consider the following array -
print(X)
print(X.strides)
[[0 2 1 4 0 0 0]
[5 0 0 0 0 0 0]
[8 0 0 0 0 0 0]
[0 0 0 0 0 0 0]
[0 0 1 0 0 0 0]
[0 0 0 1 0 0 0]]
#Strides (bytes) required to traverse in each axis.
(56, 8)
In the above array, every element after 56 bytes from any element is the next element in axis=0 and every element after 8 bytes from any element is in axis=1. (except from the last element)
Sum or reduction in this regards means taking a sum of every element in that strided series. So, sum over axis=0 means that I need to sum [0,5,8,0,0,0], [2,0,0,0,0,0], ... and sum over axis=1 means just summing [0 2 1 4 0 0 0] , [5 0 0 0 0 0 0], ...
Logical intuition
This interpretation has to do with element groupings. A numpy stores its ndarrays as groups of groups of groups ... of elements. Elements are grouped together and contain the last axis (axis=-1). Then another grouping over them creates another axis before it (axis=-2). The final outermost group is the axis=0.
These are 3 groups of 2 groups of 5 elements.
Similarly, the shape of a NumPy array is also determined by the same.
1D_array = [1,2,3]
2D_array = [[1,2,3]]
3D_array = [[[1,2,3]]]
...
Axes in this representation are the group in which elements are stored. The outermost group is axis=0 and the innermost group is axis=-1.
Sum or reduction in this regard means that I reducing elements across that specific group or axis. So, sum over axis=-1 means I sum over the innermost groups. Consider a (6, 5, 8) dimensional tensor. When I say I want a sum over some axis, I want to reduce the elements lying in that grouping / direction to a single value that is equal to their sum.
So,
np.sum(arr, axis=-1) will reduce the inner most groups (of length 8) into a single value and return (6,5,1) or (6,5).
np.sum(arr, axis=-2) will reduce the elements that lie in the 1st axis (or -2nd axis) direction and reduce those to a single value returning (6,1,8) or (6,8)
np.sum(arr, axis=0) will similarly reduce the tensor to (1,5,8) or (5,8).
Hope these 3 intuitions are beneficial to anyone trying to understand how axes and NumPy tensors work in general and how to build an intuitive understanding to work better with them.
Let's start with a one dimensional example:
a, b, c, d, e = 0, 1, 2, 3, 4
arr = np.array([a, b, c, d, e])
If you do,
arr.sum(0)
Output
10
That is the sum of the elements of the array
a + b + c + d + e
Now before moving on a 2 dimensional example. Let's clarify that in numpy the sum of two 1 dimensional arrays is done element wise, for example:
a = np.array([1, 2, 3, 4, 5])
b = np.array([6, 7, 8, 9, 10])
print(a + b)
Output
[ 7 9 11 13 15]
Now if we change our initial variables to arrays, instead of scalars, to create a two dimensional array and do the sum
a = np.array([1, 2, 3, 4, 5])
b = np.array([6, 7, 8, 9, 10])
c = np.array([11, 12, 13, 14, 15])
d = np.array([16, 17, 18, 19, 20])
e = np.array([21, 22, 23, 24, 25])
arr = np.array([a, b, c, d, e])
print(arr.sum(0))
Output
[55 60 65 70 75]
The output is the same as for the 1 dimensional example, i.e. the sum of the elements of the array:
a + b + c + d + e
Just that now the elements of the arrays are 1 dimensional arrays and the sum of those elements is applied. Now before explaining the results, for axis = 1, let's consider an alternative notation to the notation across axis = 0, basically:
np.array([arr[0, :], arr[1, :], arr[2, :], arr[3, :], arr[4, :]]).sum(0) # [55 60 65 70 75]
That is we took full slices in all other indices that were not the first dimension. If we swap to:
res = np.array([arr[:, 0], arr[:, 1], arr[:, 2], arr[:, 3], arr[:, 4]]).sum(0)
print(res)
Output
[ 15 40 65 90 115]
We get the result of the sum along axis=1. So to sum it up you are always summing elements of the array. The axis will indicate how this elements are constructed.
Intuitively, 'axis 0' goes from top to bottom and 'axis 1' goes from left to right. Therefore, when you sum along 'axis 0' you get the column sum, and along 'axis 1' you get the row sum.
As you go along 'axis 0', the row number increases. As you go along 'axis 1' the column number increases.
Think of a 1-dimension array:
mat=array([ 1, 2, 3, 4, 5])
Its items are called by mat[0], mat[1], etc
If you do:
np.sum(mat, axis=0)
it will return 15
In the background, it sums all items with mat[0], mat[1], mat[2], mat[3], mat[4]
meaning the first index (axis=0)
Now consider a 2-D array:
mat=array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
When you ask for
np.sum(mat, axis=0)
it will again sum all items based on the first index (axis=0) keeping all the rest same. This means that
mat[0][1], mat[1][1], mat[2][1], mat[3][1], mat[4][1]
will give one sum
mat[0][2], mat[1][2], mat[2][2], mat[3][2], mat[4][2]
will give another one, etc
If you consider a 3-D array, the logic will be the same. Every sum will be calculated on the same axis (index) keeping all the rest same. Sums on axis=0 will be produced by:
mat[0][1][1],mat[1][1][1],mat[2][1][1],mat[3][1][1],mat[4][1][1]
etc
Sums on axis=2 will be produced by:
mat[2][3][0], mat[2][3][1], mat[2][3][2], mat[2][3][3], mat[2][3][4]
etc
I hope you understand the logic. To keep things simple in your mind, consider axis=position of index in a chain index, eg axis=3 on a 7-mensional array will be:
mat[0][0][0][this is our axis][0][0][0]
I am still new to scikit-learn and numpy.
I read the tutorial, but I can't understand how they define array dimensions.
In the following example:
>>> import numpy as np
>>> a = np.arange(15).reshape(3, 5)
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
>>> a.shape
(3, 5)
>>> a.ndim
2
The array has five variables in each row, so I expect it to have 5 dimensions.
Why is a.ndim equal to 2?
Given that you're using scikit learn, I'll explain this in the context of machine learning as it may make more sense...
Your feature matrix (which I assume is what you're talking about here), is going to be 2 dimensional typically (hence why ndim = 2) because you have rows (which occupy a 1 dimension) and columns (which occupy a second dimension)
In machine learning cases, I typically think of the rows as the samples and columns as the features.
Note, however, that each dimension can have multiple entries (e.g. you will have multiple samples/rows, and multiple columns/features). This tells you the size along that dimension.
So in your case:
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
>>> a.shape
(3, 5)
>>> a.ndim
2
You have one dimension that has a length/size of 3. And a second dimension that has 5 entries. You can think of this as a feature matrix containing 3 samples and 5 features/variables, for example.
All in all, you have 2 dimensions (ndim = 2), but the specific size of the array is represented by the shape tuple, which tells you how large each of the 2 dimensions are.
Furthermore, (3,5,2) will be a matrix with 3 dimensions, where the 3rd dimension has 2 values
I think the key here, at least in the 2 dimensional case, is to not think of it as nested lists or nested vectors (which is what it looks like when you consider the []) but to just think of it as a table with rows and columns. The shape tuple and ndim will make more sense when you think of the data structure that way
Dimension means there the length of a.shape tuple.
Shape of the ndarray is (3, 5) since it has 3 rows and 5 columns in it. This is exactly what you try to find, aren't you?
I mistake the Frame and the Array, array is judged by how to find the number, how many steps to locate the number, in this case, you need locate the row and then locate the line. but in Frame, for one instance, this instance own a row, and all the lines in it described its attributes, or called it variables, or dimensions to define a instance. in Frame, you can use a two dimensions array to define a multi-dimension instance.
Array
1 4 5
2 3 6
so it's two steps you will find the number, like you can you a[][] to locate
but in Frame
Length Height Weight
1 23 34 56
2 89 87 63
This is a frame and actually is a 3-dimensional "array", but it's not an array.
How would be elegant solution for summing all 2's from an array based on their indices?
I have this array x = [2 2 2 3 2 2 2 2 3 3 2 3 2 2 3 3 2]
Then I found their positions with
y = where(isclose(x,2))
and get another array like this y = (array([ 0, 1, 2, 4, 5, 6, 7, 10, 12, 13, 16])
So how I can use with numpy to calculate sum of elements in x based on indices in y.
You can simply use a simple indexing to get the corresponding items then use np.sum:
>>> np.sum(x[np.where(x==2)[0]])
22
Also note that you don't need alclose within np.where you can just use x=2.And as said in comment this is not the proper way to doing this task if this is the only problem.
You don't need to use np.where for this - an array of booleans, like the one returned by np.isclose or the various comparison operators, works as an index to another array (provided the sizes match). This means you get all of the 2's with:
>>> x[np.isclose(x, 2)]
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
And sum them directly:
>>> x[np.isclose(x, 2)].sum()
22
If x contains only non-negative ints, you could sum the occurrences of each value with
total = np.bincount(x, weights=x)
# array([ 0., 0., 22., 18.])
The value of total[2] is 22 since there are 11 two's in x.
The value of total[3] is 18 since there are 3 three's in x.