I have this numpy matrix (ndarray).
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
I want to calculate the column-wise and row-wise sums.
I know this is done by calling respectively
np.sum(mat, axis=0) ### column-wise sums
np.sum(mat, axis=1) ### row-wise sums
but I cannot understand these two calls.
Why is axis 0 giving me the sums column-by-column?!
Shouldn't it be the other way around?
I thought the rows are axis 0, and the columns are axis 1.
What I am seeing as a behavior here looks counter-intuitive
(but I am sure it's OK, I guess I am just missing something important).
I am just looking for some intuitive explanation here.
Thanks in advance.
Intuition around arrays and axes
I want to offer 3 types of intuitions here.
Graphical (How to imagine them visually)
Physical (How they are physically stored)
Logical (How to work with them logically)
Graphical intuition
Consider a numpy array as a n-dimensional object. This n-dimensional object contains elements in each of the directions as below.
Axes in this representation are the direction of the tensor. So, a 2D matrix has only 2 axes, while a 4D tensor has 4 axes.
Sum in a given axis can be essentially considered as a reduction in that direction. Imagine a 3D tensor being squashed in such a way that it becomes flat (a 2D tensor). The axis tells us which direction to squash or reduce it in.
Physical intuition
Numpy stores its ndarrays as contiguous blocks of memory. Each element is stored in a sequential manner every n bytes after the previous.
(images referenced from this excellent SO post)
So if your 3D array looks like this -
Then in memory its stores as -
When retrieving an element (or a block of elements), NumPy calculates how many strides (bytes) it needs to traverse to get the next element in that direction/axis. So, for the above example, for axis=2 it has to traverse 8 bytes (depending on the datatype) but for axis=1 it has to traverse 8*4 bytes, and axis=0 it needs 8*8 bytes.
Axes in this representation is basically the series of next elements after a given stride. Consider the following array -
print(X)
print(X.strides)
[[0 2 1 4 0 0 0]
[5 0 0 0 0 0 0]
[8 0 0 0 0 0 0]
[0 0 0 0 0 0 0]
[0 0 1 0 0 0 0]
[0 0 0 1 0 0 0]]
#Strides (bytes) required to traverse in each axis.
(56, 8)
In the above array, every element after 56 bytes from any element is the next element in axis=0 and every element after 8 bytes from any element is in axis=1. (except from the last element)
Sum or reduction in this regards means taking a sum of every element in that strided series. So, sum over axis=0 means that I need to sum [0,5,8,0,0,0], [2,0,0,0,0,0], ... and sum over axis=1 means just summing [0 2 1 4 0 0 0] , [5 0 0 0 0 0 0], ...
Logical intuition
This interpretation has to do with element groupings. A numpy stores its ndarrays as groups of groups of groups ... of elements. Elements are grouped together and contain the last axis (axis=-1). Then another grouping over them creates another axis before it (axis=-2). The final outermost group is the axis=0.
These are 3 groups of 2 groups of 5 elements.
Similarly, the shape of a NumPy array is also determined by the same.
1D_array = [1,2,3]
2D_array = [[1,2,3]]
3D_array = [[[1,2,3]]]
...
Axes in this representation are the group in which elements are stored. The outermost group is axis=0 and the innermost group is axis=-1.
Sum or reduction in this regard means that I reducing elements across that specific group or axis. So, sum over axis=-1 means I sum over the innermost groups. Consider a (6, 5, 8) dimensional tensor. When I say I want a sum over some axis, I want to reduce the elements lying in that grouping / direction to a single value that is equal to their sum.
So,
np.sum(arr, axis=-1) will reduce the inner most groups (of length 8) into a single value and return (6,5,1) or (6,5).
np.sum(arr, axis=-2) will reduce the elements that lie in the 1st axis (or -2nd axis) direction and reduce those to a single value returning (6,1,8) or (6,8)
np.sum(arr, axis=0) will similarly reduce the tensor to (1,5,8) or (5,8).
Hope these 3 intuitions are beneficial to anyone trying to understand how axes and NumPy tensors work in general and how to build an intuitive understanding to work better with them.
Let's start with a one dimensional example:
a, b, c, d, e = 0, 1, 2, 3, 4
arr = np.array([a, b, c, d, e])
If you do,
arr.sum(0)
Output
10
That is the sum of the elements of the array
a + b + c + d + e
Now before moving on a 2 dimensional example. Let's clarify that in numpy the sum of two 1 dimensional arrays is done element wise, for example:
a = np.array([1, 2, 3, 4, 5])
b = np.array([6, 7, 8, 9, 10])
print(a + b)
Output
[ 7 9 11 13 15]
Now if we change our initial variables to arrays, instead of scalars, to create a two dimensional array and do the sum
a = np.array([1, 2, 3, 4, 5])
b = np.array([6, 7, 8, 9, 10])
c = np.array([11, 12, 13, 14, 15])
d = np.array([16, 17, 18, 19, 20])
e = np.array([21, 22, 23, 24, 25])
arr = np.array([a, b, c, d, e])
print(arr.sum(0))
Output
[55 60 65 70 75]
The output is the same as for the 1 dimensional example, i.e. the sum of the elements of the array:
a + b + c + d + e
Just that now the elements of the arrays are 1 dimensional arrays and the sum of those elements is applied. Now before explaining the results, for axis = 1, let's consider an alternative notation to the notation across axis = 0, basically:
np.array([arr[0, :], arr[1, :], arr[2, :], arr[3, :], arr[4, :]]).sum(0) # [55 60 65 70 75]
That is we took full slices in all other indices that were not the first dimension. If we swap to:
res = np.array([arr[:, 0], arr[:, 1], arr[:, 2], arr[:, 3], arr[:, 4]]).sum(0)
print(res)
Output
[ 15 40 65 90 115]
We get the result of the sum along axis=1. So to sum it up you are always summing elements of the array. The axis will indicate how this elements are constructed.
Intuitively, 'axis 0' goes from top to bottom and 'axis 1' goes from left to right. Therefore, when you sum along 'axis 0' you get the column sum, and along 'axis 1' you get the row sum.
As you go along 'axis 0', the row number increases. As you go along 'axis 1' the column number increases.
Think of a 1-dimension array:
mat=array([ 1, 2, 3, 4, 5])
Its items are called by mat[0], mat[1], etc
If you do:
np.sum(mat, axis=0)
it will return 15
In the background, it sums all items with mat[0], mat[1], mat[2], mat[3], mat[4]
meaning the first index (axis=0)
Now consider a 2-D array:
mat=array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
When you ask for
np.sum(mat, axis=0)
it will again sum all items based on the first index (axis=0) keeping all the rest same. This means that
mat[0][1], mat[1][1], mat[2][1], mat[3][1], mat[4][1]
will give one sum
mat[0][2], mat[1][2], mat[2][2], mat[3][2], mat[4][2]
will give another one, etc
If you consider a 3-D array, the logic will be the same. Every sum will be calculated on the same axis (index) keeping all the rest same. Sums on axis=0 will be produced by:
mat[0][1][1],mat[1][1][1],mat[2][1][1],mat[3][1][1],mat[4][1][1]
etc
Sums on axis=2 will be produced by:
mat[2][3][0], mat[2][3][1], mat[2][3][2], mat[2][3][3], mat[2][3][4]
etc
I hope you understand the logic. To keep things simple in your mind, consider axis=position of index in a chain index, eg axis=3 on a 7-mensional array will be:
mat[0][0][0][this is our axis][0][0][0]
Related
In my work I often need to aggregate and expand matrices of various quantities, and I am looking for the most efficient ways to do these actions. E.g. I'll have an NxN matrix that I want to aggregate from NxN into PxP where P < N. This is done using a correspondence between the larger dimensions and the smaller dimensions. Usually, P will be around 100 or so.
For example, I'll have a hypothetical 4x4 matrix like this (though in practice, my matrices will be much larger, around 1000x1000)
m=np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]])
>>> m
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16]])
and a correspondence like this (schematically):
0 -> 0
1 -> 1
2 -> 0
3 -> 1
that I usually store in a dictionary. This means that indices 0 and 2 (for rows and columns) both get allocated to new index 0 and indices 1 and 3 (for rows and columns) both get allocated to new index 1. The matrix could be anything at all, but the correspondence is always many-to-one when I want to compress.
If the input matrix is A and the output matrix is B, then cell B[0, 0] would be the sum of A[0, 0] + A[0, 2] + A[2, 0] + A[2, 2] because new index 0 is made up of original indices 0 and 2.
The aggregation process here would lead to:
array([[ 1+3+9+11, 2+4+10+12 ],
[ 5+7+13+15, 6+8+14+16 ]])
= array([[ 24, 28 ],
[ 40, 44 ]])
I can do this by making an empty matrix of the right size and looping over all 4x4=16 cells of the initial matrix and accumulating in nested loops, but this seems to be inefficient and the vectorised nature of numpy is always emphasised by people. I have also done it by using np.ix_ to make sets of indices and use m[row_indices, col_indices].sum(), but I am wondering what the most efficient numpy-like way to do it is.
Conversely, what is the sensible and efficient way to expand a matrix using the correspondence the other way? For example with the same correspondence but in reverse I would go from:
array([[ 1, 2 ],
[ 3, 4 ]])
to
array([[ 1, 2, 1, 2 ],
[ 3, 4, 3, 4 ],
[ 1, 2, 1, 2 ],
[ 3, 4, 3, 4 ]])
where the values simply get replicated into the new cells.
In my attempts so far for the aggregation, I have used approaches with pandas methods with groupby on index and columns and then extracting the final matrix with, e.g. df.values. However, I don't know the equivalent way to expand a matrix, without using a lot of things like unstack and join and so on. And I see people often say that using pandas is not time-efficient.
Edit 1: I was asked in a comment about exactly how the aggregation should be done. This is how it would be done if I were using nested loops and a dictionary lookup between the original dimensions and the new dimensions:
>>> m=np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
>>> mnew=np.zeros((2,2))
>>> big2small={0:0, 1:1, 2:0, 3:1}
>>> for i in range(4):
... inew = big2small[i]
... for j in range(4):
... jnew = big2small[j]
... mnew[inew, jnew] += m[i, j]
...
>>> mnew
array([[24., 28.],
[40., 44.]])
Edit 2: Another comment asked for the aggregation example towards the start to be made more explicit, so I have done so.
Assuming you don't your indices don't have a regular structure I would do it try sparse matrices.
import scipy.sparse as ss
import numpy as np
# your current array of indices
g=np.array([[0,0],[1,1],[2,0],[3,1]])
# a sparse matrix of (data=ones, (row_ind=g[:,0], col_ind=g[:,1]))
# it is one for every pair (g[i,0], g[i,1]), zero elsewhere
u=ss.csr_matrix((np.ones(len(g)), (g[:,0], g[:,1])))
Aggregate
m=np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
u.T # m # u
Expand
m2 = np.array([[1,2],[3,4]])
u # m2 # u.T
Context
I have the following example-arrays in numpy:
import numpy as np
# All arrays in this example have the shape (15,)
# Note: All values > 0 are unqiue!
a = np.array([8,5,4,-1,-1, 7,-1,-1,12,11,-1,-1,14,-1,-1])
reference = np.array([0,1,2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14])
lookup = np.array([3,6,0,-2,-2,24,-2,-2,24,48,-2,-2,84,-2,-2])
My goal is to find the elements inside the reference in a, then get the index in a and use it to extract the corresponding elements in lookup.
Finding out the matching elements and their indices works with np.flatnonzero( np.isin() ).
I can also lookup the correspodning values:
# Example how to find the index
np.flatnonzero( np.isin( reference, a) )
# -> array([ 4, 5, 7, 8, 11, 12, 14])
# Example how to find corresponding values:
lookup[ np.flatnonzero( np.isin( a, reference) ) ]
# -> array([ 3, 6, 0, 24, 24, 48, 84], dtype=int64)
Problem
I want to fill an array z with the values I looked up, following the reference.
This means, that e.g. the 8th element of z corresponds to the 8th element in the lookup-value for the 8th element in reference (= 8). This value would be 3 (reference[8] -> a[0] because a==8 here -> lookup[0] -> 3).
z = np.zeros(reference.size)
z[np.flatnonzero(np.isin(reference, a))] = ? -> numpy-array of correctly ordered lookup_values
The expected outcome for z would be:
z = [ 0 0 0 0 0 6 0 24 3 0 0 48 24 0 84]
I cannot get my head around this; I have to avoid for-loops due to performance reasons and would need a pure numpy-solution (best without udfs).
How can I fill z according with the lookup-values at the correct position?
Note: As stated in the code above, all values a > 0 are unique. Thus, there is no need to take care about the duplicated values for a < 0.
You say that you 'have to avoid for-loops due to performance reasons', so I assume that your real-world datastructure a is going to be large (thousands or millions of elements?). Since np.isin(reference, a) performs a linear search in a for every element of reference, your runtime will be O(len(reference) * len(a)).
I would strongly suggest using a dict for a, allowing lookup in O(1) per element of reference, and loop in python using for. For sufficiently large a this will outperform the 'fast' linear search performed by np.isin.
The most natural way I can think of is to just treat a and lookup as a dictionary:
In [82]: d = dict(zip(a, lookup))
In [83]: np.array([d.get(i, 0) for i in reference])
Out[83]: array([ 0, 0, 0, 0, 0, 6, 0, 24, 3, 0, 0, 48, 24, 0, 84])
This does have a bit of memory overhead but nothing crazy if reference isn't too large.
I actually had an enlightenment.
# Initialize the result
# All non-indexed entries shall be 0
z = np.zeros(reference.size, dtype=np.int64)
Now evaluate which elements in a are relevant:
mask = np.flatnonzero(np.isin(a, reference))
# Short note: If we know that any positive element of a is a number
# Which has to be in the reference, we can also shorten this to
# a simple boolean mask. This will be significantly faster to process.
mask = (a > 0)
Now the following trick: All values a > 0 are unique. Additionally, their value corresponds to the position in reference (e.g. 8 in a shall correspond to the 8th position in reference. Thus, we can use the values as index themselves:
z[ a[mask] ] = lookup[mask]
This results in the desired outcome:
z = [ 0 0 0 0 0 6 0 24 3 0 0 48 24 0 84]
In [28]: arr = np.arange(16).reshape((2, 2, 4))
In [29]: arr
Out[29]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
In [32]: arr.transpose((1, 0, 2))
Out[32]:
array([[[ 0, 1, 2, 3],
[ 8, 9, 10, 11]],
[[ 4, 5, 6, 7],
[12, 13, 14, 15]]])
When we pass a tuple of integers to the transpose() function, what happens?
To be specific, this is a 3D array: how does NumPy transform the array when I pass the tuple of axes (1, 0 ,2)? Can you explain which row or column these integers refer to? And what are axis numbers in the context of NumPy?
To transpose an array, NumPy just swaps the shape and stride information for each axis. Here are the strides:
>>> arr.strides
(64, 32, 8)
>>> arr.transpose(1, 0, 2).strides
(32, 64, 8)
Notice that the transpose operation swapped the strides for axis 0 and axis 1. The lengths of these axes were also swapped (both lengths are 2 in this example).
No data needs to be copied for this to happen; NumPy can simply change how it looks at the underlying memory to construct the new array.
Visualising strides
The stride value represents the number of bytes that must be travelled in memory in order to reach the next value of an axis of an array.
Now, our 3D array arr looks this (with labelled axes):
This array is stored in a contiguous block of memory; essentially it is one-dimensional. To interpret it as a 3D object, NumPy must jump over a certain constant number of bytes in order to move along one of the three axes:
Since each integer takes up 8 bytes of memory (we're using the int64 dtype), the stride value for each dimension is 8 times the number of values that we need to jump. For instance, to move along axis 1, four values (32 bytes) are jumped, and to move along axis 0, eight values (64 bytes) need to be jumped.
When we write arr.transpose(1, 0, 2) we are swapping axes 0 and 1. The transposed array looks like this:
All that NumPy needs to do is to swap the stride information for axis 0 and axis 1 (axis 2 is unchanged). Now we must jump further to move along axis 1 than axis 0:
This basic concept works for any permutation of an array's axes. The actual code that handles the transpose is written in C and can be found here.
As explained in the documentation:
By default, reverse the dimensions, otherwise permute the axes according to the values given.
So you can pass an optional parameter axes defining the new order of dimensions.
E.g. transposing the first two dimensions of an RGB VGA pixel array:
>>> x = np.ones((480, 640, 3))
>>> np.transpose(x, (1, 0, 2)).shape
(640, 480, 3)
In C notation, your array would be:
int arr[2][2][4]
which is an 3D array having 2 2D arrays. Each of those 2D arrays has 2 1D array, each of those 1D arrays has 4 elements.
So you have three dimensions. The axes are 0, 1, 2, with sizes 2, 2, 4. This is exactly how numpy treats the axes of an N-dimensional array.
So, arr.transpose((1, 0, 2)) would take axis 1 and put it in position 0, axis 0 and put it in position 1, and axis 2 and leave it in position 2. You are effectively permuting the axes:
0 -\/-> 0
1 -/\-> 1
2 ----> 2
In other words, 1 -> 0, 0 -> 1, 2 -> 2. The destination axes are always in order, so all you need is to specify the source axes. Read off the tuple in that order: (1, 0, 2).
In this case your new array dimensions are again [2][2][4], only because axes 0 and 1 had the same size (2).
More interesting is a transpose by (2, 1, 0) which gives you an array of [4][2][2].
0 -\ /--> 0
1 --X---> 1
2 -/ \--> 2
In other words, 2 -> 0, 1 -> 1, 0 -> 2. Read off the tuple in that order: (2, 1, 0).
>>> arr.transpose((2,1,0))
array([[[ 0, 8],
[ 4, 12]],
[[ 1, 9],
[ 5, 13]],
[[ 2, 10],
[ 6, 14]],
[[ 3, 11],
[ 7, 15]]])
You ended up with an int[4][2][2].
You'll probably get better understanding if all dimensions were of different size, so you could see where each axis went.
Why is the first inner element [0, 8]? Because if you visualize your 3D array as two sheets of paper, 0 and 8 are lined up, one on one paper and one on the other paper, both in the upper left. By transposing (2, 1, 0) you're saying that you want the direction of paper-to-paper to now march along the paper from left to right, and the direction of left to right to now go from paper to paper. You had 4 elements going from left to right, so now you have four pieces of paper instead. And you had 2 papers, so now you have 2 elements going from left to right.
Sorry for the terrible ASCII art. ¯\_(ツ)_/¯
It seems the question and the example originates from the book Python for Data Analysis by Wes McKinney. This feature of transpose is mentioned in Chapter 4.1. Transposing Arrays and Swapping Axes.
For higher dimensional arrays, transpose will accept a tuple of axis numbers to permute the axes (for extra mind bending).
Here "permute" means "rearrange", so rearranging the order of axes.
The numbers in .transpose(1, 0, 2) determines how the order of axes are changed compared to the original. By using .transpose(1, 0, 2), we mean, "Change the 1st axis with the 2nd." If we use .transpose(0, 1, 2), the array will stay the same because there is nothing to change; it is the default order.
The example in the book with a (2, 2, 4) sized array is not very clear since 1st and 2nd axes has the same size. So the end result doesn't seem to change except the reordering of rows arr[0, 1] and arr[1, 0].
If we try a different example with a 3 dimensional array with each dimension having a different size, the rearrangement part becomes more clear.
In [2]: x = np.arange(24).reshape(2, 3, 4)
In [3]: x
Out[3]:
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
In [4]: x.transpose(1, 0, 2)
Out[4]:
array([[[ 0, 1, 2, 3],
[12, 13, 14, 15]],
[[ 4, 5, 6, 7],
[16, 17, 18, 19]],
[[ 8, 9, 10, 11],
[20, 21, 22, 23]]])
Here, original array sizes are (2, 3, 4). We changed the 1st and 2nd, so it becomes (3, 2, 4) in size. If we look closer to see how the rearrangement exactly happened; arrays of numbers seems to have changed in a particular pattern. Using the paper analogy of #RobertB, if we were to take the 2 chunks of numbers, and write each one on sheets, then take one row from each sheet to construct one dimension of the array, we would now have a 3x2x4-sized array, counting from the outermost to the innermost layer.
[ 0, 1, 2, 3] \ [12, 13, 14, 15]
[ 4, 5, 6, 7] \ [16, 17, 18, 19]
[ 8, 9, 10, 11] \ [20, 21, 22, 23]
It could be a good idea to play with different sized arrays, and change different axes to gain a better intuition of how it works.
I ran across this in Python for Data Analysis by Wes McKinney as well.
I will show the simplest way of solving this for a 3-dimensional tensor, then describe the general approach that can be used for n-dimensional tensors.
Simple 3-dimensional tensor example
Suppose you have the (2,2,4)-tensor
[[[ 0 1 2 3]
[ 4 5 6 7]]
[[ 8 9 10 11]
[12 13 14 15]]]
If we look at the coordinates of each point, they are as follows:
[[[ (0,0,0) (0,0,1) (0,0,2) (0,0,3)]
[ (0,1,0) (0,1,1) (0,1,2) (0,1,3)]]
[[ (1,0,0) (1,0,1) (1,0,2) (0,0,3)]
[ (1,1,0) (1,1,1) (1,1,2) (0,1,3)]]
Now suppose that the array above is example_array and we want to perform the operation: example_array.transpose(1,2,0)
For the (1,2,0)-transformation, we shuffle the coordinates as follows (note that this particular transformation amounts to a "left-shift":
(0,0,0) -> (0,0,0)
(0,0,1) -> (0,1,0)
(0,0,2) -> (0,2,0)
(0,0,3) -> (0,3,0)
(0,1,0) -> (1,0,0)
(0,1,1) -> (1,1,0)
(0,1,2) -> (1,2,0)
(0,1,3) -> (1,3,0)
(1,0,0) -> (0,0,1)
(1,0,1) -> (0,1,1)
(1,0,2) -> (0,2,1)
(0,0,3) -> (0,3,0)
(1,1,0) -> (1,0,1)
(1,1,1) -> (1,1,1)
(1,1,2) -> (1,2,1)
(0,1,3) -> (1,3,0)
Now, for each original value, place it into the shifted coordinates in the result matrix.
For instance, the value 10 has coordinates (1, 0, 2) in the original matrix and will have coordinates (0, 2, 1) in the result matrix. It is placed into the first 2d tensor submatrix in the third row of that submatrix, in the second column of that row.
Hence, the resulting matrix is:
array([[[ 0, 8],
[ 1, 9],
[ 2, 10],
[ 3, 11]],
[[ 4, 12],
[ 5, 13],
[ 6, 14],
[ 7, 15]]])
General n-dimensional tensor approach
For n-dimensional tensors, the algorithm is the same. Consider all of the coordinates of a single value in the original matrix. Shuffle the axes for that individual coordinate. Place the value into the resulting, shuffled coordinates in the result matrix. Repeat for all of the remaining values.
To summarise a.transpose()[i,j,k] = a[k,j,i]
a = np.array( range(24), int).reshape((2,3,4))
a.shape gives (2,3,4)
a.transpose().shape gives (4,3,2) shape tuple is reversed.
when is a tuple parameter is passed axes are permuted according to the tuple.
For example
a = np.array( range(24), int).reshape((2,3,4))
a[i,j,k] equals a.transpose((2,0,1))[k,i,j]
axis 0 takes 2nd place
axis 1 takes 3rd place
axis 2 tales 1st place
of course we need to take care that values in tuple parameter passed to transpose are unique and in range(number of axis)
I'm new to numpy and I have a 2D array of objects that I need to bin into a smaller matrix and then get a count of the number of objects in each bin to make a heatmap. I followed the answer on this thread to create the bins and do the counts for a simple array but I'm not sure how to extend it to 2 dimensions. Here's what I have so far:
data_matrix = numpy.ndarray((500,500),dtype=float)
# fill array with values.
bins = numpy.linspace(0,50,50)
digitized = numpy.digitize(data_matrix, bins)
binned_data = numpy.ndarray((50,50))
for i in range(0,len(bins)):
for j in range(0,len(bins)):
k = len(data_matrix[digitized == i:digitized == j]) # <-not does not work
binned_data[i:j] = k
P.S. the [digitized == i] notation on an array will return an array of binary values. I cannot find documentation on this notation anywhere. A link would be appreciated.
You can reshape the array to a four dimensional array that reflects the desired block structure, and then sum along both axes within each block. Example:
>>> a = np.arange(24).reshape(4, 6)
>>> a
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])
>>> a.reshape(2, 2, 2, 3).sum(3).sum(1)
array([[ 24, 42],
[ 96, 114]])
If a has the shape m, n, the reshape should have the form
a.reshape(m_bins, m // m_bins, n_bins, n // n_bins)
At first I was also going to suggest that you use np.histogram2d rather than reinventing the wheel, but then I realized that it would be overkill to use that and would need some hacking still.
If I understand correctly, you just want to sum over submatrices of your input. That's pretty easy to brute force: going over your output submatrix and summing up each subblock of your input:
import numpy as np
def submatsum(data,n,m):
# return a matrix of shape (n,m)
bs = data.shape[0]//n,data.shape[1]//m # blocksize averaged over
return np.reshape(np.array([np.sum(data[k1*bs[0]:(k1+1)*bs[0],k2*bs[1]:(k2+1)*bs[1]]) for k1 in range(n) for k2 in range(m)]),(n,m))
# set up dummy data
N,M = 4,6
data_matrix = np.reshape(np.arange(N*M),(N,M))
# set up size of 2x3-reduced matrix, assume congruity
n,m = N//2,M//3
reduced_matrix = submatsum(data_matrix,n,m)
# check output
print(data_matrix)
print(reduced_matrix)
This prints
print(data_matrix)
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]]
print(reduced_matrix)
[[ 24 42]
[ 96 114]]
which is indeed the result for summing up submatrices of shape (2,3).
Note that I'm using // for integer division to make sure it's python3-compatible, but in case of python2 you can just use / for division (due to the numbers involved being integers).
Another solution is to have a look at the binArray function on the comments here:
Binning a numpy array
To use your example :
data_matrix = numpy.ndarray((500,500),dtype=float)
binned_data = binArray(data_matrix, 0, 10, 10, np.sum)
binned_data = binArray(binned_data, 1, 10, 10, np.sum)
The result sum all square of size 10x10 in data_matrix (of size 500x500) to obtain a single value per square in binned_data (of size 50x50).
Hope this help !
I am attempting to generalize some Python code to operate on arrays of arbitrary dimension. The operations are applied to each vector in the array. So for a 1D array, there is simply one operation, for a 2-D array it would be both row and column-wise (linearly, so order does not matter). For example, a 1D array (a) is simple:
b = operation(a)
where 'operation' is expecting a 1D array. For a 2D array, the operation might proceed as
for ii in range(0,a.shape[0]):
b[ii,:] = operation(a[ii,:])
for jj in range(0,b.shape[1]):
c[:,ii] = operation(b[:,ii])
I would like to make this general where I do not need to know the dimension of the array beforehand, and not have a large set of if/elif statements for each possible dimension.
Solutions that are general for 1 or 2 dimensions are ok, though a completely general solution would be preferred. In reality, I do not imagine needing this for any dimension higher than 2, but if I can see a general example I will learn something!
Extra information:
I have a matlab code that uses cells to do something similar, but I do not fully understand how it works. In this example, each vector is rearranged (basically the same function as fftshift in numpy.fft). Not sure if this helps, but it operates on an array of arbitrary dimension.
function aout=foldfft(ain)
nd = ndims(ain);
for k = 1:nd
nx = size(ain,k);
kx = floor(nx/2);
idx{k} = [kx:nx 1:kx-1];
end
aout = ain(idx{:});
In Octave, your MATLAB code does:
octave:19> size(ain)
ans =
2 3 4
octave:20> idx
idx =
{
[1,1] =
1 2
[1,2] =
1 2 3
[1,3] =
2 3 4 1
}
and then it uses the idx cell array to index ain. With these dimensions it 'rolls' the size 4 dimension.
For 5 and 6 the index lists would be:
2 3 4 5 1
3 4 5 6 1 2
The equivalent in numpy is:
In [161]: ain=np.arange(2*3*4).reshape(2,3,4)
In [162]: idx=np.ix_([0,1],[0,1,2],[1,2,3,0])
In [163]: idx
Out[163]:
(array([[[0]],
[[1]]]), array([[[0],
[1],
[2]]]), array([[[1, 2, 3, 0]]]))
In [164]: ain[idx]
Out[164]:
array([[[ 1, 2, 3, 0],
[ 5, 6, 7, 4],
[ 9, 10, 11, 8]],
[[13, 14, 15, 12],
[17, 18, 19, 16],
[21, 22, 23, 20]]])
Besides the 0 based indexing, I used np.ix_ to reshape the indexes. MATLAB and numpy use different syntax to index blocks of values.
The next step is to construct [0,1],[0,1,2],[1,2,3,0] with code, a straight forward translation.
I can use np.r_ as a short cut for turning 2 slices into an index array:
In [201]: idx=[]
In [202]: for nx in ain.shape:
kx = int(np.floor(nx/2.))
kx = kx-1;
idx.append(np.r_[kx:nx, 0:kx])
.....:
In [203]: idx
Out[203]: [array([0, 1]), array([0, 1, 2]), array([1, 2, 3, 0])]
and pass this through np.ix_ to make the appropriate index tuple:
In [204]: ain[np.ix_(*idx)]
Out[204]:
array([[[ 1, 2, 3, 0],
[ 5, 6, 7, 4],
[ 9, 10, 11, 8]],
[[13, 14, 15, 12],
[17, 18, 19, 16],
[21, 22, 23, 20]]])
In this case, where 2 dimensions don't roll anything, slice(None) could replace those:
In [210]: idx=(slice(None),slice(None),[1,2,3,0])
In [211]: ain[idx]
======================
np.roll does:
indexes = concatenate((arange(n - shift, n), arange(n - shift)))
res = a.take(indexes, axis)
np.apply_along_axis is another function that constructs an index array (and turns it into a tuple for indexing).
If you are looking for a programmatic way to index the k-th dimension an n-dimensional array, then numpy.take might help you.
An implementation of foldfft is given below as an example:
In[1]:
import numpy as np
def foldfft(ain):
result = ain
nd = len(ain.shape)
for k in range(nd):
nx = ain.shape[k]
kx = (nx+1)//2
shifted_index = list(range(kx,nx)) + list(range(kx))
result = np.take(result, shifted_index, k)
return result
a = np.indices([3,3])
print("Shape of a = ", a.shape)
print("\nStarting array:\n\n", a)
print("\nFolded array:\n\n", foldfft(a))
Out[1]:
Shape of a = (2, 3, 3)
Starting array:
[[[0 0 0]
[1 1 1]
[2 2 2]]
[[0 1 2]
[0 1 2]
[0 1 2]]]
Folded array:
[[[2 0 1]
[2 0 1]
[2 0 1]]
[[2 2 2]
[0 0 0]
[1 1 1]]]
You could use numpy.ndarray.flat, which allows you to linearly iterate over a n dimensional numpy array. Your code should then look something like this:
b = np.asarray(x)
for i in range(len(x.flat)):
b.flat[i] = operation(x.flat[i])
The folks above provided multiple appropriate solutions. For completeness, here is my final solution. In this toy example for the case of 3 dimensions, the function 'ops' replaces the first and last element of a vector with 1.
import numpy as np
def ops(s):
s[0]=1
s[-1]=1
return s
a = np.random.rand(4,4,3)
print '------'
print 'Array a'
print a
print '------'
for ii in np.arange(a.ndim):
a = np.apply_along_axis(ops,ii,a)
print '------'
print ' Axis',str(ii)
print a
print '------'
print ' '
The resulting 3D array has a 1 in every element on the 'border' with the numbers in the middle of the array unchanged. This is of course a toy example; however ops could be any arbitrary function that operates on a 1D vector.
Flattening the vector will also work; I chose not to pursue that simply because the book-keeping is more difficult and apply_along_axis is the simplest approach.
apply_along_axis reference page