Grouping 2D numpy array in average - python

I am trying to group a numpy array into smaller size by taking average of the elements. Such as take average foreach 5x5 sub-arrays in a 100x100 array to create a 20x20 size array. As I have a huge data need to manipulate, is that an efficient way to do that?

I have tried this for smaller array, so test it with yours:
import numpy as np
nbig = 100
nsmall = 20
big = np.arange(nbig * nbig).reshape([nbig, nbig]) # 100x100
small = big.reshape([nsmall, nbig//nsmall, nsmall, nbig//nsmall]).mean(3).mean(1)
An example with 6x6 -> 3x3:
nbig = 6
nsmall = 3
big = np.arange(36).reshape([6,6])
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
small = big.reshape([nsmall, nbig//nsmall, nsmall, nbig//nsmall]).mean(3).mean(1)
array([[ 3.5, 5.5, 7.5],
[ 15.5, 17.5, 19.5],
[ 27.5, 29.5, 31.5]])

This is pretty straightforward, although I feel like it could be faster:
from __future__ import division
import numpy as np
Norig = 100
Ndown = 20
step = Norig//Ndown
assert step == Norig/Ndown # ensure Ndown is an integer factor of Norig
x = np.arange(Norig*Norig).reshape((Norig,Norig)) #for testing
y = np.empty((Ndown,Ndown)) # for testing
for yr,xr in enumerate(np.arange(0,Norig,step)):
for yc,xc in enumerate(np.arange(0,Norig,step)):
y[yr,yc] = np.mean(x[xr:xr+step,xc:xc+step])
You might also find scipy.signal.decimate interesting. It applies a more sophisticated low-pass filter than simple averaging before downsampling the data, although you'd have to decimate one axis, then the other.

Average a 2D array over subarrays of size NxN:
height, width = data.shape
data = average(split(average(split(data, width // N, axis=1), axis=-1), height // N, axis=1), axis=-1)

Note that eumiro's approach does not work for masked arrays as .mean(3).mean(1) assumes that each mean along axis 3 was computed from the same number of values. If there are masked elements in your array, this assumption does not hold any more. In that case, you have to keep track of the number of values used to compute .mean(3) and replace .mean(1) by a weighted mean. The weights are the normalized number of values used to compute .mean(3).
Here is an example:
import numpy as np
def gridbox_mean_masked(data, Nbig, Nsmall):
# Reshape data
rshp = data.reshape([Nsmall, Nbig//Nsmall, Nsmall, Nbig//Nsmall])
# Compute mean along axis 3 and remember the number of values each mean
# was computed from
mean3 = rshp.mean(3)
count3 = rshp.count(3)
# Compute weighted mean along axis 1
mean1 = (count3*mean3).sum(1)/count3.sum(1)
return mean1
# Define test data
big = np.ma.array([[1, 1, 2],
[1, 1, 1],
[1, 1, 1]])
big.mask = [[0, 0, 0],
[0, 0, 1],
[0, 0, 0]]
Nbig = 3
Nsmall = 1
# Compute gridbox mean
print gridbox_mean_masked(big, Nbig, Nsmall)

Related

How to square a row in NumPy to go from a 2-d array to a 3-d one where each row was squared?

I am trying to figure out a way to get the rows of a 2-d matrix squared.
The behaviour I would like to have is something like this:
in[1] import numpy as np
in[2] a = np.array([[1,2,3],
[4,5,6]])
in[3] some_function(a) # for each row, row.reshape(-1,1); row # row.T
out[1] array([[[ 1, 2, 3],
[ 2, 4, 6],
[ 3, 6, 9]],
[[16, 20, 24],
[20, 25, 30],
[24, 30, 36]]])
I need this to make a softmax derivative for auto diff in a manual implementation of a feed-forward neural network.
The same derivative would look like this for a point:
in[4] def softmax_derivative(x):
in[5] s = x.reshape(-1,1)
in[6] return np.diagflat(s) - np.dot(s,s.T)
Instead of np.diagflat I am using:
in[7] matrix = np.array([[1,2,3],
[4,5,6])
in[8] matrix.shape
out[2] (2,3)
in[9] Id = np.eye(matrix.shape[-1])
in[10] (matrix[...,np.newaxis] * Id).shape
out[3] (2,3,3)
The reason I want a 3-d array of the squared rows is to subtract it from the 3-d array of the diagonal rows which I get in the same way as in the above example.
While I know that I can get the same multiplication result from
in[11] def get_squared_rows(matrix):
in[12] s = matrix.reshape(-1,1)
in[13] return s # s.T
I do not know how to get it to the correct shape in a fast way. Since, yes, the correct 2-d arrays are a part of the matrix on the diagonal, I have to get them together to match the shape of the diagonal 3-d matrix I got. This means I would somehow both have to extract the correct matrices and then turn that into a 3-d array of shape (n_samples,row,row). I do not know how to do that any faster than just a simple loop through all rows of the input matrix.
Use broadcasting:
>>> a[:, None, :] * a[:, :, None]
array([[[ 1, 2, 3],
[ 2, 4, 6],
[ 3, 6, 9]],
[[16, 20, 24],
[20, 25, 30],
[24, 30, 36]]])

identifying sub-arrays in numpy

I have two two dimensional arrays a and b (#columns of a <= #columns in b). I would like to find an efficient way of matching a row in array a to a contiguous part of a row in array b.
a = np.array([[ 25, 28],
[ 84, 97],
[105, 24],
[ 28, 900]])
b = np.array([[ 25, 28, 84, 97],
[ 22, 25, 28, 900],
[ 11, 12, 105, 24]])
The output should be np.array([[0,0], [0,1], [1,0], [2,2], [3,1]]). Row 0 in array a matches Row 0 in array b (first two positions). Row 1 in array a matches row 0 in array b (third and fourth positions).
We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows for efficient patch extraction, and then compare those patches against each row off a, all of it in a vectorized manner. Then, get the matching indices with np.argwhere -
# a and b from posted question
In [325]: from skimage.util.shape import view_as_windows
In [428]: w = view_as_windows(b,(1,a.shape[1]))
In [429]: np.argwhere((w == a).all(-1).any(-2))[:,::-1]
Out[429]:
array([[0, 0],
[1, 0],
[0, 1],
[3, 1],
[2, 2]])
Alternatively, we could get the indices by the order of rows in a by pushing forward the first axis of a while performing broadcasted comparisons -
In [444]: np.argwhere((w[:,:,0] == a[:,None,None,:]).all(-1).any(-1))
Out[444]:
array([[0, 0],
[0, 1],
[1, 0],
[2, 2],
[3, 1]])
Another way I can think of is to loop over each row in a and perform a 2D correlation between the b which you can consider as a 2D signal a row in a.
We would find the results which are equal to the sum of squares of all values in a. If we subtract our correlation result with this sum of squares, we would find matches with a zero result. Any rows that give you a 0 result would mean that the subarray was found in that row. If you are using floating-point numbers for example, you may want to compare with some small threshold that is just above 0.
If you can use SciPy, the scipy.signal.correlate2d method is what I had in mind.
import numpy as np
from scipy.signal import correlate2d
a = np.array([[ 25, 28],
[ 84, 97],
[105, 24]])
b = np.array([[ 25, 28, 84, 97],
[ 22, 25, 28, 900],
[ 11, 12, 105, 24]])
EPS = 1e-8
result = []
for (i, row) in enumerate(a):
out = correlate2d(b, row[None,:], mode='valid') - np.square(row).sum()
locs = np.where(np.abs(out) <= EPS)[0]
unique_rows = np.unique(locs)
for res in unique_rows:
result.append((i, res))
We get:
In [32]: result
Out[32]: [(0, 0), (0, 1), (1, 0), (2, 2)]
The time complexity of this could be better, especially since we're looping over each row of a to find any subarrays in b.

getting multiple array after performing subtraction operation within array elements

import numpy as np
m = []
k = []
a = np.array([[1,2,3,4,5,6],[50,51,52,40,20,30],[60,71,82,90,45,35]])
for i in range(len(a)):
m.append(a[i, -1:])
for j in range(len(a[i])-1):
n = abs(m[i] - a[i,j])
k.append(n)
k.append(m[i])
print(k)
Expected Output in k:
[5,4,3,2,1,6],[20,21,22,10,10,30],[25,36,47,55,10,35]
which is also a numpy array.
But the output that I am getting is
[array([5]), array([4]), array([3]), array([2]), array([1]), array([6]), array([20]), array([21]), array([22]), array([10]), array([10]), array([30]), array([25]), array([36]), array([47]), array([55]), array([10]), array([35])]
How can I solve this situation?
You want to subtract the last column of each sub array from themselves. Why don't you use a vectorized approach? You can do all the subtractions at once by subtracting the last column from the rest of the items and then column_stack together with unchanged version of the last column. Also note that you need to change the dimension of the last column inorder to be subtractable from the 2D array. For that sake we can use broadcasting.
In [71]: np.column_stack((abs(a[:, :-1] - a[:, None, -1]), a[:,-1]))
Out[71]:
array([[ 5, 4, 3, 2, 1, 6],
[20, 21, 22, 10, 10, 30],
[25, 36, 47, 55, 10, 35]])

How to bin a 2D array in numpy?

I'm new to numpy and I have a 2D array of objects that I need to bin into a smaller matrix and then get a count of the number of objects in each bin to make a heatmap. I followed the answer on this thread to create the bins and do the counts for a simple array but I'm not sure how to extend it to 2 dimensions. Here's what I have so far:
data_matrix = numpy.ndarray((500,500),dtype=float)
# fill array with values.
bins = numpy.linspace(0,50,50)
digitized = numpy.digitize(data_matrix, bins)
binned_data = numpy.ndarray((50,50))
for i in range(0,len(bins)):
for j in range(0,len(bins)):
k = len(data_matrix[digitized == i:digitized == j]) # <-not does not work
binned_data[i:j] = k
P.S. the [digitized == i] notation on an array will return an array of binary values. I cannot find documentation on this notation anywhere. A link would be appreciated.
You can reshape the array to a four dimensional array that reflects the desired block structure, and then sum along both axes within each block. Example:
>>> a = np.arange(24).reshape(4, 6)
>>> a
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])
>>> a.reshape(2, 2, 2, 3).sum(3).sum(1)
array([[ 24, 42],
[ 96, 114]])
If a has the shape m, n, the reshape should have the form
a.reshape(m_bins, m // m_bins, n_bins, n // n_bins)
At first I was also going to suggest that you use np.histogram2d rather than reinventing the wheel, but then I realized that it would be overkill to use that and would need some hacking still.
If I understand correctly, you just want to sum over submatrices of your input. That's pretty easy to brute force: going over your output submatrix and summing up each subblock of your input:
import numpy as np
def submatsum(data,n,m):
# return a matrix of shape (n,m)
bs = data.shape[0]//n,data.shape[1]//m # blocksize averaged over
return np.reshape(np.array([np.sum(data[k1*bs[0]:(k1+1)*bs[0],k2*bs[1]:(k2+1)*bs[1]]) for k1 in range(n) for k2 in range(m)]),(n,m))
# set up dummy data
N,M = 4,6
data_matrix = np.reshape(np.arange(N*M),(N,M))
# set up size of 2x3-reduced matrix, assume congruity
n,m = N//2,M//3
reduced_matrix = submatsum(data_matrix,n,m)
# check output
print(data_matrix)
print(reduced_matrix)
This prints
print(data_matrix)
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]]
print(reduced_matrix)
[[ 24 42]
[ 96 114]]
which is indeed the result for summing up submatrices of shape (2,3).
Note that I'm using // for integer division to make sure it's python3-compatible, but in case of python2 you can just use / for division (due to the numbers involved being integers).
Another solution is to have a look at the binArray function on the comments here:
Binning a numpy array
To use your example :
data_matrix = numpy.ndarray((500,500),dtype=float)
binned_data = binArray(data_matrix, 0, 10, 10, np.sum)
binned_data = binArray(binned_data, 1, 10, 10, np.sum)
The result sum all square of size 10x10 in data_matrix (of size 500x500) to obtain a single value per square in binned_data (of size 50x50).
Hope this help !

Applying an operation to the rows of a matrix, except for some rows

I have a numpy matrix M and I need to apply some operations to all the rows of the matrix, except for a determined rows.
For example, suppose I have rows [3,5] whose elements should be avoided from an operation like M[:,8] = 4. So I want to have all the rows of the 8th column to be set to 4, but I want to avoid doing so to rows 3 and 5. How can I do this in numpy?
Edit: basically I need that to avoid a division by zero when doing a normalization by the sum of the elements of a row. Some rows are all zeros, so doing the summation (which is zero) then dividing by the summation will give a division by zero. What I'm doing is that I find out which rows are all zeros and then I want not to do the normalization operation for those specific rows.
Perhaps something like this?
>>> import numpy as np
>>> M = np.arange(32).reshape(8, 4)
>>> ignore = {3, 5}
>>> rest = [i for i in xrange(M.shape[0]) if i not in ignore]
>>> M[rest, 3] = 4
>>> M
array([[ 0, 1, 2, 4],
[ 4, 5, 6, 4],
[ 8, 9, 10, 4],
[12, 13, 14, 15],
[16, 17, 18, 4],
[20, 21, 22, 23],
[24, 25, 26, 4],
[28, 29, 30, 4]])
Based on your edit, in order to solve your specific problem, where you seem to manipulating a matrix with non-negative entries, you may exploit the following trick
import numpy as np
rng = np.random.RandomState(42)
M = rng.randn(10, 10) ** 2
M[[0, 5]] = 0. # set 2 lines to 0
M_norm = M / (M.sum(axis=1) + 1e-18)[:, np.newaxis]
Obviously this result is not exact, but exact enough to not notice the difference. To make it slightly better, you can also write
M_norm = M / np.maximum(M.sum(axis=1), 1e-18)[:, np.newaxis]
If this still isn't sufficient, and you want it exact, for the general case (negativity allowed) you can write
row_sums = M.sum(axis=1)
row_sums[row_sums == 0] = 1.
M_norm = M / row_sums[:, np.newaxis] # dividing the zeros by 1 still yields 0
To add some robustness, you could also do
tolerance = 1e-6
row_sums = M.sum(axis=1)
OK_rows = np.abs(row_sums) > tolerance
M_norm = np.zeros_like(M)
M_norm[OK_rows] = M[OK_rows] / row_sums[OK_rows][:, np.newaxis]

Categories

Resources