Numpy median-of-means computation across unequal-sized array

Numpy median-of-means computation across unequal-sized array - python

Assume a numpy array X of shape m x n and type float64. The rows of X need to pass through an element-wise median-of-means computation. Specifically, the m row indices are partitioned into b "buckets", each containing m/b such indices. Next, within each bucket I compute the mean and across the resulting means I do a final median computation.
An example that clarifies it is
import numpy as np
m = 10
n = 10000
# A random data matrix
X = np.random.uniform(low=0.0, high=1.0, size=(m,n)).astype(np.float64)
# Number of buckets to split rows into
b = 5
# Partition the rows of X into b buckets
row_indices = np.arange(X.shape[0])
buckets = np.array(np.array_split(row_indices, b))
X_bucketed = X[buckets, :]
# Compute the mean within each bucket
bucket_means = np.mean(X_bucketed, axis=1)
# Compute the median-of-means
median = np.median(bucket_means, axis=0)
# Edit - Method 2 (based on answer)
np.random.shuffle(row_indices)
X = X[row_indices, :]
buckets2 = np.array_split(X, b, axis=0)
bucket_means2 = [np.mean(x, axis=0) for x in buckets2]
median2 = np.median(np.array(bucket_means2), axis=0)
This program works fine if b divides m since np.array_split() results in partitioning the indices in equal parts and array buckets is a 2D array.
However, it does not work if b does not divide m. In that case, np.array_split() still splits into b buckets but of unequal sizes, which is fine for my purposes. For example, if b = 3 it will split the indices {0,1,...,9} into [0 1 2 3], [4 5 6] and [7 8 9]. Those arrays cannot be stacked onto one another so the array buckets is not a 2D array and it cannot be used to index X_bucketed.
How can I make this work for unequal-sized buckets, i.e., to have the program compute the mean within each bucket (irrespective of its size) and then the median across the buckets?
I cannot fully grasp masked arrays and I am not sure if those can be used here.

You can consider computing each buckets' mean separately, then stack and compute the median. Also you can just use array_split to X directly, no need to index it with a sliced index array (maybe this was your main question?).
m = 11
n = 10000
# A random data matrix
X = np.random.uniform(low=0.0, high=1.0, size=(m,n)).astype(np.float64)
# Number of buckets to split rows into
b = 5
# Partition the rows of X into b buckets
buckets = np.array_split(X, 2, axis = 0)
# Compute the mean within each bucket
b_means = [np.mean(x, axis=0) for x in buckets]
# Compute the median-of-means
median = np.median(np.array(b_means), axis=0)
print(median) #(10000,) shaped array

Related

looking for a simplied approach for calculating pairwise correlation among arrays

I have n arrays of length m, I want to take pairwise Pearson correlation among arrays, and take average of them.
The arrays are saved as a numpy array with shape (n, m)
One way to do it is to write "two for loop operation". However, I would like to know can this be written in python in a more simplified way?
My current code looks like this:
sum_dd = 0
counter_dd = 0
for i in range(len(stc_data_roi)):
for j in range(i+1, len(stc_data_roi)):
sum_dd += np.corrcoef(stc_data_roi[i], stc_data_roi[j])
counter += 1

Suppose you have n=4 arrays of length m=5
n = 4
m = 5
X = np.random.rand(n, m)
print(X)
array([[0.49017121, 0.58751099, 0.87868983, 0.75328938, 0.16491984],
[0.81175397, 0.26486309, 0.42424784, 0.37485824, 0.66667452],
[0.80901099, 0.84121723, 0.36623767, 0.59928036, 0.22773295],
[0.59606777, 0.63301654, 0.30963807, 0.82884099, 0.95136045]])
Now transpose the array and convert to a dataframe. Each column of the dataframe represents one array and then use pandas corr function.
df = pd.DataFrame(X.T)
corr_coef = df.corr(method="pearson")
print(corr_coef)
Each column of corr_coef will represent correlation coefficient with other arrays including itself (where it will be one).
#sum of relevant coefficients as per your code
#Subtract by 4 because we don't want self correlation
#Divide by 2 becasue we are adding twice
corr_coef_sum = (corr_coef.sum().sum() - n) / 2
corr_coef_avg = corr_coef_sum / 6 #Total 6 combination in our example case

way to create a 3d matrix of 2 vectors and 1 matrix

Hello i have a question regarding a problem I am facing in python. I was studying about tensors and I saw that each row/column of a tensor must have the same size. Is it possible to create a tensor of perhaps a 3d object or matrix where lets say we have 3 axis : x,y,z
In the x axis I want to create a vector to work as an index. So let x be from 0 to N
Then on the y axis I want to have N random integer vectors of size m (where mm
Is it possible?
My first approach was to create a big vector of Nm and a big matrix of (Nm,Nm) dimensions where i would store all my random vectors and matrices and then if I wanted to change for example the my second vector then i would have to play with the indexes. However is there another way to approach this problem with tensors or numpy that I m unaware of?
Thank you in advance for your advices

First vector, N = 3, [1,2, 3]
Second N vectors with length m, m = 2
[[4,5], [6,7], [7,8]]
So, N matrices of size (m,m)
[[[1,1], [2,2]], [[1,1], [2,2]], [[1,1], [2,2]] ]
Lets create numpy arrays from them.
import numpy as np
N = 3
m = 2
a = np.array([1,2,3])
b = np.random.randn(N, m)
c = np.random.randn(N, m, m)
You see the problem here? The last matrix c has already 3 dimensions according to your definitions.
Your argument can be simplified.
Let's say our final matrix is -
a = np.zeros((3,2,2)) # 3 dimensions, x,y,z
1) For first dimension -
a[0,:,:] = 0 # first axis, first index = 0
a[1,:,:] = 1 # first axis, 2nd index = 1
a[2,:,:] = 2 # first axis, 3rd index = 2
2) Now, we need to fill up the rest of the positions, but dimensions don't match up.
So, it's better to create separate tensors for them.

Python: Filtering numpy values based on certain columns

I'm trying to create a method for evaluating co-ordinates for a project that's due in about a week.
Assuming that I'm working in a 3D cartesian co-ordinate system - whose values are stored as rows in a numpy array. I am trying to read if 'z' (n[i, 2]) values exist given the corresponding, predetermined 'x' (n[i,0]) and 'y' (n[i,1]) values.
In the case where the values that are assigned are scalars, I am content to think that:
# Given that n is some numpy array
x, y = 2,3
out = []
for i in range(0,n.shape[0]):
if n[i, 0] == x and n[i,1] == y:
out.append(n[i,2])
However, where the sorrow comes in is having to check if the values in another numpy array are in the original numpy array 'n'.
# Given that n is the numpy array that is to be searched
# Given that x contains the 'search elements'
out = []
for i in range(0,n.shape[0]):
for j in range(0, x.shape[0]):
if n[i, 0] == x[j,0] and n[i,1] == x[j,1]:
out.append(n[i,2])
The issue with doing it this way is that the 'n' matrix in my application may well be in excess of 100 000 lines long.
Is there a more efficient way of performing this function?

This might be more efficient than nested loops:
out = []
for row in x:
idx = np.equal(n[:,:2], row).all(1)
out.extend(n[idx,2].tolist())
Note this assumes that x is of shape (?, 2). Otherwise, if it has more than two columns, just change row to row[:2] in the loop body.

Numpythonic solution without loops.
This solution works in case the x and y coordinates are non-negative.
import numpy as np
# Using a for x and b for n, to avoid confusion with x,y coordinates and array names
a = np.array([[1,2],[3,4]])
b = np.array([[1,2,10],[1,2,11],[3,4,12],[5,6,13],[3,4,14]])
# Adjust the shapes by taking the z coordinate as 0 in a and take the dot product with b transposed
a = np.insert(a,2,0,axis=1)
dot_product = np.dot(a,b.T)
# Reshape a**2 to check the dot product values corresponding to exact values in the x, y coordinates
sum_reshaped = np.sum(a**2,axis=1).reshape(a.shape[0],1)
# Match for values for indivisual elements in a. Can be used if you want z coordinates corresponding to some x, y separately
indivisual_indices = ( dot_product == np.tile(sum_reshaped,b.shape[0]) )
# Take OR of column values and take z if atleast one x,y present
indices = np.any(indivisual_indices, axis=0)
print(b[:,2][indices]) # prints [10 11 12 14]

Permute rows in "slices" of 3d array to match each other

I have a series of 2d arrays where the rows are points in some space. Many similar points occur across all arrays but in different row order. I want to sort the rows so they have the most similar order. Also the points are too different for clustering with K-means or DBSCAN. The problem can also be cast like this. If I stack the arrays into a 3d array, how do I permute the rows to minimize the average standard deviation (SD) along the 2nd axis? What's a good sorting algorithm for this problem?
I've tried the following approaches.
Create a set of reference 2d array and sort rows in each array to minimize mean euclidean distances to the reference 2d array. This I am afraid gives biased results.
Sort rows in arrays pairwise, then pairs of pair-medians, then pairs of that, etc... This doesn't really work and I'm not sure why.
A third approach could be just brute force optimization but I try to avoid that since I have multiple sets of arrays to perform the procedure on.
This is my code for the 2nd approach (Python):
def reorder_to(A, B):
"""Reorder rows in A to best match rows in B.
Input
-----
A : N x M numpy.array
B : N x M numpy.array
Output
------
perm_order : permutation order
"""
if A.shape != B.shape:
print "A and B must have the same shape"
return None
N = A.shape[0]
# Create a distance matrix of distance between rows in A and B
distance_matrix = np.ones((N, N))*np.inf
for i, a in enumerate(A):
for ii, b in enumerate(B):
ba = (b-a)
distance_matrix[i, ii] = np.sqrt(np.dot(ba, ba))
# Choose permutation order by smallest distances first
perm_order = [[] for _ in range(N)]
for _ in range(N):
ind = np.argmin(distance_matrix)
i, ii = ind/N, ind%N
perm_order[ii] = i
distance_matrix[i, :] = np.inf
distance_matrix[:, ii] = np.inf
return perm_order
def permute_tensor_rows(A):
"""Permute 1d rows in 3d array along the 0th axis to minimize average SD along 2nd axis.
Input
-----
A : numpy.3darray
Each "slice" in the 2nd direction is an independent array whose rows can be permuted
to decrease the average SD in the 2nd direction.
Output
------
A : numpy.3darray
A with sorted rows in each "slice".
"""
step = 2
while step <= A.shape[2]:
for k in range(0, A.shape[2], step):
# If last, reorder to previous
if k + step > A.shape[2]:
A_kk = A[:, :, k:(k+step)]
kk_order = reorder_to(np.median(A_kk, axis=2), np.median(A_k, axis=2))
A[:, :, k:(k+step)] = A[kk_order, :, k:(k+step)]
continue
k_0, k_1 = k, k+step/2
kk_0, kk_1 = k+step/2, k+step
A_k = A[:, :, k_0:k_1]
A_kk = A[:, :, kk_0:kk_1]
order = reorder_to(np.median(A_k, axis=2), np.median(A_kk, axis=2))
A[:, :, k_0:k_1] = A[order, :, k_0:k_1]
print "Step:", step, "\t ... Average SD:", np.mean(np.std(A, axis=2))
step *= 2
return A

Sorry I should have looked at your code sample; that was very informative.
Seems like this here gives an out-of-the-box solution to your problem:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linear_sum_assignment.html#scipy.optimize.linear_sum_assignment
Only really feasible for a few 100 points at most though, in my experience.

Accessing ndarray and discarding invalid positions - Python

I have one question about accessing a matrix position that in fact does not exists.
First, I have an matrix with rows rows and cols columns. From this matrix, I have to get sets of n x n sub matrices. For example, to get 3 x 3 sub matrices, I do the following:
for x, y in product(range(1, matrix.rows-1), range(1, matrix.cols-1)):
bootstrap_3x3 = npr.choice(matrix.data[x-1:x+2, y-1:y+2].flatten(), size=(3, 3), replace=True)
But, as can be seen, I'm not considering the extremes, and I have to. For x = 0 and y = 0, for example, I should consider matrix.data[x:x+2, y:y+2] (the center should be the current x and y), returning a 3 x 3 with the first row/column = 0.
I know that I can achieve this with some if statements. But I guess Python should have a clever way to do this properly.
Thank you in advance.

I would make a new matrix, padded with (n-1)/2 zeros around it:
import numpy as np
rows, cols = 4, 6
n = 3
d = (n-1)/2
data = np.arange(rows*cols).reshape(rows, cols)
padded = np.pad(data, d, mode='constant')
for x, y in np.indices(data.shape).reshape(2, -1).T:
sub = padded[x:x+n, y:y+n]
print sub
bootstrap_nxn = np.random.choice(sub.ravel(), (n, n))
This assumes n is odd, and that the submatrix center is always within the the original data matrix. If n is even, the center of the submatrix isn't well defined.
If you actually want to have the submatrix overlap with the data matrix with only one row, then you'd need to pad with n-1 zeros (and in that case even vs odd n won't matter).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy median-of-means computation across unequal-sized array - python

Related

looking for a simplied approach for calculating pairwise correlation among arrays

way to create a 3d matrix of 2 vectors and 1 matrix

Python: Filtering numpy values based on certain columns

Permute rows in "slices" of 3d array to match each other

Accessing ndarray and discarding invalid positions - Python

Categories

Resources