Finding mean of centers in k-means - python

Since K-means algorithm is susceptible to the order of the columns, I am executing it 100 times and storing the final centers of each iteration in an array.
I want to calculate the mean centers of the array , but I am getting only a value using this
a =np.mean(center_array)
vmean = np.vectorize(np.mean)
vmean(a)
How can I calculate the median centers?
This is the structure of my centers array
[[ 1.39450598, 0.65213679, 1.37195399, 0.02577591, 0.17637011,
0.44572744, 1.50699298, -0.02577591, -0.17637011, -0.48222273,
-0.14651225, -0.12975152],
[-0.40910528, -0.18480587, -0.40459059, 1.00860933, -0.91902229,
-0.13536744, -0.45108061, -1.00860933, 0.91902229, 0.11367937,
0.19771608, 0.23722015],
[-0.46264585, -0.23289607, -0.45219009, 0.0290917 , 1.08811289,
-0.14996175, -0.48998741, -0.0290917 , -1.08811289, 0.19925625,
-0.14748408, -0.1943812 ]]), array([[ 0.20004497, -0.12493111, 0.99146416, -0.91902229, -0.17537297,
0.11154588, -0.41348193, -0.99146416, -0.45307083, -0.4091783 ,
0.18579957, 0.91902229]],

You need to specify the axis that contains the final centers of each iteration, otherwise the np.mean is calculated over the flattened array, resulting in a single value. From documentation:
Returns the average of the array elements. The average is taken over the flattened array by default, otherwise over the specified axis.
import numpy as np
np.random.seed(42)
x = np.random.rand(5,3)
out1 = x.mean()
print(out1, out1.shape)
# 0.49456456164468965 ()
out2 = x.mean(axis=1) # rows
print(out2, out2.shape)
# [0.68574946 0.30355721 0.50845826 0.56618897 0.40886891] (5,)
out3 = x.mean(axis=0) # columns
print(out3, out3.shape)
# [0.51435949 0.44116654 0.52816766] (3,)

Related

Moving Window Calculation Across Multiple Arrays

I have several two-dimensional data arrays loaded into NumPy arrays, all of which have identical dimensions. These shared dimensions are 2606x and 1228y.
I am interested in computing calculations between the first two arrays (a1 & a2) across a moving window, using a window sized 2x by 2y, with the resultant calculation then applied to the third array. Specifically, the workflow would be:
Finding the the maximum & minimum value of a1 at the piece of this moving window
Selecting the corresponding array indices of these values
Extracting the values at these indices of a2
Cast the calculation result to each of the indices in the third array (a3) inside the moving window.
I know that this process involves the following pieces of code to obtain the values I require:
idx1 = np.where(a1 == a1.max())
idx2 = np.where(a1 == a1.min())
val1 = a2[idx1[1], idx1[2]]
val2 = a2[idx2[1], idx2[2]]
What additional code is required to perform this moving window along the identically sized arrays?
Since your array shape is divisible by your window size, you can use numpy.reshape to split your array up into little windows such that your original array shape of (2606, 1228) becomes (2606/2, 2, 1228/2, 2).
If numpy.argmin accepted sequences of axes, this would be easier, but since it only accepts a single axis (or None but we don't want that), we need to compress the two window axes into a single axes. To do that, we use numpy.moveaxis to make the shape (2606/2, 1228/2, 2, 2) and then numpy.reshape again to flatten the last two axes into (2606/2, 1228/2, 4).
With that headache over with, we can then use numpy.argmin and numpy.argmax on the last axis to compute the indices you're interested in and use advanced indexing to write the corresponding value of a2 to a3. After that, we just have to undo the reshape and moveaxis operations that were done to a3.
import numpy as np
shape = (4, 6)
a1 = np.random.random(shape)
a2 = np.random.random(shape)
a3 = np.zeros(shape)
win_x = 2
win_y = 2
shape_new = (shape[0] // win_x, win_x, shape[1] // win_y, win_y)
a1_r = np.moveaxis(a1.reshape(shape_new), 1, 2).reshape(*shape_new[::2], -1)
a2_r = np.moveaxis(a2.reshape(shape_new), 1, 2).reshape(*shape_new[::2], -1)
a3_r = np.moveaxis(a3.reshape(shape_new), 1, 2).reshape(*shape_new[::2], -1)
index_x, index_y = np.indices(shape_new[::2])
index_min = np.argmin(a1_r, axis=-1)
index_max = np.argmax(a1_r, axis=-1)
a3_r[index_x, index_y, index_min] = a2_r[index_x, index_y, index_min]
a3_r[index_x, index_y, index_max] = a2_r[index_x, index_y, index_max]
a3 = np.moveaxis(a3_r.reshape(*shape_new[::2], win_x, win_y), 2, 1).reshape(shape)
print(a1)
print()
print(a2)
print()
print(a3)
Outputs
[[0.54885307 0.74457945 0.84943538 0.14139329 0.68678556 0.03460323]
[0.74031057 0.5499962 0.03148748 0.13936734 0.05006111 0.88850868]
[0.97789608 0.13262023 0.76350358 0.74640822 0.7918286 0.80675845]
[0.35784598 0.20918229 0.82880072 0.06051794 0.0825886 0.6398353 ]]
[[0.66176657 0.10120202 0.15306892 0.05963046 0.79057051 0.08837686]
[0.78550049 0.09918834 0.00213652 0.61053454 0.42966757 0.25952916]
[0.00387273 0.78247644 0.65549303 0.39351233 0.11002493 0.55652453]
[0.06047582 0.87997514 0.60820023 0.06705212 0.34581512 0.93504438]]
[[0.66176657 0.10120202 0.15306892 0. 0. 0.08837686]
[0. 0. 0.00213652 0. 0. 0.25952916]
[0.00387273 0.78247644 0. 0. 0. 0.55652453]
[0. 0. 0.60820023 0.06705212 0.34581512 0. ]]

How to sample `n=1000` vector from Multivariate Normal distribution?

I want to sample the number of m=10 of size n=1000 vectors (1000 dimension) from Multivariate Normal distribution with mean vector (0,0,..,0) and covariance matrix identity I_n and then divided by its l_2 norm.
Based on the answer, I try the following code:
import random
m = 2
n = 5
random.seed(1000001)
x = np.random.multivariate_normal(np.zeros(m), np.eye(m), size=n)
print(x)
[[ 0.93503543 -0.00605634]
[-0.42033252 0.08350352]
[ 0.58507136 -0.07849799]
[ 0.79762498 0.26868063]
[ 1.31544479 0.79820179]]
Normalized
# Calculate the norms on axis zero
axis_0_norms = np.linalg.norm(x,axis = 0)
#print(f"Norms on axis 0 = {axis_0_norms}\n")
# Normalise the arrays
normalized_x = x/axis_0_norms
print("Normalized data:\n", normalized_x)
Normalized data:
[[ 0.48221541 -0.00712517]
[-0.21677341 0.09824033]
[ 0.30173234 -0.09235142]
[ 0.41135025 0.31609774]
[ 0.6783997 0.93906949]]
But 0.48221541**2+(-0.00712517)**2 is not 1.
Use np.zeros(), and np.eye(), and size, to provide the parameters for the multivariate_normal function in order to create the array. Then normalize the data using the l2 norm parameter of the normalize function from sklearn. We can then validate this l2 normalization by checking the sum of the squared values in each row of the data.
So firstly, let us create the array:
import numpy as np
import pandas as pd
from sklearn import preprocessing
# Set the seed for reproducibility
rng = np.random.default_rng(42)
# Create the array
m = 10
n = 1000
X = rng.multivariate_normal(np.zeros(m), np.eye(m), size=n)
# Display the data within a dataframe
df_X = pd.DataFrame(X)
print("Original X:\n", df_X.head(5))
OUTPUT:
Showing the first 5/1000 rows of the Original array (X)
Original X:
Now let us normalize the array using the preprocessing.normalize() function from sklearn.
# Normalize X using l2 norms
X_normalized = preprocessing.normalize(X, norm='l2')
# Display the normalized array within a dataframe
df_norm = pd.DataFrame(X_normalized)
print("X_normalized:\n", df_norm.head(5))
OUTPUT:
Showing the first 5/1000 rows of the normalized array.
X_normalized:
And finally, we can now check the validity of this normalized array by checking that thesum of the squared values in each row is equal to 1.
# Confirm l2 normalization by checking the sum of the squared values in each row.
# Should equal 1 in each row
X_normalized_squared = X_normalized ** 2
X_sum_squared = np.sum(X_normalized_squared, axis=1)
# Display the sum of the squared values for each row within a dataframe
df_sum = pd.DataFrame(X_sum_squared, columns=["Sum"])
print("X_sum_squared:\n", df_sum.head(5))
OUTPUT:
Showing the first 5/1000 rows.
Sum of the squared values for each row.
X_sum_squared:

How to sort each row of a 3D numpy array by another 2D array?

I have a 2D numpy array of 2D points:
np.random.seed(0)
a = np.random.rand(3, 4, 2) # each value is a 2D point
I would like to sort each row by the norm of every point
norms = np.linalg.norm(a, axis=2) # shape(3, 4)
indices = np.argsort(norms, axis=0) # indices of each sorted row
Now I would like to create an array with the same shape and values as a. that will have each row of 2D points sorted by their norm.
How can I achieve that?
I tried variations of np.take & np.take_along_axis but with no success.
for example:
np.take(a, indices, axis=1) # shape (3,3,4,2)
This samples a 3 times, once for each row in indices.
I would like to sample a just once. each row in indices has the columns that should be sampled from the corresponding row.
If I understand you correctly, you want this:
norms = np.linalg.norm(a,axis=2) # shape(3,4)
indices = np.argsort(norms , axis=1)
np.take_along_axis(a, indices[:,:,None], axis=1)
output for your example:
[[[0.4236548 0.64589411]
[0.60276338 0.54488318]
[0.5488135 0.71518937]
[0.43758721 0.891773 ]]
[[0.07103606 0.0871293 ]
[0.79172504 0.52889492]
[0.96366276 0.38344152]
[0.56804456 0.92559664]]
[[0.0202184 0.83261985]
[0.46147936 0.78052918]
[0.77815675 0.87001215]
[0.97861834 0.79915856]]]

How to multiply a 3D matrix with a 2D matrix efficiently in numpy

I have two multidimensional arrays, which I want to multiply with each other. One has the shape N,N,3 and the other has the shape N,N.
Let me set the stage:
I have an array of atom positions of the shape N,3:
atom_positions = [[x1,y1,z1],
[x2,y2,z2],
[x3,y3,z3],
...
]
From these I calculate an upper triangular matrix of distance vectors so that the resulting N,N,3 matrix contains all unique pair distance vectors r_ij of the vectors inside atom_positions:
pair_distance_vectors = [[[0,0,0],[x2-x1,y2-y1,z2-z1],[x3-x1,y3-y1,z3-z1],...],
[[0,0,0],[0,0,0] ,[x3-x2,y3-y2,z3-z2],...],
...
]
Now I want to normalize each of these pair distance vectors. For that I want to use my N,N pair_distances array, which contains the length of every vector inside pair_distance_vectors.
The formula for a single vector is:
r_ij/|r_ij|
I want to do that by doing a matrix multiplication, where every entry in the N,N array becomes a scalar by which a vector inside the N,N,3 array is multiplied. I'm pretty sure that this can be achieved somehow with numpy by using numpy.dot() or a different function, but I just can't find the answer myself. Also, I'm afraid if I do find a transformation which allows for this, that my maths will be faulty.
Here's some demonstration code, which achieves what I want in a very inefficient fashion:
import numpy as np
pair_distance_vectors = np.ones(shape=(2,2,3))
pair_distances = np.array(((1,2),(3,4)))
normalized_pair_distance_vectors = np.zeros(shape=(2,2,3))
for i,vec_list in enumerate(pair_distance_vectors):
for j,vec in enumerate(vec_list):
normalized_pair_distance_vectors[i,j] = vec*pair_distances[i,j]
print(normalized_pair_distance_vectors)
Thanks in advance.
EDIT: Maybe this is clearer:
distance_vectors = [[[x11,y11,z11],[x12,y12,z12],[x13,y13,z13],...],
[[x21,y21,z21],[x22,y22,z22],[x23,y23,z23],...],
... ]
distance_matrix = [[r_11,r_12,r_13,...],
[r_21,r_22,r_23,...],
... ]
norm_distance_vectors = some_operation(distance_vectors,distance_matrix)
norm_distance_vectors = [[r_11*[x11,y11,z11],r_12*[x12,y12,z12],r_13*[x13,y13,z13],...],
[r_21*[x21,y21,z21],r_22*[x22,y22,z22],r_23*[x23,y23,z23],...],
... ]
You won't need a loop. Trick is to expand your pair_distance in the 3rd dimension by repeating it m times (m being the dimension of your vectors, here 3D) and then divide two arrays element wise (works for any m-dimensional vectors, replace 3 with m):
pair_distances = np.repeat(pair_distances[:,:,None], 3, axis=2)
normalized_pair_distance_vectors = np.nan_to_num(pair_distance_vectors/ pair_distances)
Output for your example inputs:
[[[1. 1. 1. ]
[0.5 0.5 0.5 ]]
[[0.33333333 0.33333333 0.33333333]
[0.25 0.25 0.25 ]]]

Representing a ragged array in numpy by padding

I have a 1-dimensional numpy array scores of scores associated with some objects. These objects belong to some disjoint groups, and all the scores of the items in the first group are first, followed by the scores of the items in the second group, etc.
I'd like to create a 2-dimensional array where each row corresponds to a group, and each entry is the score of one of its items. If all the groups are of the same size I can just do:
scores.reshape((numGroups, groupSize))
Unfortunately, my groups may be of varying size. I understand that numpy doesn't support ragged arrays, but it is fine for me if the resulting array simply pads each row with a specified value to make all rows the same length.
To make this concrete, suppose I have set A with 3 items, B with 2 items, and C with four items.
scores = numpy.array([f(a[0]), f(a[1]), f(a[2]), f(b[0]), f(b[1]),
f(c[0]), f(c[1]), f(c[2]), f(c[3])])
rowStarts = numpy.array([0, 3, 5])
paddingValue = -1.0
scoresByGroup = groupIntoRows(scores, rowStarts, paddingValue)
The desired value of scoresByGroup would be:
[[f(a[0]), f(a[1]), f(a[2]), -1.0],
[f(b[0]), f(b[1]), -1.0, -1.0]
[f(c[0]), f(c[1]), f(c[2]), f(c[3])]]
Is there some numpy function or composition of functions I can use to create groupIntoRows?
Background:
This operation will be used in calculating the loss for a minibatch for a gradient descent algorithm in Theano, so that's why I need to keep it as a composition of numpy functions if possible, rather than falling back on native Python.
It's fine to assume there is some known maximum row size
The original objects being scored are vectors and the scoring function is a matrix multiplication, which is why we flatten things out in the first place. It would be possible to pad everything to the maximum item set size before doing the matrix multiplication, but the biggest set is over ten times bigger than the average set size, so this is undesirable for speed reasons.
Try this:
scores = np.random.rand(9)
row_starts = np.array([0, 3, 5])
row_ends = np.concatenate((row_starts, [len(scores)]))
lens = np.diff(row_ends)
pad_len = np.max(lens) - lens
where_to_pad = np.repeat(row_ends[1:], pad_len)
padding_value = -1.0
padded_scores = np.insert(scores, where_to_pad,
padding_value).reshape(-1, np.max(lens))
>>> padded_scores
array([[ 0.05878244, 0.40804443, 0.35640463, -1. ],
[ 0.39365072, 0.85313545, -1. , -1. ],
[ 0.133687 , 0.73651147, 0.98531828, 0.78940163]])

Categories

Resources