I want to sample the number of m=10 of size n=1000 vectors (1000 dimension) from Multivariate Normal distribution with mean vector (0,0,..,0) and covariance matrix identity I_n and then divided by its l_2 norm.
Based on the answer, I try the following code:
import random
m = 2
n = 5
random.seed(1000001)
x = np.random.multivariate_normal(np.zeros(m), np.eye(m), size=n)
print(x)
[[ 0.93503543 -0.00605634]
[-0.42033252 0.08350352]
[ 0.58507136 -0.07849799]
[ 0.79762498 0.26868063]
[ 1.31544479 0.79820179]]
Normalized
# Calculate the norms on axis zero
axis_0_norms = np.linalg.norm(x,axis = 0)
#print(f"Norms on axis 0 = {axis_0_norms}\n")
# Normalise the arrays
normalized_x = x/axis_0_norms
print("Normalized data:\n", normalized_x)
Normalized data:
[[ 0.48221541 -0.00712517]
[-0.21677341 0.09824033]
[ 0.30173234 -0.09235142]
[ 0.41135025 0.31609774]
[ 0.6783997 0.93906949]]
But 0.48221541**2+(-0.00712517)**2 is not 1.
Use np.zeros(), and np.eye(), and size, to provide the parameters for the multivariate_normal function in order to create the array. Then normalize the data using the l2 norm parameter of the normalize function from sklearn. We can then validate this l2 normalization by checking the sum of the squared values in each row of the data.
So firstly, let us create the array:
import numpy as np
import pandas as pd
from sklearn import preprocessing
# Set the seed for reproducibility
rng = np.random.default_rng(42)
# Create the array
m = 10
n = 1000
X = rng.multivariate_normal(np.zeros(m), np.eye(m), size=n)
# Display the data within a dataframe
df_X = pd.DataFrame(X)
print("Original X:\n", df_X.head(5))
OUTPUT:
Showing the first 5/1000 rows of the Original array (X)
Original X:
Now let us normalize the array using the preprocessing.normalize() function from sklearn.
# Normalize X using l2 norms
X_normalized = preprocessing.normalize(X, norm='l2')
# Display the normalized array within a dataframe
df_norm = pd.DataFrame(X_normalized)
print("X_normalized:\n", df_norm.head(5))
OUTPUT:
Showing the first 5/1000 rows of the normalized array.
X_normalized:
And finally, we can now check the validity of this normalized array by checking that thesum of the squared values in each row is equal to 1.
# Confirm l2 normalization by checking the sum of the squared values in each row.
# Should equal 1 in each row
X_normalized_squared = X_normalized ** 2
X_sum_squared = np.sum(X_normalized_squared, axis=1)
# Display the sum of the squared values for each row within a dataframe
df_sum = pd.DataFrame(X_sum_squared, columns=["Sum"])
print("X_sum_squared:\n", df_sum.head(5))
OUTPUT:
Showing the first 5/1000 rows.
Sum of the squared values for each row.
X_sum_squared:
Related
Since K-means algorithm is susceptible to the order of the columns, I am executing it 100 times and storing the final centers of each iteration in an array.
I want to calculate the mean centers of the array , but I am getting only a value using this
a =np.mean(center_array)
vmean = np.vectorize(np.mean)
vmean(a)
How can I calculate the median centers?
This is the structure of my centers array
[[ 1.39450598, 0.65213679, 1.37195399, 0.02577591, 0.17637011,
0.44572744, 1.50699298, -0.02577591, -0.17637011, -0.48222273,
-0.14651225, -0.12975152],
[-0.40910528, -0.18480587, -0.40459059, 1.00860933, -0.91902229,
-0.13536744, -0.45108061, -1.00860933, 0.91902229, 0.11367937,
0.19771608, 0.23722015],
[-0.46264585, -0.23289607, -0.45219009, 0.0290917 , 1.08811289,
-0.14996175, -0.48998741, -0.0290917 , -1.08811289, 0.19925625,
-0.14748408, -0.1943812 ]]), array([[ 0.20004497, -0.12493111, 0.99146416, -0.91902229, -0.17537297,
0.11154588, -0.41348193, -0.99146416, -0.45307083, -0.4091783 ,
0.18579957, 0.91902229]],
You need to specify the axis that contains the final centers of each iteration, otherwise the np.mean is calculated over the flattened array, resulting in a single value. From documentation:
Returns the average of the array elements. The average is taken over the flattened array by default, otherwise over the specified axis.
import numpy as np
np.random.seed(42)
x = np.random.rand(5,3)
out1 = x.mean()
print(out1, out1.shape)
# 0.49456456164468965 ()
out2 = x.mean(axis=1) # rows
print(out2, out2.shape)
# [0.68574946 0.30355721 0.50845826 0.56618897 0.40886891] (5,)
out3 = x.mean(axis=0) # columns
print(out3, out3.shape)
# [0.51435949 0.44116654 0.52816766] (3,)
Assume a numpy array X of shape m x n and type float64. The rows of X need to pass through an element-wise median-of-means computation. Specifically, the m row indices are partitioned into b "buckets", each containing m/b such indices. Next, within each bucket I compute the mean and across the resulting means I do a final median computation.
An example that clarifies it is
import numpy as np
m = 10
n = 10000
# A random data matrix
X = np.random.uniform(low=0.0, high=1.0, size=(m,n)).astype(np.float64)
# Number of buckets to split rows into
b = 5
# Partition the rows of X into b buckets
row_indices = np.arange(X.shape[0])
buckets = np.array(np.array_split(row_indices, b))
X_bucketed = X[buckets, :]
# Compute the mean within each bucket
bucket_means = np.mean(X_bucketed, axis=1)
# Compute the median-of-means
median = np.median(bucket_means, axis=0)
# Edit - Method 2 (based on answer)
np.random.shuffle(row_indices)
X = X[row_indices, :]
buckets2 = np.array_split(X, b, axis=0)
bucket_means2 = [np.mean(x, axis=0) for x in buckets2]
median2 = np.median(np.array(bucket_means2), axis=0)
This program works fine if b divides m since np.array_split() results in partitioning the indices in equal parts and array buckets is a 2D array.
However, it does not work if b does not divide m. In that case, np.array_split() still splits into b buckets but of unequal sizes, which is fine for my purposes. For example, if b = 3 it will split the indices {0,1,...,9} into [0 1 2 3], [4 5 6] and [7 8 9]. Those arrays cannot be stacked onto one another so the array buckets is not a 2D array and it cannot be used to index X_bucketed.
How can I make this work for unequal-sized buckets, i.e., to have the program compute the mean within each bucket (irrespective of its size) and then the median across the buckets?
I cannot fully grasp masked arrays and I am not sure if those can be used here.
You can consider computing each buckets' mean separately, then stack and compute the median. Also you can just use array_split to X directly, no need to index it with a sliced index array (maybe this was your main question?).
m = 11
n = 10000
# A random data matrix
X = np.random.uniform(low=0.0, high=1.0, size=(m,n)).astype(np.float64)
# Number of buckets to split rows into
b = 5
# Partition the rows of X into b buckets
buckets = np.array_split(X, 2, axis = 0)
# Compute the mean within each bucket
b_means = [np.mean(x, axis=0) for x in buckets]
# Compute the median-of-means
median = np.median(np.array(b_means), axis=0)
print(median) #(10000,) shaped array
I'd like to get an NxM matrix where numbers in each row are random samples generated from different normal distributions(same mean but different standard deviations). The following code works:
import numpy as np
mean = 0.0 # same mean
stds = [1.0, 2.0, 3.0] # different stds
matrix = np.random.random((3,10))
for i,std in enumerate(stds):
matrix[i] = np.random.normal(mean, std, matrix.shape[1])
However, this code is not quite efficient as there is a for loop involved. Is there a faster way to do this?
np.random.normal() is vectorized; you can switch axes and transpose the result:
np.random.seed(444)
arr = np.random.normal(loc=0., scale=[1., 2., 3.], size=(1000, 3)).T
print(arr.mean(axis=1))
# [-0.06678394 -0.12606733 -0.04992722]
print(arr.std(axis=1))
# [0.99080274 2.03563299 3.01426507]
That is, the scale parameter is the column-wise standard deviation, hence the need to transpose via .T since you want row-wise inputs.
How about this?
rows = 10000
stds = [1, 5, 10]
data = np.random.normal(size=(rows, len(stds)))
scaled = data * stds
print(np.std(scaled, axis=0))
Output:
[ 0.99417905 5.00908719 10.02930637]
This exploits the fact that a two normal distributions can be interconverted by linear scaling (in this case, multiplying by standard deviation). In the output, each column (second axis) will contain a normally distributed variable corresponding to a value in stds.
I would like to use a generic filter to calculate the mean of values within a given window (or kernel), for values that fulfill a couple of conditions. I expected the following code to produce a mean filter of the first array in a 3-layer window, using the other two arrays to mask values from the mean calculation.
from scipy import ndimage
import numpy as np
#some test data
tstArr = np.random.rand(3,7,7)
tstArr = tstArr*10
tstArr = np.int_(tstArr)
tstArr[1] = tstArr[1]*100
tstArr[2] = tstArr[2] *1000
#mean function
def testFun(tstData,processLayer,nLayers,kernelSize):
funData= tstData.reshape((nLayers,kernelSize,kernelSize))
meanLayer = funData[processLayer]
maskedData = meanLayer[(funData[1]>1)&(funData[2]<9000)]
returnMean = np.mean(maskedData)
return returnMean
#number of layers in the array
nLayers = np.shape(tstArr)[0]
#window size
kernelSize = 5
#create a sampling window of 5x5 elements from each array
footprnt = np.ones((nLayers,kernelSize,kernelSize),dtype = np.int)
# calculate the mean of the first layer in the array (other two are for masking)
processLayer = 0
tstOut = ndimage.generic_filter(tstArr, testFun, footprint=footprnt, extra_arguments = (processLayer,nLayers,kernelSize))
I thought this would yield a 7x7 array of masked mean values from the first layer in the input array. The output is a 3x7x7 array, and I don't understand what the values represent. I'm not sure how to produce the "masked" mean-filtered array, or how to interpret the output as given.
Your code produce a mean filter of the first array in a 3-layer window, using the over two arrays to mask values from the mean calculation. You will find the result in tstOut[1].
What is going on ? When you call ndimage.generic_filter with tstArr of shape (3, 7, 7) and footprint=np.ones((3, 5, 5)) then for all i from 0 to 2, for all j from 0 to 6 and for all k from 0 to 6, testFun is called with the subarray of tstArr centered in (i, j, k) and of shape (3, 5, 5) (the array is reflected at the boundary to supply missing values).
In the end:
tstOut[0] is the mean filter of tstArr[0] with tstArr[0] and tstArr[1] as masks
tstOut[1] is the mean filter of tstArr[0] with tstArr[1] and tstArr[2] as masks
tstOut[2] is the mean filter of tstArr[1] with tstArr[2] and tstArr[2] as masks
Again, the wanted result is in tstOut[1].
I hope this will help you.
I have a relatively large array, e.g. 200 x 1000.The matrix is a sparse matrix where elements are can be considered weights. The weights range from 0 to 500. I would like to generate a new array of the same size, 200x1000, where N of the elements of the new array are random integers {0,1}. The probability of an element in the new matrix being 0 or 1 is proportional to the weights from the original array - the higher the weight the higher the probability of 1 versus 0.
Stated another way: I would like to generate a zero matrix of size 200x1000 and then randomly choose N elements to flip to 1 based on a 200x1000 matrix of weights.
I'll throw my proposed solution in here as well:
# for example
a = np.random.random_integers(0, 500, size=(200,1000))
N = 200
result = np.zeros((200,1000))
ia = np.arange(result.size)
tw = float(np.sum(a.ravel()))
result.ravel()[np.random.choice(ia, p=a.ravel()/tw,
size=N, replace=False)]=1
where a is the array of weights: that is, pick the indexes for the items to change to 1 from the array ia, weighted by a.
This can be done with numpy with
# Compute probabilities as a 1-D array
probs = numpy.float64(weights).ravel()
probs /= numpy.sum(probs)
# Pick winner indexes
winners = numpy.random.choice(len(probs), N, False, probs)
# Build result
result = numpy.zeros(weights.shape, numpy.uint8)
result.ravel()[winners] = 1
Something like this should work, no reason to get too complicated:
>>> import random
>>> weights = [[1,5],[500,0]]
>>> output = []
>>> for row in weights:
... outRow = []
... for entry in row:
... outRow.append(random.choice([0]+[1 for _ in range(entry)]))
... output.append(outRow)
...
>>> output
[[1, 1], [1, 0]]
This chooses a random entry from a sequence which always has a single zero and then n 1s where n is the corresponding entry in your weight matrix. In this implementation, a weight of 1 is actually a 50/50 chance of either a 1 or 0. If you want a 50/50 chance to happen at 250 use outRow.append(random.choice([0 for _ in range(500-entry)] + [1 for _ in range(entry)]))