I'd like to get an NxM matrix where numbers in each row are random samples generated from different normal distributions(same mean but different standard deviations). The following code works:
import numpy as np
mean = 0.0 # same mean
stds = [1.0, 2.0, 3.0] # different stds
matrix = np.random.random((3,10))
for i,std in enumerate(stds):
matrix[i] = np.random.normal(mean, std, matrix.shape[1])
However, this code is not quite efficient as there is a for loop involved. Is there a faster way to do this?
np.random.normal() is vectorized; you can switch axes and transpose the result:
np.random.seed(444)
arr = np.random.normal(loc=0., scale=[1., 2., 3.], size=(1000, 3)).T
print(arr.mean(axis=1))
# [-0.06678394 -0.12606733 -0.04992722]
print(arr.std(axis=1))
# [0.99080274 2.03563299 3.01426507]
That is, the scale parameter is the column-wise standard deviation, hence the need to transpose via .T since you want row-wise inputs.
How about this?
rows = 10000
stds = [1, 5, 10]
data = np.random.normal(size=(rows, len(stds)))
scaled = data * stds
print(np.std(scaled, axis=0))
Output:
[ 0.99417905 5.00908719 10.02930637]
This exploits the fact that a two normal distributions can be interconverted by linear scaling (in this case, multiplying by standard deviation). In the output, each column (second axis) will contain a normally distributed variable corresponding to a value in stds.
Related
i would like to get an numpy array , shape 1000 row and 2 column.
1st column will contain - Gaussian distributed variables with standard deviation 2 and mean 1.
2nd column will contain Gaussian distributed variables with mean -1 and standard deviation 0.5.
How to create the array using define value of mean and std?
You can use numpy's random generators.
import numpy as np
# as per kwinkunks suggestion
rng = np.random.default_rng()
arr1 = rng.normal(1, 2, 1000).reshape(1000, 1)
arr2 = rng.normal(-1, 0.5, 1000).reshape(1000, 1)
arr1[:5]
array([[-2.8428678 ],
[ 2.52213097],
[-0.98329961],
[-0.87854616],
[ 0.65674208]])
arr2[:5]
array([[-0.85321735],
[-1.59748405],
[-1.77794019],
[-1.02239036],
[-0.57849622]])
After that, you can concatenate.
np.concatenate([arr1, arr2], axis = 1)
# output
array([[-2.8428678 , -0.85321735],
[ 2.52213097, -1.59748405],
[-0.98329961, -1.77794019],
...,
[ 0.84249042, -0.26451526],
[ 0.6950764 , -0.86348222],
[ 3.53885426, -0.95546126]])
Use np.random.normal directly:
import numpy as np
np.random.normal([1, -1], [2, 0.5], (1000, 2))
You can just create two normal distributions with the mean and std for each and stack them.
np.hstack((np.random.normal(1, 2, size=(1000,1)), np.random.normal(-1, 0.5, size=(1000,1))))
I want to sample the number of m=10 of size n=1000 vectors (1000 dimension) from Multivariate Normal distribution with mean vector (0,0,..,0) and covariance matrix identity I_n and then divided by its l_2 norm.
Based on the answer, I try the following code:
import random
m = 2
n = 5
random.seed(1000001)
x = np.random.multivariate_normal(np.zeros(m), np.eye(m), size=n)
print(x)
[[ 0.93503543 -0.00605634]
[-0.42033252 0.08350352]
[ 0.58507136 -0.07849799]
[ 0.79762498 0.26868063]
[ 1.31544479 0.79820179]]
Normalized
# Calculate the norms on axis zero
axis_0_norms = np.linalg.norm(x,axis = 0)
#print(f"Norms on axis 0 = {axis_0_norms}\n")
# Normalise the arrays
normalized_x = x/axis_0_norms
print("Normalized data:\n", normalized_x)
Normalized data:
[[ 0.48221541 -0.00712517]
[-0.21677341 0.09824033]
[ 0.30173234 -0.09235142]
[ 0.41135025 0.31609774]
[ 0.6783997 0.93906949]]
But 0.48221541**2+(-0.00712517)**2 is not 1.
Use np.zeros(), and np.eye(), and size, to provide the parameters for the multivariate_normal function in order to create the array. Then normalize the data using the l2 norm parameter of the normalize function from sklearn. We can then validate this l2 normalization by checking the sum of the squared values in each row of the data.
So firstly, let us create the array:
import numpy as np
import pandas as pd
from sklearn import preprocessing
# Set the seed for reproducibility
rng = np.random.default_rng(42)
# Create the array
m = 10
n = 1000
X = rng.multivariate_normal(np.zeros(m), np.eye(m), size=n)
# Display the data within a dataframe
df_X = pd.DataFrame(X)
print("Original X:\n", df_X.head(5))
OUTPUT:
Showing the first 5/1000 rows of the Original array (X)
Original X:
Now let us normalize the array using the preprocessing.normalize() function from sklearn.
# Normalize X using l2 norms
X_normalized = preprocessing.normalize(X, norm='l2')
# Display the normalized array within a dataframe
df_norm = pd.DataFrame(X_normalized)
print("X_normalized:\n", df_norm.head(5))
OUTPUT:
Showing the first 5/1000 rows of the normalized array.
X_normalized:
And finally, we can now check the validity of this normalized array by checking that thesum of the squared values in each row is equal to 1.
# Confirm l2 normalization by checking the sum of the squared values in each row.
# Should equal 1 in each row
X_normalized_squared = X_normalized ** 2
X_sum_squared = np.sum(X_normalized_squared, axis=1)
# Display the sum of the squared values for each row within a dataframe
df_sum = pd.DataFrame(X_sum_squared, columns=["Sum"])
print("X_sum_squared:\n", df_sum.head(5))
OUTPUT:
Showing the first 5/1000 rows.
Sum of the squared values for each row.
X_sum_squared:
a = (np.arange(12)).reshape(2,2,3)
I am simply trying to get the standard deviation of each column in my 3D numpy array. When I do this for the mean I get the expected result - each mean in the resulting array is a float.
a.mean(axis = 0).mean(axis = 0)
output:
array([4.5, 5.5, 6.5])
However, for the standard deviation:
a.std(axis = 0).std(axis = 0)
Returns:
array([0., 0., 0.])
When verifying that np.std works correctly on one column
np.std(np.array([1,4,7,10]))
it returns
3.3541019662496847
Why are the column standard deviations returning 0,0,0 ?
For a 2D-numpy array finding the standard deviation and mean of each column can be done as:
a = (np.arange(12)).reshape(4,3)
a_mean = a.T.mean(axis=1)
a_std = a.T.std(axis=1)
As for 3d numpy arrays, I am not sure what exacty you mean with column.
So maybe the solution you are looking for is to first reshape the array into a 2d-numpy array and then use the code above.
Let's say I have a standard 2d numpy array, let's call it my2darray with values. In this array there are two major sections. Let's say for each column, there is a specific row which separates "scenario1" and "scenario2". How can i create 2 masked arrays that represent the top section of my2darray and the bottom of my2darray. For example, i am interested in calculating the mean of the top half and the mean of the second half. One idea is to have a mask that is of the same shape as my2darray but that seems like a waste of memory. Is there a better idea? Let's say I have a vector, in which the length is equal to the number of rows in my2darray (in this case 6), i.e. I have
myvector=np.array([9, 15, 5,7,11,11])
I am using python 2.6 with numpy 1.5.0
Using NumPy's broadcasted comparison, we can create such a 2D mask in a vectorized manner. Rest of the work is all about sum-reduction along the first axis for which we can take help from np.einsum. Thus, we would have an implementation like so -
N = my2darray.shape[0]
mask = myvector <= np.arange(N)[:,None]
uout = np.true_divide(np.einsum('ij,ij->j',my2darray,~mask),myvector)
lout = np.true_divide(np.einsum('ij,ij->j',my2darray,mask),N-myvector)
Sample run to verify results -
In [184]: N = my2darray.shape[0]
...: mask = myvector <= np.arange(N)[:,None]
...: uout = np.true_divide(np.einsum('ij,ij->j',my2darray,~mask),myvector)
...: lout = np.true_divide(np.einsum('ij,ij->j',my2darray,mask),N-myvector)
...:
In [185]: uout
Out[185]: array([ 6. , 4.6, 4. , 0. ])
In [186]: [my2darray[:item,i].mean() for i,item in enumerate(myvector)]
Out[186]: [6.0, 4.5999999999999996, 4.0, 0.0] # Loopy version results
In [187]: lout
Out[187]: array([ 5.2 , 4. , 2.66666667, 2. ])
In [188]: [my2darray[item:,i].mean() for i,item in enumerate(myvector)]
Out[188]: [5.2000000000000002, 4.0, 2.6666666666666665, 2.0] # Loopy version
Another potentially faster way would be to calculate the summations for the upper mask, store it and from it, subtract the sum along the first axis along the entire length of the 2D input array. This could be then used for the calculation of the lower part average. Thus, after we store N and calculate mask, we would have -
usum = np.einsum('ij,ij->j',my2darray,~mask)
uout = np.true_divide(usums,myvector)
lout = np.true_divide(my2darray.sum(0) - usums,N-myvector)
If I have some numpy array, I can measure its mean, median, standard deviation, and so on with numpy routines, http://docs.scipy.org/doc/numpy/reference/routines.statistics.html
For example, for array arr, I would run
import numpy as np
print np.mean(arr) # prints the mean
print np.median(arr) # prints the median
However, for my purposes, instead of measuring the statistical properties after an array is created, I would like to create an array with data of distinct statistical properties.
So, for example, I would like to create an array shaped (1000,) of mean 2.5, variance 10, data points i.i.d. such that they are Gaussian draws, etc.
How could one do this with numpy?
You can use numpy.random.randn(size) which gives you normal(0,1) samples of length size. So multiply by the standard deviation and add the mean:
import numpy as np
m = 2.5
std = np.sqrt(10)
v = m + std*np.random.randn(1000)
print np.mean(v) # 2.43375955445
print np.var(v) # 9.9049376296
Yes,you can do this with numpy library
>>import numpy as np
>>import math
>>mean = 2.5
>>deviation = math.sqrt(10)
>>s = np.random.normal(mean,deviation, 1000)
It will give you 1000 Data points array which has mean value 2.5 and variance value 10.
For more information you can check this link http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.normal.html