random normal numbers using 2d means vector - python

I have a vector of 2d means.
means = np.array([[0,0], [0, 3], [3,0], [3,3], [0, 5]])
I want to generate random normal numbers using this means vector.
If the means were only in x axis, I would do this in a way like this:
x_samples = np.asarray(list(map(lambda mean: np.random.normal(mean, 1), x_means)))
Is there a simple way to generate the samples for x and y together?
Thanks

With two mean values (x and y) for each point, I am assuming you want a multivariate normal distribution with these mean values in each axis, and standard deviation of 1 in each axis? (The standard deviation is 1 in your 1d example)
in which case you can use np.random.multivariate_normal.
xy_samples = np.asarray([np.random.multivariate_normal(mean, np.diag([1., 1.])) for mean in means])
or similar to your formulation, using map:
xy_samples = np.asarray(list(map(lambda mean: np.random.multivariate_normal(mean, np.diag([1., 1.])), means)))
the np.diag deals with the fact that you need to supply a covariance matrix, not scalar variance.

Related

Why are these 2 ways of finding eigenvectors different in python?

import numpy as np
#first way
A = np.array([[1, 0, 1], [-2, 1, 0]])
print(A)
B = A#A.transpose()
print(B)
eig_val, eig_vec = np.linalg.eig(B)
print(eig_vec)
#second way
from sympy import *
G = Matrix([[2,-2], [-2,5]])
print(G.eigenvects())
Why does these two ways give different result when they are aiming a same goal of finding the eigenvectors?
It has already been mentioned that eignevectors are only unique upto a scalar multiple. That's a mathematical fact. To dig into the implementations of the methods you're using, numpy.linalg.eig returns normalized eigenvectors (i.e. the norm of the vectors would be 1) whereas eigenvects() of sympy does not normalize the vectors.
In some sense, normalized vectors are unique precisely because they have unit norm. They can define a unit eigendirection (geometrically), just like unit vectors in coordinate geometry. (Not strictly important to know)

How to calculate the standard deviation of a list of m x n matrices in Python?

Say I have a data set of 100 data. The interesting part about this data set is that each data is a 4x3 matrix. My question is how should I calculate the standard deviation of this data set? I tried the following code, but I don't know if the result is correct. If it is correct, I want to know how it works. I know the standard deviation equation for 1d data, but I don't know the definition of std for a collection of m x n data. There is only explanation for 1d data in the docstring of np.std.
import numpy as np
datalist = []
for _ in range(100):
data = np.random.random((4,3))
datalist.append(data)
std = np.std(np.asarray(datalist))
print(std)
Seems like you're having unnecessary steps. To begin with, you can get 100 matrices of 4x3 like this:
x = np.random.rand(100, 4, 3)
Then just call np.std on it:
np.std(x)
0.2827262559096299
That's if you want the standard deviation of all values. If you want it per matrix cell, specify the axis argument:
np.std(x, axis=0)
array([[0.27863211, 0.2670126 , 0.28752064],
[0.28540484, 0.25365294, 0.28905531],
[0.28848584, 0.27695767, 0.26886147],
[0.27138472, 0.3135065 , 0.29361115]])
axis=0 means that it's going to collapse the axis 0 (the one with size 100), which will return a matrix of 4x3.

Finding standard deviations along x and y in 2D numpy array

If I have a 2D numpy array composed of points (x, y) that give some value z(x, y) at each point, can I find the standard deviation along the x-axis and along the y-axis? I know that np.std(data) will simply find the standard deviation of the entire dataset, but that's not want I want. Also, adding in axis=0 or axis=1 computes the standard deviations along each axis for as many rows or columns that you have. If I just want one standard deviation along the y-axis, and another along the x-axis, can I find these in a dataset like this? From my understanding, standard deviations along x and y normally make sense when you have points x with values y(x). But I need some sigma_x and sigma_y for a 2D Gaussian fit I'm trying to do. Is this possible?
Here is an oversimplified example, since my actual data is much larger.
import numpy as np
data = np.array([[1, 5, 0, 3], [3, 5, 1, 1], [41, 33, 9, 20], [11, 20, 4, 13]])
print(np.std(data)) #not what I want
>>> 11.78386
print(np.std(data, axis=0)) #this gives me as many results as there are rows/columns so it's not what I want
>>> [16.03 11.69 3.5 7.69]
I'm not sure how the output corresponding to what I want would look like, since I'm not even sure if it's possible in a 2D array with shape > nx2. But I want to know if it's possible to compute a standard deviation along the x-axis, and one along the y-axis. I'm not even sure if this makes sense for a 2D array... But if it doesn't, I'm not sure what to input as my sigma_x and sigma_y for a 2D Gaussian fit.
Standard deviation doesn't care whether y = f(x) or (x, y) are coordinates. It just measures how spread a set of values are. If you have n points (x, y) which make up a nX2 size array, then the std(axis=0) is what you want. It creates a (2, )shaped array, where the first elements is the x-axis std, and the second the y-axis std. Whether that is useful, depends on what you want, and it ignores the correlation between x and y.
I think what you want is to separate the x axis in small intervals and compute the standard deviation of the y coordinates of the points within those intervals.
You could compute std(y_i), where y_i are the y coordinates for points x in the interval (x_min+i*delta_x, x_min+(i+1)*delta_x), choosing a small delta_x, such that enough points (x_j, y_j) lie within the interval.
import numpy as np
x = np.array([0, 0.11, 0.1, 0.01, 0.2, 0.22, 0.23])
y = np.array([1, 2, 3, 2, 2, 2.1, 2.2])
num_intervals = 3
#sort the arrays
sort_inds = np.argsort(x)
x = x[sort_inds]
y = y[sort_inds]
# create intervals
x_range = x.max() - x.min()
x_intervals = np.linspace(np.min(x)+x_range/num_intervals, x.max()-x_range/num_intervals, num_intervals)
print(x_intervals)
>> [0.07666667 0.115 0.15333333]
Next, we split the arrays y and x using these intervals:
# get indices of x where the elements of x_intervals
# should be inserted, in order to maintain the order
# for sufficiently large num_intervals it
# approximates the closest value in x to an element
# in x_intervals
split_indices = np.unique(np.searchsorted(x, x_intervals, side='left'))
ls_of_arrays_x = np.array_split(x, split_indices)
ls_of_arrays_y = np.array_split(y, split_indices)
print(ls_of_arrays_x)
print(ls_of_arrays_y)
>> [array([0. , 0.01]), array([0.1 , 0.11]), array([0.2 , 0.22, 0.23])]
>> [array([1., 2.]), array([3., 2.]), array([2. , 2.1, 2.2])]
Now compute the x coordinates and the corresponding y std:
y_stds = np.array([np.std(yi) for yi in ls_of_arrays_y])
x_mean = np.array([np.std(xi) for xi in ls_of_arrays_x])
print(x_mean)
print(y_stds)
>> [0.005 0.105 0.21666667]
>> [0.5 0.5 0.08164966]
I hope it was what you were looking for.

Vector sum of multidimensional arrays in numpy

If I have a an N^3 array of triplets in a numpy array, how do I do a vector sum on all of the triplets in the array? For some reason I just can't wrap my brain around the summation indices. Here is what I tried, but it doesn't seem to work:
a = np.random.random((5,5,5,3)) - 0.5
s = a.sum((0,1,2))
np.linalg.norm(s)
I would expect that as N gets large, if the sum is working correctly I should converge to 0, but I just keep getting bigger. The sum gives me a vector that is the correct shape (3x1), but obviously I must be doing something wrong. I know this should be easy, but I'm just not getting it.
Thanks in advance!
Is is easier to understand you problem analytically if instead of uniform random numbers we use standard normal numbers, and the qualitative results can be applied to your particular case:
>>> a = np.random.normal(0, 1, size=(5, 5, 5, 3))
>>> s = a.sum(axis=(0, 1, 2))
So now each of the three items of s is the sum of 125 numbers, each drawn from a standard normal distribution. It is a well established fact that adding up two normal distributions gives you another normal distribution with mean the sum of the means, and variance the sum of the variances. So each of the three values in s will be distributed as a random sample from a normal distribution with mean 0 and standard deviation sqrt(125) = 11.18.
The fact that the variance of the distribution grows means that, even though if you run your code many times, you will see an average value of 0 for each of those numbers, on any given run you are more likely to see larger offsets from 0.
Furthermore you then go and compute the norm of those three values. Squaring three standard normal distributions and adding them together gives you a chi-squared distribution. If you then take the square root, you get a chi distribution. The former is easier to deal with, and it predicts that the average value of the square of the norm of your three values will be 3 * 125. And it most certainly seems to be:
>>> mean_norm_sq = 0
>>> for n in xrange(1000):
... a = np.random.normal(0, 1, size=(5, 5, 5, 3))
... s = a.sum(axis=(0, 1, 2))
... mean_norm_sq += np.sum(s**2)
...
>>> mean_norm_sq / 1000
374.47629802482447
As the comments note, there is no reason why the squared sum should approach zero. By the description, an array of N three-dimensional vectors sounds like it should have the shape of (N,3) not (N,N,N,3), but I may be misunderstanding it. Either way, it is simple to observe what happens in the two cases:
import numpy as np
avg_sum = []
sq_sum = []
N_val = 2**np.arange(15)
for N in N_val:
A = np.random.random((N,3)) - 0.5
avg_sum.append( A.sum(axis=1).mean() )
sq_sum.append ( (A**2).sum(axis=1).mean() )
import pylab as plt
plt.plot(N_val, avg_sum, label="Average sum")
plt.plot(N_val, sq_sum, label="Squared sum")
plt.legend(loc="best")
plt.show()
The average sum goes to zero as your intuition expects.

Generating random correlated x and y points using Numpy

I'd like to generate correlated arrays of x and y coordinates, in order to test various matplotlib plotting approaches, but I'm failing somewhere, because I can't get numpy.random.multivariate_normal to give me the samples I want. Ideally, I want my x values between -0.51, and 51.2, and my y values between 0.33 and 51.6 (though I suppose equal ranges would be OK, since I can constrain the plot afterwards), but I'm not sure what mean (0, 0?) and covariance values I should be using to get these samples from the function.
As the name implies numpy.random.multivariate_normal generates normal distributions, this means that there is a non-null probability of finding points outside of any given interval. You can generate correlated uniform distributions but this a little more convoluted. Take a look here for two possible methods.
If you want to go with the normal distribution you can set up the sigmas so that your half-interval correspond to 3 standard deviations (you can also filter out the bad points if needed). In this way you will have ~99% of your points inside your interval, ex:
import numpy as np
from matplotlib.pyplot import scatter
xx = np.array([-0.51, 51.2])
yy = np.array([0.33, 51.6])
means = [xx.mean(), yy.mean()]
stds = [xx.std() / 3, yy.std() / 3]
corr = 0.8 # correlation
covs = [[stds[0]**2 , stds[0]*stds[1]*corr],
[stds[0]*stds[1]*corr, stds[1]**2]]
m = np.random.multivariate_normal(means, covs, 1000).T
scatter(m[0], m[1])

Categories

Resources