A random vector sampled from the Dirichlet distribution contains values that fall in the domain [0,1] and they sum to 1. In numpy it can be programmed like this for a vector size of 5:
x = numpy.random.dirichlet(np.ones(5))
Instead, I would like a random vector that contains values that are [-1,1] and sum to 1, which I was told can be achieved by transforming the Dirichlet generated x vector as y = 2x -1
Below is an attempt at this transformation. The script doesn't work properly however because y doesn't sum to 1 as needed. How can it be fixed, or could it be that y = 2x -1 does not do what they said?
x = numpy.random.dirichlet(np.ones(5))
y = 2*x -1
print(x, np.sum(x))
print(y, np.sum(y))
which outputs:
[0.0209344 0.44791586 0.21002354 0.04107336 0.28005284] 1.0
[-0.9581312 -0.10416828 -0.57995291 -0.91785327 -0.43989433] -3.0000000000000004
The thing is that interval [0, 1] can have one and the only one linear map to interval [-1, 1] which is actualy a map x -> 2x - 1. But it can't guarantee that your sum will be stable. A reason can be seen in these observations:
np.sum(x)
0.9999999999999999
np.sum(2*x)
1.9999999999999998
np.sum(2*x-1)
-3.0
As you can see, the last sum doesn't decrease by 1 as expected. It actually decreases by 5 because every of 5 items were decreased by 1.
Try y=1/(dimension/3)-2*x. That worked for me.
As one of the answers on stats.stackexchange explains: there is only one way that you can map the variables of a Dirichlet distribution to [-1,1] while keeping the sum equal to 1. This is when the dimension is 3 and when you use y=1-2x
import numpy
numpy.random.seed(seed = 1)
x = numpy.random.dirichlet(alpha = numpy.ones(3), size = 1)
y = 1-2*x
print(x, numpy.sum(x))
print(y, numpy.sum(y))
which prints:
[[2.97492728e-01 7.02444212e-01 6.30601451e-05]] 1.0000000000000002
[[0.40501454 -0.40488842 0.99987388]] 0.9999999999999998
Related
Currently I want to generate some samples to get expectation & variance of it.
Given the probability density function: f(x) = {2x, 0 <= x <= 1; 0 otherwise}
I already found that E(X) = 2/3, Var(X) = 1/18, my detail solution is from here https://math.stackexchange.com/questions/4430163/simulating-expectation-of-continuous-random-variable
But here is what I have when simulating using python:
import numpy as np
N = 100_000
X = np.random.uniform(size=N, low=0, high=1)
Y = [2*x for x in X]
np.mean(Y) # 1.00221 <- not equal to 2/3
np.var(Y) # 0.3323 <- not equal to 1/18
What am I doing wrong here? Thank you in advanced.
You are generating the mean and variance of Y = 2X, when you want the mean and variance of the X's themselves. You know the density, but the CDF is more useful for random variate generation than the PDF. For your problem, the density is:
so the CDF is:
Given that the CDF is an easily invertible function for the range [0,1], you can use inverse transform sampling to generate X values by setting F(X) = U, where U is a Uniform(0,1) random variable, and inverting the relationship to solve for X. For your problem, this yields X = U1/2.
In other words, you can generate X values with
import numpy as np
N = 100_000
X = np.sqrt(np.random.uniform(size = N))
and then do anything you want with the data, such as calculate mean and variance, plot histograms, use in simulation models, or whatever.
A histogram will confirm that the generated data have the desired density:
import matplotlib.pyplot as plt
plt.hist(X, bins = 100, density = True)
plt.show()
produces
The mean and variance estimates can then be calculated directly from the data:
print(np.mean(X), np.var(X)) # => 0.6661509538922444 0.05556962913014367
But wait! There’s more...
Margin of error
Simulation generates random data, so estimates of mean and variance will be variable across repeated runs. Statisticians use confidence intervals to quantify the magnitude of the uncertainty in statistical estimates. When the sample size is sufficiently large to invoke the central limit theorem, an interval estimate of the mean is calculated as (x-bar ± half-width), where x-bar is the estimate of the mean. For a so-called 95% confidence interval, the half-width is 1.96 * s / sqrt(n) where:
s is the estimated standard deviation;
n is the number of samples used in the estimates of mean and standard deviation; and
1.96 is a scaling constant derived from the normal distribution and the desired level of confidence.
The half-width is a quantitative measure of the margin of error, a.k.a. precision, of the estimate. Note that as n gets larger, the estimate has a smaller margin of error and becomes more precise, but there are diminishing returns to increasing the sample size due to the square root. Increasing the precision by a factor of 2 would require 4 times the sample size if independent sampling is used.
In Python:
var = np.var(X)
print(np.mean(X), var, 1.96 * np.sqrt(var / N))
produces results such as
0.6666763186360812 0.05511848269208021 0.0014551397290634852
where the third column is the confidence interval half-width.
Improving precision
Inverse transform sampling can yield greater precision for a given sample size if we use a clever trick based on fundamental properties of expectation and variance. In intro prob/stats courses you probably were told that Var(X + Y) = Var(X) + Var(Y). The true relationship is actually Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y), where Cov(X,Y) is the covariance between X and Y. If they are independent, the covariance is 0 and the general relationship becomes the one we learn/teach in intro courses, but if they are not independent the more general equation must be used. Variance is always a positive quantity, but covariance can be either positive or negative. Consequently, it’s easy to see that if X and Y have negative covariance the variance of their sum will be less than when they are independent. Negative covariance means that when X is above its mean Y tends to be below its mean, and vice-versa.
So how does that help? It helps because we can use the inverse transform, along with a technique known as antithetic variates, to create pairs of random variables which are identically distributed but have negative covariance. If U is a random variable with a Uniform(0,1) distribution, U’ = 1 - U also has a Uniform(0,1) distribution. (In fact, flipping any symmetric distribution will produce the same distribution.) As a result, X = F-1(U) and X’ = F-1(U’) are identically distributed since they’re defined by the same CDF, but will have negative covariance because they fall on opposite sides of their shared median and thus strongly tend to fall on opposite sides of their mean. If we average each pair to get A = (F-1(ui) + F-1(1-ui)) / 2) the expected value E[A] = E[(X + X’)/2] = 2E[X]/2 = E[X] while the variance Var(A) = [(Var(X) + Var(X’) + 2Cov(X,X’)]/4 = 2[Var(X) + Cov(X,X’)]/4 = [Var(X) + Cov(X,X’)]/2. In other words, we get a random variable A whose average is an unbiased estimate of the mean of X but which has less variance.
To fairly compare antithetic results head-to-head with independent sampling, we take the original sample size and allocate it with half the data being generated by the inverse transform of the U’s, and the other half generated by antithetic pairing using 1-U’s. We then average the paired values and generate statistics as before. In Python:
U = np.random.uniform(size = N // 2)
antithetic_avg = (np.sqrt(U) + np.sqrt(1.0 - U)) / 2
anti_var = np.var(antithetic_avg)
print(np.mean(antithetic_avg), anti_var, 1.96*np.sqrt(anti_var / (N / 2)))
which produces results such as
0.6667222935263972 0.0018911848781598295 0.0003811869837216061
Note that the half-width produced with independent sampling is nearly 4 times as large as the half-width produced using antithetic variates. To put it another way, we would need more than an order of magnitude more data for independent sampling to achieve the same precision.
To approximate the integral of some function of x, say, g(x), over S = [0, 1], using Monte Carlo simulation, you
generate N random numbers in [0, 1] (i.e. draw from the uniform distribution U[0, 1])
calculate the arithmetic mean of g(x_i) over i = 1 to i = N where x_i is the ith random number: i.e. (1 / N) times the sum from i = 1 to i = N of g(x_i).
The result of step 2 is the approximation of the integral.
The expected value of continuous random variable X with pdf f(x) and set of possible values S is the integral of x * f(x) over S. The variance of X is the expected value of X-squared minus the square of the expected value of X.
Expected value: to approximate the integral of x * f(x) over S = [0, 1] (i.e. the expected value of X), set g(x) = x * f(x) and apply the method outlined above.
Variance: to approximate the integral of (x * x) * f(x) over S = [0, 1] (i.e. the expected value of X-squared), set g(x) = (x * x) * f(x) and apply the method outlined above. Subtract the result of this by the square of the estimate of the expected value of X to obtain an estimate of the variance of X.
Adapting your method:
import numpy as np
N = 100_000
X = np.random.uniform(size = N, low = 0, high = 1)
Y = [x * (2 * x) for x in X]
E = [(x * x) * (2 * x) for x in X]
# mean
print((a := np.mean(Y)))
# variance
print(np.mean(E) - a * a)
Output
0.6662016482614397
0.05554821798023696
Instead of making Y and E lists, a much better approach is
Y = X * (2 * X)
E = (X * X) * (2 * X)
Y, E in this case are numpy arrays. This approach is much more efficient. Try making N = 100_000_000 and compare the execution times of both methods. The second should be much faster.
what is the best way to create a NumPy array x of a given size with values randomly (and uniformly?) spread between -1 and 1, and that also sum to 1 ?
I tried 2*np.random.rand(size)-1 and np.random.uniform(-1,1,size) based on the discussion here, but if I take a transformation approach, by re-scaling both methods by their sum afterwards, x/=np.sum(x), this ensures the elements sum to 1, but: there are elements in the array that are suddenly much greater or less than 1 (>1, <-1) which is not wanted.
In this case, let's let a uniform distribution start the process, but adjust the values to give a sum of 1. For sake of illustration, I'll use an initial step of [-1, -0.75, 0, 0.25, 1] This gives us a sum of -0.5, but we require 1.0
STEP 1: Compute the amount of total change needed: 1.0 - (-0.5) = 1.5.
Now, we will apportion that change among the elements of the distribution is some appropriate fashion. One simply method I've used is to move middle elements the most, while keeping the endpoints stable.
STEP 2: Compute the difference of each element from the nearer endpoint. For your nice range, this is 1 - abs(x)
STEP 3: sum these differences. Divide into the required change. That gives the amount to adjust each element.
Putting this much into a chart:
x diff adjust
-1.0 0.00 0.0
-0.75 0.25 0.1875
0.0 1.0 0.75
0.25 0.75 0.5625
1.0 0.0 0.0
Now, simply add the x and adjust columns to get the new values:
x adjust new
-1.0 0.0 -1.0
-0.75 0.1875 -0.5625
0 0.75 0.75
0.25 0.5625 0.8125
1.0 0.0 1.0
There is your adjusted data set: a sum of 1.0, the endpoints intact.
Simple python code:
x = [-1, -0.75, 0, 0.25, 1.0]
total = sum(x)
diff = [1 - abs(q) for q in x]
total_diff = sum(diff)
needed = 1.0 - sum(x)
adjust = [q * needed / total_diff for q in diff]
new = [x[i] + adjust[i] for i in range(len(x))]
for i in range(len(x)):
print(f'{x[i]:8} {diff[i]:8} {adjust[i]:8} {new[i]:8}')
print (new, sum(new))
Output:
-1 0 0.0 -1.0
-0.75 0.25 0.1875 -0.5625
0 1 0.75 0.75
0.25 0.75 0.5625 0.8125
1.0 0.0 0.0 1.0
[-1.0, -0.5625, 0.75, 0.8125, 1.0] 1.0
I'll let you vectorize this in NumPy.
You can create two different arrays for positive and negative values. Make sure the positive side adds up to 1 and negative side adds up to 0.
import numpy as np
size = 10
x_pos = np.random.uniform(0, 1, int(np.floor(size/2)))
x_pos = x_pos/x_pos.sum()
x_neg = np.random.uniform(0, 1, int(np.ceil(size/2)))
x_neg = x_neg - x_neg.mean()
x = np.concatenate([x_pos, x_neg])
np.random.shuffle(x)
print(x.sum(), x.max(), x.min())
>>> 0.9999999999999998 0.4928358768227867 -0.3265210342316333
print(x)
>>>[ 0.49283588 0.33974127 -0.26079784 0.28127281 0.23749531 -0.32652103
0.12651658 0.01497403 -0.03823131 0.13271431]
Rejection sampling
You can use rejection sampling. The method below does this by sampling in a space of 1 dimension less than the original space.
Step 1: you sample x(1), x(2), ..., x(n-1) randomly by sampling each x(i) from a uniform distribution
Step 2: if the sum S = x(1) + x(2) + ... + x(n-1) is below 0 or above 2 then reject and start again in Step 1.
Step 3: compute the n-th variable as x(n) = 1-S
Intuition
You can view the vector x(1), x(2), ..., x(n-1), x(n) on the interior of a n-dimensional cube with cartesian coordinates ±1, ±1,... , ±1. Such that you follow the constraints -1 <= x(i) <= 1.
The additional constraint that the sum of the coordinates must equal 1 constrains the coordinates to a smaller space than the hypercube and will be a hyperplane with dimension n-1.
If you do regular rejection sampling, sampling from uniform distribution for all the coordinates, then you will never hit the constraint. The sampled point will never be in the hyperplane. Therefore you consider a subspace of n-1 coordinates. Now you can use rejection sampling.
Visually
Say you have dimension 4 then you could plot 3 of the coordinated from the 4. This plot (homogeneously) fills a polyhedron. Below this is illustrated by plotting the polyhedron in slices. Each slice corresponds to a different sum S = x(1) + x(2) + ... + x(n-1) and a different value for x(n).
Image: domain for 3 coordinates. Each colored surface relates to a different value for the 4-th coordinate.
Marginal distributions
For large dimensions, rejection sampling will become less efficient because the fraction of rejections grows with the number of dimensions.
One way to 'solve' this would be by sampling from the marginal distributions. However, it is a bit tedious to compute these marginal distributions. Comparison: For generating samples from a Dirichlet distribution a similar algorithm exists, but in that case, the marginal distributions are relatively easy. (however, it is not impossible to derive these distributions, see below 'relationship with Irwin Hall distribution')
In the example above the marginal distribution of the x(4) coordinate corresponds to the surface area of the cuts. So for 4 dimensions, you might be able to figure out the computation based on that figure (you'd need to compute the area of those irregular polygons) but it starts to get more complicated for larger dimensions.
Relationship with Irwin Hall distribution
To get the marginal distributions you can use truncated Irwin Hall distributions. The Irwin Hall distribution is is the distribution of a sum of uniform distributed variables and will follow some piecewise polynomial shape. This is demonstrated below for one example.
Code
Since my python is rusty I will mostly add R code. The algorithm is very basic and so I imagine that any Python coder can easily adapt it into Python code. The hard part of the question seems to me to be more about the algorithm than about how to code in Python (although I am not a Python coder so I leave that up to others).
Image: output from sampling. The 4 black curves are marginal distributions for the four coordinates. The red curve is a computation based on an Irwin Hall distribution. This can be extended to a sampling method by computing directly instead of rejection sampling.
The rejection sampling in python
import numpy as np
def sampler(size):
reject = 1
while reject:
x = np.random.rand(size - 1) # step 1
S = np.sum(x)
reject = (S<0) or (S>2) # step 2
x = np.append(x,1-S) # step 3
return[x]
y = sampler(5)
print(y, np.sum(y))
Some more code in R, including the comparison with the Irwin Hall distribution. This distribution can be used to compute the marginal distributions and can be used to devise an algorithm to that is more efficient than rejection sampling.
### function to do rejection sample
samp <- function(n) {
S <- -1
## a while loop that performs step 1 (sample) and 2 (compare sum)
while((S<0) || (S>2) ) {
x <- runif(n-1,-1,1)
S <- sum(x)
}
x <- c(x,1-S) ## step 3 (generate n-th coordinate)
x
}
### compute 10^5 samples
y <- replicate(10^5,samp(4))
### plot histograms
h1 <- hist(y[1,], breaks = seq(-1,1,0.05))
h2 <- hist(y[2,], breaks = seq(-1,1,0.05))
h3 <- hist(y[3,], breaks = seq(-1,1,0.05))
h4 <- hist(y[4,], breaks = seq(-1,1,0.05))
### histograms together in a line plot
plot(h1$mids,h1$density, type = 'l', ylim = c(0,1),
xlab = "x[i]", ylab = "frequency", main = "marginal distributions")
lines(h2$mids,h2$density)
lines(h3$mids,h3$density)
lines(h4$mids,h4$density)
### add distribution based on Irwin Hall distribution
### Irwin Hall PDF
dih <- function(x,n=3) {
k <- 0:(floor(x))
terms <- (-1)^k * choose(n,k) *(x-k)^(n-1)
sum(terms)/prod(1:(n-1))
}
dih <- Vectorize(dih)
### Irwin Hall CDF
pih <- function(x,n=3) {
k <- 0:(floor(x))
terms <- (-1)^k * choose(n,k) *(x-k)^n
sum(terms)/prod(1:(n))
}
pih <- Vectorize(pih)
### adding the line
### (note we need to scale the variable for the Erwin Hall distribution)
xn <- seq(-1,1,0.001)
range <- c(-1,1)
cum <- pih(1.5+(1-range)/2,3)
scale <- 0.5/(cum[1]-cum[2]) ### renormalize
### (the factor 0.5 is due to the scale difference)
lines(xn,scale*dih(1.5+(1-xn)/2,3),col = 2)
You have coded an algebraic contradiction. The assumption of the question you cite is that the random sample will approximately fill the range [-1, 1]. If you re-scale linearly, it is algebraically impossible to maintain that range unless the sum is 1 before scaling, such that the scaling makes no changes.
You have two immediate choices here:
Surrender the range idea. Make a simple change to ensure that the sum will be at least 1, and accept a smaller range after scaling. You can do this in any way you like that skews the choices toward the positive side.
Change your original "random" selection algorithm such that it tends to maintain a sum near to 1, and then add a final element that returns it to exactly 1.0. Then you don't have to re-scale.
Consider basic interval algebra. If you begin with the interval (range) of [-1,1] and multiply by a (which would be 1/sum(x) for you), then the resulting interval is [-a,a]. If a > 1, as in your case, the resulting interval is larger. If a < 0, then the ends of the interval are swapped.
From your comments, I infer that your conceptual problem is a bit more subtle. You are trying to force a distribution with an expected value of 0 to yield a sum of 1. This is unrealistic until you agree to somehow skew that distribution without certain bounds. So far, you have declined my suggestions, but have not offered anything you will accept. Until you identify that, I cannot reasonably suggest a solution for you.
I tried using the random.gauss function in Python to get a random value between a range. The gauss function takes mean, standard deviation as parameters. Shouldn't the return value be between [mean +- standard deviation]?
Below is the code snippet:
for y in [random.gauss(4,1) for _ in range(50)]:
if y > 5 or y < 3: # shouldn't y be between (3, 5) ?
print(y)
Output of the code:
6.011096878888296
2.9192195126660403
5.020299287583643
2.9322959456674083
1.6704559841869528
"Shouldn't the return value be between [mean +- standard deviation]?"
No. For the Gaussian distribution slightly less than 32% of the outcomes will be more than one standard deviation away from the mean in either direction. There is no fixed range that contains 100% of the outcomes.
Gaussian distribution only concern about the probability of a value to deviate from the mean by a certain amount. So, any value is possible, it just that the probability is very low for some range of values
A sample size of 50 is not enough for the sample mean and sample SD to be good estimators of the population mean/SD. Law of large numbers to the rescue:
from random import gauss
xs = []
for i in range(100000):
xs.append(gauss(0,1))
print("mean = ", sum(xs)/len(xs)) # should print a number close to 0
in_sd = len(list(x for x in xs if x > -1 and x < 1))
print("in SD = ", in_sd/len(xs)) # should print a number close to 0.68
(https://repl.it/#millimoose/SiennaFlatBackground)
To get random numbers between 3 and 5, use random.uniform(3, 5) instead.
I have the following code for finding the coefficients for polynomial regression,
x =[-1,0,1,2,3,5,7,9,4,6,8,2,4,6,8,5,2]
y = [-1,3,2.5,5,4,2,5,4,4,7,3,2,7,9,6,2,6]
degree = 4
x = np.asarray(x)
x = np.vstack((np.ones(len(x)),x))
i = 2
for p in range(degree-1):
xs= np.power(x,i)
x = np.vstack((x,xs))
i +=1
y = np.asarray(y)
y= np.matrix(y).T
x= np.matrix(x).T
parameters =(x.T*x).I*x.T*y
this code will work for lower powers but begins to return different results as numpy.polyfit( x, y, degree ) at around 16th degree.
Is my code incorrect, why does its result differ at higher powers to the built in function.
Given n data points, you can only fit up to degree n - 1, after that there is insufficient data to determine a unique polynomial. Effectively, you are dividing by zero.
I have two arrays, one is an array of corrected values, x, and the other is an array of the original values(before a correction was applied), y. I know that if I want to do a two-tailed ttest to get the two-tailed pvalue I need to do this:
t_statistic, pvlaue = scipy.stats.ttest_ind(x, y, nan_policy='omit')
However this only tells me if the two arrays are significantly different from eachother. I want to show that the corrected values, x, are significantly less than y. To do this it seems like I need to get the one-tailed pvalue but I can't seem to find a function that does this. Any ideas?
Consider these two arrays:
import scipy.stats as ss
import numpy as np
prng = np.random.RandomState(0)
x, y = prng.normal([1, 2], 1, size=(10, 2)).T
An independent sample t-test returns:
t_stat, p_val = ss.ttest_ind(x, y, nan_policy='omit')
print('t stat: {:.4f}, p value: {:4f}'.format(t_stat, p_val))
# t stat: -1.1052, p value: 0.283617
This p-value is actually calculated from the cumulative density function:
ss.t.cdf(-abs(t_stat), len(x) + len(y) - 2) * 2
# 0.28361693716176473
Here, len(x) + len(y) - 2 is the number of degrees of freedom.
Notice the multiplication with 2. If the test is one-tailed, you don't multiply. That's all. So your p-value for a left tailed test is
ss.t.cdf(t_stat, len(x) + len(y) - 2)
# 0.14180846858088236
If the test was right tailed, you would use the survival function
ss.t.sf(t_stat, len(x) + len(y) - 2)
# 0.85819153141911764
which is the same as 1 - ss.t.cdf(...).
I assumed that the arrays have the same length. If not, you need to modify the degrees of freedom.