My goal is to have draw 500 sample points, take its mean, and then do 6000 times from a distribution. Basically:
Take sample lengths ranging from N = 1 to 500. For each sample length,
draw 6000 samples and estimate the mean from each of the samples.
Calculate the standard deviation from these means for each sample
length, and show graphically that the decrease in standard deviation
corresponds to a square root reduction.
I am trying to do this on a gamma distribution, but all of my standard deviations are coming out as zero... and I'm not sure why.
This is the program so far:
import math
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import gamma
# now taking random gamma samples
stdevs = []
length = np.arange(1, 401,1)
mean=[]
for i in range(400):
sample = np.random.gamma(shape=i,size=1000)
mean.append(np.mean(sample))
stdevs.append(np.std(mean))
# then trying to plot the standard deviations but it's just a line..
# thought there should be a decrease
plt.plot(length, stdevs,label='sampling')
plt.show()
I thought there should be a decrease in the standard deviation, not an increase. What might I be doing wrong when trying to draw 1000 samples from a gamma distribution and estimate the mean and standard deviation?
I think you are misusing shape. Shape is the shape of the distribution not the number of independent draws.
import numpy as np
import matplotlib.pyplot as plt
# Reproducible
gen = np.random.default_rng(20210513)
# Generate 400 (max sample size) by 1000 (number of indep samples)
sample = gen.gamma(shape=2, size=(400, 1000))
# Use cumsum to compute the cumulative sum
means = np.cumsum(sample, axis=0)
# Divid the cumsume by the number of observations used in each
# A little care needed to get broadcasting to work right
means = means / np.arange(1,401)[:,None]
# Compute the std dev using the observations in each row
stdevs = means.std(axis=1)
# Plot
plt.plot(np.arange(1,401), stdevs,label='sampling')
plt.show()
This produces the pictire.
The problem is with the line stdevs.append(np.std(sample.mean(axis=0)))
This takes the standard deviation of a single value i.e. the mean of your sample array, so it will always be 0.
You need to pass np.std() all the values in your sample not just its mean.
stdevs.append(np.std(sample)) will give you your array of standard deviations for each sampling.
Related
This question already has an answer here:
Generating random samples from fit PDF in SciPy (Python)
(1 answer)
Closed 1 year ago.
I am trying to answer this question:
Assume that a sample is created from a standard normal distribution (μ= 0,σ= 1). Take sample lengths ranging from N = 1 to 600. For each sample length, draw 5000 samples and estimate the mean from each of the samples. Find the standard deviation from these means, and show that the standard deviation corresponds to a square root reduction.
I'm not sure if I am interpreting the question properly, but my goal is to find the standard deviation of the means for each sample length and then show that the decrease in standard deviation is similar to a square root reduction:
this is what I have so far (is what I'm doing making sense in relation to the problem?):
First making normal distribution and just plotting a simple one for reference:
import math
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
from scipy.stats import norm, kurtosis, skew
from scipy import stats
n = np.arange(1,401,1)
mu = 0
sigma = 1
x = np.linspace(mu - 4*sigma, mu + 4*sigma, 100)
pdf = stats.norm.pdf(x, mu, sigma)
# plot normal distribution
plt.plot(x,pdf)
plt.show()
now for the sample lengths etc and calculating the sdev and mean:
sample_means = []
sample_stdevs = []
for i in range(400):
rand_list = np.random.randint(1,400,1000) #samples ranging from values 1 - 400, and make a 1000 of them
sample_means.append(np.mean(rand_list))
sample_stdevs.append(np.std(sample_means))
plt.plot(sample_stdevs)
does this make sense?... also I am confused on the root reduction part.
Take sample lengths ranging from N = 1 to 400. For each sample length, draw 1000 samples and estimate the mean from each of the samples.
A sample of length 200 means drawing 200 sample points. Take its mean. Now do this 1000 time for N = 200 and you have 1000 means. Calculate the std of these 1000 means and it tells you the spread of these means. Do this for all N to see how this spread changes for different sample lengths.
The idea is that if you only draw 5 samples, it's quite likely their mean won't sit nicely near 0. If you collect 1000 of these means, they will vary wildly and you'll get a wide spread. If you collect a larger sample, due to the law of large numbers the mean will be very close to 0 and this will be reproducible even if you do this 1000 times. Therefore the spread of those means will be smaller.
The standard deviation of the mean is the standard deviation of the population (σ = 1 in our case) divided by the square root of the size of the sample we drew. See the wiki article for a derivation.
import numpy as np
import matplotlib.pyplot as plt
stdevs = []
lengths = np.arange(1, 401)
for length in lengths:
# mean = 0, std = 1 by default
sample = np.random.normal(size=(length, 1000))
stdevs.append(sample.mean(axis=0).std())
plt.plot(lengths, stdevs)
plt.plot(lengths, 1 / np.sqrt(lengths))
plt.legend(['Sampling', 'Theory'])
plt.show()
Output
Suppose I have the following sample of 100,000 points drawn from the chi-square distribution.
x=np.random.chisquare(10,100000)
We plot the histogram which is asymmetric. Let us say the histogram represents the probability.
I want to get 68% of the sample having the highest probability. Or, in general how to get the N% of the samples with maximum probability? Note that when N tends to zero we would get the mode/maxima/maximum likelihood point.
Please help.
P.S. I am not looking for quantile/percentile which would not give the part of the sample with highest probability if the distribution/histogram is asymmetric.
The most naive solution, I can think of, to your problem is to fit the chi-square distribution, evaluate the density over each sample, and take the top k samples where k is the N'th fraction of your dataset.
from math import floor
import numpy as np
from scipy.stats import chi2
N = 100000
k = int(floor(0.68 * N))
x = np.random.chisquare(10, N)
dist = chi2.fit(x)
top_k = x[np.argsort(chi2.pdf(x, *dist))][::-1][:k]
let's say I generate 10000 normally distributed random variates with σ = 1 and μ = 0:
from scipy.stats import norm
x = norm.rvs(size=10000,loc=0,scale=1)
How can I get the percentage of random variates in [-1,1] or [-3,3]? How can I count the percentage that will fall into these intervals?
You can do this:
import numpy as np
print(sum(np.abs(x)<1) / len(x) * 100)
sum(np.abs(x)<1) finds the number of samples in the (-1, 1) range and dividing that by the number of samples, you get what you need.
Edit: You can replace np.abs(x)<1 with (x<1) & (-1<x) to make it work for non-symmetric ranges and also make it work without numpy.
I'm trying to pick numbers from an array at random.
I can easily do this by picking an element using np.random.randint(len(myArray)) - but that gives a uniform distribution.
For my needs I need to pick a random number with higher probability of picking a number near the beginning of the array - so I think something like an exponential probability function would suit better.
Is there a way for me to generate a random integer in a range (1, 1000) using an exponential (or other, non-uniform distribution) to use as an array index?
You can assign an exponential probability vector to choice module from NumPy. The probability vector should add up to 1 therefore you normalize it by the sum of all probabilities.
import numpy as np
from numpy.random import choice
arr = np.arange(0, 1001)
prob = np.exp(arr/1000) # To avoid a large number
rand_draw = choice(arr, 1, p=prob/sum(prob))
To make sure the random distribution follows exponential behavior, you can plot it for 100000 random draws between 0 and 1000.
import matplotlib.pyplot as plt
# above code here
rand_draw = choice(arr, 100000, p=prob/sum(prob))
plt.hist(rand_draw, bins=100)
plt.show()
I am trying to derive the PDF of the sum of independent random variables. At first i would like to do this for a simple case: sum of Gaussian random variables.
I was surprised to see that I don't get a Gaussian density function when I sum an even number of gaussian random variables. I actually get:
which looks like two halfs of a Gaussian distribution.
On the other hand, when I sum an odd number of Gaussian distributions i get the right distribution:
below the code I used to produce the results above:
import numpy as np
from scipy.stats import norm
from scipy.fftpack import fft,ifft
import matplotlib.pyplot as plt
%matplotlib inline
a=10**(-15)
end=norm(0,1).ppf(a)
sample=np.linspace(end,-end,1000)
pdf=norm(0,1).pdf(sample)
plt.subplot(211)
plt.plot(np.real(ifft(fft(pdf)**2)))
plt.subplot(212)
plt.plot(np.real(ifft(fft(pdf)**3)))
Could someone help me understand why I get odd results for even sums of Gaussian distributions?
Even though your code creates a zero-mean Gaussian PDF:
sample=np.linspace(end,-end,1000)
pdf=norm(0,1).pdf(sample)
the FFT does not know about sample, and only sees pdf with samples at 0, 1, 2, 3, ... 999. The FFT expects the origin to be the first sample of the signal. To the FFT function, your PDF is not zero mean, but has a mean of 500.
Thus, what is going on here is that you are adding two PDFs with a 500 mean, leading to one with a 1000 mean. And because the FFT imposes a periodicity to the spatial domain signal, you are seeing the PDF exiting the graph on the right and coming back in on the left.
Adding 3 PDFs shifts the mean to 1500, which due to periodicity is the same as 500, meaning it ends up in the same place as the original PDF.
The solution is to shift the origin to the first sample for the FFT, and shift the result back:
from scipy.fftpack import fftshift, ifftshift
pdf2 = fftshift(ifft(fft(ifftshift(pdf))**2))
ifftshift shifts the signal so that the center sample ends up at the first sample, and fftshift shifts it back to where you wanted it for display.
But do note that the way you generate the PDF, the origin is not at a sample, and so the above will not work exactly. Instead, use:
sample=np.linspace(end,-end,1001)
pdf=norm(0,1).pdf(sample)
By picking 1001 samples instead of 1000, zero is exactly at the middle sample.
Use R!
library(ggplot2)
f <- function(n) {
x1 <- rnorm(n)
x2 <- rnorm(n)
X <- x1+x2
return(ds)
}
ds.list <- lapply(10^(2:5),f)
ds <- Reduce(rbind,ds.list)
ggplot(ds,aes(X,fill = n)) + geom_density(alpha = 0.5) + xlab("")
Here's the distribution plot: