Sampling distribution Normal Approximation Misfit - python

I was trying to simulate "Sampling Distribution of Sample Proportions" using Python. I tried with a Bernoulli Variable as in example here
The crux is that, out of large number of gumballs, we have yellow balls with true proportion of 0.6. If we take samples (of some size, say 10), take mean of that and plot, we should get a normal distribution.
I have managed to obtain the sampling distribution as normal, however, the actual normal continuous curve with same mu and sigma, does not fit at all, but scaled to few factors up. I am not sure what is causing this, ideally it should fit perfectly. Below is my code and output. I tried varying the amplitude and also sigma (dividing by sqrt(samplesize)) but nothing helped. Kindly help.
Code:
from SDSP import create_bernoulli_population, get_frequency_df
from random import shuffle, choices
from bi_to_nor_demo import get_metrics, bare_minimal_plot
import matplotlib.pyplot as plt
N = 10000 # 10000 balls
p = 0.6 # probability of yellow ball is 0.6, and others (1-0.6)=>0.4
n_pickups = 10 # sample size
n_experiments = 2000 # I dont know what this is called
# STATISTICAL PDF
# choose sample, take mean and add to X_mean_list. Do this for n_experiments times.
X_hat = []
X_mean_list = []
for each_experiment in range(n_experiments):
X_hat = choices(population, k=n_pickups) # choose, say 10 samples from population (with replacement)
X_mean = sum(X_hat)/len(X_hat)
X_mean_list.append(X_mean)
stats_df = get_frequency_df(X_mean_list)
# plot both theoretical and statistical outcomes
fig, ax = plt.subplots(1,1, figsize=(5,5))
from SDSP import plot_pdf
mu,var,sigma = get_metrics(stats_df)
plot_pdf(stats_df, ax, n_pickups, mu, sigma, p=mu, bar_width=round(0.5/n_pickups,3),
title='Sampling Distribution of\n a Sample Proportion')
plt.tight_layout()
plt.show()
Output:
Red curve is the misfit normal approximation curve. The mu and sigma is derived from statistical discrete distribution (small blue bars), and fed to formula calculating normal curve. But normal curve looks scaled up somehow.
Update:
Avoiding a division to take average, solves the graph issue but mu is scaled. So issue is still not fully solved yet. :(
X_mean = sum(X_hat) # removed the division /len(X_hat)
Output after removing above division (but its needed?):

Related

Generate random samples for each sample length for a distribution

My goal is to have draw 500 sample points, take its mean, and then do 6000 times from a distribution. Basically:
Take sample lengths ranging from N = 1 to 500. For each sample length,
draw 6000 samples and estimate the mean from each of the samples.
Calculate the standard deviation from these means for each sample
length, and show graphically that the decrease in standard deviation
corresponds to a square root reduction.
I am trying to do this on a gamma distribution, but all of my standard deviations are coming out as zero... and I'm not sure why.
This is the program so far:
import math
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import gamma
# now taking random gamma samples
stdevs = []
length = np.arange(1, 401,1)
mean=[]
for i in range(400):
sample = np.random.gamma(shape=i,size=1000)
mean.append(np.mean(sample))
stdevs.append(np.std(mean))
# then trying to plot the standard deviations but it's just a line..
# thought there should be a decrease
plt.plot(length, stdevs,label='sampling')
plt.show()
I thought there should be a decrease in the standard deviation, not an increase. What might I be doing wrong when trying to draw 1000 samples from a gamma distribution and estimate the mean and standard deviation?
I think you are misusing shape. Shape is the shape of the distribution not the number of independent draws.
import numpy as np
import matplotlib.pyplot as plt
# Reproducible
gen = np.random.default_rng(20210513)
# Generate 400 (max sample size) by 1000 (number of indep samples)
sample = gen.gamma(shape=2, size=(400, 1000))
# Use cumsum to compute the cumulative sum
means = np.cumsum(sample, axis=0)
# Divid the cumsume by the number of observations used in each
# A little care needed to get broadcasting to work right
means = means / np.arange(1,401)[:,None]
# Compute the std dev using the observations in each row
stdevs = means.std(axis=1)
# Plot
plt.plot(np.arange(1,401), stdevs,label='sampling')
plt.show()
This produces the pictire.
The problem is with the line stdevs.append(np.std(sample.mean(axis=0)))
This takes the standard deviation of a single value i.e. the mean of your sample array, so it will always be 0.
You need to pass np.std() all the values in your sample not just its mean.
stdevs.append(np.std(sample)) will give you your array of standard deviations for each sampling.

Using scipy to fit CDF with real data, but CDF start not from 0

Herewith my samples and my codes for fitting CDF.
import numpy as np
import pandas as pd
import scipy.stats as st
samples = [2,3,10,7,9,6,1,3,7,2,5,4,6,3,4,1,4,6,3,10,3,7,5,6,6,5,4,2,2,5,4,5,6,4,4,6,3,3,3,2,2,2,4,2,6,2,7,4,3,2,2,1,4,2,2,5,3,9,6,8,3,6,6,3,9,2,3,3,3,5,4,4,5,4,1,8,5,8,6,6,7,6,3,2,4,2,16,6,2,3,4,2,2,9,9,5,5,5,1,5,2,8,5,3,5,8,11,4,7,4,11,3,7,3,6,6,1,4,2,1,1,1,9,4,15,2,1,3,4,9,3,3,4,3,6,3,3,5,5,6,3,3,4,8,4,4,2,5,6,7,3,5,5,2,5,9,7,6,1,3,4,9,3,2,4,8,5,8,4,4,5,6,5,8,6,1,3,7,9,6,7,12,4,1,4,5,5,7,1,7,1,15,3,3,2,3,7,7,15,6,5,1,7,4,2,10,1,3,3,8,3,8,1,5,4,7,4,2,9,2,1,3,6,1,6,10,6,3,4,7,5,7,3,3,7,4,4,3,5,3,5,2,2,1,2,3,1,1,2,1,1,2,3,10,7,3,2,6,5,6,5,11,1,7,5,2,9,5,12,6,3,9,9,4,3,4,6,4,10,4,8,6,1,7,2,5,8,3,1,3,1,1,3,3,2,2,6,3,3,2,6,6,6,4,2,4,1,10,5,3,5,6,3,4,1,1,7,6,6,5,7,6,3,4,6,6,5,3,2,3,2,1,2,4,1,1,1,3,7,1,6,3,4,3,3,6,7,3,7,4,1,1,7,1,4,4,3,4,2,4,2,6,6,2,2,6,5,4,6,5,6,3,5,1,5,3,3,2,2,2,2,3,3,3,2,2,1,4,2,3,5,7,2,5,1,2,2,5,6,5,2,1,2,4,5,2,3,2,4,9,3,5,2,2,5,4,2,3,4,2,3,1,3,6,7,2,6,3,5,4,2,2,2,2,1,2,5,2,2,3,4,2,5,2,2,3,5,3,2,4,3,2,5,4,1,4,8,6,8,2,2,3,1,2,3,8,2,3,4,3,3,2,1,1,1,3,3,4,3,4,1,2,8,2,2,7,3,1,2,3,3,2,3,1,2,1,1,1,3,2,2,2,4,7,2,1,2,3,1,3,1,1,6,2,1,1,3,1,4,4,1,3,1,1,4,1,1,2,4,4,3,2,3,2,1,2,1,4,2,5,3,4,2,1,1,1,3,1,2,1,1,4,2,1,3,2,1,3,2,1,1,1,2,1,1,1,1,2,1,1,1,1,1,1,1]
bins=np.arange(1, 18, 0.1)
#Because min(samples) = 1, so I start from 1.
y, x = np.histogram(samples, bins=bins, density=True)
params = st.lognorm.fit(samples)
# Separate parts of parameters
arg = params[:-2]
loc = params[-2]
scale = params[-1]
ccdf = st.lognorm.cdf(x, loc=loc, scale=scale, *arg)
cdf = pd.Series(ccdf, x)
#cdf[1.0] is not 0... That is the issue...
When I print out the first value cdf[1.0], it does not equal to 0. According to theory, it should be 0. As the below picture has shown, the first CDF is not 0. I check my code again and again. However, I cannot fix the problem. If any suggestion to me, I very appreciate it.
In your code, you are trying to plot a bar chart from your sample. This is good, but on the graph you are not having a histogram, but a distribution function of the sample.
The code does not match the picture.
Here is the pdf graph and histogram.
Code for graph above:
# ... insert your sample and calculate lognorm parameters (already in your code)
x = np.linspace(min(samples), max(samples), 100)
pdf = stats.lognorm.pdf(x, loc=loc, scale=scale, *arg)
plt.plot(x, pdf)
plt.hist(samples, bins=max(samples)-min(samples), density=True, alpha=0.75)
plt.show()
You are also looking in the code for cdf options. And Scipy finds them.
And on the graph you draw exactly the cdf.
You don't understand that the cdf value for the minimum value in the sample is not zero.
However, you should be aware that the fit function only brings the approximated curve closer to your sample, it does not produce a curve that accurately describes the empirical distribution function.
Scipy just thinks your sample may contain values ​​less than one,
although there are no such values ​​in the training set.
The pdf also says that a value greater than 14 is extremely unlikely, but your sample has more than 13 values.
As a result, cdf and should not be equal to zero at your point cdf[1.0].
p.s. cdf will still be equal to zero at zero if you pass this point to it.
Code for graph above:
# ... insert your sample and calculate lognorm parameters (already in your code)
x = np.linspace(0, max(samples), 100)
cdf = stats.lognorm.cdf(x, loc=loc, scale=scale, *arg)
plt.plot(x, cdf)
plt.show()

How do I use the Monte Carlo method to find the uncertainties of a value?

I am trying to solve a Physics equation using a Monte Carlo simulation which I know is very long (I just need to use it to learn about it).
I have around 5 values, one is time and I have the random uncertainties (errors) for each of these values. So like mass is (10 +- 0.1)kg, where the error is 0.1 kg
How do I actually find the distribution of measurements if I performed this experiment 5,000 times for example?
I know I could make 2 arrays of errors, and maybe put them in a function. But what am I supposed to do to then? Do I put the errors in the equation and then add the answer to the arrays, and then put the changed array values in the equation and repeat this a thousand times. Or do I actually calculate the real value and add it to the array.
Please can you help me understand this.
Edit:
The problem I have is basically of a sphere of density ds that is falling by a distance l in time t through a liquid of density dl, this fits in an equation for viscosity and I need to find the distribution of viscosity measurements.
The equation shouldn't matter at, whatever equation I have I should be able to use a method like this to find the distribution of measurements. Weather I'm dropping a ball out a window or whatever.
Basic Monte Carlo is very straightforward. The following might get you started:
import random,statistics,math
#The following function generates a
#random observation of f(x) where
#x is a vector of independent normal variables
#whose means are given by the vector mus
#and whose standard deviations are given by sigmas
def sample(f,mus,sigmas):
x = (random.gauss(m,s) for m,s in zip(mus,sigmas))
return f(*x)
#do n times, returning the sample mean and standard deviation:
def monte_carlo(f,mus,sigmas,n):
samples = [sample(f,mus,sigmas) for _ in range(n)]
return (statistics.mean(samples), statistics.stdev(samples))
#for testing purposes:
def V(r,h):
return math.pi*r**2*h
print(monte_carlo(V,(2,4),(0.02, 0.01),1000))
With output:
(50.2497301631037, 1.0215188736786902)
Ok, lets try with simple example - you have air gun which shoots balls with mass m and velocity v. You have to measure kinetic energy
E = m*v2 / 2
There is distribution of velocity - gaussian with mean value of 10 and std deviation 1.
There is distribution of masses - but we cannot do gaussian, lets assume it is truncated normal, with low limit of 1, so that there is no negative values, with loc equal to 5 and scale equal to 3.
So what we will do - sample velocity, sample mass, use them to find kinetic energy, do it multiple times, build energy distribution, get mean value, get std deviation, draw graphs etc
Some simple Python code
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import truncnorm
def sampleMass(n, low, high, mu, sigma):
"""
Sample n mass values from truncated normal
"""
tn = truncnorm(low, high, loc=mu, scale=sigma)
return tn.rvs(n)
def sampleVelocity(n, mu, sigma):
return np.random.normal(loc = mu, scale = sigma, size=n)
mass_low = 1.
mass_high = 1000.
mass_mu = 5.
mass_sigma = 3.0
vel_mu = 10.0
vel_sigma = 1.0
nof_trials = 100000
mass = sampleMass(nof_trials, mass_low, mass_high, mass_mu, mass_sigma) # get samples of mass
vel = sampleVelocity(nof_trials, vel_mu, vel_sigma) # get samples of velocity
kinenergy = 0.5 * mass * vel*vel # distribution of kinetic energy
print("Mean value and stddev of the final distribution")
print(np.mean(kinenergy))
print(np.std(kinenergy))
print("Min/max values of the final distribution")
print(np.min(kinenergy))
print(np.max(kinenergy))
# print histogram of the distribution
n, bins, patches = plt.hist(kinenergy, 100, density=True, facecolor='green', alpha=0.75)
plt.xlabel('Energy')
plt.ylabel('Probability')
plt.title('Kinetic energy distribution')
plt.grid(True)
plt.show()
with output like
Mean value and stddev of the final distribution
483.8162951263243
118.34049421853899
Min/max values of the final distribution
128.86671038372
1391.400187563612

Numpy: how to generate a random noisy curve resembling a "training curve"

I'd like to know how I can generate some random data whose plot resembles a "training curve." By training curve, I mean an array of training loss values from a learning model. These typically have larger values and variance at the beginning, and over time converge to some value with very little variance. It looks a bit like a noisy exponential curve.
This is the closest I've gotten to making random data that resembles a training curve. The problems are that the curve does not flatten out or converge like true loss curves, and there is too much variance on the flatter part.
import numpy as np
import matplotlib.pyplot as plt
num_iters = 2000
rand_curve = np.sort(np.random.exponential(size=num_iters))[::-1]
noise = np.random.normal(0, 0.2, num_iters)
signal = rand_curve + noise
noisy_curve = signal[signal > 0]
plt.plot(noisy_curve, c='r', label='random curve')
And here is an actual training loss curve for reference.
I do not know enough about probability distributions to know if this is a stupid question. I only wanted to generate a random curve so that others had a data array to work with to help me with another question I have about logarithmic plots in matplotlib.
Here is the illustration how to do it with gamma distribution for the noise
x = np.arange(2000)
y = 0.00025 + 0.001 * np.exp(-x/100.) + scipy.stats.gamma(3).rvs(len(x))*(1-np.exp(-x/100))*2e-5
You can adjust the parameters here, to reduce the amount of noise etc
Seems like you could add a dampener to the noise value that is proportional to how far along the x axis that given value is. This would mean, in this case, the variance would decrease the flatter the curve got. Something like:
import numpy as np
import matplotlib.pyplot as plt
num_iters = 2000
rand_curve = np.sort(np.random.exponential(size=num_iters))[::-1]
noise = np.random.normal(0, 0.2, num_iters)
index = 0
for noise_value in np.nditer(noise):
noise[index] = noise_value - index
index = index + 1
signal = rand_curve + noise
noisy_curve = signal[signal > 0]
plt.plot(noisy_curve, c='r', label='random curve')
Thus I think the noise values should be lower the further along X you go and it should achieve the result you want!

Plotting only one side of gaussian in Python using matplotlib and scipy

I have a set of points in the first quadrant that look like a gaussian, and I am trying to fit it using a gaussian in python and my code is as follows:
import pylab as plb
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy import asarray as ar,exp
import math
x=ar([37,69,157,238,274,319,391,495,533,626,1366,1855,2821,3615,4130,4374,6453,6863,7021,
7951,8646,9656,10464,11400])
y=ar([1.77,1.67,1.65,1.17,1.34,1.46,0.75,1,0.8,1.02,0.65,0.69,0.44,0.44,0.55,0.43,0.75,0.27,0.26,
0.44,0.04,0.44,0.26,0.04])
n = 24 #the number of data
mean = sum(x*y)/n #note this correction
sigma = math.sqrt(sum(y*(x-mean)**2)/n) #note this correction
def gaus(x,a,x0,sigma):
return a*exp(-(x-x0)**2/(2*sigma**2))
popt,pcov = curve_fit(gaus,x,y,p0=None, sigma=None) #'''p0=[1,mean,sigma]'''
plt.plot(x,y,'b+:',label='data')
plt.plot(x,gaus(x,*popt),'ro:',label='fit')
plt.legend()
plt.title('Fig. 3 - Fit for Time Constant')
plt.xlabel('Time (s)')
plt.ylabel('Voltage (V)')
plt.show()
And the output is: this figure:
http://s2.postimg.org/wevggkc95/Workspace_1_022.png
Why are all the red points coming below, Also note that I am interested in a half gaussian as my data is like that, so my y values are big at first and then decreasing like one side of the gaussian bell. Can anyone tell me how to fit this curve in python, (in case it cannot be fit to gaussian). Or in other words, I want code to fit the half(left side) gaussian of my points (in the first quadrant only). Note that my points cannot be fit as an exponentially decreasing curve as I tried that earlier, and it is not fitting well at lower 'x' values.
Apparently your data do not fit well or easily to a Gaussian function. You use the default initial guesses for p0 = [1,1,1] which is so far away from any kind of optimal choice that curve_fit gives up before it gets started (check the values of popt=[1,1,1] and pcov=[inf, inf, inf]). You could try with better guesses (e.g. p0 = [2,0, 2000]), but on my system it won't converge: Optimal parameters not found: Number of calls to function has reached maxfev = 800.
To fit a "half-Gaussian", don't float the centre position x0 (just leave it equal to 0):
def gaus(x,a,sigma):
return a*exp(-(x)**2/(2*sigma**2))
p0 = [1.2, 4000]
popt,pcov = curve_fit(gaus,x,y,p0=p0)
Unless you have a particular reason for wanting to fit a Gaussian, why not do a more robust linear least squares fit to a polynomial, e.g.:
pfit = np.polyfit(x, y, 3)
poly = np.poly1d(pfit)

Categories

Resources