Generate sample data with an exact Mean and Standard Deviation

Generate sample data with an exact Mean and Standard Deviation - python

I wanted to create a data set with a specific Mean and Std deviation.
Using np.random.normal() gives me an approximate. However for what I want to test I need an exact Mean and Std deviation.
I have tried using a combination of norm.pdf and np.linspace however the data set generated doesn't match up either (It could just be me misusing it though).
It really doesn't matter whether the data set is random or not as long as I can set a specific Sample size, mean and Std deviation.
Help would be much appreciated

The easiest would be to generate some zero-mean samples, with the desired standard deviation. Then subtract the sample mean from the samples so it is truly zero mean. Then scale the samples so that the standard deviation is spot on, and then add the desired mean.
Here is some example code:
import numpy as np
num_samples = 1000
desired_mean = 50.0
desired_std_dev = 10.0
samples = np.random.normal(loc=0.0, scale=desired_std_dev, size=num_samples)
actual_mean = np.mean(samples)
actual_std = np.std(samples)
print("Initial samples stats : mean = {:.4f} stdv = {:.4f}".format(actual_mean, actual_std))
zero_mean_samples = samples - (actual_mean)
zero_mean_mean = np.mean(zero_mean_samples)
zero_mean_std = np.std(zero_mean_samples)
print("True zero samples stats : mean = {:.4f} stdv = {:.4f}".format(zero_mean_mean, zero_mean_std))
scaled_samples = zero_mean_samples * (desired_std_dev/zero_mean_std)
scaled_mean = np.mean(scaled_samples)
scaled_std = np.std(scaled_samples)
print("Scaled samples stats : mean = {:.4f} stdv = {:.4f}".format(scaled_mean, scaled_std))
final_samples = scaled_samples + desired_mean
final_mean = np.mean(final_samples)
final_std = np.std(final_samples)
print("Final samples stats : mean = {:.4f} stdv = {:.4f}".format(final_mean, final_std))
Which produces output similar to this:
Initial samples stats : mean = 0.2946 stdv = 10.1609
True zero samples stats : mean = 0.0000 stdv = 10.1609
Scaled samples stats : mean = 0.0000 stdv = 10.0000
Final samples stats : mean = 50.0000 stdv = 10.0000

For others seeing this later, Python 3.8+ has the statistics.NormalDist class for exactly this purpose:
import statistics as s
n = s.NormalDist(mu=10, sigma=2)
samples = n.samples(100_000, seed=42) # remove seed if desired
print(s.mean(samples)) # 10.004521585462394
print(s.stdev(samples)) # 2.0052615406360457
Methods from #Spoonless's answer can be used to tweak the exact mean and stdev of the samples if needed, or one can just use a large enough number of samples to get exceedingly close -- this is statistics, after all.

You can also do this with the random library.
import random as rand
mean = 20.9
stdd = 3
samples = 1000
data = [rand.normalvariate(mean, stdd) for i in samples]
I also needed to generate data with residuals, so I simply added the product of a rand.randomrange(-1,1) with the residual.
data = [rand.normalvariate(mean, stdd)+(rand.randrange(-1,1)*residual) for i in samples]
Note by adding residuals you will throw off the exact mean and stdd slightly.

Related

Set up Numpy random seed to 150 to make sure you get the same results

I am trying to add Numpy Random seed to 150 to get the same results.
I tried
simulations = np.random.seed(simulations)
But got me a float error
# Generate sales samples
simulations = 10000
sales_sims = np.random.normal(sales_mean, sales_std, simulations)
print(sales_sims)
print("mean:", np.mean(sales_sims))
print("std:", np.std(sales_sims))
My question: How to set up Numpy random seed to my 10,000 simulations

Here’s a guide on setting the random state for numpy. What does numpy.random.seed(0) do?
np.random.seed(150) #set seed
sales_mean = 0 # replace with desired value
sales_std = 1 # replace with desired value
#Generate sales samples
simulations = 10000
sales_sims = np.random.normal(sales_mean, sales_std, simulations)
print(sales_sims)
print("mean:", np.mean(sales_sims))
print("std:", np.std(sales_sims))

Computing the confidence intervals for the lambda parameter of an exponential distribution in Python

Suppose I have a sample, which I have reason to believe follows an exponential distribution. I want to estimate the distributions parameter (lambda) and some indication of the confidence. Either the confidence interval or the standard error would be fine. Sadly, scipy.stats.expon.fit does not seem to permit that. Here's an example, I will use lambda=1/120=0.008333 for the test data:
""" Generate test data"""
import scipy.stats
test_data = scipy.stats.expon(scale=120).rvs(size=3000)
""" Scipy.stats fit"""
fit = scipy.stats.expon.fit(test_data)
print("\nExponential parameters:", fit, " Specifically lambda: ", 1/fit[1], "\n")
# Exponential parameters: (0.0066790678905608875, 116.8376079908356) Specifically lambda: 0.008558887991599736
The answer to a similar question about gamma-distributed data suggests using GenericLikelihoodModel from the statsmodels module. While I can confirm that this works nicely for gamma-distributed data, it does not for exponential distributions, because the optimization apparently results in a non-invertible Hessian matrix. This results either from non-finite elements in the Hessian or from np.linalg.eigh producing non-positive eigenvalues for the Hessian. (Source code here; HessianInversionWarning is raised in the fit method of the LikelihoodModel class.).
""" Statsmodel fit"""
from statsmodels.base.model import GenericLikelihoodModel
class Expon(GenericLikelihoodModel):
nparams = 2
def loglike(self, params):
return scipy.stats.expon.logpdf(self.endog, *params).sum()
res = Expon(test_data).fit(start_params=fit)
res.df_model = len(fit)
res.df_resid = len(test_data) - len(fit)
print(res.summary())
#Optimization terminated successfully.
# Current function value: 5.760785
# Iterations: 38
# Function evaluations: 76
#/usr/lib/python3.8/site-packages/statsmodels/tools/numdiff.py:352: RuntimeWarning: invalid value encountered in double_scalars
# hess[i, j] = (f(*((x + ee[i, :] + ee[j, :],) + args), **kwargs)
#/usr/lib/python3.8/site-packages/statsmodels/base/model.py:547: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
# warn('Inverting hessian failed, no bse or cov_params '
#/usr/lib/python3.8/site-packages/scipy/stats/_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in greater
# return (a < x) & (x < b)
#/usr/lib/python3.8/site-packages/scipy/stats/_distn_infrastructure.py:903: RuntimeWarning: invalid value encountered in less
# return (a < x) & (x < b)
#/usr/lib/python3.8/site-packages/scipy/stats/_distn_infrastructure.py:1912: RuntimeWarning: invalid value encountered in less_equal
# cond2 = cond0 & (x <= _a)
# Expon Results
#==============================================================================
#Dep. Variable: y Log-Likelihood: -17282.
#Model: Expon AIC: 3.457e+04
#Method: Maximum Likelihood BIC: 3.459e+04
#Date: Thu, 06 Aug 2020
#Time: 13:55:24
#No. Observations: 3000
#Df Residuals: 2998
#Df Model: 2
#==============================================================================
# coef std err z P>|z| [0.025 0.975]
#------------------------------------------------------------------------------
#par0 0.0067 nan nan nan nan nan
#par1 116.8376 nan nan nan nan nan
#==============================================================================
This seems to happen every single time, so it might be something related to exponentially-distributed data.
Is there some other possible approach? Or am I maybe missing something or doing something wrong here?
Edit: It turns out, I was doing something wrong, namely, I wrongly had
test_data = scipy.stats.expon(120).rvs(size=3000)
instead of
test_data = scipy.stats.expon(scale=120).rvs(size=3000)
and was correspondingly looking at the forst element of the fit tuple while I should have been looking at the second.
As a result, the two other options I considered (manually computing the fit and confidence intervals following the standard procedure described on wikipedia) and using scikits.bootstrap as suggested in this answer actually does work and is part of the solution I'll add in a minute and not of the problem.

As mentioned in the edited question, part of the problem was that I was looking at the wrong parameter when creating the sample and again in the fit.
What remains is that scipy.stats.expon.fit does not offer the possibility to compute confidence or errors and that using GenericLikelihoodModel from the statsmodels module as suggested here fails because of a malformed Hessian.
There are, however three approaches that do work:
1. Using the simple inference procedure for confidence intervals in exponential data as given in the wikipedia article
""" Maximum likelihood"""
import numpy as np
ML_lambda = 1 / np.mean(test_data)
print("\nML lambda: {0:8f}".format(ML_lambda))
#ML lambda: 0.008558
""" Bias corrected ML"""
ML_BC_lambda = ML_lambda - ML_lambda / (len(test_data) - 1)
print("\nML bias-corrected lambda: {0:8f}".format(ML_BC_lambda))
#ML bias-corrected lambda: 0.008556
Computation of the confidence intervals::
""" Maximum likelihood 95% confidence"""
CI_distance = ML_BC_lambda * 1.96/(len(test_data)**0.5)
print("\nLambda with confidence intervals: {0:8f} +/- {1:8f}".format(ML_BC_lambda, CI_distance))
print("Confidence intervals: ({0:8f}, {1:9f})".format(ML_BC_lambda - CI_distance, ML_BC_lambda + CI_distance))
#Lambda with confidence intervals: 0.008556 +/- 0.000306
#Confidence intervals: (0.008249, 0.008862)
Second option with this: In addition, the confidence interval equation should also be valid for a lambda estimate produced by a different such as the one from scipy.stats.expon.fit. (I thought that the fitting procedure in scipy.stats.expon.fit was more reliable, but it turns out it is actually the same, without the bias correction (see above).)
""" Maximum likelihood 95% confidence based on scipy.stats fit"""
scipy_stats_lambda = 1 / fit[1]
scipy_stats_CI_distance = scipy_stats_lambda * 1.96/(len(test_data)**0.5)
print("\nOr, based on scipy.stats fit:")
print("Lambda with confidence intervals: {0:8f} +/- {1:8f}".format(scipy_stats_lambda, scipy_stats_CI_distance))
print("Confidence intervals: ({0:8f}, {1:9f})".format(scipy_stats_lambda - scipy_stats_CI_distance,
scipy_stats_lambda + scipy_stats_CI_distance))
#Or, based on scipy.stats fit:
#Lambda with confidence intervals: 0.008559 +/- 0.000306
#Confidence intervals: (0.008253, 0.008865)
2. Bootstrapping with scikits.bootstrap following the suggestion in this answer
This yields an InstabilityWarning: Some values were NaN; results are probably unstable (all values were probably equal), so this should be treated a bit skeptically.
""" Bootstrapping with scikits"""
print("\n")
import scikits.bootstrap as boot
bootstrap_result = boot.ci(test_data, scipy.stats.expon.fit)
print(bootstrap_result)
#tmp/expon_fit_test.py:53: InstabilityWarning: Some values were NaN; results are probably unstable (all values were probably equal)
# bootstrap_result = boot.ci(test_data, scipy.stats.expon.fit)
#[[6.67906789e-03 1.12615588e+02]
# [6.67906789e-03 1.21127091e+02]]
3. Using rpy2
""" Using r modules with rpy2"""
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
MASS = importr('MASS')
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
rpy_fit = MASS.fitdistr(test_data, "exponential")
rpy_estimate = rpy_fit.rx("estimate")[0][0]
rpy_sd = rpy_fit.rx("sd")[0][0]
rpy_lower = rpy_estimate - 2*rpy_sd
rpy_upper = rpy_estimate + 2*rpy_sd
print("\nrpy2 fit: \nLambda={0:8f} +/- {1:8f}, CI: ({2:8f}, {3:8f})".format(rpy_estimate, rpy_sd, rpy_lower, rpy_upper))
#rpy2 fit:
#Lambda=0.008558 +/- 0.000156, CI: (0.008246, 0.008871)

You have already found a solution to your problem, but here is a solution based on OpenTURNS. I think it relies on bootstrapping under the hood.
OpenTURNS requires you to reshape the data so that it is clear we are dealing with 3000 one-dimensional points and not a single 3000-dimensional point.
test_data = test_data.reshape(-1, 1)
The rest is rather straightforward.
import openturns as ot
confidence_level = 0.9
params_ci = ot.ExponentialFactory().buildEstimator(test_data).getParameterDistribution().computeBilateralConfidenceInterval(confidence_level)
lambda_ci = [params_ci.getLowerBound()[0], params_ci.getUpperBound()[0]]
# the index 0 means we are interested in the CI on lambda
print(lambda_ci)
I got the following output (but it depends on the random seed):
[0.008076302149561718, 0.008688296487447742]

Forecasting Volatility using GARCH in Python - Arch Package

I'm testing ARCH package to forecast the Variance (Standard Deviation) of two series using GARCH(1,1).
This is the first part of my code
import pandas as pd
import numpy as np
from arch import arch_model
returns = pd.read_csv('ret_full.csv', index_col=0)
returns.index = pd.to_datetime(returns.index)
Ibovespa Returns
The first series is the 1st Future Contract of Ibovespa Index, has an observed annualized volatility really close to the Garch Forecast.
The first problem that I've found is that you need to rescale your sample by 100. To do this, you can multiply your return series by 100 or setting the parameter rescale=True in the arch_model function.
Why is necessary to do this?
# Ibov
ret_ibov = returns['IBOV_1st']
model_ibov = arch_model(ret_ibov, vol='Garch', p=1, o=0, q=1, dist='Normal', rescale=True)
res_ibov = model_ibov.fit()
After fitting the model I forecast the Variance (just 5 steps to illustrate the problem), get the Standard Deviation and annualize it. Obs: Since I had to rescale my return series, I divide my forecast by 10000 (100**2, because of rescale)
# Forecast
forecast_ibov = res_ibov.forecast(horizon=5)
# Getting Annualized Standard Deviation
# Garch Vol
vol_ibov_for = (forecast_ibov.variance.iloc[-1]/10000)**0.5 * np.sqrt(252) * 100
# Observed Vol
vol_ibov = ret_ibov.std() * np.sqrt(252) * 100
And that's the forecast output
vol_ibov_for
h.1 24.563208
h.2 24.543245
h.3 24.523969
h.4 24.505357
h.5 24.487385
Which is really close to Observed Vol 23.76
This is a results that I was expecting.
IRFM Returns
When I do exactly the same process a less volatile series, I got a really weird result.
# IRFM
ret_irfm = returns['IRFM1M']
model_irfm = arch_model(ret_irfm, vol='Garch', p=1, o=0, q=1, dist='Normal', rescale=True)
res_irfm = model_irfm.fit()
# Forecast
forecasts_irfm = res_irfm.forecast(horizon=5)
# Getting Annualized Standard Deviation
# Garch Vol
vol_irfm_for = (forecasts_irfm.variance.iloc[-1]/10000)**0.5 * np.sqrt(252) * 100
# Observed Vol
vol_irfm = ret_irfm.std() * np.sqrt(252) * 100
Forecast output:
vol_irfm_for
h.1 47.879679
h.2 49.322351
h.3 50.519282
h.4 51.517356
h.5 52.352894
And this is significantly different from the Observed Volatility 5.39
Why is this happening? Maybe because of the rescaling? Do I have to do another adjust before the forecast?
Thanks

Found the answer.
The rescale=True is used when the model fails to converge to a result. So rescale could be a solution for the problem. If the model doesn't need rescale, even if the parameter is True, it will not do anything.
Point of Attempion: If the rescale=True and, in fact, rescaled the series. It's necessary to adjust the outputs. In my question I was confused about how high my volatility was. That's because I was assuming that my rescale values was 100, which not necessarily is true.
The correct thing to do is to set the parameter as True and get the rescale value after that.
To do this, just need to insert the following code:
# IRFM
ret_irfm = returns['IRFM1M']
model_irfm = arch_model(ret_irfm, vol='Garch', p=1, o=0, q=1, dist='Normal', rescale=True, mean='Zero')
res_irfm = model_irfm.fit()
scale = res_irfm.scale # New part of the code
# Forecast
forecasts_irfm = res_irfm.forecast(horizon=5)
# Getting Annualized Standard Deviation
# Garch Vol
# New part of the code: Divide variance by scale^2
vol_irfm_for = (forecasts_irfm.variance.iloc[-1] / np.power(scale, 2))**0.5 * np.sqrt(252) * 100
# Observed Vol
vol_irfm = ret_irfm.std() * np.sqrt(252) * 100
Hope this help another users with the same problem. It's a really simple thing.
Thanks.

Are these functions equivalent?

I am building a neural network that makes use of T-distribution noise. I am using functions defined in the numpy library np.random.standard_t and the one defined in tensorflow tf.distributions.StudentT. The link to the documentation of the first function is here and that to the second function is here. I am using the said functions like below:
a = np.random.standard_t(df=3, size=10000) # numpy's function
t_dist = tf.distributions.StudentT(df=3.0, loc=0.0, scale=1.0)
sess = tf.Session()
b = sess.run(t_dist.sample(10000))
In the documentation provided for the Tensorflow implementation, there's a parameter called scale whose description reads
The scaling factor(s) for the distribution(s). Note that scale is not technically the standard deviation of this distribution but has semantics more similar to standard deviation than variance.
I have set scale to be 1.0 but I have no way of knowing for sure if these refer to the same distribution.
Can someone help me verify this? Thanks

I would say they are, as their sampling is defined in almost the exact same way in both cases. This is how the sampling of tf.distributions.StudentT is defined:
def _sample_n(self, n, seed=None):
# The sampling method comes from the fact that if:
# X ~ Normal(0, 1)
# Z ~ Chi2(df)
# Y = X / sqrt(Z / df)
# then:
# Y ~ StudentT(df).
seed = seed_stream.SeedStream(seed, "student_t")
shape = tf.concat([[n], self.batch_shape_tensor()], 0)
normal_sample = tf.random.normal(shape, dtype=self.dtype, seed=seed())
df = self.df * tf.ones(self.batch_shape_tensor(), dtype=self.dtype)
gamma_sample = tf.random.gamma([n],
0.5 * df,
beta=0.5,
dtype=self.dtype,
seed=seed())
samples = normal_sample * tf.math.rsqrt(gamma_sample / df)
return samples * self.scale + self.loc # Abs(scale) not wanted.
So it is a standard normal sample divided by the square root of a chi-square sample with parameter df divided by df. The chi-square sample is taken as a gamma sample with parameter 0.5 * df and rate 0.5, which is equivalent (chi-square is a special case of gamma). The scale value, like the loc, only comes into play in the last line, as a way to "relocate" the distribution sample at some point and scale. When scale is one and loc is zero, they do nothing.
Here is the implementation for np.random.standard_t:
double legacy_standard_t(aug_bitgen_t *aug_state, double df) {
double num, denom;
num = legacy_gauss(aug_state);
denom = legacy_standard_gamma(aug_state, df / 2);
return sqrt(df / 2) * num / sqrt(denom);
})
So essentially the same thing, slightly rephrased. Here we have also have a gamma with shape df / 2 but it is standard (rate one). However, the missing 0.5 is now by the numerator as / 2 within the sqrt. So it's just moving the numbers around. Here there is no scale or loc, though.
In truth, the difference is that in the case of TensorFlow the distribution really is a noncentral t-distribution. A simple empirical proof that they are the same for loc=0.0 and scale=1.0 is to plot histograms for both distributions and see how close they look.
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
np.random.seed(0)
t_np = np.random.standard_t(df=3, size=10000)
with tf.Graph().as_default(), tf.Session() as sess:
tf.random.set_random_seed(0)
t_dist = tf.distributions.StudentT(df=3.0, loc=0.0, scale=1.0)
t_tf = sess.run(t_dist.sample(10000))
plt.hist((t_np, t_tf), np.linspace(-10, 10, 20), label=['NumPy', 'TensorFlow'])
plt.legend()
plt.tight_layout()
plt.show()
Output:
That looks pretty close. Obviously, from the point of view of statistical samples, this is not any kind of proof. If you were not still convinced, there are some statistical tools for testing whether a sample comes from a certain distribution or two samples come from the same distribution.

statistics for histogram of periodic data

For a series of angle values in (-pi, pi) range, I make a histogram. Is there an effective way to calculate a mean and modal (post probable) value? Consider following examples:
import numpy as N, cmath
deg = N.pi/180.
d = N.array([-175., 170, 175, 179, -179])*deg
i = N.sum(N.exp(1j*d))
ave = cmath.phase(i)
i /= float(d.size)
stdev = -2. * N.log(N.sqrt(i.real**2 + i.imag**2))
print ave/deg, stdev/deg
Now, let's have a histogram:
counts, bins = N.histogram(data, N.linspace(-N.pi, N.pi, 360))
Is it possible to calculate mean, mode having counts and bins? For non-periodic data, calculation of a mean is straightforward:
ave = sum(counts*bins[:-1])
Calculations of a modal value requires more effort. Actually, I'm not sure my code below is correct: firstly, I identify bins which occur most frequently and then I calculate an arithmetic mean:
cmax = bins[N.argmax(counts)]
mode = N.mean(N.take(bins, N.nonzero(counts == cmax)[0]))
I have no idea, how to calculate standard deviation from such data, though. One obvious solution to all my problems (at least those described above) is to convert histogram data to a data series and then use it in calculations. This is not elegant, however, and inefficient.
Any hints will be very appreciated.
This is the partial solution I wrote.
import numpy as N, cmath
import scipy.stats as ST
d = [-175, 170.2, 175.57, 179, -179, 170.2, 175.57, 170.2]
deg = N.pi/180.
data = N.array(d)*deg
i = N.sum(N.exp(1j*data))
ave = cmath.phase(i) # correct and exact mean for periodic data
wrong_ave = N.mean(d)
i /= float(data.size)
stdev = -2. * N.log(N.sqrt(i.real**2 + i.imag**2))
wrong_stdev = N.std(d)
bins = N.linspace(-N.pi, N.pi, 360)
counts, bins = N.histogram(data, bins, normed=False)
# consider it weighted vector addition
nz = N.nonzero(counts)[0]
weight = counts[nz]
i = N.sum(weight * N.exp(1j*bins[nz])/len(nz))
pave = cmath.phase(i) # correct and approximated mean for periodic data
i /= sum(weight)/float(len(nz))
pstdev = -2. * N.log(N.sqrt(i.real**2 + i.imag**2))
print
print 'scipy: %12.3f (mean) %12.3f (stdev)' % (ST.circmean(data)/deg, \
ST.circstd(data)/deg)
When run, it gives following results:
mean: 175.840 85.843 175.360
stdev: 0.472 151.785 0.430
scipy: 175.840 (mean) 3.673 (stdev)
A few comments now: the first column gives mean/stdev calculated. As can be seen, the mean agrees well with scipy.stats.circmean (thanks JoeKington for pointing it out). Unfortunately stdev differs. I will look at it later. The second column gives completely wrong results (non-periodic mean/std from numpy obviously does not work here). The 3rd column gives sth I wanted to obtain from the histogram data (#JoeKington: my raw data won't fit memory of my computer.., #dmytro: thanks for your input: of course, bin size will influence result but in my application I don't have much choice, i.e. I have to reduce data somehow). As can be seen, the mean (3rd column) is properly calculated, stdev needs further attention :)

Have a look at scipy.stats.circmean and scipy.stats.circstd.
Or do you only have the histogram counts, and not the "raw" data? If so, you could fit a Von Mises distribution to your histogram counts and approximate the mean and stddev in that way.

Here's how to get an approximation.
Since Var(x) = <x^2> - <x>^2, we have:
meanX = N.sum(counts * bins[:-1]) / N.sum(counts)
meanX2 = N.sum(counts * bins[:-1]**2) / N.sum(counts)
std = N.sqrt(meanX2 - meanX**2)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generate sample data with an exact Mean and Standard Deviation - python

Related

Set up Numpy random seed to 150 to make sure you get the same results

Computing the confidence intervals for the lambda parameter of an exponential distribution in Python

Forecasting Volatility using GARCH in Python - Arch Package

Are these functions equivalent?

statistics for histogram of periodic data

Categories

Resources