I am trying to plot a Q-Q plot using python. I was checking scipy.stats.probplot, and the input seems to be the measurement against a normal distributiom.
import numpy as np
import pylab
import scipy.stats as stats
measurements = np.random.normal(loc = 20, scale = 5, size=100)
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()
and in my code, I had
stats.probplot(mean, dist="norm", plot=plt)
to compare distributions.
But I am wondering where can I input standard deviation? I thought that's a very important factor when comparing distributions but so far I can only input the mean.
Thanks
Let's suppose you have a list on float
X = [-1.31,
4.82,
2.18,
1.99,
4.37,
2.58,
7.22,
3.93,
6.95,
2.41,
2.02,
2.48,
-1.01,
2.3,
2.87,
-0.06,
2.13,
3.62,
5.24,
0.57]
If you want to make a QQ_plot test you need to compare X against a distribution.
For example : N(0, 1) a normal distribution whose mean = 0 and sigma = 1
In OpenTURNS, it goes like that:
import openturns as ot
sample = ot.Sample([[p] for p in X])
graph = ot.VisualTest.DrawQQplot(sample, ot.Normal(0,1))
View(graph);
Explanation: I tell OpenTURNS I have a sample of 20 points [p] coming from X and not 1 point in dimension 20. Then I call ot.VisualTest.DrawQQplot with 2 arguments: sample and the Normal distribution (0,1) ot.Normal(0,1).
We see on the graph that the test fails:
The question now is: what is the best Normal Distribution fitting the sample?
Thanks to NormalFactory() the answer is simple:
BestNormalDistribution = ot.NormalFactory().build(sample)
If you print(BestNormalDistribution) you get the parameters of this distribution:
Normal(mu = 2.76832, sigma = 2.27773)
If we repeat the QQ_plot test of sample against BestNormalDistribution it would be much better
Related
I would like to apply the best fit CDF found by Fitter to each value in a number of panda data-frame columns by hopefully passing the Fitter results to Scipy Stats (or another library?).
I can get the distribution function easily enough from Fitter with the following code:
import numpy as np
import pandas as pd
import seaborn as sns
from fitter import Fitter
from fitter import get_common_distributions
from fitter import get_distributions
dataset = pd.read_csv("econ.csv")
dataset.head()
sns.set_style('white')
sns.set_context("paper", font_scale = 2)
sns.displot(data = dataset, x = "Value_1",kind = "hist", bins = 100, aspect = 1.5)
spac = dataset['Value_1'].values
f = Fitter(spac, distributions=get_distributions())
f.fit()
f.summary()
f.get_best(method='sumsquare_error')
This provides me with an output for Value_1:
{'norminvgauss': {'a': 1.87,
'b': -0.65,
'loc': 0.46,
'scale': 1.24}}
Now this is where I am stuck:
Is there a way to pass this information back to Scipy Stats (or another library) so I can calculate the cumulative distribution function (CDF) of the best fit for each value in each column?
The dataset columns range from Value_1 to Value_99 with about 400 rows - Once I know how to feed the fitter results back into scipy stats I should be able to write a simple for loop to apply this over each column.
An example of the result would be like:
ID
Value1
CDF_BestFit_Value1
n
0.9
0.33
n+1
0.7
0.07
Much appreciated in advanced for anyone who is able to help with this.
Given an array of values, I want to be able to fit a density function to it and find the pdf of an arbitrary input value. Is this possible, and how would I go about it? There aren't necessarily assumptions of normality, and I don't need the function itself.
For instance, given:
x = array([ 0.62529759, -0.08202699, 0.59220673, -0.09074541, 0.05517865,
0.20153703, 0.22773723, -0.26229708, 0.76137555, -0.61229314,
0.27292745, 0.35596795, -0.01373896, 0.32464979, -0.22932331,
1.14796175, 0.17268531, 0.40692172, 0.13846154, 0.22752953,
0.13087359, 0.14111479, -0.09932381, 0.12800392, 0.02605917,
0.18776078, 0.45872642, -0.3943505 , -0.0771418 , -0.38822433,
-0.09171721, 0.23083624, -0.21603973, 0.05425592, 0.47910286,
0.26359565, -0.19917942, 0.40182097, -0.0797546 , 0.47239264,
-0.36654449, 0.4513859 , -0.00282486, -0.13950512, -0.05375369,
0.03331833, 0.48951555, -0.13760504, 2.788 , -0.15017848,
0.02930675, 0.10910646, 0.03868301, -0.048482 , 0.7277376 ,
0.08841259, -0.10968462, 0.50371324, 0.86379698, 0.01674877,
0.19542421, -0.06639165, 0.74500856, -0.10148342, 0.02482331,
0.79195804, 0.40401969, 0.25120005, 0.21020794, -0.01767013,
-0.13453783, -0.09605592, -0.88044229, 0.04689623, 0.09043851,
0.21232286, 0.34129982, -0.3736799 , 0.17313858])
I would like to find how a value of 0.3 compares to all of the above, and what percent of the above values it is greater than.
I personally like using the scipy.stats package. It has a useful implementation of Kernel Density Estimation. Bascially what this does is it estimates a probability density function of certain data, using combinations of gaussian (or other) distributions. Which distributions are used is a parameter you can set. Look at the documentation and related examples here: https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#kernel-density-estimation
And for more about KDE: https://en.wikipedia.org/wiki/Kernel_density_estimation
Once you have built your KDE, then you can perform the same operations on it to get probabilities. For example, if you want to calculate the probability that a value occurs that is as large or larger than 0.3 you would do the following:
kde = stats.gaussian_kde(np.array(x))
#visualize KDE
fig = plt.figure()
ax = fig.add_subplot(111)
x_eval = np.linspace(-.2, .2, num=200)
ax.plot(x_eval, kde(x_eval), 'k-')
#get probability
kde.integrate_box_1d( 0.3, np.inf)
TLDR:
Calculate a KDE, then use the KDE as if it were a PDF.
You can use openTURNS for that. You can use a Gaussian kernel smoothing to do that easily! From the doc:
import openturns as ot
kernel = ot.KernelSmoothing()
estimated = kernel.build(x)
That's it, now you have a distribution object :)
This library is very cool for statistics! (I am not related to them).
We have first to create the Sample from the Numpy array.
Then we compute the complementary CDF with the complementaryCDF method of the distribution (a small improvement over Yoda's answer).
import numpy as np
x = np.array([ 0.62529759, -0.08202699, 0.59220673, -0.09074541, 0.05517865,
0.20153703, 0.22773723, -0.26229708, 0.76137555, -0.61229314,
0.27292745, 0.35596795, -0.01373896, 0.32464979, -0.22932331,
1.14796175, 0.17268531, 0.40692172, 0.13846154, 0.22752953,
0.13087359, 0.14111479, -0.09932381, 0.12800392, 0.02605917,
0.18776078, 0.45872642, -0.3943505 , -0.0771418 , -0.38822433,
-0.09171721, 0.23083624, -0.21603973, 0.05425592, 0.47910286,
0.26359565, -0.19917942, 0.40182097, -0.0797546 , 0.47239264,
-0.36654449, 0.4513859 , -0.00282486, -0.13950512, -0.05375369,
0.03331833, 0.48951555, -0.13760504, 2.788 , -0.15017848,
0.02930675, 0.10910646, 0.03868301, -0.048482 , 0.7277376 ,
0.08841259, -0.10968462, 0.50371324, 0.86379698, 0.01674877,
0.19542421, -0.06639165, 0.74500856, -0.10148342, 0.02482331,
0.79195804, 0.40401969, 0.25120005, 0.21020794, -0.01767013,
-0.13453783, -0.09605592, -0.88044229, 0.04689623, 0.09043851,
0.21232286, 0.34129982, -0.3736799 , 0.17313858])
import openturns as ot
kernel = ot.KernelSmoothing()
sample = ot.Sample(x,1)
distribution = kernel.build(sample)
q = distribution.computeComplementaryCDF(0.3)
print(q)
which prints:
0.29136124840835353
from SALib.sample import saltelli
import numpy as np
problem = {
'num_vars': 3,
'names': ['x1', 'x2', 'x3'],
'bounds': [-10,10]
}
Must bounds be for a uniform distribution? Could it be an other probability distribution? Such as normal, binormial, poisson, Beta...
I want to make an addition to Calvin Whealton's answer which is based on this blog post of (I suppose the same) Calvin Whealton:
SALib supports defining different input parameter distributions in the problem dictionary, although this functionality does not seem to be documented on their official website.
You just need to add a new key 'dists' to the problem dictionary and give a list of strings which encode the distribution type.
The problem definition in that case looks like this:
# problem definition
prob_dists_code = {
'num_vars':6,
'names': ['P1','P2','P3','P4','P5','P6'],
'bounds':[[0.0,1.0], [1.0, 0.75], [0.0, 0.2], [0.0, 0.2], [-1.0,1.0], [1.0,0.25]],
'dists':['unif','triang','norm','lognorm','unif','triang']
}
The values given in 'bounds' then correspond to the parameters of each distribution, e.g. to the mean and standard deviation in the case of a normal distribution. More details can be found in the blog post.
If you want other distributions besides uniform from SALib you can do the following:
Generate uniform samples on the interval (0,1).
Use the inverse cumulative distribution function to convert the input for each parameter into the desired distribution. You can use scipy for the transformation, which should give plenty of flexibility for distributions.
Evaluate the model with these transformed inputs.
The following example to convert to normal distributions is based on code modified from the SALib site (https://github.com/SALib/SALib).
from SALib.sample import saltelli
from SALib.analyze import sobol
from SALib.test_functions import Ishigami
import numpy as np
import scipy as sp # for inverse CDF (ppf) function for distributions
problem2 = {
'num_vars': 3,
'names': ['x1', 'x2', 'x3'],
'bounds': [[0,1]]*3
}
# Generate samples
param_values2 = saltelli.sample(problem2, 1000, calc_second_order=False)
# using normal inverse CDF, can change to other distributions as desired
# look at scipy documentation for other distributions and parameters
param_values2[:,0] = sp.stats.norm.ppf(param_values2[:,0],0,np.pi/2.)
param_values2[:,1] = sp.stats.norm.ppf(param_values2[:,1],0,np.pi/2.)
param_values2[:,2] = sp.stats.norm.ppf(param_values2[:,2],0,np.pi/2.)
# Run model (example)
Y = Ishigami.evaluate(param_values2)
# Perform analysis
Si = sobol.analyze(problem2, Y, print_to_console=True)
I have two arrays: one with 30 years of observations, and one with 30 years of historical model runs. I want to calculate the standard deviation between observations and model results, to see how much the model deviates from observations. How do I go about doing this?
Edit
Here are the two arrays (Each number represents a year(1971-2000)):
obs = [ 2790.90283203 2871.02514648 2641.31738281 2721.64453125
2554.19384766 2773.7746582 2500.95825195 3238.41186523
2571.62133789 2421.93017578 2615.80395508 2271.70654297
2703.82275391 3062.25366211 2656.18359375 2593.62231445
2547.87182617 2846.01245117 2530.37573242 2535.79931641
2237.58032227 2890.19067383 2406.27587891 2294.24975586
2510.43847656 2395.32055664 2378.36157227 2361.31689453 2410.75
2593.62915039]
model = [ 2976.01928711 3353.92114258 3000.92700195 3116.5078125 2935.31787109
2799.75805664 3328.06225586 3344.66333008 3318.31689453
3348.85302734 3578.70800781 2791.78198242 4187.99902344
3610.77124023 2991.984375 3112.97412109 4223.96826172
3590.92724609 3284.6015625 3846.34936523 3955.84350586
3034.26074219 3574.46362305 3674.80175781 3047.98144531
3209.56616211 2654.86547852 2780.55053711 3117.91699219
2737.67626953]
You want to compare two signals, e.g. A and B in the following example:
import numpy as np
A = np.random.rand(5)
B = np.random.rand(5)
print "A:", A
print "B:", B
Output:
A: [ 0.66926369 0.63547359 0.5294013 0.65333154 0.63912645]
B: [ 0.17207719 0.26638423 0.55176735 0.05251388 0.90012135]
Analyzing individual signals
The standard deviation of each single signal is not what you need:
print "standard deviation of A:", np.std(A)
print "standard deviation of B:", np.std(B)
Output:
standard deviation of A: 0.0494162021651
standard deviation of B: 0.304319034639
Analyzing the difference
Instead you might compute the difference and apply some common measure like the sum of absolute differences (SAD), the sum of squared differences (SSD) or the correlation coefficient:
print "difference:", A - B
print "SAD:", np.sum(np.abs(A - B))
print "SSD:", np.sum(np.square(A - B))
print "correlation:", np.corrcoef(np.array((A, B)))[0, 1]
Output:
difference: [ 0.4971865 0.36908937 -0.02236605 0.60081766 -0.2609949 ]
SAD: 1.75045448355
SSD: 0.813021824351
correlation: -0.38247081
Use numpy.
import numpy as np
data = [1.2, 2.3, 1.3, 1.2, 5.4]
np.std(data)
Or you could try this:
import numpy as np
obs = np.array([1.2, 2.3, 1.3, 1.2, 5.4])
model = np.array([1.1, 2.4, 1.2, 1.2, 5.3])
np.std(obs-model)
The standard deviation of the same index of multiple lists (e.g. comparing model vs measurement, multiple measurement data etc.. ) as such as
import numpy as np
obs = np.array([0,1,2,3,4])
model = np.array([2,4,6,8,10])
can be calculated by stacking the data into one array:
arr = np.vstack((obs,model))
Now the standard deviation is calculated using np.std() with a specific axis
std = np.std(arr,axis=0)
Alternative one line solution:
std = np.std((model,obs),axis=0)
Output:
[1.0, 1.5, 2.0, 2.5, 3.0]
If you're doing anything more complicated than just finding the standard deviation and/or mean, use numpy/scipy. If that's all you need to do, use the statistics package from the Python Standard Library.
>>> import statistics
>>> statistics.stdev([1, 2, 3])
1.0
It was added in Python 3.4 (see PEP-450) as a lightweight alternative to Numpy for basic stats equations.
I am struggling with the Scipy truncnorm fit method and I would appreciate help so that the fitted parameter coefficients are consistent with the observed data.
As an example, I have created a small sample from the right hand tail of a N(0,1) distribution (where the observations are larger than 2 stdev) and have thrown in a few outliers.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import truncnorm
values = np.array([2.01, 2.06, 2.71, 2.31, 2.58, 2.17, 3.03, 2.24, 2.12,
2.72, 2.46, 2.66, 2.49, 3.41, 2.46, 2.12, 2.12, 2.65,
2.32, 2.49, 5.15, 2.62, 2.48, 2.27, 2.05])
pd.Series(values).describe()
This then produces the following summary statistics:
count 25.000
mean 2.548
std 0.633
min 2.01
25% 2.17
50% 2.46
75% 2.65
max 5.15
To illustrate the problem I am having with the scipy fit method and to better understand the truncnorm implementation, I have built the following intuitive models from inspecting the above summary statistics and sampled histograms to the observed values (see charts below). What I am struggling with is why the fit method gives such poor results when I attempt to sample using the estimated parameters? In the event I am not using the fit results correctly or making some other mistake, I would appreciate help with the transformations?
The code to build up these examples:
size = 10000
bins = 30
intuitive_models = {"model1":(2, 5),
"model2":(1, 4, 1),
"model3":(0.8, 4, 1, 1.25),
"fitted":truncnorm.fit(values)}
# store the tuncnorm random sample into a dict
model_results = dict()
for model, params in intuitive_models.items():
model_results[model] = truncnorm(*params).rvs(size)
# plot the random sample vs the oserved values
for model, params in intuitive_models.iteritems():
plt.figure()
plt.hist(model_results[model], bins=bins, normed=True)
plt.title(model + ': ' + repr(params))
plt.hist(values, normed=True, alpha=0.5)
# tabular comparison
print pd.DataFrame(model_results).describe()
which produced the following tabular data:
fitted model1 model2 model3
count 10000.000000 10000.000000 10000.000000 10000.000000
mean 1.024707 2.372819 2.524923 2.698601
std 0.014362 0.333144 0.443857 0.584215
min 1.000019 2.000040 2.000007 2.000019
25% 1.012248 2.121838 2.181642 2.245088
50% 1.024518 2.280975 2.407814 2.557983
75% 1.036996 2.534782 2.757778 2.998948
max 1.049991 4.829619 4.982337 5.905201
Thanks Bertie.
p.s. I hope this is accepted as a coding question and not a stats question - which is why I have posted it here.
-- Update 28-Aug-2014 --
The idea with this post was to see get some help with the scipy.stats.truncnorm.fit method and in the couple of days, I have built my own clunky algorithm. From my discussions with Robert, I get the impression the R or standard implementation for truncnorm is to only take 3 parameters. For those coming to this post later, once scipy has an improved fitting engine, this is what I have estimated (assuming we want an asymptotic right tail).