Transformation to dgamma function in Python - python

I want to transform data
[0.54667, 0.471447, 0.826591, 0.330514, 0.7263, 0.496063, 0.520698, 0.321594, 0.351358, 0.894333]
to distribution
'dgamma(a=0.91, loc=0.48, scale=0.15)'
How to do in python

First of all, you don't need to generate an distribution object in advance. All you need would be the distribution params using codes below.
from scipy.stats import gamma
import numpy as np
data = [1,2,3,4,5] # your data
fit_alpha, fit_loc, fit_beta = gamma.fit(np.array(data), floc=0, fscale=1)
Then, you can use scipy.stats.gamma funtions to get PDF/CDF/ec. Like:
print(gamma.pdf(0.9, fit_alpha))
Check out the documentations to find the useful calls.

Related

How to retrieve seaborn's histplot data [duplicate]

I use
sns.distplot
to plot a univariate distribution of observations. Still, I need not only the chart, but also the data points. How do I get the data points from matplotlib Axes (returned by distplot)?
You can use the matplotlib.patches API. For instance, to get the first line:
sns.distplot(x).get_lines()[0].get_data()
This returns two numpy arrays containing the x and y values for the line.
For the bars, information is stored in:
sns.distplot(x).patches
You can access the bar's height via the function patches.get_height():
[h.get_height() for h in sns.distplot(x).patches]
If you want to obtain the kde values of an histogram you can use scikit-learn KernelDensity function instead:
import numpy as np
import pandas as pd
from sklearn.neighbors import KernelDensity
ds=pd.read_csv('data-to-plot.csv')
X=ds.loc[:,'Money-Spent'].values[:, np.newaxis]
kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(X) #you can supply a bandwidth
#parameter.
x=np.linspace(0,5,100)[:, np.newaxis]
log_density_values=kde.score_samples(x)
density=np.exp(log_density_values)
array([1.88878660e-05, 2.04872903e-05, 2.21864649e-05, 2.39885206e-05,
2.58965064e-05, 2.79134003e-05, 3.00421245e-05, 3.22855645e-05,
3.46465903e-05, 3.71280791e-05, 3.97329392e-05, 4.24641320e-05,
4.53246933e-05, 4.83177514e-05, 5.14465430e-05, 5.47144252e-05,
5.81248850e-05, 6.16815472e-05, 6.53881807e-05, 6.92487062e-05,
7.32672057e-05, 7.74479375e-05, 8.17953578e-05, 8.63141507e-05,
..........................
..........................
3.93779919e-03, 4.15788216e-03, 4.38513011e-03, 4.61925890e-03,
4.85992626e-03, 5.10672757e-03, 5.35919187e-03, 5.61677855e-03])
With the newer version of seaborn this is not the case anymore. First of all, distplot has been replaced with displot. Secondly, when calling get_lines() an error message comes up AttributeError: 'FacetGrid' object has no attribute 'get_lines'.
This will get the kde curve you want
line = sns.distplot(data).get_lines()[0]
plt.plot(line.get_xdata(), line.get_ydata())

Python - fitting data with exponential function

I am aware that there are a few questions about a similar subject, although I couldn't find a proper answer.
I would like to fit some data with a function (called Bastenaire) and iget the parameters values. Here is the code:
import numpy as np
from matplotlib import pyplot as plt
from scipy import optimize
def bastenaire(s, A,B, C,sd):
logNB=np.log(A)-C*(s-sd)-np.log(s-sd)
return np.exp(logNB)-B
S=np.array([659,646,634,623,613,595,580,565,551,535,515,493,473,452,432,413,394,374,355,345])
N=np.array([46963,52934,59975,65522,74241,87237,101977,116751,133665,157067,189426,233260,281321,355558,428815,522582,630257,768067,902506,1017280])
fitmb,fitmob=optimize.curve_fit(bastenaire,S,N,p0=(30000,2000000000,0.2,250))
plt.scatter(N,S)
plt.plot(bastenaire(S,*fitmb),S,label='bastenaire')
plt.legend()
plt.show()
However, the curve fit cannot identify the correct parameters and I get: OptimizeWarning: Covariance of the parameters could not be estimated.
Same results when I give no input parameters values.
Figure
Is there any way to tweak something and get results? Should my dataset cover a wider range and values?
Thank you!
Broc
Fitting is tough, you need to restrain the parameter space using bounds and (often) check a bit your initial values.
To make it work, I search for an initial value where the function had the correct look, then estimated some constraints:
bounds = np.array([(1e4, 1e12), (-np.inf, np.inf), (1e-20, 1e-2), (-2000., 20000)]).T
fitmb, fitmob = optimize.curve_fit(bastenaire,S, N,p0=(1e7,-100.,1e-5,250.), bounds=bounds)
returns
(array([ 1.00000000e+10, 1.03174824e+04, 7.53169772e-03, -7.32901325e+01]), array([[ 2.24128391e-06, 6.17858390e+00, -1.44693602e-07,
-5.72040842e-03],
[ 6.17858390e+00, 1.70326029e+07, -3.98881486e-01,
-1.57696515e+04],
[-1.44693602e-07, -3.98881486e-01, 1.14650323e-08,
4.68707940e-04],
[-5.72040842e-03, -1.57696515e+04, 4.68707940e-04,
1.93358414e+01]]))

python - What produces the same plot as autocorrelation_plot()?

I need the values of the autocorrelation coefficients coming from the autocorrelation_plot(). The problem is that the output coming from this function is not accessible, so I need another function to get such values. That's why I used acf() from statsmodels but it didn't get the same plot as autocorrelation_plot() does. Here is my code:
from statsmodels.tsa.stattools import acf
from pandas.plotting import autocorrelation_plot
import matplotlib.pyplot as plt
import numpy as np
y = np.sin(np.arange(1,6*np.pi,0.1))
plt.plot(acf(y))
plt.show()
So the result is not the same as this:
autocorrelation_plot(y)
plt.show()
This seems to be related to the nlags parameter of acf:
nlags: int, optional
Number of lags to return autocorrelation for.
I don't know what exactly this does but in the source of acf there is a slicing
that shortens the array:
avf = acovf(x, unbiased=unbiased, demean=True, fft=fft, missing=missing)
acf = avf[:nlags + 1] / avf[0]
If you use statsmodels.tsa.stattools.acovf directly the result is the same as with autocorrelation_plot:
avf = acovf(x, unbiased=unbiased, demean=True, fft=fft, missing=missing)
So you can call it like
plt.plot(acf(y, nlags=len(y)))
to make it work.
An explanation of lag: https://math.stackexchange.com/questions/2548314/what-is-lag-in-a-time-series/2548350

How to check is data in a numpy array follows a poisson distribution

I wanted to know if an array follows poisson distribution I used Kolmogorov–Smirnov test scipy.stats.kstest but it returns p-value zero so I tested the following code and this also returns p-value of zero
import numpy as np
import pandas as pd
from scipy.stats import( kstest, poisson)
mu=2
poissonDis = poisson.rvs(mu,size=10000)
kstest(poissonDis,'poisson', args=(1,), alternative = 'less',N=10000)
The output is
KstestResult(statistic=0.5972588823428846, pvalue=0.0)
Is there something I am missing ?
I found out that I should set the alternative hypothesis to greater this works
kstest(poissonDis,'poisson', args=(1,), alternative = 'greater',N=10000)

Fitting a pareto distribution with (python) Scipy

I have a data set that I know has a Pareto distribution. Can someone point me to how to fit this data set in Scipy? I got the below code to run but I have no idea what is being returned to me (a,b,c). Also, after obtaining a,b,c, how do I calculate the variance using them?
import scipy.stats as ss
import scipy as sp
a,b,c=ss.pareto.fit(data)
Be very careful fitting power laws!! Many reported power laws are actually badly fitted by a power law. See Clauset et al. for all the details (also on arxiv if you don't have access to the journal). They have a companion website to the article which now links to a Python implementation. Don't know if it uses Scipy because I used their R implementation when I last used it.
Here's a quickly written version, taking some hints from the Reference page that Rupert gave.
This is currently work in progress in scipy and statsmodels and requires MLE with some fixed or frozen parameters, which is only available in the trunk versions.
No standard errors on the parameter estimates or other result statistics are available yet.
'''estimating pareto with 3 parameters (shape, loc, scale) with nested
minimization, MLE inside minimizing Kolmogorov-Smirnov statistic
running some examples looks good
Author: josef-pktd
'''
import numpy as np
from scipy import stats, optimize
#the following adds my frozen fit method to the distributions
#scipy trunk also has a fit method with some parameters fixed.
import scikits.statsmodels.sandbox.stats.distributions_patch
true = (0.5, 10, 1.) # try different values
shape, loc, scale = true
rvs = stats.pareto.rvs(shape, loc=loc, scale=scale, size=1000)
rvsmin = rvs.min() #for starting value to fmin
def pareto_ks(loc, rvs):
est = stats.pareto.fit_fr(rvs, 1., frozen=[np.nan, loc, np.nan])
args = (est[0], loc, est[1])
return stats.kstest(rvs,'pareto',args)[0]
locest = optimize.fmin(pareto_ks, rvsmin*0.7, (rvs,))
est = stats.pareto.fit_fr(rvs, 1., frozen=[np.nan, locest, np.nan])
args = (est[0], locest[0], est[1])
print 'estimate'
print args
print 'kstest'
print stats.kstest(rvs,'pareto',args)
print 'estimation error', args - np.array(true)
Let's say you data is formated like this
import openturns as ot
data = [
[2.7018013],
[8.53280352],
[1.15643882],
[1.03359467],
[1.53152735],
[32.70434285],
[12.60709624],
[2.012235],
[1.06747063],
[1.41394096],
]
sample = ot.Sample([[v] for v in data])
You can easily fit a Pareto distribution using ParetoFactory of OpenTURNS library:
distribution = ot.ParetoFactory().build(sample)
You can of course print it:
print(distribution)
>>> Pareto(beta = 0.00317985, alpha=0.147365, gamma=1.0283)
or plot its PDF:
from openturns.viewer import View
pdf_graph = distribution.drawPDF()
pdf_graph.setTitle(str(distribution))
View(pdf_graph, add_legend=False)
More details on the ParetoFactory are provided in the documentation.
Before passing the data to build() function in OPENTURNS, make sure to convert it this way:
data = [[i] for i in data]
Because Sample() function may return an error.
FYI #Tropilio

Categories

Resources