Why do the Frechet distributions differ in scipy.stats vs R - python

I've fitted a frechet distribution in R and would like to use this in a python script. However inputting the same distribution parameters in scipy.stats.frechet_r gives me a very different curve. Is this a mistake in my implementation or a fault in scipy ?
R distribution:
vs Scipy distribution:
R frechet parameters: loc=17.440, shape=0.198, scale=8.153
python code:
from scipy.stats import frechet_r
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(1, 1)
F=frechet_r(loc=17.440 ,scale= 8.153, c= 0.198)
x=np.arange(0.01,120,0.01)
ax.plot(x, F.pdf(x), 'k-', lw=2)
plt.show()
edit - relevant documentation.
The Frechet parameters were calculated in R using the fgev function in the 'evd' package http://cran.r-project.org/web/packages/evd/evd.pdf (page 40)
Link to the scipy documentation:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.frechet_r.html#scipy.stats.frechet_r

I haven't used the frechet_r function from scipy.stats (when just quickly testing it I got the same plot out as you) but you can get the required behaviour from genextreme in scipy.stats. It is worth noting that for genextreme the Frechet and Weibull shape parameter have the 'opposite' sign to usual. That is, in your case you would need to use a shape parameter of -0.198:
from scipy.stats import genextreme as gev
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(1, 1)
x=np.arange(0.01,120,0.01)
# The order for this is array, shape, loc, scale
F=gev.pdf(x,-0.198,loc=17.44,scale=8.153)
plt.plot(x,F,'g',lw=2)
plt.show()

Related

Why does pdf of arange function have normal distribution?

arange works on stepwise incrementing values and is not random function then why does it give a random distribution?
from scipy.stats import norm
import matplotlib.pyplot as plt
x = np.arange(-3, 3, 0.001)
plt.plot(x, norm.pdf(x))
I expect a uniform distribution
The library scipy.stats.norm provides functionality of Normal Distribution, not Uniform distribution. Meaning when you apply the probability density function (pdf), you are not applying a constant function, but something else entirely (also knowns as the Bell curve):
https://en.wikipedia.org/wiki/Normal_distribution
So in the end what you are seeing are points between (-3, 3) visualised on the probability density function of Normal distribution. If you want to see Uniform distribution:
from scipy.stats import uniform
import matplotlib.pyplot as plt
x = np.arange(-3, 3, 0.001)
plt.plot(x, uniform.pdf(x))
But that is just a very fancy way to draw a constant line.

Interpretation of PP plot

I am playing around with PP plots in statsmodels and I wonder why comparing Normal distribution with scale = 5 and loc = 20 to Standard Normal distribution results in a straight line on the PP plot when the distributions are much different. Please find sample code below:
import numpy as np
import statsmodels.api as sm
import pylab
test = np.random.normal(20, 5, 100000)
pp = sm.ProbPlot(test, loc=0, scale=1)
fig = pp.ppplot()
plt.show()
You can try to reduce the sample size and you will see the effect.
test = np.random.normal(20, 5, 100)
pp = sm.ProbPlot(test, loc=0, scale=1, fit=False).ppplot(line='45')
plt.show()
If fit is false, loc, scale, and distargs are passed to the distribution. If fit is True then the parameters for dist are fit automatically using dist.fit. The quantiles are formed from the standardized data, after subtracting the fitted loc and dividing by the fitted scale. fit cannot be used if dist is a SciPy frozen distribution.

Python piecewise function interpolation

i am trying to construct a function which gives me interpolated values of a piecewise linear function. I tried linear spline interpolation (which should be able to do exactly this?)- but without any luck. The problem is most visible on a log scale plot. Below there is the code of a small example i prepared:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os
from scipy import interpolate
#Original Data
pwl_data = np.array([[0,1e3, 1e5, 1e8], [-90,-90, -90, -130]])
#spine interpolation
pwl_spline = interpolate.splrep(pwl_data[0], pwl_data[1])
spline_x = np.linspace (0,1e8, 10000)
legend = []
plt.plot(pwl_data[0],pwl_data[1])
plt.plot(spline_x,interpolate.splev(spline_x,pwl_spline ),'*')
legend.append("Data")
legend.append("Interpolated Data")
plt.xscale('log')
plt.legend(legend)
plt.grid(True)
plt.grid(b=True, which='minor', linestyle='--')
plt.show()
What am I doing wrong?
The spline fitting have to be performed on the linearized data, i.e. using log(x) instead of x:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy import interpolate
#Original Data
pwl_data = np.array([[1, 1e3, 1e5, 1e8], [-90, -90, -90, -130]])
x = pwl_data[0]
y = pwl_data[1]
log_x = np.log(x)
#spine interpolation
pwl_spline = interpolate.splrep(log_x, y)
spline_log_x = np.linspace(0, 18, 30)
spline_y = interpolate.splev(spline_log_x, pwl_spline )
plt.plot(log_x, y, '-o')
plt.plot(spline_log_x, spline_y, '-*')
plt.xlabel('log(x)');
note: I remove the zero from the data. Also, spline fitting could be not the best if you want a piecewise linear function, you could have a look at this question for example: https://datascience.stackexchange.com/q/8457/53362
For plotting with matplotlib, consider matplotlibs step which internally performs a piecewise constant interpolation.
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.step.html
you can invoke it simply via:
plt.step(x,y) given your inputs x and y.
In plotly the argument line_shape='hv' for the Scatter plot achieves similar results see https://plotly.com/python/line-charts/

Calculate Scipy LOGNORM.CDF() and get the same answer as MS Excel LOGNORM.DIST

I am reproducing a chart in a paper using the LOGNORM.DIST in Microsoft Excel 2013 and would like to get the same chart in Python. I am getting the correct answer in excel, but not in python.
In excel the I have,
mean of ln(KE) 4.630495093
std dev of ln(KE) 0.560774853
I then plot x (KE) from 10 to 1000 and using the Excel LOGNORM.DIST and calculate the probability of the event. I'm getting the exact answers from the paper so I'm confident in the calculation. The plot is below:
MS Excel 2013 Plot of LOGNORM.DIST
In python I'm using Python 3.4 and Scipy 0.16.0 and my code is as follows:
%matplotlib inline
from scipy.stats import lognorm
import numpy as np
import matplotlib.pyplot as plt
shape = 0.560774853 #standard deviation
scale = 4.630495093 #mean
loc = 0
dist=lognorm(shape, loc, scale)
x=np.linspace(10,1000,200)
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.set_xscale('log')
ax.set_xlim([10., 1000.])
ax.set_ylim([0., 1.])
ax.plot(x,dist.cdf(x)), dist.cdf(103)
and the plot is,
Python Plot of LOGNORM
I have messed around a lot with the loc parameter, but nothing works. The last line in the python code
dist.cdf(103)
should give me a 50% probability, but obviously I'm doing something wrong.
The scale parameter of the scipy lognorm distribution is exp(mean), where mean is the mean of the underlying normal distribution. So you should write:
scale = np.exp(mean)
Here's a script that generates a plot like the Excel plot:
import numpy as np
from scipy.stats import lognorm
import matplotlib.pyplot as plt
shape = 0.560774853
scale = np.exp(4.630495093)
loc = 0
dist = lognorm(shape, loc, scale)
x = np.linspace(10, 1000, 500)
plt.semilogx(x, dist.cdf(x))
plt.grid(True)
plt.grid(True, which='minor')
plt.show()

How to estimate density function and calculate its peaks?

I have started to use python for analysis. I would like to do the following:
Get the distribution of dataset
Get the peaks in this distribution
I used gaussian_kde from scipy.stats to make estimation for kernel density function. Does guassian_kde make any assumption about the data ?. I am using data that are changed over time. so if data has one distribution (e.g. Gaussian), it could have another distribution later. Does gaussian_kde have any drawbacks in this scenario?. It was suggested in question to try to fit the data in every distribution in order to get the data distribution. So what's the difference between using gaussian_kde and the answer provided in question. I used the code below, I was wondering also to know is gaussian_kde good way to estimate pdf if the data will be changed over time ?. I know one advantage of gaussian_kde is that it calculate bandwidth automatically by a rule of thumb as in here. Also, how can I get its peaks?
import pandas as pd
import numpy as np
import pylab as pl
import scipy.stats
df = pd.read_csv('D:\dataset.csv')
pdf = scipy.stats.kde.gaussian_kde(df)
x = np.linspace((df.min()-1),(df.max()+1), len(df))
y = pdf(x)
pl.plot(x, y, color = 'r')
pl.hist(data_column, normed= True)
pl.show(block=True)
I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow question you mention). To illustrate the difference between these two, try the following code.
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
np.random.seed(0)
gaussian1 = -6 + 3 * np.random.randn(1700)
gaussian2 = 4 + 1.5 * np.random.randn(300)
gaussian_mixture = np.hstack([gaussian1, gaussian2])
df = pd.DataFrame(gaussian_mixture, columns=['data'])
# non-parametric pdf
nparam_density = stats.kde.gaussian_kde(df.values.ravel())
x = np.linspace(-20, 10, 200)
nparam_density = nparam_density(x)
# parametric fit: assume normal distribution
loc_param, scale_param = stats.norm.fit(df)
param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df.values, bins=30, normed=True)
ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')
ax.plot(x, param_density, 'k--', label='parametric density')
ax.set_ylim([0, 0.15])
ax.legend(loc='best')
From the graph, we see that the non-parametric density is nothing but a smoothed version of histogram. In histogram, for a particular observation x=x0, we use a bar to represent it (put all probability mass on that single point x=x0 and zero elsewhere) whereas in non-parametric density estimation, we use a bell-shaped curve (the gaussian kernel) to represent that point (spreads over its neighbourhood). And the result is a smoothed density curve. This internal gaussian kernel has nothing to do with your distributional assumption on the underlying data x. Its sole purpose is for smoothing.
To get the mode of non-parametric density, we need to do an exhaustive search, as the density is not guaranteed to have uni-mode. As shown in the example above, if you quasi-Newton optimization algo starts between [5,10], it is very likely to end up with a local optimal point rather than the global one.
# get mode: exhastive search
x[np.argsort(nparam_density)[-1]]

Categories

Resources