A lognormal distribution in python - python

I have seen several questions in stackoverflow regarding how to fit a log-normal distribution. Still there are two clarifications that I need known.
I have a sample data, the logarithm of which follows a normal distribution. So I can fit the data using scipy.stats.lognorm.fit (i.e a log-normal distribution)
The fit is working fine, and also gives me the standard deviation. Here is my piece of code with the results.
import numpy as np
from scipy import stats
sample = np.log10(data) #taking the log10 of the data
scatter,loc,mean = stats.lognorm.fit(sample) #Gives the paramters of the fit
x_fit = np.linspace(13.0,15.0,100)
pdf_fitted = stats.lognorm.pdf(x_fit,scatter,loc,mean) #Gives the PDF
print "scatter for data is %s" %scatter
print "mean of data is %s" %mean
THE RESULT
scatter for data is 0.186415047243
mean for data is 1.15731050926
From the image you can clearly see that the mean is around 14.2, but what I get is 1.15??!! Why is this so? clearly the log(mean) is also not near 14.2!!
In THIS POST and in THIS QUESTION it is mentioned that the log(mean) is the actual mean.
But you can see from my above code, the fit that I have obtained is using a the sample = log(data) and it also seems to fit well. However when I tried
sample = data
pdf_fitted = stats.lognorm.pdf(x_fit,scatter,loc,np.log10(mean))
The fit does not seem to work.
1) Why is the mean not 14.2?
2) How to draw fill/draw vertical lines showing the 1 sigma confidence region?

You say
I have a sample data, the logarithm of which follows a normal distribution.
Suppose data is the array containing the samples. To fit this data to
a log-normal distribution using scipy.stats.lognorm, use:
s, loc, scale = stats.lognorm.fit(data, floc=0)
Now suppose mu and sigma are the mean and standard deviation of the
underlying normal distribution. To get the estimate of those values
from this fit, use:
estimated_mu = np.log(scale)
estimated_sigma = s
(These are not the estimates of the mean and standard deviation of
the samples in data. See the wikipedia page for the formulas
for the mean and variance of a log-normal distribution in terms of mu and sigma.)
To combine the histogram and the PDF, you can use, for example,
import matplotlib.pyplot as plt.
plt.hist(data, bins=50, normed=True, color='c', alpha=0.75)
xmin = data.min()
xmax = data.max()
x = np.linspace(xmin, xmax, 100)
pdf = stats.lognorm.pdf(x, s, scale=scale)
plt.plot(x, pdf, 'k')
If you want to see the log of the data, you could do something like
the following. Note the the PDF of the normal distribution is used
here.
logdata = np.log(data)
plt.hist(logdata, bins=40, normed=True, color='c', alpha=0.75)
xmin = logdata.min()
xmax = logdata.max()
x = np.linspace(xmin, xmax, 100)
pdf = stats.norm.pdf(x, loc=estimated_mu, scale=estimated_sigma)
plt.plot(x, pdf, 'k')
By the way, an alternative to fitting with stats.lognorm is to fit log(data)
using stats.norm.fit:
logdata = np.log(data)
estimated_mu, estimated_sigma = stats.norm.fit(logdata)
Related questions:
Fitting lognormal distribution using Scipy vs Matlab
Lognormal Random Numbers Centered around a high value

Related

How to draw the consistent Probability Density Function (PDF) plot regardless of sample size in Python?

I have a question about drawing Probability Density Function (PDF) plot regardless of sample size in Python.
This is my code.
# Library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
# Data frame
x = np.random.normal(45, 9, 1000)
source = {"Genotype": ["CV1"]*1000, "AGW": x}
df=pd.DataFrame(source)
# Calculating PDF
df_mean = np.mean(df["AGW"])
df_std = np.std(df["AGW"])
pdf = stats.norm.pdf(df["AGW"].sort_values(), df_mean, df_std)
# Graph
plt.plot(df["AGW"].sort_values(), pdf, color="black")
plt.xlim([0,90])
plt.xlabel("Grain weight (mg)", size=12)
plt.ylabel("Frequency", size=12)
plt.grid(True, alpha=0.3, linestyle="--")
plt.show()
and this is the graph. However, when I change the sample number from 1000 to 100 such as x = np.random.normal(45, 9, 100), the graph shape is changed.
This is because lack of sample size cannot represent full normal distribution. If we draw a normal distribution graph in Excel with limited sample size, we can find the same problem.
However, in R, stat_function() always provides the same shape of normal distribution graph regardless of sample size.
In R, when I run the below code, I can obtain the same shape of normal distribution graph regardless of sample size. It assume that the full normal distribution in given mean and standard deviation.
Could you let me know how I can get such a consistent normal distribution graph in Python like R? Regardless of sample size, I'd like to obtain the same shape of normal distribution graph in Python.
Always, many thanks!!
AGW<-rnorm(100, mean=45, sd=9)
Genotype<-c(rep("CV1",100))
df<- data.frame (Genotype, AGW)
ggplot () +
stat_function(data=df, aes(x=AGW), color="Black", size=1, fun = dnorm,
args = c(mean = mean(df$AGW), sd = sd(df$AGW))) +
scale_x_continuous(breaks = seq(0,90,10),limits = c(0,90)) +
scale_y_continuous(breaks = seq(0,0.05,0.01), limits = c(0,0.05)) +
labs(x="Grain weight (mg)", y="Frequency") +
theme_grey(base_size=15, base_family="serif")+
theme(axis.line= element_line(size=0.5, colour="black")) +
windows(width=6, height=5)

Calculating probability distribution from time series data in python

I have a question about probability distribution function I have a time series data and I want to calculate the probability distribution of data in different time windows.
I have developed the following code but i could not find the value of probability distribution for this function.
a = pd.DataFrame([0.0,
21.660332407421638,
20.56428943581567,
20.597329924045983,
19.313207915827956,
19.104973174542806,
18.031361568112377,
17.904747973652125,
16.705687654209264,
16.534206966165637,
16.347782724271802,
13.994284547628721,
12.870120434556945,
12.794530081249571,
10.660675400742669])
this is the histogram and density plot of my data:
a.plot.hist()
a.plot.density()
but i don't know how can I calculate the value of the area under density curve.
You can directly call the method scipy.stats.gaussian_kde which is also used by pandas internally.
This method returns the desired function.
You can then call one of the methods from scipy.integrate to calculate areas under the kernel density estimate, e.g.
from scipy import stats, integrate
kde = stats.gaussian_kde(a[0])
# Calculate the integral of the kde between 10 and 20:
xmin, xmax = 10, 20
integral, err = integrate.quad(kde, xmin, xmax)
x = np.linspace(-5,20,100)
x_integral = np.linspace(xmin, xmax, 100)
plt.plot(x, kde(x), label="KDE")
plt.fill_between(x_integral, 0, kde(x_integral),
alpha=0.3, color='b', label="Area: {:.3f}".format(integral))
plt.legend()

Python: how to fit a gamma distribution from data?

I have a dataset and I am trying to see which is the best distribution its following.
In the firs attempt I tried to fit it with a rayleigh, so
y, x = np.histogram(data, bins=45, normed=True)
param = rayleigh.fit(y) # distribution fitting
# fitted distribution
xx = linspace(0,45,1000)
pdf_fitted = rayleigh.pdf(xx,loc=param[0],scale=param[1])
pdf = rayleigh.pdf(xx,loc=0,scale=8.5)
fig,ax = plt.subplots(figsize=(7,5))
plot(xx,pdf,'r-', lw=5, alpha=0.6, label='rayleigh pdf')
plot(xx,pdf,'k-', label='Data')
plt.bar(x[1:], y)
ax.set_xlabel('Distance, '+r'$x [km]$',size = 15)
ax.set_ylabel('Frequency, '+r'$P(x)$',size=15)
ax.legend(loc='best', frameon=False)
I am trying to do the same with a gamma distribution without succeding
y, x = np.histogram(net1['distance'], bins=45, normed=True)
xx = linspace(0,45,1000)
ag,bg,cg = gamma.fit(y)
pdf_gamma = gamma.pdf(xx, ag, bg,cg)
fig,ax = plt.subplots(figsize=(7,5))
# fitted distribution
plot(xx,pdf_gamma,'r-', lw=5, alpha=0.6, label='gamma pdf')
plot(xx,pdf_gamma,'k-')
plt.bar(x[1:], y, label='Data')
ax.set_xlabel('Distance, '+r'$x [km]$',size = 15)
ax.set_ylabel('Frequency, '+r'$P(x)$',size=15)
ax.legend(loc='best', frameon=False)
Unfortunately scipy.stats.gamma is not well documented.
suppose you have some "raw" data in the form data=array([a1,a2,a3,.....]), these can be the results of an experiment of yours.
You can give these raw values to the fit method: gamma.fit(data) and it will return for you three parameters a,b,c = gamma.fit(data). These are the "shape", the "loc"ation and the "scale" of the gamma curve that fits better the DISTRIBUTION HISTOGRAM of your data (not the actual data).
I noticed from the questions online that many people confuse. They have a distribution of data, and try to fit it with gamma.fit. This is wrong.
The method gamma.fit expects your raw data, not the distribution of your data.
This will presumably solve problems to few of us.
GR
My guess is that you have much of the original data at 0, so the alpha of the fit ends up lower than 1 (0.34) and you get the decreasing shape with singularity at 0. The bar plot does not include the zero (x[1:]) so you don't see the huge bar on the left.
Can I be right?

Plot a line graph over a histogram for residual plot in python

I have created a script to plot a histogram of a NO2 vs Temperature residuals in a dataframe called nighttime.
The histogram shows the normal distribution of the residuals from a regression line somewhere else in the python script.
I am struggling to find a way to plot a bell curve over the histogram like this example :
Plot Normal distribution with Matplotlib
How can I get a fitting normal distribution for my residual histogram?
plt.suptitle('NO2 and Temperature Residuals night-time', fontsize=20)
WSx_rm = nighttime['Temperature']
WSx_rm = sm.add_constant(WSx_rm)
NO2_WS_RM_mod = sm.OLS(nighttime.NO2, WSx_rm, missing = 'drop').fit()
NO2_WS_RM_mod_sr = (NO2_WS_RM_mod.resid / np.std(NO2_WS_RM_mod.resid))
#Histogram of residuals
ax = plt.hist(NO2_WS_RM_mod.resid)
plt.xlim(-40,50)
plt.xlabel('Residuals')
plt.show
You can exploit the methods from seaborn library for plotting the distribution with the bell curve. The residual variable is not clear to me in the example you have provided. You may see the code snippet below just for your reference.
# y here is an arbitrary target variable for explaining this example
residuals = y_actual - y_predicted
import seaborn as sns
sns.distplot(residuals, bins = 10) # you may select the no. of bins
plt.title('Error Terms', fontsize=20)
plt.xlabel('Residuals', fontsize = 15)
plt.show()
Does the following work for you? (using some adapted code from the link you gave)
import scipy.stats as stats
plt.suptitle('NO2 and Temperature Residuals night-time', fontsize=20)
WSx_rm = nighttime['Temperature']
WSx_rm = sm.add_constant(WSx_rm)
NO2_WS_RM_mod = sm.OLS(nighttime.NO2, WSx_rm, missing = 'drop').fit()
NO2_WS_RM_mod_sr = (NO2_WS_RM_mod.resid / np.std(NO2_WS_RM_mod.resid))
#Histogram of residuals
ax = plt.hist(NO2_WS_RM_mod.resid)
plt.xlim(-40,50)
plt.xlabel('Residuals')
# New Code: Draw fitted normal distribution
residuals = sorted(NO2_WS_RM_mod.resid) # Just in case it isn't sorted
normal_distribution = stats.norm.pdf(residuals, np.mean(residuals), np.std(residuals))
plt.plot(residuals, normal_distribution)
plt.show

How to estimate density function and calculate its peaks?

I have started to use python for analysis. I would like to do the following:
Get the distribution of dataset
Get the peaks in this distribution
I used gaussian_kde from scipy.stats to make estimation for kernel density function. Does guassian_kde make any assumption about the data ?. I am using data that are changed over time. so if data has one distribution (e.g. Gaussian), it could have another distribution later. Does gaussian_kde have any drawbacks in this scenario?. It was suggested in question to try to fit the data in every distribution in order to get the data distribution. So what's the difference between using gaussian_kde and the answer provided in question. I used the code below, I was wondering also to know is gaussian_kde good way to estimate pdf if the data will be changed over time ?. I know one advantage of gaussian_kde is that it calculate bandwidth automatically by a rule of thumb as in here. Also, how can I get its peaks?
import pandas as pd
import numpy as np
import pylab as pl
import scipy.stats
df = pd.read_csv('D:\dataset.csv')
pdf = scipy.stats.kde.gaussian_kde(df)
x = np.linspace((df.min()-1),(df.max()+1), len(df))
y = pdf(x)
pl.plot(x, y, color = 'r')
pl.hist(data_column, normed= True)
pl.show(block=True)
I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow question you mention). To illustrate the difference between these two, try the following code.
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
np.random.seed(0)
gaussian1 = -6 + 3 * np.random.randn(1700)
gaussian2 = 4 + 1.5 * np.random.randn(300)
gaussian_mixture = np.hstack([gaussian1, gaussian2])
df = pd.DataFrame(gaussian_mixture, columns=['data'])
# non-parametric pdf
nparam_density = stats.kde.gaussian_kde(df.values.ravel())
x = np.linspace(-20, 10, 200)
nparam_density = nparam_density(x)
# parametric fit: assume normal distribution
loc_param, scale_param = stats.norm.fit(df)
param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df.values, bins=30, normed=True)
ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')
ax.plot(x, param_density, 'k--', label='parametric density')
ax.set_ylim([0, 0.15])
ax.legend(loc='best')
From the graph, we see that the non-parametric density is nothing but a smoothed version of histogram. In histogram, for a particular observation x=x0, we use a bar to represent it (put all probability mass on that single point x=x0 and zero elsewhere) whereas in non-parametric density estimation, we use a bell-shaped curve (the gaussian kernel) to represent that point (spreads over its neighbourhood). And the result is a smoothed density curve. This internal gaussian kernel has nothing to do with your distributional assumption on the underlying data x. Its sole purpose is for smoothing.
To get the mode of non-parametric density, we need to do an exhaustive search, as the density is not guaranteed to have uni-mode. As shown in the example above, if you quasi-Newton optimization algo starts between [5,10], it is very likely to end up with a local optimal point rather than the global one.
# get mode: exhastive search
x[np.argsort(nparam_density)[-1]]

Categories

Resources