In Java, I usually rely on the org.apache.commons.math3.random.EmpiricalDistribution class to do the following:
Derive a probability distribution from observed data.
Generate random values from this distribution.
Is there any Python library that provides the same functionality? It seems like scipy.stats.gaussian_kde.resample does something similar, but I'm not sure if it implements the same procedure as the Java type I'm familiar with.
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
# This represents the original "empirical" sample -- I fake it by
# sampling from a normal distribution
orig_sample_data = np.random.normal(size=10000)
# Generate a KDE from the empirical sample
sample_pdf = scipy.stats.gaussian_kde(orig_sample_data)
# Sample new datapoints from the KDE
new_sample_data = sample_pdf.resample(10000).T[:,0]
# Histogram of initial empirical sample
cnts, bins, p = plt.hist(orig_sample_data, label='original sample', bins=100,
histtype='step', linewidth=1.5, density=True)
# Histogram of datapoints sampled from KDE
plt.hist(new_sample_data, label='sample from KDE', bins=bins,
histtype='step', linewidth=1.5, density=True)
# Visualize the kde itself
y_kde = sample_pdf(bins)
plt.plot(bins, y_kde, label='KDE')
plt.legend()
plt.show(block=False)
new_sample_data should be drawn from roughly the same distribution as the original data (to the degree that the KDE is a good approximation to the original distribution).
Related
I have some histogram code as follows:
plt.subplots(figsize=(10,8), dpi=100)
sns.distplot(x1, color='k', label='a',norm_hist = True)
sns.distplot(x2, color='g', label='b',norm_hist = True)
sns.distplot(x3, color='b', label='b',norm_hist = True)
sns.distplot(x4, color='r', label='c',norm_hist = True)
sns.distplot(x5, color='y', label='c',norm_hist = True)
For my data, I get this plot
This is good but what I'm really trying is to fit the curve only on positive x values. Negative duration doesn't make physical sense. Is there any option for that?
From the sns.distplot() documentation:
... It can also fit scipy.stats distributions and plot the estimated PDF over the data.
So you can choose a non-negative distribution that would make sense for your data and use the fit argument to pass a SciPy distribution object that will be fitted to your data.
For example:
import seaborn as sns
from scipy import stats
iris = sns.load_dataset('iris')
sns.distplot(iris.petal_length, color='k', fit=stats.expon, kde=False)
I have plotted histogram and now I want to have curve which will represent the histogram trend. I want my histogram binning to be logarithmic (as I have below in the code; Mass variable is predefined variable, ranging from 10^43-10^45 gram).
I have looked for many many codes but could not suit any of them to my case (tried to modify as well). Do you know how I can make this curve? Actually, I just want to modify my code in the way that it will also include plotting this curve above the histogram.
Thanks,
Salome
See the attached image
import matplotlib.pyplot as plt
import numpy as np
x=Mass
hist, bins = np.histogram(x, bins=10)
logbins = np.logspace(np.log10(bins[0]),np.log10(bins[-1]),len(bins))
n, bins, patches = plt.hist(x=Mass, bins=logbins, color='#0504aa', alpha=0.8, rwidth=0.85)
plt.xscale('log')
plt.xlabel('Mass $(g)$ ')
plt.ylabel('Number of halos')
plt.show()
I have created a script to plot a histogram of a NO2 vs Temperature residuals in a dataframe called nighttime.
The histogram shows the normal distribution of the residuals from a regression line somewhere else in the python script.
I am struggling to find a way to plot a bell curve over the histogram like this example :
Plot Normal distribution with Matplotlib
How can I get a fitting normal distribution for my residual histogram?
plt.suptitle('NO2 and Temperature Residuals night-time', fontsize=20)
WSx_rm = nighttime['Temperature']
WSx_rm = sm.add_constant(WSx_rm)
NO2_WS_RM_mod = sm.OLS(nighttime.NO2, WSx_rm, missing = 'drop').fit()
NO2_WS_RM_mod_sr = (NO2_WS_RM_mod.resid / np.std(NO2_WS_RM_mod.resid))
#Histogram of residuals
ax = plt.hist(NO2_WS_RM_mod.resid)
plt.xlim(-40,50)
plt.xlabel('Residuals')
plt.show
You can exploit the methods from seaborn library for plotting the distribution with the bell curve. The residual variable is not clear to me in the example you have provided. You may see the code snippet below just for your reference.
# y here is an arbitrary target variable for explaining this example
residuals = y_actual - y_predicted
import seaborn as sns
sns.distplot(residuals, bins = 10) # you may select the no. of bins
plt.title('Error Terms', fontsize=20)
plt.xlabel('Residuals', fontsize = 15)
plt.show()
Does the following work for you? (using some adapted code from the link you gave)
import scipy.stats as stats
plt.suptitle('NO2 and Temperature Residuals night-time', fontsize=20)
WSx_rm = nighttime['Temperature']
WSx_rm = sm.add_constant(WSx_rm)
NO2_WS_RM_mod = sm.OLS(nighttime.NO2, WSx_rm, missing = 'drop').fit()
NO2_WS_RM_mod_sr = (NO2_WS_RM_mod.resid / np.std(NO2_WS_RM_mod.resid))
#Histogram of residuals
ax = plt.hist(NO2_WS_RM_mod.resid)
plt.xlim(-40,50)
plt.xlabel('Residuals')
# New Code: Draw fitted normal distribution
residuals = sorted(NO2_WS_RM_mod.resid) # Just in case it isn't sorted
normal_distribution = stats.norm.pdf(residuals, np.mean(residuals), np.std(residuals))
plt.plot(residuals, normal_distribution)
plt.show
I have started to use python for analysis. I would like to do the following:
Get the distribution of dataset
Get the peaks in this distribution
I used gaussian_kde from scipy.stats to make estimation for kernel density function. Does guassian_kde make any assumption about the data ?. I am using data that are changed over time. so if data has one distribution (e.g. Gaussian), it could have another distribution later. Does gaussian_kde have any drawbacks in this scenario?. It was suggested in question to try to fit the data in every distribution in order to get the data distribution. So what's the difference between using gaussian_kde and the answer provided in question. I used the code below, I was wondering also to know is gaussian_kde good way to estimate pdf if the data will be changed over time ?. I know one advantage of gaussian_kde is that it calculate bandwidth automatically by a rule of thumb as in here. Also, how can I get its peaks?
import pandas as pd
import numpy as np
import pylab as pl
import scipy.stats
df = pd.read_csv('D:\dataset.csv')
pdf = scipy.stats.kde.gaussian_kde(df)
x = np.linspace((df.min()-1),(df.max()+1), len(df))
y = pdf(x)
pl.plot(x, y, color = 'r')
pl.hist(data_column, normed= True)
pl.show(block=True)
I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow question you mention). To illustrate the difference between these two, try the following code.
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
np.random.seed(0)
gaussian1 = -6 + 3 * np.random.randn(1700)
gaussian2 = 4 + 1.5 * np.random.randn(300)
gaussian_mixture = np.hstack([gaussian1, gaussian2])
df = pd.DataFrame(gaussian_mixture, columns=['data'])
# non-parametric pdf
nparam_density = stats.kde.gaussian_kde(df.values.ravel())
x = np.linspace(-20, 10, 200)
nparam_density = nparam_density(x)
# parametric fit: assume normal distribution
loc_param, scale_param = stats.norm.fit(df)
param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df.values, bins=30, normed=True)
ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')
ax.plot(x, param_density, 'k--', label='parametric density')
ax.set_ylim([0, 0.15])
ax.legend(loc='best')
From the graph, we see that the non-parametric density is nothing but a smoothed version of histogram. In histogram, for a particular observation x=x0, we use a bar to represent it (put all probability mass on that single point x=x0 and zero elsewhere) whereas in non-parametric density estimation, we use a bell-shaped curve (the gaussian kernel) to represent that point (spreads over its neighbourhood). And the result is a smoothed density curve. This internal gaussian kernel has nothing to do with your distributional assumption on the underlying data x. Its sole purpose is for smoothing.
To get the mode of non-parametric density, we need to do an exhaustive search, as the density is not guaranteed to have uni-mode. As shown in the example above, if you quasi-Newton optimization algo starts between [5,10], it is very likely to end up with a local optimal point rather than the global one.
# get mode: exhastive search
x[np.argsort(nparam_density)[-1]]
How can I make a figure like the following one but with flat curve using matlibplot in Python?
Instead of using a histogram to bin your data have a look at using a KDE for a continuous estimate of the probability distribution. There is an implementation using a gaussian kernel in scipy.stats.gaussian_kde.
As an example:
import numpy as np
from scipy.stats import gaussian_kde
import matplotlib.pyplot as plt
data = np.random.normal(0.0, 1.0, 10000) #Generate some data
kde = gaussian_kde(data)
xplot = np.linspace(-5,5,1000)
plt.plot( xplot, kde(xplot), label='KDE' )
plt.hist( data, bins=50, histtype='step', normed=True, label='histogram' )
plt.legend()
plt.show()
Will produce the plot:
Note that when using KDEs the bandwidth of the kernel that you choose can have a very big impact on the representation of the data that gets produced, this is similar to the effect that the bin size would have when making a histogram. Both the scipy documentation that I linked to and the wikipedia page have good writeups on how to make this selection in a well motivated way.