Modifying the histogram curve for positive x - python

I have some histogram code as follows:
plt.subplots(figsize=(10,8), dpi=100)
sns.distplot(x1, color='k', label='a',norm_hist = True)
sns.distplot(x2, color='g', label='b',norm_hist = True)
sns.distplot(x3, color='b', label='b',norm_hist = True)
sns.distplot(x4, color='r', label='c',norm_hist = True)
sns.distplot(x5, color='y', label='c',norm_hist = True)
For my data, I get this plot
This is good but what I'm really trying is to fit the curve only on positive x values. Negative duration doesn't make physical sense. Is there any option for that?

From the sns.distplot() documentation:
... It can also fit scipy.stats distributions and plot the estimated PDF over the data.
So you can choose a non-negative distribution that would make sense for your data and use the fit argument to pass a SciPy distribution object that will be fitted to your data.
For example:
import seaborn as sns
from scipy import stats
iris = sns.load_dataset('iris')
sns.distplot(iris.petal_length, color='k', fit=stats.expon, kde=False)

Related

Seaborn regplot fit line does not match calculated fit from stats.linregress or stats model

I am trying to fit a xlog-linear regression. I used Seaborn regplot to plot the fit, which looks like a good fit (green line). Then, because regplot does not provide the coefficients. I used stats.linregress to find the coefficients. However, that plotted line (purple) does not match the fit from Seaborn regplot. I also used stats model to get the coefficients which matched the lineregress output. Is there a better way to get the coefficients that match the regplot line. I am unable to reproduce the Seaborn regplot line. I need the coefficients to report the fit for the model.
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
sns.regplot(x, y,x_bins=100, logx=True,n_boot=2000, scatter_kws={"color": "black"},
ci=None,label='logfit',line_kws={"color": "green"})
#Find the coefficients slope and intercept
slope, intercept, r_value, pv, se = stats.linregress(y, np.log10(x))
yy= np.linspace(-.01, 0.05, 400)
xx = 10**(slope*yy+intercept)
plt.plot(xx,yy,marker='.',color='purple')
#Label Figure
plt.tick_params(labelsize=18)
plt.xlabel('insitu', fontsize=22)
plt.ylabel('CI', fontsize=22)
I also used stats model for the fit and got the same results as stats.linregress for the coefficients. I'm unable to reproduce Seaborn regplot line.
import statsmodels as sm
import statsmodels.formula.api as smf
results = smf.ols('np.log10(x) ~ (y)', data=df_data).fit()
# Inspect the results
print(results.summary())
There are two issues with your attempt to recreate what seaborn is doing:
you have the arguments to stats.linregress backwards
that's not how yhat is computed
Here's how you could recreate the seaborn logx regression line:
diamonds = sns.load_dataset("diamonds").sample(500, random_state=0)
x = diamonds["price"]
y = diamonds["carat"]
ax = sns.regplot(x=x, y=y, logx=True, line_kws=dict(color="g", lw=10))
fit = stats.linregress(np.log(x), y)
grid = np.linspace(x.min(), x.max())
ax.plot(grid, fit.intercept + fit.slope * np.log(grid), color="r", lw=5)

Get actual numbers instead of normalized value in seaborn KDE plots

I have three dataframes and I plot the KDE using seaborn module in python. The issue is that these plots try to make the area under the curve 1 (which is how they are intended to perform), so the height in the plots are normalized ones. But is there any way to show the actual values instead of the normalized ones. Also is there any way I can find out the point of intersection for the curves?
Note: I do not want to use the curve_fit method of scipy as I am not sure about the distribution I will get for each dataframe, it can be multimodal also.
import seaborn as sns
plt.figure()
sns.distplot(data_1['gap'],kde=True,hist=False,label='1')
sns.distplot(data_2['gap'],kde=True,hist=False,label='2')
sns.distplot(data_3['gap'],kde=True,hist=False,label='3')
plt.legend(loc='best')
plt.show()
Output for the code is attached in the link as I can't post images.plot_link
You can just grab the line and rescale its y-values with set_data:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# create some data
n = 1000
x = np.random.rand(n)
# plot stuff
fig, ax = plt.subplots(1,1)
ax = sns.distplot(x, kde=True, hist=False, ax=ax)
# find the line and rescale y-values
children = ax.get_children()
for child in children:
if isinstance(child, matplotlib.lines.Line2D):
x, y = child.get_data()
y *= n
child.set_data(x,y)
# update y-limits (not done automatically)
ax.set_ylim(y.min(), y.max())
fig.canvas.draw()

Python: Generate random values from empirical distribution

In Java, I usually rely on the org.apache.commons.math3.random.EmpiricalDistribution class to do the following:
Derive a probability distribution from observed data.
Generate random values from this distribution.
Is there any Python library that provides the same functionality? It seems like scipy.stats.gaussian_kde.resample does something similar, but I'm not sure if it implements the same procedure as the Java type I'm familiar with.
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
# This represents the original "empirical" sample -- I fake it by
# sampling from a normal distribution
orig_sample_data = np.random.normal(size=10000)
# Generate a KDE from the empirical sample
sample_pdf = scipy.stats.gaussian_kde(orig_sample_data)
# Sample new datapoints from the KDE
new_sample_data = sample_pdf.resample(10000).T[:,0]
# Histogram of initial empirical sample
cnts, bins, p = plt.hist(orig_sample_data, label='original sample', bins=100,
histtype='step', linewidth=1.5, density=True)
# Histogram of datapoints sampled from KDE
plt.hist(new_sample_data, label='sample from KDE', bins=bins,
histtype='step', linewidth=1.5, density=True)
# Visualize the kde itself
y_kde = sample_pdf(bins)
plt.plot(bins, y_kde, label='KDE')
plt.legend()
plt.show(block=False)
new_sample_data should be drawn from roughly the same distribution as the original data (to the degree that the KDE is a good approximation to the original distribution).

Plot a line graph over a histogram for residual plot in python

I have created a script to plot a histogram of a NO2 vs Temperature residuals in a dataframe called nighttime.
The histogram shows the normal distribution of the residuals from a regression line somewhere else in the python script.
I am struggling to find a way to plot a bell curve over the histogram like this example :
Plot Normal distribution with Matplotlib
How can I get a fitting normal distribution for my residual histogram?
plt.suptitle('NO2 and Temperature Residuals night-time', fontsize=20)
WSx_rm = nighttime['Temperature']
WSx_rm = sm.add_constant(WSx_rm)
NO2_WS_RM_mod = sm.OLS(nighttime.NO2, WSx_rm, missing = 'drop').fit()
NO2_WS_RM_mod_sr = (NO2_WS_RM_mod.resid / np.std(NO2_WS_RM_mod.resid))
#Histogram of residuals
ax = plt.hist(NO2_WS_RM_mod.resid)
plt.xlim(-40,50)
plt.xlabel('Residuals')
plt.show
You can exploit the methods from seaborn library for plotting the distribution with the bell curve. The residual variable is not clear to me in the example you have provided. You may see the code snippet below just for your reference.
# y here is an arbitrary target variable for explaining this example
residuals = y_actual - y_predicted
import seaborn as sns
sns.distplot(residuals, bins = 10) # you may select the no. of bins
plt.title('Error Terms', fontsize=20)
plt.xlabel('Residuals', fontsize = 15)
plt.show()
Does the following work for you? (using some adapted code from the link you gave)
import scipy.stats as stats
plt.suptitle('NO2 and Temperature Residuals night-time', fontsize=20)
WSx_rm = nighttime['Temperature']
WSx_rm = sm.add_constant(WSx_rm)
NO2_WS_RM_mod = sm.OLS(nighttime.NO2, WSx_rm, missing = 'drop').fit()
NO2_WS_RM_mod_sr = (NO2_WS_RM_mod.resid / np.std(NO2_WS_RM_mod.resid))
#Histogram of residuals
ax = plt.hist(NO2_WS_RM_mod.resid)
plt.xlim(-40,50)
plt.xlabel('Residuals')
# New Code: Draw fitted normal distribution
residuals = sorted(NO2_WS_RM_mod.resid) # Just in case it isn't sorted
normal_distribution = stats.norm.pdf(residuals, np.mean(residuals), np.std(residuals))
plt.plot(residuals, normal_distribution)
plt.show

python, matplotlib: specgram data array values does not match specgram plot

I am using matplotlib.pyplot.specgram and matplotlib.pyplot.pcolormesh to make spectrogram plots of a seismic signal.
Background information -The reason for using pcolormesh is that I need to do arithmitic on the spectragram data array and then replot the resulting spectrogram (for a three-component seismogram - east, north and vertical - I need to work out the horizontal spectral magnitude and divide the vertical spectra by the horizontal spectra). It is easier to do this using the spectrogram array data than on individual amplitude spectra
I have found that the plots of the spectrograms after doing my arithmetic have unexpected values. Upon further investigation it turns out that the spectrogram plot made using the pyplot.specgram method has different values compared to the spectrogram plot made using pyplot.pcolormesh and the returned data array from the pyplot.specgram method. Both plots/arrays should contain the same values, I cannot work out why they do not.
Example:
The plot of
plt.subplot(513)
PxN, freqsN, binsN, imN = plt.specgram(trN.data, NFFT = 20000, noverlap = 0, Fs = trN.stats.sampling_rate, detrend = 'mean', mode = 'magnitude')
plt.title('North')
plt.xlabel('Time [s]')
plt.ylabel('Frequency [Hz]')
plt.clim(0, 150)
plt.colorbar()
#np.savetxt('PxN.txt', PxN)
looks different to the plot of
plt.subplot(514)
plt.pcolormesh(binsZ, freqsZ, PxN)
plt.clim(0,150)
plt.colorbar()
even though the "PxN" data array (that is, the spectrogram data values for each segment) is generated by the first method and re-used in the second.
Is anyone aware why this is happening?
P.S. I realise that my value for NFFT is not a square number, but it's not important at this stage of my coding.
P.P.S. I am not aware of what the "imN" array (fourth returned variable from pyplot.specgram) is and what it is used for....
First off, let's show an example of what you're describing so that other folks
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
# Brownian noise sequence
x = np.random.normal(0, 1, 10000).cumsum()
fig, (ax1, ax2) = plt.subplots(nrows=2, figsize=(8, 10))
values, ybins, xbins, im = ax1.specgram(x, cmap='gist_earth')
ax1.set(title='Specgram')
fig.colorbar(im, ax=ax1)
mesh = ax2.pcolormesh(xbins, ybins, values, cmap='gist_earth')
ax2.axis('tight')
ax2.set(title='Raw Plot of Returned Values')
fig.colorbar(mesh, ax=ax2)
plt.show()
Magnitude Differences
You'll immediately notice the difference in magnitude of the plotted values.
By default, plt.specgram doesn't plot the "raw" values it returns. Instead, it scales them to decibels (in other words, it plots the 10 * log10 of the amplitudes). If you'd like it not to scale things, you'll need to specify scale="linear". However, for looking at frequency composition, a log scale is going to make the most sense.
With that in mind, let's mimic what specgram does:
plotted = 10 * np.log10(values)
fig, ax = plt.subplots()
mesh = ax.pcolormesh(xbins, ybins, plotted, cmap='gist_earth')
ax.axis('tight')
ax.set(title='Plot of $10 * log_{10}(values)$')
fig.colorbar(mesh)
plt.show()
Using a Log Color Scale Instead
Alternatively, we could use a log norm on the image and get a similar result, but communicate that the color values are on a log scale more clearly:
from matplotlib.colors import LogNorm
fig, ax = plt.subplots()
mesh = ax.pcolormesh(xbins, ybins, values, cmap='gist_earth', norm=LogNorm())
ax.axis('tight')
ax.set(title='Log Normalized Plot of Values')
fig.colorbar(mesh)
plt.show()
imshow vs pcolormesh
Finally, note that the examples we've shown have had no interpolation applied, while the original specgram plot did. specgram uses imshow, while we've been plotting with pcolormesh. In this case (regular grid spacing) we can use either.
Both imshow and pcolormesh are very good options, in this case. However,imshow will have significantly better performance if you're working with a large array. Therefore, you might consider using it instead, even if you don't want interpolation (e.g. interpolation='nearest' to turn off interpolation).
As an example:
extent = [xbins.min(), xbins.max(), ybins.min(), ybins.max()]
fig, ax = plt.subplots()
mesh = ax.imshow(values, extent=extent, origin='lower', aspect='auto',
cmap='gist_earth', norm=LogNorm())
ax.axis('tight')
ax.set(title='Log Normalized Plot of Values')
fig.colorbar(mesh)
plt.show()

Categories

Resources