Smooth histogram in python - python

In my program, I calculate N amounts of three parameters and want to create three histograms for each parameter. I have strict conditions for the histograms. Firstly, it is a condition on the range (at some points histogram should go to zero strictly), and it should be smooth.
I use np.histogram, as following:
hist, bins = np.histogram(Gamm1, bins=100)
center = bins[:-1]
plt.plot(center, hist)
plt.show()
but the solution is too sharp. After that, I use the following construction with seaborn,
snsplot = sns.kdeplot(data['Third'], shade=True)
fig = snsplot.get_figure()
fig.savefig("output2.png")
but here approximation goes out of range (range is created from physical conditions).
I think that changing in bins for a seaborn solution, as it could be done for np.histogram, can help.
But, in the end, I'm looking for some solution, which will be smooth and into given by me range.

Related

Why do these plots with same parameters look so different? logging in matplotlib vs seaborn

With seaborn.histplot:
import seaborn as sns
plot = sns.histplot(data = adata.obs, x = 'n_counts', bins=50, log_scale=True)
plot.set_xlim(1, 100000)
With plt.hist
adata = org_1
data = adata.obs['n_counts']
plt.hist(data, bins=50, range=(1, 100000))
plt.xscale("log")
With plt.hist, but logging the data before passing it to plotting function:
Tangent - how can I get the x axis to be in 10^n notation? (as in first plot)
data = np.log10(adata.obs['n_counts'])
plt.hist(data, bins=50)
plt.xlabel('log nUMI')
With plt.hist, but logging the data before passing it to plotting function, but specifying range to be as in plots 1 and 2:
data = np.log10(adata.obs['n_counts'])
plt.hist(data, bins=50, range = (1, 10000))
plt.xlabel('log nUMI')
(I don't have enough reputation to comment, but I would like to assist)
With plt.hist
Here you have exactly 50 buckets! They are each 2_000 units. (see how the first bucket ends at 2*10^3) Because you have plotted them on a logscale, the buckets to the left appear unnaturally wide which is an artifact of log axes.
data = np.log10(adata.obs['n_counts'])
As a general rule in most (if not all) plotting utilities that I have used, taking the log of the x-axis does not lend well to getting the variable tick spacing you desire. If you want the variable tick spacing, don't take the log for the plot, let the plot do it.
plt.hist(data, bins=50, range = (1, 10000))
Once you take the log, then the range must be scaled back. You need to go from log(1) to log(10000). There are no data points whose value after the log is 10000 (this would imply you had 10000 digit numbers to begin with)
Your plots do not have the same parameters.
When you pass plt.hist some data and ask for 50 bins, it has no way of knowing that you are later going to change the axis scale, so it computes 50 linearly-spaced breaks.
Because you passed log_scale=True to sns.histplot, it knows the scale at the time that it computes the bin breaks, and it can make them evenly spaced in log intervals.
(You could also set the axis scale to log before calling sns.histplot and without passing log_scale=True, but plt.hist does not work this way).
When you log the data first, there is no way for the function to know that the numbers represent log values. So you do get bins that appear evenly spaced (because everything matplotlib does happens on a linear scale now) and represent a lognormal distribution well, but those bins no longer correspond to the range covered by the original data, and you would need to manually change any tick labels to represent the original magnitudes.

Why is distplot changing the range of values plotted?

I have an array of values for which I am trying to fit a probability density function. I plotted the histogram using distplot as shown below:
x = [ 17.56,
162.52,
172.58,
160.82,
182.14,
165.86,
242.06,
135.76,
122.86,
230.22,
208.66,
271.36,
122.68,
188.42,
171.82,
102.30,
196.40,
107.38,
192.35,
179.66,
173.30,
254.66,
176.12,
75.365,
135.78,
103.66,
183.50,
166.08,
207.66,
146.22,
151.19,
172.20,
103.41,
133.93,
186.48,]
sns.distplot(x)
and the plot looks like this:
My minimum value in the array is 17 and maximum value is around 250 so I don't understand the range on the x-axis in the figure as I have not added any arguments either. Does sns.displot standardize the data before plotting?
A kde curve fits many gaussian normal curves over the data points. Such a normal curve has an infinite tail, which here is cut off when it gets close enough to zero height.
Note that sns.distplot has been deprecated since seaborn 0.11, and replaced by (in this case) sns.histplot(..., kde=True). The new kdeplot has a parameter cut= which defaults to zero, cutting the curve at the data limits (cut is one of the kde_kws in histplot: sns.histplot(x, kde=True, kde_kws={'cut': 0}). ).
import seaborn as sns
x = [17.56, 162.52, 172.58, 160.82, 182.14, 165.86, 242.06, 135.76, 122.86, 230.22, 208.66, 271.36, 122.68, 188.42,
171.82, 102.30, 196.40, 107.38, 192.35, 179.66, 173.30, 254.66, 176.12, 75.365, 135.78, 103.66, 183.50, 166.08,
207.66, 146.22, 151.19, 172.20, 103.41, 133.93, 186.48]
sns.histplot(x, kde=True)

Seaborn distplot() won't display frequency in the y-axis

I am trying to display the weighted frequency in the y-axis of a seaborn.distplot() graph, but it keeps displaying the density (which is the default in distplot())
I read the documentation and also many similar questions here in Stack.
The common answer is to set norm_hist=False and also to assign the weights in a bumpy array as in a standard histogram. However, it keeps showing the density and not the probability/frequency of each bin.
My code is
plt.figure(figsize=(10, 4))
plt.xlim(-0.145,0.145)
plt.axvline(0, color='grey')
data = df['col1']
x = np.random.normal(data.mean(), scale=data.std(), size=(100000))
normal_dist =sns.distplot(x, hist=False,color="red",label="Gaussian")
data_viz = sns.distplot(data,color="blue", bins=31,label="data", norm_hist=False)
# I also tried adding the weights inside the argument
#hist_kws={'weights': np.ones(len(data))/len(data)})
plt.legend(bbox_to_anchor=(1, 1), loc=1)
And I keep receiving this output:
Does anyone have an idea of what could be the problem here?
Thanks!
[EDIT]: The problem is that the y-axis is showing the kdevalues and not those from the weighted histogram. If I set kde=False then I can display the frequency in the y-axis. However, I still want to keep the kde, so I am not considering that option.
Keeping the kde and the frequency/count in one y-axis in one plot will not work because they have different scales. So it might be better to create a plot with 2 axis with each showing the kde and histogram separately.
From documentation norm_hist If True, the histogram height shows a density rather than a count. **This is implied if a KDE or fitted density is plotted**.
versusnja in https://github.com/mwaskom/seaborn/issues/479 has a workaround:
# Plot hist without kde.
# Create another Y axis.
# Plot kde without hist on the second Y axis.
# Remove Y ticks from the second axis.
first_ax = sns.distplot(data, kde=False)
second_ax = ax.twinx()
sns.distplot(data, ax=second_ax, kde=True, hist=False)
second_ax.set_yticks([])
If you need this just for visualization it should be good enough.

is it possible to use a non gaussian kernel for the two lateral distributions in seaborn jointplot

My data look like:
s1 = sns.jointplot(data.columns[i],
data.columns[j],
data=data,
space=0, color="b", stat_func=None)
if I use kde instead
s1 = sns.jointplot(data.columns[i],
data.columns[j],
data=data, kind = 'kde',
space=0, color="b", stat_func=None)
I am quite happy with the two dimensional kde interpolation, less with the lateral one. The problem is both placed so close together actually suggest the maximum of the distribution lying at two different points which might be quite misleading.
So now the actual question: is it possible to specify something different from gaussian as a kernel (blue) for the two lateral distributions? (I know that gaussian is thw only option in 2D). Because for example 'biw' (green) might esthetically look better (I am still not convinced that it is mathematically speaking a good thing to place the interpolations done with the different kernel close together making them seem the same thing...). So my question is whether I can specify the different kernel somewhere in sns.jointplot or is the only way to overwrite the lateral distribution by anotherone calculated in a second moment.
ax1 = sns.distplot(data[data.columns[j]])
sns.kdeplot(data[data.columns[j]], kernel= 'biw', ax = ax1)
You can set a different kernel for the marginal plots:
s1 = sns.jointplot(data.columns[i],
data.columns[j],
data=data, kind = 'kde',
space=0, color="b", stat_func=None,
marginal_kws={"kernel":"biw"}) # like this
or, if you want to change just one marginal plot, you can replot on them:
s1.ax_marg_y.cla() # clear axis
sns.kdeplot(data.y, ax=s1.ax_marg_y, # choose the ax
kernel="biw", # choose your kernel
legend=0, # remove the legend
vertical=True) # swap axis
vertical=True allows you to switch x and y axis, ie not needed if you change the top-margin plot.

Displaying 3 histograms on 1 axis in a legible way - matplotlib

I have produced 3 sets of data which are organised in numpy arrays. I'm interested in plotting the probability distribution of these three sets of data as normed histograms. All three distributions should look almost identical so it seems sensible to plot all three on the same axis for ease of comparison.
By default matplotlib histograms are plotted as bars which makes the image I want look very messy. Hence, my question is whether it is possible to force pyplot.hist to only draw a box/circle/triangle where the top of the bar would be in the default form so I can cleanly display all three distributions on the same graph or whether I have to calculate the histogram data and then plot it separately as a scatter graph.
Thanks in advance.
There are two ways to plot three histograms simultaniously, but both are not what you've asked for. To do what you ask, you must calculate the histogram, e.g. by using numpy.histogram, then plot using the plot method. Use scatter only if you want to associate other information with your points by setting a size for each point.
The first alternative approach to using hist involves passing all three data sets at once to the hist method. The hist method then adjusts the widths and placements of each bar so that all three sets are clearly presented.
The second alternative is to use the histtype='step' option, which makes clear plots for each set.
Here is a script demonstrating this:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(101)
a = np.random.normal(size=1000)
b = np.random.normal(size=1000)
c = np.random.normal(size=1000)
common_params = dict(bins=20,
range=(-5, 5),
normed=True)
plt.subplots_adjust(hspace=.4)
plt.subplot(311)
plt.title('Default')
plt.hist(a, **common_params)
plt.hist(b, **common_params)
plt.hist(c, **common_params)
plt.subplot(312)
plt.title('Skinny shift - 3 at a time')
plt.hist((a, b, c), **common_params)
plt.subplot(313)
common_params['histtype'] = 'step'
plt.title('With steps')
plt.hist(a, **common_params)
plt.hist(b, **common_params)
plt.hist(c, **common_params)
plt.savefig('3hist.png')
plt.show()
And here is the resulting plot:
Keep in mind you could do all this with the object oriented interface as well, e.g. make individual subplots, etc.

Categories

Resources