Seaborn distplot() won't display frequency in the y-axis - python

I am trying to display the weighted frequency in the y-axis of a seaborn.distplot() graph, but it keeps displaying the density (which is the default in distplot())
I read the documentation and also many similar questions here in Stack.
The common answer is to set norm_hist=False and also to assign the weights in a bumpy array as in a standard histogram. However, it keeps showing the density and not the probability/frequency of each bin.
My code is
plt.figure(figsize=(10, 4))
plt.xlim(-0.145,0.145)
plt.axvline(0, color='grey')
data = df['col1']
x = np.random.normal(data.mean(), scale=data.std(), size=(100000))
normal_dist =sns.distplot(x, hist=False,color="red",label="Gaussian")
data_viz = sns.distplot(data,color="blue", bins=31,label="data", norm_hist=False)
# I also tried adding the weights inside the argument
#hist_kws={'weights': np.ones(len(data))/len(data)})
plt.legend(bbox_to_anchor=(1, 1), loc=1)
And I keep receiving this output:
Does anyone have an idea of what could be the problem here?
Thanks!
[EDIT]: The problem is that the y-axis is showing the kdevalues and not those from the weighted histogram. If I set kde=False then I can display the frequency in the y-axis. However, I still want to keep the kde, so I am not considering that option.

Keeping the kde and the frequency/count in one y-axis in one plot will not work because they have different scales. So it might be better to create a plot with 2 axis with each showing the kde and histogram separately.
From documentation norm_hist If True, the histogram height shows a density rather than a count. **This is implied if a KDE or fitted density is plotted**.
versusnja in https://github.com/mwaskom/seaborn/issues/479 has a workaround:
# Plot hist without kde.
# Create another Y axis.
# Plot kde without hist on the second Y axis.
# Remove Y ticks from the second axis.
first_ax = sns.distplot(data, kde=False)
second_ax = ax.twinx()
sns.distplot(data, ax=second_ax, kde=True, hist=False)
second_ax.set_yticks([])
If you need this just for visualization it should be good enough.

Related

Set log xticks in matplotlib for a linear plot

Consider
xdata=np.random.normal(5e5,2e5,int(1e4))
plt.hist(np.log10(xdata), bins=100)
plt.show()
plt.semilogy(xdata)
plt.show()
is there any way to display xticks of the first plot (plt.hist) as in the second plot's yticks? For good reasons I want to histogram the np.log10(xdata) of xdata but I'd like to set minor ticks to display as usual in a log scale (even considering that the exponent is linear...)
In other words, I want the x_axis of this plot:
to be like the y_axis
of the 2nd plot, without changing the spacing between major ticks (e.g., adding log marks between 5.5 and 6.0, without altering these values)
Proper histogram plot with logarithmic x-axis:
Explanation:
Cut off negative values
The randomly generated example data likely contains still some negative values
activate the commented code lines at the beginning to see the effect
logarithmic function isn't defined for values <= 0
while the 2nd plot just deals with y-axis log scaling (negative values are just out of range), the 1st plot doesn't work with negative values in the BINs range
probably real world working data won't be <= 0, otherwise keep that in mind
BINs should be aligned to log scale as well
otherwise the 'BINs widths' distribution looks off
switch # on the plt.hist( statements in the 1st plot section to see the effect)
xdata (not np.log10(xdata)) to be plotted in the histogram
that 'workaround' with plotting np.log10(xdata) probably was the root cause for the misunderstanding in the comments
Code:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42) # just to have repeatable results for the answer
xdata=np.random.normal(5e5,2e5,int(1e4))
# MIN_xdata, MAX_xdata = np.min(xdata), np.max(xdata)
# print(f"{MIN_xdata}, {MAX_xdata}") # note the negative values
# cut off potential negative values (log function isn't defined for <= 0 )
xdata = np.ma.masked_less_equal(xdata, 0)
MIN_xdata, MAX_xdata = np.min(xdata), np.max(xdata)
# print(f"{MIN_xdata}, {MAX_xdata}")
# align the bins to fit a log scale
bins = 100
bins_log_aligned = np.logspace(np.log10(MIN_xdata), np.log10(MAX_xdata), bins)
# 1st plot
plt.hist(xdata, bins = bins_log_aligned) # note: xdata (not np.log10(xdata) )
# plt.hist(xdata, bins = 100)
plt.xscale('log')
plt.show()
# 2nd plot
plt.semilogy(xdata)
plt.show()
Just kept for now for clarification purpose. Will be deleted when the question is revised.
Disclaimer:
As Lucas M. Uriarte already mentioned that isn't an expected way of changing axis ticks.
x axis ticks and labels don't represent the plotted data
You should at least always provide that information along with such a plot.
The plot
From seeing the result I kinda understand where that special plot idea is coming from - still there should be a preferred way (e.g. conversion of the data in advance) to do such a plot instead of 'faking' the axis.
Explanation how that special axis transfer plot is done:
original x-axis is hidden
a twiny axis is added
note that its y-axis is hidden by default, so that doesn't need handling
twiny x-axis is set to log and the 2nd plot y-axis limits are transferred
subplots used to directly transfer the 2nd plot y-axis limits
use variables if you need to stick with your two plots
twiny x-axis is moved from top (twiny default position) to bottom (where the original x-axis was)
Code:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42) # just to have repeatable results for the answer
xdata=np.random.normal(5e5,2e5,int(1e4))
plt.figure()
fig, axs = plt.subplots(2, figsize=(7,10), facecolor=(1, 1, 1))
# 1st plot
axs[0].hist(np.log10(xdata), bins=100) # plot the data on the normal x axis
axs[0].axes.xaxis.set_visible(False) # hide the normal x axis
# 2nd plot
axs[1].semilogy(xdata)
# 1st plot - twin axis
axs0_y_twin = axs[0].twiny() # set a twiny axis, note twiny y axis is hidden by default
axs0_y_twin.set(xscale="log")
# transfer the limits from the 2nd plot y axis to the twin axis
axs0_y_twin.set_xlim(axs[1].get_ylim()[0],
axs[1].get_ylim()[1])
# move the twin x axis from top to bottom
axs0_y_twin.tick_params(axis="x", which="both", bottom=True, top=False,
labelbottom=True, labeltop=False)
# Disclaimer
disclaimer_text = "Disclaimer: x axis ticks and labels don't represent the plotted data"
axs[0].text(0.5,-0.09, disclaimer_text, size=12, ha="center", color="red",
transform=axs[0].transAxes)
plt.tight_layout()
plt.subplots_adjust(hspace=0.2)
plt.show()

How set a specific range of x-axis and increase the length of the plot?

I´m looking to add a specific range of values to the x-axis of my plot and increase the length of this axis.
I change the range of the values of my x-axis; however, the values keep in a specific range.
Besides, I tried to increase the length of the x-axis but I failed again.
For now, I´m only plotting an empty graph, because a need to set the specifications for the axis.
Here is part of the code to the plot:
fig1, ax = plt.subplots()
ax.set_xlim(1, 1200)
ax.set_ylim(-800, 200)
ax.set_box_aspect(1)
plt.show()
This code gives me a plot square with the range of the:
x-axis = 0-200-400...1200,
I´m looking for:
x-axis = 0-50-100-150...1200
Also, I need to change the shape of the plot: square to a rectangular, where the x-axis increases the length.
Any suggestion or comment is welcome!
Thank!
plt.figure(figsize=(15,2))
Use this at first line to set the size of your plot. As you want to increase x-axis, then see that x>y in figsize parameter.
l1=np.arange(0,1250,50)
plt.xticks(l1)
Use the above code after setting y limits to set the xticks in range of 0-1200 with gap of 50.
``
You can change the size (and therefore the shape) of a pyplot figure like this:
fig1.set_size_inches(10, 8)
As for the ticks on the axis, this post gives a pretty in-depth answer on how to customize those.

Why is distplot changing the range of values plotted?

I have an array of values for which I am trying to fit a probability density function. I plotted the histogram using distplot as shown below:
x = [ 17.56,
162.52,
172.58,
160.82,
182.14,
165.86,
242.06,
135.76,
122.86,
230.22,
208.66,
271.36,
122.68,
188.42,
171.82,
102.30,
196.40,
107.38,
192.35,
179.66,
173.30,
254.66,
176.12,
75.365,
135.78,
103.66,
183.50,
166.08,
207.66,
146.22,
151.19,
172.20,
103.41,
133.93,
186.48,]
sns.distplot(x)
and the plot looks like this:
My minimum value in the array is 17 and maximum value is around 250 so I don't understand the range on the x-axis in the figure as I have not added any arguments either. Does sns.displot standardize the data before plotting?
A kde curve fits many gaussian normal curves over the data points. Such a normal curve has an infinite tail, which here is cut off when it gets close enough to zero height.
Note that sns.distplot has been deprecated since seaborn 0.11, and replaced by (in this case) sns.histplot(..., kde=True). The new kdeplot has a parameter cut= which defaults to zero, cutting the curve at the data limits (cut is one of the kde_kws in histplot: sns.histplot(x, kde=True, kde_kws={'cut': 0}). ).
import seaborn as sns
x = [17.56, 162.52, 172.58, 160.82, 182.14, 165.86, 242.06, 135.76, 122.86, 230.22, 208.66, 271.36, 122.68, 188.42,
171.82, 102.30, 196.40, 107.38, 192.35, 179.66, 173.30, 254.66, 176.12, 75.365, 135.78, 103.66, 183.50, 166.08,
207.66, 146.22, 151.19, 172.20, 103.41, 133.93, 186.48]
sns.histplot(x, kde=True)

seaborn distplot different bar width on each figure

Sorry for giving an image however I think it is the best way to show my problem.
As you can see all of the bin width are different, from my understanding it shows range of rent_hours. I am not sure why different figure have different bin width even though I didn't set any.
My code looks is as follows:
figure, axes = plt.subplots(nrows=4, ncols=3)
figure.set_size_inches(18,14)
plt.subplots_adjust(hspace=0.5)
for ax, age_g in zip(axes.ravel(), age_cat):
group = total_usage_df.loc[(total_usage_df.age_group == age_g) & (total_usage_df.day_of_week <= 4)]
sns.distplot(group.rent_hour, ax=ax, kde=False)
ax.set(title=age_g)
ax.set_xlim([0, 24])
figure.suptitle("Weekday usage pattern", size=25);
additional question:
Seaborn : How to get the count in y axis for distplot using PairGrid for here it says that kde=False makes y-axis count however http://seaborn.pydata.org/generated/seaborn.distplot.html in the doc, it uses kde=False and still seems to show something else. How can I set y-axis to show count?
I've tried
sns.distplot(group.rent_hour, ax=ax, norm_hist=True) and it still seems to give something else rather than count.
sns.distplot(group.rent_hour, ax=ax, kde=False) gives me count however I don't know why it is giving me count.
Answer 1:
From the documentation:
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count.
This is implied if a KDE or fitted density is plotted.
So you need to take into account your bin width as well, i.e. compute the area under the curve and not just the sum of the bin heights.
Answer 2:
# Plotting hist without kde
ax = sns.distplot(your_data, kde=False)
# Creating another Y axis
second_ax = ax.twinx()
#Plotting kde without hist on the second Y axis
sns.distplot(your_data, ax=second_ax, kde=True, hist=False)
#Removing Y ticks from the second axis
second_ax.set_yticks([])

Matplotlib - Boxplot calculated on log10 values but shown in logarithmic scale

I think this is a simple question, but I just still can't seem to think of a simple solution. I have a set of data of molecular abundances, with values ranging many orders of magnitude. I want to represent these abundances with boxplots (box-and-whiskers plots), and I want the boxes to be calculated on log scale because of the wide range of values.
I know I can just calculate the log10 of the data and send it to matplotlib's boxplot, but this does not retain the logarithmic scale in plots later.
So my question is basically this:
When I have calculated a boxplot based on the log10 of my values, how do I convert the plot afterward to be shown on a logarithmic scale instead of linear with the log10 values?
I can change tick labels to partly fix this, but I have no clue how I get logarithmic scales back to the plot.
Or is there another more direct way to plotting this. A different package maybe that has this options already included?
Many thanks for the help.
I'd advice against doing the boxplot on the raw values and setting the y-axis to logarithmic, because the boxplot function is not designed to work across orders of magnitudes and you may get too many outliers (depends on your data, of course).
Instead, you can plot the logarithm of the data and manually adjust the y-labels.
Here is a very crude example:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator, FormatStrFormatter
np.random.seed(42)
values = 10 ** np.random.uniform(-3, 3, size=100)
fig = plt.figure(figsize=(9, 3))
ax = plt.subplot(1, 3, 1)
ax.boxplot(np.log10(values))
ax.set_yticks(np.arange(-3, 4))
ax.set_yticklabels(10.0**np.arange(-3, 4))
ax.set_title('log')
ax = plt.subplot(1, 3, 2)
ax.boxplot(values)
ax.set_yscale('log')
ax.set_title('raw')
ax = plt.subplot(1, 3, 3)
ax.boxplot(values, whis=[5, 95])
ax.set_yscale('log')
ax.set_title('5%')
plt.show()
The right figure shows the box plot on the raw values. This leads to many outliers, because the maximum whisker length is computed as a multiple (default: 1.5) of the interquartile range (the box height), which does not scale across orders of magnitude.
Alternatively, you could specify to draw the whiskers for a given percentile range:
ax.boxplot(values, whis=[5, 95])
In this case you get a fixed amount of outlires (5%) above and below.
You can use plt.yscale:
plt.boxplot(data); plt.yscale('log')

Categories

Resources