This question already has answers here:
Plot a histogram such that bar heights sum to 1 (probability)
(6 answers)
Closed 7 months ago.
I tried looking this up on other users' questions, but I don't think I have found an answer.
I am attempting to plot a histogram from some data I have stored in a Pandas dataframe, and I want the y-axis value of each bin to equal the probability of that bin's event occurring. Since the density=True argument of matplotlib.pyplot.hist divides the counts in a bin by total counts and by the bin size, for bins of size =/= 1, the y-axis value of the histogram doesn't equal the probability of the event happening in that bin. It instead equals the probability in that bin per unit in that bin. I wish to make my bins 10 units wide, which has lead to my issue.
My code to generate a Pandas dataframe with data similar to what I'm working with:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from random import seed
from random import randint
data = pd.DataFrame(columns=['Col1'])
i = 0
while i < 49500:
data.loc[len(data.index)] = [0]
i += 1
seed(1)
j = 0
while j < 500:
data.loc[len(data.index)] = [randint(1,500)]
j += 1
My code to plot the histogram:
plt.figure(2)
fig2, ax2 = plt.subplots()
ax2.hist(data['Col1'], range=(0.0, 500.0), bins=50, label='50000 numbers\n in 10 unit bins', density=True)
plt.title('Probability Density of Some Numbers from 0 to 500', wrap=True)
plt.legend(loc='upper right')
plt.yscale('log')
plt.xticks()
plt.minorticks_on()
plt.ylabel('Probability')
plt.xlabel('Number')
plt.savefig('randnum.png')
My histogram (note the 0-10 bin, while composing roughly 99% of the data, is only at a probability of 0.1):
I do realize that by making the y-axis probability not inversely proportional to bin size, the integral of the histogram no longer equals to 1 (it will equal to 10 in my case), but this is precisely what I am seeking.
Is there a way to either 1) change the value the histogram is normalized to or 2) directly multiply y-values of the histogram by a value of my choosing?
I was able to accomplish this in pyplot with help from #JohanC's reference to Seaborn. The terminology I was looking for is 'probability mass' (the histogram bar heights sum to 1). Using [this answer][2], I was able to properly plot my histogram. Below is my code and my new histogram:
plt.figure(2)
fig2, ax2 = plt.subplots()
weights = np.ones_like(data['Col1']) / len(data['Col1'])
ax2.hist(data['Col1'], range=(0.0, 500.0), weights=weights, bins=50, label='50000 numbers\n in 10 unit bins')
plt.title('Probability Density of Some Numbers from 0 to 500', wrap=True)
plt.legend(loc='upper right')
plt.yscale('log')
plt.xticks()
plt.minorticks_on()
plt.ylabel('Probability')
plt.xlabel('Number')
plt.savefig('randnum.png')
Related
I need some help with a pyplot bar chart that isn't doing what it should, and I cannot figure out why.
So basically what I need to do is draw the power function of a binomial distribution test. First I plot the binomial distribution and mark important values.
from scipy.stats import binom
import numpy as np
import matplotlib.pyplot as plt
n = 20
p = 1/2
x_values = list(range(n + 1))
prob = [binom.pmf(x, n, p) for x in x_values ]
cumult = 0
index_count = 0
for px in prob:
cumult += px
print(cumult)
if cumult > 0.1:
print(index_count-1)
break
else:
index_count = index_count + 1
plt.bar(x_values,prob)
plt.axvline(x=6, color='red', linestyle='-', label='Grenze')
plt.axhline(y=0.1, color='green',linestyle='--',label='Signifikanzniveau')
plt.legend()
plt.show()
Binomial distribution plot
So far so good. Looks exactly like it should. Now for the power function what I do is add up the single probabilities from prob, and for each one, I calculate their probability of failing the test. Now the graph for this should look something like this for example
Example Graph
(ofc as a bar chart in my case)
Yet, my code
p_values = []
err_p = []
cumul = 0
for p in prob:
cumul = cumul + p
p_values.append(cumul)
err_p.append(1-cumul)
x_pos = np.arange(len(p_values))
plt.bar(p_values, err_p)
plt.axvline(x=0.5, color='red', linestyle='-', label='p0')
plt.axhline(y=0.1, color='green',linestyle='--',label='Signifikanzniveau')
plt.legend()
plt.show()
Produces this weird bar chart
which has values in the negatives and over 1 on the x-axis even though there are no values like this in the data??? I know that it worked once before I marked the values in this chart as well, but I haven't been able to reproduce it. I always get the one with non-existent values. I also don't know if it may have to do with the weirdly wide bars since in the first graph they look normal but here they sort of flow into each other.
For your task, you don't want to use a bar plot but a step plot:
plt.step(x=p_values, y=err_p, where="mid", label="err")
plt.axvline(x=0.5, color='red', linestyle='-', label='p0')
plt.axhline(y=0.1, color='green',linestyle='--',label='Signifikanzniveau')
plt.legend()
plt.show()
Sample output:
Bars have usually a constant width, hence they will leak into x-data that are not actually in your dataset. You could manually calculate the necessary width of each bar but thankfully matplotlib has implemented the step function for this task.
If you wanted a filled plot like a histogram, you could use fill_between:
plt.fill_between(x=p_values, y1=err_p, step="mid", color="lightblue", label="err")
plt.axvline(x=0.5, color='red', linestyle='-', label='p0')
plt.axhline(y=0.1, color='green',linestyle='--',label='Signifikanzniveau')
plt.legend()
plt.show()
Sample output:
I have a question about seaborn kdeplot. In histplot one can set up which stats they want to have (counts, frequency, density, probability) and if used with the kde argument, it also applies to the kdeplot. However, I have not found a way how to change it directly in the kdeplot if I wanted to have just the kde plot estimation with probabilities. Alternatively, the same result should be coming from histplot if the bars were possible to be switched off, which I also have not found. So how can one do that?
To give some visual example, I would like to have just the red curve, ie. either pass an argument to kdeplot to use probabilities, or to remove the bars from histplot:
import seaborn
penguins = sns.load_dataset("penguins")
sns.histplot(data=penguins, x="flipper_length_mm", kde=True, stat="probability", color="r", label="probabilities")
sns.kdeplot(data=penguins, x="flipper_length_mm", color="k", label="kde density")
plt.legend()
Thanks a lot.
The y-axis of a histplot with stat="probability" corresponds to the probability that a value belongs to a certain bar. The value of 0.23 for the highest bar, means that there is a probability of about 23% that a flipper length is between 189.7 and 195.6 mm (being the edges of that specific bin). Note that by default, 10 bins are spread out between the minimum and maximum value encountered.
The y-axis of a kdeplot is similar to a probability density function. The height of the curve is proportional to the approximate probability of a value being within a bin of width 1 of the corresponding x-value. A value of 0.031 for x=191 means there is a probability of about 3.1 % that the length is between 190.5 and 191.5.
Now, to directly get probability values next to a kdeplot, first a bin width needs to be chosen. Then the y-values can be divided by that bin with to correspond to an x-value being within a bin of that width. The PercentageFormatter provides a way to set such a correspondence, using ax.yaxis.set_major_formatter(PercentFormatter(1/binwidth)).
The code below illustrates an example with a binwidth of 5 mm, and how a histplot can match a kdeplot.
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import PercentFormatter
fig, ax1 = plt.subplots()
penguins = sns.load_dataset("penguins")
binwidth = 5
sns.histplot(data=penguins, x="flipper_length_mm", kde=True, stat="probability", color="r", label="Probabilities",
binwidth=binwidth, ax=ax1)
ax2 = ax1.twinx()
sns.kdeplot(data=penguins, x="flipper_length_mm", color="k", label="kde density", ls=':', lw=5, ax=ax2)
ax2.set_ylim(0, ax1.get_ylim()[1] / binwidth) # similir limits on the y-axis to align the plots
ax2.yaxis.set_major_formatter(PercentFormatter(1 / binwidth)) # show axis such that 1/binwidth corresponds to 100%
ax2.set_ylabel(f'Probability for a bin width of {binwidth}')
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')
plt.show()
PS: To only show the kdeplot with a probability, the code could be:
binwidth = 5
ax = sns.kdeplot(data=penguins, x="flipper_length_mm")
ax.yaxis.set_major_formatter(PercentFormatter(1 / binwidth)) # show axis such that 1/binwidth corresponds to 100%
ax.set_ylabel(f'Probability for a bin width of {binwidth}')
Another option could be to draw a histplot with kde=True, and remove the generated bars. To be interpretable, a binwidth should be set. With binwidth=1 you'd get the same y-axis as a density plot. (kde_kws={'cut': 3}) lets the kde smoothly go to about zero, default the kde curve is cut off with the minimum and maximum of the data).
ax = sns.histplot(data=penguins, x="flipper_length_mm", binwidth=1, kde=True, stat='probability', kde_kws={'cut': 3})
ax.containers[0].remove() # remove the bars
ax.relim() # the axis limits need to be recalculated without the bars
ax.autoscale_view()
Sometimes when I create a histogram, using say seaborn's displot function, with norm_hist = True, the y-axis is less than 1 as expected for a PDF. Other times it takes on values greater than one.
For example if I run
sns.set();
x = np.random.randn(10000)
ax = sns.distplot(x)
Then the y-axis on the histogram goes from 0.0 to 0.4 as expected, but if the data is not normal the y-axis can be as large as 30 even if norm_hist = True.
What am I missing about the normalization arguments for histogram functions, e.g. norm_hist for sns.distplot? Even if I normalize the data myself by creating a new variable thus:
new_var = data/sum(data)
so that the data sums to 1, the y-axis will still show values far larger than 1 (like 30 for example) whether the norm_hist argument is True or not.
What interpretation can I give when the y-axis has such a large range?
I think what is happening is my data is concentrated closely around zero so in order for the data to have an area equal to 1 (under the kde for example) the height of the histogram has to be larger than 1...but since probabilities can't be above 1 what does the result mean?
Also, how can I get these functions to show probability on the y-axis?
The rule isn't that all the bars should sum to one. The rule is that all the areas of all the bars should sum to one. When the bars are very narrow, their sum can be quite large although their areas sum to one. The height of a bar times its width is the probability that a value would all in that range. To have the height being equal to the probability, you need bars of width one.
Here is an example to illustrate what's going on.
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
fig, axs = plt.subplots(ncols=2, figsize=(14, 3))
a = np.random.normal(0, 0.01, 100000)
sns.distplot(a, bins=np.arange(-0.04, 0.04, 0.001), ax=axs[0])
axs[0].set_title('Measuring in meters')
axs[0].containers[0][40].set_color('r')
a *= 1000
sns.distplot(a, bins=np.arange(-40, 40, 1), ax=axs[1])
axs[1].set_title('Measuring in milimeters')
axs[1].containers[0][40].set_color('r')
plt.show()
The plot at the left uses bins of 0.001 meter wide. The highest bin (in red) is about 40 high. The probability that a value falls into that bin is 40*0.001 = 0.04.
The plot at the right uses exactly the same data, but measures in milimeter. Now the bins are 1 mm wide. The highest bin is about 0.04 high. The probability that a value falls into that bin is also 0.04, because of the bin width of 1.
PS: As an example of a distribution for which the probability density function has zones larger than 1, see the Pareto distribution with α = 3.
This is my first post, so please bear with me
Here is the code
plt.figure()
ax1 = plt.subplot()
sample = np.random.normal(loc=0.0, scale=1.0, size=100)
ax1.hist(sample,bins=100)
ax1.set_title('n={}'.format(sample_size))
print(len(np.unique(sample))) ##outputs 100 as expected
My doubt is if I am giving bins=100 and the number of samples is also 100, so why it doesn't plot bars for every single sample and why the output plot contains frequencies greater than 1?
With default parameters, all bins get the same width. 100 bins means the width of each bin is 1/100th of the total width. The total width goes from smallest to the largest of the list of samples.
Due to the choice of boundaries, at least one point will end up in the first bin, one in the last bin, but most will end up in the central bins and many of the outermost bins stay empty.
Having all bins the same width often is desired. A histogram wants to show in which region there are more and where there are less samples, whether there is just one peak or multiple peaks. Generally, to convey interesting information about data, the number of bins should be much less than the number of samples.
Here is a plot to illustrate what's happening. As 100 bins create a very crowded plot, the example uses just 20 samples and 20 bins. With so few samples, they will be spread out a bit more than with more samples. hist returns 3 arrays: one with the contents of each bin, one with the boundaries between the bins (this is one more than the number of bins) and one with the graphical objects (rectangular patches). The boundaries can be used to show their position.
import matplotlib.pyplot as plt
import numpy as np
N = 20
plt.figure()
ax1 = plt.subplot()
sample = np.random.normal(loc=0.0, scale=1.0, size=N)
bin_values, bin_bounds, _ = ax1.hist(sample, bins=N, label='Histogram')
ax1.set_title(f'{len(np.unique(sample))} samples')
ax1.plot(np.repeat(bin_bounds, 3), np.tile([0, -1, np.nan], len(bin_bounds)), label='Bin boundaries' )
ax1.scatter(sample, np.full_like(sample, -0.5), facecolor='none', edgecolor='crimson', label='Sample values')
ax1.axhline(0, color='black')
plt.legend()
plt.show()
Here is how 100 samples and 100 bins looks like:
I am having trouble using the pyplot.hist function to plot 2 histograms on the same figure. For each binning interval, I want the 2 bars to be centered between the bins (Python 3.6 user). To illustrate, here is an example:
import numpy as np
from matplotlib import pyplot as plt
bin_width=1
A=10*np.random.random(100)
B=10*np.random.random(100)
bins=np.arange(0,np.round(max(A.max(),B.max())/bin_width)*bin_width+2*bin_width,bin_width)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(A,bins,color='Orange',alpha=0.8,rwidth=0.4,align='mid',label='A')
ax.hist(B,bins,color='Orange',alpha=0.8,rwidth=0.4,align='mid',label='B')
ax.legend()
ax.set_ylabel('Count')
I get this:
Histogram_1
A and B series are overlapping, which is not good. Knowing there are only 3 option for 'align', (centered on left bin, middle of 2 bins, centered on right bin), i see no other options than modifying the bins, by adding:
bins-=0.25*bin_width
Before plotting A, and adding:
bins+=0.5*bin_width
Before plotting B. That gives me: Histogram
That's better! However, I had to modify the binning, so it is not the same for A and B.
I searched for a simple way to use the same bins, and then shift the 1st and 2nd plot so they are correctly displayed in the binning intervals, but I didn't find it. Any advice?
I hope I explained my problem clearly.
As previously was mentioned in the above comment you do not need a hist plot function. Use numpy histogram function and plot it results with bar function of matplotlib.
According to bins count and count of data types you can calculate bin width. Ticks you may adjust with xticks method:
import numpy as np
import matplotlib.pylab as plt
A=10*np.random.random(100)
B=10*np.random.random(100)
bins=20
# calculate heights and bins for both lists
ahist, abins = np.histogram(A, bins)
bhist, bbins = np.histogram(B, abins)
fig = plt.figure()
ax = fig.add_subplot(111)
# calc bin width for two lists
w = (bbins[1] - bbins[0])/3.
# plot bars
ax.bar(abins[:-1]-w/2.,ahist,width=w,color='r')
ax.bar(bbins[:-1]+w/2.,bhist,width=w,color='orange')
# adjsut xticks
plt.xticks(abins[:-1], np.arange(bins))
plt.show()