Bar heights and widths in histogram plot of several data - python

I'm trying to plot a simple histogram with multiple data in parallel.
My data are a set of 2D ndarrays, all of them with the same dimension (in this example 256 x 256).
I have this method to plot the data set:
def plot_data_histograms(data, bins, color, label, file_path):
"""
Plot multiple data histograms in parallel
:param data : a set of data to be plotted
:param bins : the number of bins to be used
:param color : teh color of each data in the set
:param label : the label of each color in the set
:param file_path : the path where the output will be save
"""
plt.figure()
plt.hist(data, bins, normed=1, color=color, label=label, alpha=0.75)
plt.legend(loc='upper right')
plt.savefig(file_path + '.png')
plt.close()
And I'm passing my data as follows:
data = [sobel.flatten(), prewitt.flatten(), roberts.flatten(), scharr.flatten()]
labels = ['Sobel', 'Prewitt', 'Roberts Cross', 'Scharr']
colors = ['green', 'blue', 'yellow', 'red']
plot_data_histograms(data, 5, colors, labels, '../Visualizations/StatisticalMeasures/RMSEHistograms')
And I got this histogram:
I know that this may be stupid, but I didn't get why my yticks varies from 0 to 4.5. I know that is due the normed parameter, but even reading this;
If True, the first element of the return tuple will be the counts
normalized to form a probability density, i.e., n/(len(x)*dbin). In a
probability density, the integral of the histogram should be 1; you
can verify that with a trapezoidal integration of the probability
density function.
I didn't really get how it works.
Also, once I set my bins to be equal five and the histogram has exactly 5 xticks (excluding borders), I didn't understand why I have some bars in the middle of some thicks, like the yellow one over the 0.6 thick. Since my number of bins and of xticks matches, I though that each set of four bars should be concentrated inside each interval, like it happens with the four first bars, completely concentrated inside the [0.0, 0.2] interval.
Thank you in advance.

The reason this is confusing is because you're squishing four histograms on one plot. In order to do this, matplotlib chooses to narrow the bars and put a gap between them. In a standard histogram, the total area of all bins is either 1 if normed or N. Here's a simple example:
a = np.random.rand(10)
bins = np.array([0, 0.5, 1.0]) # just two bins
plt.hist(a, bins, normed=True)
First note that the each bar covers the entire range of its bin: The first bar ranges from 0 to 0.5, and its height is given by the number of points in that range.
Next, you can see that the total area of the two bars is 1 because normed = True: The width of each bar is 0.5 and the heights are 1.2 and 0.8.
Let's plot the same thing again with another distribution so you can see the effect:
b = np.random.rand(10)
plt.hist([a, b], bins, normed=True)
Recall that the blue bars represent exactly the same data as in the first plot, but they're less than half the width now because they must make room for the green bars. You can see that now two bars plus some whitespace covers the range of each bin. So we must pretend that the width of each bar is actually the width of all bars plus the width of the whitespace gap when we are calculating the bin range and bar area.
Finally, notice that nowhere do the xticks align with the binedges. If you wish, you can set this to be the case manually, with:
plt.xticks(bins)
If you hadn't manually created bins first, you can grab it from plt.hist:
counts, bins, bars = plt.hist(...)
plt.xticks(bins)

Related

How can histogram bins be separated and how can the number of labels printed on x-axis be reduced?

I have a series of Counter() whose graphs I want to plot. The issue is that there are around 120 to 150 bars that are plotted. I've tried playing with the width but it didn't scale at all for me and all the x-axis labels were mashed together. Therefore, I decided to print instead a graph with only 50 bars but still all the x-axis labels are being printed. I've tried things like labels[:50] but it doesn't work at all and I tried implementing the solution from another post to space out the bars but it was for naught (matplotlib bar chart: space out bars).
Therefore, could someone tell me how can I:
If I want to print all 120 to 150 bars how can the graph be properly scaled, if possible?
How can I print 50 bars x-axis labels instead of all the labels in my data set.
How can I space out the bars when dealing with Counter()?
The code for my plot function and the screenshots of the graphs are:
def plot_count(mycount):
labels, values = zip(*mycount.items())
indexes = np.arange(len(labels))
print(labels)
width = 1
plt.bar(indexes[:50], values[:50], width)
plt.xticks(indexes + width * 0.5, labels, rotation = 45)
plt.show()

How to plot for frequency only?

Question
How can I plot the following scenario, just like shown in the attached image? This is for the purpose of visualising frequency allocation in a network
Scenario
I have a range of frequency values in a list-tuple like so, where the 1st value is the centre frequency, 2nd is total width, 3rd is guard band:
frequencies = [('195.71250000', '59.00000000', '2.50000000'), ('195.78750000', '59.00000000', '2.50000000'), ('195.86250000', '59.00000000', '2.50000000')]
and the range of these values are:
range = [('191.32500000', '196.12500000')]
Note: These are dummy values, the actual data is much larger but follows the same general structure
There are several ways to create this plot. One way is to use ax.vlines to plot the dashed lines for the frequencies and to use ax.bar for the rectangles representing the frequency ranges.
Here is an example where the frequencies are occupied at regular intervals within the range you have given (boundaries included) but with widths of randomly varying size. No guards are computed seeing as they should be automatically apparent thanks to the position of the frequencies and the widths, as far as I understand.
Also, the widths are much smaller compared to the sample data you have provided, else the bars will be very wide and will all overlap with one another, which would look very different from the image you have shared.
import numpy as np # v 1.19.2
import matplotlib.pyplot as plt # v 3.3.2
# Create sample dataset
rng = np.random.default_rng(seed=1) # random number generator
frequencies = np.arange(191.325, 196.125, step=0.3)
widths = rng.uniform(0.05, 0.25, size=frequencies.size)
# Create figure with single Axes and loop through frequencies and widths to plot
# vertical dashed lines for the frequencies and bars for the widths
fig, ax = plt.subplots(figsize=(10,3))
for freq, width in zip(frequencies, widths):
ax.vlines(x=freq, ymin=0, ymax=10, colors='tab:blue', linestyle='--', zorder=1)
ax.bar(x=freq, height=6, width=width, color='tab:blue', zorder=2)
# Additional formatting
ax.set_xlabel('Frequency (THZ)', labelpad=15, size=12)
ax.set_xticks(frequencies[::2])
ax.yaxis.set_visible(False)
for spine in ['top', 'left', 'right']:
ax.spines[spine].set_visible(False)
plt.show()

Normed histogram y-axis larger than 1

Sometimes when I create a histogram, using say seaborn's displot function, with norm_hist = True, the y-axis is less than 1 as expected for a PDF. Other times it takes on values greater than one.
For example if I run
sns.set();
x = np.random.randn(10000)
ax = sns.distplot(x)
Then the y-axis on the histogram goes from 0.0 to 0.4 as expected, but if the data is not normal the y-axis can be as large as 30 even if norm_hist = True.
What am I missing about the normalization arguments for histogram functions, e.g. norm_hist for sns.distplot? Even if I normalize the data myself by creating a new variable thus:
new_var = data/sum(data)
so that the data sums to 1, the y-axis will still show values far larger than 1 (like 30 for example) whether the norm_hist argument is True or not.
What interpretation can I give when the y-axis has such a large range?
I think what is happening is my data is concentrated closely around zero so in order for the data to have an area equal to 1 (under the kde for example) the height of the histogram has to be larger than 1...but since probabilities can't be above 1 what does the result mean?
Also, how can I get these functions to show probability on the y-axis?
The rule isn't that all the bars should sum to one. The rule is that all the areas of all the bars should sum to one. When the bars are very narrow, their sum can be quite large although their areas sum to one. The height of a bar times its width is the probability that a value would all in that range. To have the height being equal to the probability, you need bars of width one.
Here is an example to illustrate what's going on.
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
fig, axs = plt.subplots(ncols=2, figsize=(14, 3))
a = np.random.normal(0, 0.01, 100000)
sns.distplot(a, bins=np.arange(-0.04, 0.04, 0.001), ax=axs[0])
axs[0].set_title('Measuring in meters')
axs[0].containers[0][40].set_color('r')
a *= 1000
sns.distplot(a, bins=np.arange(-40, 40, 1), ax=axs[1])
axs[1].set_title('Measuring in milimeters')
axs[1].containers[0][40].set_color('r')
plt.show()
The plot at the left uses bins of 0.001 meter wide. The highest bin (in red) is about 40 high. The probability that a value falls into that bin is 40*0.001 = 0.04.
The plot at the right uses exactly the same data, but measures in milimeter. Now the bins are 1 mm wide. The highest bin is about 0.04 high. The probability that a value falls into that bin is also 0.04, because of the bin width of 1.
PS: As an example of a distribution for which the probability density function has zones larger than 1, see the Pareto distribution with α = 3.

Frequencies of values in histogram

This is my first post, so please bear with me
Here is the code
plt.figure()
ax1 = plt.subplot()
sample = np.random.normal(loc=0.0, scale=1.0, size=100)
ax1.hist(sample,bins=100)
ax1.set_title('n={}'.format(sample_size))
print(len(np.unique(sample))) ##outputs 100 as expected
My doubt is if I am giving bins=100 and the number of samples is also 100, so why it doesn't plot bars for every single sample and why the output plot contains frequencies greater than 1?
With default parameters, all bins get the same width. 100 bins means the width of each bin is 1/100th of the total width. The total width goes from smallest to the largest of the list of samples.
Due to the choice of boundaries, at least one point will end up in the first bin, one in the last bin, but most will end up in the central bins and many of the outermost bins stay empty.
Having all bins the same width often is desired. A histogram wants to show in which region there are more and where there are less samples, whether there is just one peak or multiple peaks. Generally, to convey interesting information about data, the number of bins should be much less than the number of samples.
Here is a plot to illustrate what's happening. As 100 bins create a very crowded plot, the example uses just 20 samples and 20 bins. With so few samples, they will be spread out a bit more than with more samples. hist returns 3 arrays: one with the contents of each bin, one with the boundaries between the bins (this is one more than the number of bins) and one with the graphical objects (rectangular patches). The boundaries can be used to show their position.
import matplotlib.pyplot as plt
import numpy as np
N = 20
plt.figure()
ax1 = plt.subplot()
sample = np.random.normal(loc=0.0, scale=1.0, size=N)
bin_values, bin_bounds, _ = ax1.hist(sample, bins=N, label='Histogram')
ax1.set_title(f'{len(np.unique(sample))} samples')
ax1.plot(np.repeat(bin_bounds, 3), np.tile([0, -1, np.nan], len(bin_bounds)), label='Bin boundaries' )
ax1.scatter(sample, np.full_like(sample, -0.5), facecolor='none', edgecolor='crimson', label='Sample values')
ax1.axhline(0, color='black')
plt.legend()
plt.show()
Here is how 100 samples and 100 bins looks like:

What is y axis in seaborn distplot?

I have some geometrically distributed data. When I want to take a look at it, I use
sns.distplot(data, kde=False, norm_hist=True, bins=100)
which results is a picture:
However, bins heights don't add up to 1, which means y axis doesn't show probability, it's something different. If instead we use
weights = np.ones_like(np.array(data))/float(len(np.array(data)))
plt.hist(data, weights=weights, bins = 100)
the y axis shall show probability, as bins heights sum up to 1:
It can be seen more clearly here: suppose we have a list
l = [1, 3, 2, 1, 3]
We have two 1s, two 3s and one 2, so their respective probabilities are 2/5, 2/5 and 1/5. When we use seaborn histplot with 3 bins:
sns.distplot(l, kde=False, norm_hist=True, bins=3)
we get:
As you can see, the 1st and the 3rd bin sum up to 0.6+0.6=1.2 which is already greater than 1, so y axis is not a probability. When we use
weights = np.ones_like(np.array(l))/float(len(np.array(l)))
plt.hist(l, weights=weights, bins = 3)
we get:
and the y axis is probability, as 0.4+0.4+0.2=1 as expected.
The amount of bins in these 2 cases are is the same for both methods used in each case: 100 bins for geometrically distributed data, 3 bins for small array l with 3 possible values. So bins amount is not the issue.
My question is: in seaborn distplot called with norm_hist=True, what is the meaning of y axis?
From the documentation:
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.
So you need to take into account your bin width as well, i.e. compute the area under the curve and not just the sum of the bin heights.
The x-axis is the value of the variable just like in a histogram, but what exactly does the y-axis represent?
ANS-> The y-axis in a density plot is the probability density function for the kernel density estimation. However, we need to be careful to specify this is a probability density and not a probability. The difference is the probability density is the probability per unit on the x-axis. To convert to an actual probability, we need to find the area under the curve for a specific interval on the x-axis. Somewhat confusingly, because this is a probability density and not a probability, the y-axis can take values greater than one. The only requirement of the density plot is that the total area under the curve integrates to one. I generally tend to think of the y-axis on a density plot as a value only for relative comparisons between different categories.
from the reference of https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0
This code will help you make something like this :
sns.set_style("whitegrid")
ax = sns.displot(data=df_p,
x='Volume_Tonnes', kind='kde', fill=True, height=5, aspect=2)
# Here you can define the x limit
ax.set(xlim=(-50,100))
ax.set(xlabel = 'Volume Tonnes', ylabel = 'Probability Density')
ax.fig.suptitle("Volume Tonnes Distribution",
fontsize=20, fontdict={"weight": "bold"})
plt.show()

Categories

Resources