Matplotlib: Probability Mass Graph - python

Why does the total probability exceed 1?
import matplotlib.pyplot as plt
figure, axes = plt.subplots(nrows = 1, ncols = 1)
axes.hist(x = [0.1, 0.2, 0.3, 0.4], density = True)
figure.show()
Expected y-values: [0.25, 0.25, 0.25, 0.25]

Following is my understanding as per the documentation. I don't claim to be an expert in matplotlib nor I am one of the authors. Your question made me think and then I read the documentation and took some logical steps to understand it. So this is not an expert opinion.
===================================================================
Since you have not passed bins information, matplotlib went ahead and created its own bins. In this case the bins are as below.
bins = [0.1 , 0.13, 0.16, 0.19, 0.22, 0.25, 0.28, 0.31, 0.34, 0.37, 0.4 ]
You can see the bind width is 0.03.
Now according to the documentation.
density : bool, optional
If True, the first element of the return
tuple will be the counts normalized to form a probability density,
i.e., the area (or integral) under the histogram will sum to 1. This
is achieved by dividing the count by the number of observations times
the bin width and not dividing by the total number of observations.
In order to make the sum to 1, it is normalizing the counts so that when you multiply the resulting normalized counts in each bin with bin width the resulting sum of the individual product becomes 1.
Your counts are as below for X = [0.1,0.2,0.3,0.4]
OriginalCounts = [1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1]
As you can see if you multiply the OriginalCounts array with the bin width and sum all of them, it is going to come out to 4*0.03 = 0.12 .. which is less than one.
So according to the documentation we need to divide the OriginalCounts array with a factor .. which is (number of observations * bin width).
In this case the number of observations are 4 and bin width is 0.03. So 4*0.03 is equal to 0.12. Thus you divide each element of OriginalCounts with 0.12 to get a Normalized histogram values array.
That means that the revised counts are as below
NormalizedCounts = [8.33333333, 0. , 0. , 8.33333333, 0. , 0. , 8.33333333, 0. , 0. , 8.33333333]
Please note that, now, if you sum the Normalized counts multiplied by bin width, it will be equal to 1. You can quickly check this: 8.333333*4*0.03=0.9999999.. which is very close to 1.
This normalized counts is finally shown in the graph. This is the reason why the height of the bars in the histogram is close to 8 for at four positions.
Hope this helps.

Related

Normalize vector such that sum equals 1, while satisfying a lower bound

Given a lower bound of 0.025, I want a vector consisting of weights that sum up to 1 and satisfy this lower bound. Starting from a vector with an arbitrary length and the values ranging from 0.025 (lower bound) to 1.
For example,
[0.025, 0.8, 0.7]
Then a normalization where you divide by the sum of the numbers gives you roughly the following:
[0.016, 0.524, 0.459]
Now this does not satisfy the lower bound, any ideas on how I can get this to work?
If you want the weights (values in an array) to sum up to 1, you can divide each value by the sum of all values (i.e. normalize by the sum). This procedure keeps the relative sizes of each pair of values I means: before and after the step, the second item will be 5 times as large as the fourth item.
On the other hand you want all values to be larger than 0.025. Imagine if one item is 50 times larger than another before normalization, and the smallest value must be 0.025, the other item would need to be 1.25, which is already larger than the sum has to be.
You can figure out that you cannot (given any array) just scale all values equally so that they sum up to 1 AND the smallest value is 0.025.
Then the question is what relation between the values do you want to keep in the procedure?
As a side not, you cannot have more than 40 items, all bigger than 0.025, summ up to 1. So "arbitrary length" just cannot work either.
Add the lower bound to the dividend and divisor:
I used numpy for readability:
import numpy as np
v = np.array([0.025, 0.8, 0.7])
v2 = (v + min(v)) / sum(v + min(v))
Output:
>>> v2
array([0.03125 , 0.515625, 0.453125])
>>> sum(v2)
1.0

Points outside histogram range are discarted from plot

I am plotting a histogram of values and I want all histograms to have the same range of values for the bins, so plots can be compared. To do so, I specify a vector x with the values and range of each bin.
data = np.array([0.1, 0.1, 0.2, 0.2, 0.2, 0.32])
x = np.linspace(0, 0.2, 9)
plt.hist(data, x)
What I notice is that if I specify the range of x to be between 0 and 0.2 then values larger than 0.2 (0.32 in the example) are discarted from the plot.
Is there a way of accumulating all values greater than 0.2 in the last bin and all values lower than 0.0 in the first bin?
Of course I can do something like
data[data>0.2] = 0.2
data[data<0.0] = 0.0
But I'd prefer not to modify my original array and not have to make a copy of it unless there isn't another way.
You can pass the bins argument as an array with demarcation wherever you want. It does not have to be linearly spaced. This will make the bars of different widths though. For your particular case, you can use the .clip method of the data array.
data = np.array([0.1, 0.1, 0.2, 0.2, 0.2, 0.32])
x = np.linspace(0, 0.2, 9)
plt.hist(data.clip(min=0, max=0.2), x)

Merging similar columns in NumPy, probability vector

I have a numpy array of the following shape:
img1
It is a probability vector, where the second row corresponds to a value and the first row to the probability that this value is realized. (e.g. the probability of getting 1.0 is 20%)
When two values are close to each other, I want to merge their columns by adding up the probabilities. In this example I want to have:
img2
My current solution involves 3 loops and is really slow for larger arrays. Does someone know an efficient way to program this in NumPy?
While it won't do exactly what you want, you could try to use np.histogram to tackle the problem.
For example say you just want two "bins" like in your example you could do
import numpy as np
x = np.array([[0.1, 0.2, 0.6, 0.1], [0.0, 1.0, 0.0, 1.01]])
hist, bin_edges = np.histogram(x[1, :], bins=[0, 1.0, 1.5], weights=x[0, :])
and then stack your histogram with the leading bin edges to get your output
print(np.stack([hist, bin_edges[:-1]]))
This will print
[[0.7 0.3]
[0. 1. ]]
You can use the bins parameter to get your desired output. I hope this helps.

Difference between "counts" and "number of observations" in matplotlib histogram

The matplotlib.pyplot.hist() documentation describes the parameter "density" (its deprecated name was "normed") as:
density : bool, optional
If True, the first element of the return tuple will be the counts normalized to form a probability density, i.e., the area (or integral) under the histogram will sum to 1. This is achieved by dividing the count by the number of observations times the bin width and not dividing by the total number of observations.
With the first element of the tuple it refers to the y-axis values. It says that it manages to get the area under the histogram to be 1 by: dividing the count by the number of observations times the bin width.
What is the difference between count and number of observations? In my head they are the same thing: the number of instances (or number of counts or number of observations) the variable value falls into a certain bin. However, this would mean that the transformed number of counts for each bin is just one over the bin width (since # / #*bin_width = 1/bin_width) which does not make any sense.
Could someone clarify this for me? Thank you for your help and sorry for the probably stupid question.
I think the wording in the documentation is a bit confusing. The count is the number of entries in a given bin (height of the bin) and the number of observation is the total number of events that go into the histogram.
The documentation makes the distinction about how they normalized because there are generally two ways to do the normalization:
count / number of observations - in this case if you add up all the entries of the output array you would get 1.
count / (number of observations * bin width) - in this case the integral of the output array is 1 so it is a true probability density. This is what matplotlib does, and they just want to be clear in this distinction.
The count of all obervations is the number of observations. But with a histogram you're interested in the counts per bin. So for each bin you divide the count of this bin by the total number of observations times the bin width.
import numpy as np
observations = [1.2, 1.5, 1.7, 1.9, 2.2, 2.3, 3.6, 4.1, 4.2, 4.4]
bin_edges = [0,1,2,3,4,5]
counts, edges = np.histogram(observations, bins=bin_edges)
print(counts) # prints [0 4 2 1 3]
density, edges = np.histogram(observations, bins=bin_edges, density=True)
print(density) # prints [0. 0.4 0.2 0.1 0.3]
# calculate density manually according to formula
man_density = counts/(len(observations)*np.diff(edges))
print(man_density) # prints [0. 0.4 0.2 0.1 0.3]
# Check that density == manually calculated density
assert(np.all(man_density == density))

How are the "error bands" in Seaborn tsplot calculated?

I'm trying to understand how the error bands are calculated in the tsplot. Examples of the error bands are shown here.
When I plot something simple like
sns.tsplot(np.array([[0,1,0,1,0,1,0,1], [1,0,1,0,1,0,1,0], [.5,.5,.5,.5,.5,.5,.5,.5]]))
I get a vertical line at y=0.5 as expected. The top error band is also a vertical line at around y=0.665 and the bottom error band is a vertical line at around y=0.335. Can someone explain how these are derived?
EDIT: The question and this answer referred to old versions of Seaborn and is not relevant for new versions. See #CGFoX 's comment below.
I'm not a statistician, but I read through the seaborn code in order to see exactly what's happening. There are three steps:
Bootstrap resampling. Seaborn creates resampled versions of your data. Each of these is
a 3x8 matrix like yours, but each row is randomly selected from the
three rows of your input. For example, one might be:
[[ 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]
[ 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]
[ 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]]
and another might be:
[[ 1. 0. 1. 0. 1. 0. 1. 0. ]
[ 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]
[ 0. 1. 0. 1. 0. 1. 0. 1. ]]
It creates n_boot of these (10000 by default).
Central tendency estimation. Seaborn runs a function on each of the columns of each of the 10000 resampled versions of your data. Because you didn't specify this argument (estimator), it feeds the columns to a mean function (numpy.mean with axis=0). Lots of your columns in your bootstrap iterations are going to have a mean of 0.5, because they will be things like [0, 0.5, 1], [0.5, 1, 0], [0.5, 0.5, 0.5], etc. but you will also have some [1,1,0] and even some [1,1,1] which will result in higher means.
Confidence interval determination. For each column, seaborn sorts the 1000 estimates of the means calculated from each resampled version of the data from smallest to greatest, and picks the ones which represent the upper and lower CI. By default, it's using a 68% CI, so if you line up all 1000 mean estimates, then it will pick the 160th and the 840th. (840-160 = 680, or 68% of 1000).
A couple of notes:
There are actually only 3^3, or 27, possible resampled versions of your array, and if you use a function such as mean where the order doesn't matter then there's only 3!, or 6. So all 10000 bootstrap iterations will be identical to one of those 27 versions, or 6 versions in the unordered case. This means that it's probably silly to do 10000 iterations in this case.
The means 0.3333... and 0.6666... that show up as your confidence intervals are the means for [1,1,0] and [1,0,0] or rearranged versions of those.
They show a bootstrap confidence interval, computed by resampling units (rows in the 2d array input form). By default it shows a 68 percent confidence interval, which is equivalent to a standard error, but this can be changed with the ci parameter.

Categories

Resources