The matplotlib.pyplot.hist() documentation describes the parameter "density" (its deprecated name was "normed") as:
density : bool, optional
If True, the first element of the return tuple will be the counts normalized to form a probability density, i.e., the area (or integral) under the histogram will sum to 1. This is achieved by dividing the count by the number of observations times the bin width and not dividing by the total number of observations.
With the first element of the tuple it refers to the y-axis values. It says that it manages to get the area under the histogram to be 1 by: dividing the count by the number of observations times the bin width.
What is the difference between count and number of observations? In my head they are the same thing: the number of instances (or number of counts or number of observations) the variable value falls into a certain bin. However, this would mean that the transformed number of counts for each bin is just one over the bin width (since # / #*bin_width = 1/bin_width) which does not make any sense.
Could someone clarify this for me? Thank you for your help and sorry for the probably stupid question.
I think the wording in the documentation is a bit confusing. The count is the number of entries in a given bin (height of the bin) and the number of observation is the total number of events that go into the histogram.
The documentation makes the distinction about how they normalized because there are generally two ways to do the normalization:
count / number of observations - in this case if you add up all the entries of the output array you would get 1.
count / (number of observations * bin width) - in this case the integral of the output array is 1 so it is a true probability density. This is what matplotlib does, and they just want to be clear in this distinction.
The count of all obervations is the number of observations. But with a histogram you're interested in the counts per bin. So for each bin you divide the count of this bin by the total number of observations times the bin width.
import numpy as np
observations = [1.2, 1.5, 1.7, 1.9, 2.2, 2.3, 3.6, 4.1, 4.2, 4.4]
bin_edges = [0,1,2,3,4,5]
counts, edges = np.histogram(observations, bins=bin_edges)
print(counts) # prints [0 4 2 1 3]
density, edges = np.histogram(observations, bins=bin_edges, density=True)
print(density) # prints [0. 0.4 0.2 0.1 0.3]
# calculate density manually according to formula
man_density = counts/(len(observations)*np.diff(edges))
print(man_density) # prints [0. 0.4 0.2 0.1 0.3]
# Check that density == manually calculated density
assert(np.all(man_density == density))
Related
Given a lower bound of 0.025, I want a vector consisting of weights that sum up to 1 and satisfy this lower bound. Starting from a vector with an arbitrary length and the values ranging from 0.025 (lower bound) to 1.
For example,
[0.025, 0.8, 0.7]
Then a normalization where you divide by the sum of the numbers gives you roughly the following:
[0.016, 0.524, 0.459]
Now this does not satisfy the lower bound, any ideas on how I can get this to work?
If you want the weights (values in an array) to sum up to 1, you can divide each value by the sum of all values (i.e. normalize by the sum). This procedure keeps the relative sizes of each pair of values I means: before and after the step, the second item will be 5 times as large as the fourth item.
On the other hand you want all values to be larger than 0.025. Imagine if one item is 50 times larger than another before normalization, and the smallest value must be 0.025, the other item would need to be 1.25, which is already larger than the sum has to be.
You can figure out that you cannot (given any array) just scale all values equally so that they sum up to 1 AND the smallest value is 0.025.
Then the question is what relation between the values do you want to keep in the procedure?
As a side not, you cannot have more than 40 items, all bigger than 0.025, summ up to 1. So "arbitrary length" just cannot work either.
Add the lower bound to the dividend and divisor:
I used numpy for readability:
import numpy as np
v = np.array([0.025, 0.8, 0.7])
v2 = (v + min(v)) / sum(v + min(v))
Output:
>>> v2
array([0.03125 , 0.515625, 0.453125])
>>> sum(v2)
1.0
The goal is to simulate the actual number of occurrence given theoretical probabilities.
For example, a 6-faces biased dice with probability of landing (1,2,3,4,5,6) being (0.1,0.2,0.15,0.25,0.1,0.2).
Roll the dice for 1000 times and output simulated number of getting each face.
I know numpy.random.choices offer the function to generate each rolling, but I need kind of summary of number of landing of each face.
What is the optimal scripts with Python for above?
Numpy can be used to do that easily and very efficently:
faces = np.arange(0, 6)
faceProbs = [0.1, 0.2, 0.15, 0.25, 0.1, 0.2] # Define the face probabilities
v = np.random.choice(faces, p=faceProbs, size=1000) # Roll the dice for 1000 times
counts = np.bincount(v, minlength=6) # Count the existing occurrences
prob = counts / len(v) # Compute the probability
Can be done without Numpy too.
import random
random.choices([1,2,3,4,5,6], weights=[0.1,0.2,0.15,0.25,0.1,0.2], k=1000)
Why does the total probability exceed 1?
import matplotlib.pyplot as plt
figure, axes = plt.subplots(nrows = 1, ncols = 1)
axes.hist(x = [0.1, 0.2, 0.3, 0.4], density = True)
figure.show()
Expected y-values: [0.25, 0.25, 0.25, 0.25]
Following is my understanding as per the documentation. I don't claim to be an expert in matplotlib nor I am one of the authors. Your question made me think and then I read the documentation and took some logical steps to understand it. So this is not an expert opinion.
===================================================================
Since you have not passed bins information, matplotlib went ahead and created its own bins. In this case the bins are as below.
bins = [0.1 , 0.13, 0.16, 0.19, 0.22, 0.25, 0.28, 0.31, 0.34, 0.37, 0.4 ]
You can see the bind width is 0.03.
Now according to the documentation.
density : bool, optional
If True, the first element of the return
tuple will be the counts normalized to form a probability density,
i.e., the area (or integral) under the histogram will sum to 1. This
is achieved by dividing the count by the number of observations times
the bin width and not dividing by the total number of observations.
In order to make the sum to 1, it is normalizing the counts so that when you multiply the resulting normalized counts in each bin with bin width the resulting sum of the individual product becomes 1.
Your counts are as below for X = [0.1,0.2,0.3,0.4]
OriginalCounts = [1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1]
As you can see if you multiply the OriginalCounts array with the bin width and sum all of them, it is going to come out to 4*0.03 = 0.12 .. which is less than one.
So according to the documentation we need to divide the OriginalCounts array with a factor .. which is (number of observations * bin width).
In this case the number of observations are 4 and bin width is 0.03. So 4*0.03 is equal to 0.12. Thus you divide each element of OriginalCounts with 0.12 to get a Normalized histogram values array.
That means that the revised counts are as below
NormalizedCounts = [8.33333333, 0. , 0. , 8.33333333, 0. , 0. , 8.33333333, 0. , 0. , 8.33333333]
Please note that, now, if you sum the Normalized counts multiplied by bin width, it will be equal to 1. You can quickly check this: 8.333333*4*0.03=0.9999999.. which is very close to 1.
This normalized counts is finally shown in the graph. This is the reason why the height of the bars in the histogram is close to 8 for at four positions.
Hope this helps.
Suppose you have a number that you want to represent a total -- let's say it's 123,456,789.
Now, suppose you want to generate some numbers that add up to that number, but with fuzzy weights.
For instance, suppose I want to generate three numbers. The first should be around 60% of the total, but with some small level of variance. The second should be 30% of the total, again with some variance. And the third would end up being about 10%, depending on the other two.
I tried doing it this way:
percentages = [0.6, 0.3]
total = 123456789
still_need = total
values = []
for i in range(2):
x = int(total * (percentages[i] + np.random.normal(scale=0.05)))
values.append(x)
still_need = still_need - x
values.append(still_need)
But that doesn't seem very elegant.
Is there a better way?
A clean way to do it would be to draw from a multinomial distribution
total = 123456789
percentages = [0.6, 0.3, 0.1]
values = np.random.multinomial(total, percentages)
In this case, the multinomial distribution models rolling a 3-sided die 123456789 times, where the probability of each face turning up is [0.6, 0.3, 0.1]. Calling multinomial() is like running a single trial of this experiment. It returns 3 random integers that sum to 123456789. They represent the number of times that each face of the die turned up. If you want multiple draws, you can use the size parameter`.
I would like to pick a number randomly between 1-100 such that the probability of getting numbers 60-100 is higher than 1-59.
I would like to have the probability to be a left-skewed distribution for numbers 1-100. That is to say, it has a long tail and a peak.
Something along the lines:
pers = np.arange(1,101,1)
prob = <left-skewed distribution>
number = np.random.choice(pers, 1, p=prob)
I do not know how to generate a left-skewed discrete probability function. Any ideas? Thanks!
This is the answer you are looking for using the SciPy function 'skewnorm'. It can make any positive set of integers either left or rightward skewed.
from scipy.stats import skewnorm
import matplotlib.pyplot as plt
numValues = 10000
maxValue = 100
skewness = -5 #Negative values are left skewed, positive values are right skewed.
random = skewnorm.rvs(a = skewness,loc=maxValue, size=numValues) #Skewnorm function
random = random - min(random) #Shift the set so the minimum value is equal to zero.
random = random / max(random) #Standadize all the vlues between 0 and 1.
random = random * maxValue #Multiply the standardized values by the maximum value.
#Plot histogram to check skewness
plt.hist(random,30,density=True, color = 'red', alpha=0.1)
plt.show()
Please reference the documentation here:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skewnorm.html
Histogram of left-skewed distribution
The code generates the following plot.
Like you described, just make sure your skewed-distribution adds up to 1.0:
pers = np.arange(1,101,1)
# Make each of the last 41 elements 5x more likely
prob = [1.0]*(len(pers)-41) + [5.0]*41
# Normalising to 1.0
prob /= np.sum(prob)
number = np.random.choice(pers, 1, p=prob)
The p argument of np.random.choice is the probability associated with each element in the array in the first argument. So something like:
np.random.choice(pers, 1, p=[0.01, 0.01, 0.01, 0.01, ..... , 0.02, 0.02])
Where 0.01 is the lower probability for 1-59 and 0.02 is the higher probability for 60-100.
The SciPy documentation has some useful examples.
http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.random.choice.html
EDIT:
You might also try this link and look for a distribution (about half way down the page) that fits the model you are looking for.
http://docs.scipy.org/doc/scipy/reference/stats.html