I'm trying to use scipy in order to calculate a probability, given a binomial distribution:
The probability: in an exam with 45 questions, each one with 5 items, what is the probability of randomly choose right (instead of wrong) more than half the exam, that is, 22.5?
I've tried:
from scipy.stats import binom
n = 45
p = 0.20
mu = n * p
p_x = binom.pmf(1,n,p)
How do I calculate this with scipy?
Assuming there's exactly one correct choice for each question, the random variable X which counts the number of correctly answered questions by choosing randomly is indeed binomial distributed with parameters n=45 and p=0.2. Hence, you want to calculate P(X >= 23) = P(X = 23 ) + ... + P(X = 45 ) = 1 - P(X <= 22), so there are two ways to compute it:
from scipy.stats import binom
n = 45
p = 0.2
# (1)
prob = sum(binom.pmf(k, n, p) for k in range(23, 45 + 1))
# (2)
prob = 1 - binom.cdf(22, n, p)
Related
I have two questions:
1- This code takes too long to execute. Any idea how I can make it faster?
With the code bellow I want generate 100 random discrete values between 700 and 1200.
I choosed the weibull distribution because I wanted to generate failure rates data please see the histogram bellow.
import random
nums = []
alpha = 0.6
beta = 0.4
while len(nums) !=100:
temp = int(random.weibullvariate(alpha, beta))
if 700 <= temp <1200:
nums.append(temp)
print(nums)
# plotting a graph
#plt.hist(nums, bins = 200)
#plt.show()
print(nums)
I wanted to generate a histogram like this one:
Histogram
2- I have this function for discrete weibull distribution
def DiscreteWeibull(q, b, x):
return q**(x**b) - q**((x + 1)**b)
How can I generate random values that follow this distribution?
Since the Weibull distribution with shape parameter K and scale parameter lambda can be characterized as this function on the Uniform (0,1) dist. U, we can 'cut' the distribution to a desired minimum and maximum value. We do this by inverting the equation, setting W to 700 or 1200, and finding the values between 0 and 1 that correspond. Here's some sample code.
def weibull_from_uniform(shape, scale, x):
assert 0 <= x <= 1
return scale * pow(-1 * math.log(x), 1.0 / shape)
scale_param = 0.6
shape_param = 0.4
min_value = 700.0
max_value = 1200.0
lower_bound = math.exp(-1 * pow(min_value / scale_param, shape_param))
upper_bound = math.exp(-1 * pow(max_value / scale_param, shape_param))
if lower_bound > upper_bound:
lower_bound, upper_bound = upper_bound, lower_bound
nums = []
while len(nums) < 100:
nums.append(weibull_from_uniform(shape_param, scale_param, random.uniform(lower_bound, upper_bound)))
print(nums)
plt.hist(nums, bins=8)
plt.show()
This code gives a histogram very similar to the one you provided; the method will give values from the same distribution as your original method, just faster. Note that this direct approach only works when our shape parameter K <= 1, so that the density function is strictly decreasing. When K > 1, the Weibull density function increases to a mode, then decreases, so you may need to draw from two uniform intervals for particular min and max values (since inverting for W and U may give two answers).
Your question is not very clear on why you thought using this Weibull distribution was a good idea, nor what distribution you are looking to achieve.
Discrete uniform distribution
Here are two ways to achieve the discrete uniform distribution on [700, 1200).
1) With random
import random
nums = [random.randrange(700, 1200) for _ in range(100)]
2) With numpy
import numpy
nums = numpy.random.randint(700, 1200, 100)
Geometric distribution
You have edited your question with an example histogram, and the mention "I wanted to generate a histogram like this one". The histogram vaguely looks like a geometric distribution.
We can use numpy.random.geometric:
import numpy
n_samples = 100
p = 0.5
a, b = 50, 650
cap = 1200
nums = numpy.random.geometric(p, size = 2 * n_samples) * a + b
nums = nums[numpy.where(nums < cap)][:n_samples]
In my dataset, there are N people who are each split into one 3 groups (groups = {A, B, C}). I want to find the probability that two random people, n_1 and n_2, belong to the same group.
I have data on each of these groups and how many people belong to them. Importantly, each group is of a different size.
import pandas as pd
import numpy as np
import math
data = {
"Group": ['A', 'B', 'C'],
"Count": [20, 10, 5],
}
df = pd.DataFrame(data)
Group Count
0 A 20
1 B 10
2 C 5
I think I know how to get the sample space, S but I am unsure how to get the numerator.
def nCk(n,k):
f = math.factorial
return f(n) / f(k) / f(n-k)
n = sum(df['Count'])
k = 2
s = nCk(n, k)
My discrete mathematics skills are a bit rusty so feel free to correct me. You have N people split into groups of sizes s_1, ..., s_n so that N = s_1 + ... + s_n.
The chance of one random person belonging to group i is s_i / N
The chance of a second person being in group i is (s_i - 1) / (N - 1)
The chance of both being in group i is s_i / N * (s_i - 1) / (N - 1)
The probability of them being together in any group is the sum of the probabilities in #3 across all groups.
Code:
import numpy as np
s = df['Count'].values
n = s.sum()
prob = np.sum(s/n * (s-1)/(n-1)) # 0.4117647058823529
We can generalize this solution to "the probability of k people all being in the same group":
k = 2
i = np.arange(k)[:, None]
tmp = (s-i) / (n-i)
prob = np.prod(tmp, axis=0).sum()
When k > s.max() (20 in this case), the answer is 0 because you cannot fit all of them in one group. When k > s.sum() (35 in this case), the result is nan.
I will answer your problem by using hypergeometric distribution, hypergeometric distribution is a discrete probability distribution that describes the probability of k successes (random draws for which the object drawn has a specified feature) in n draws, without replacement, from a finite population of size N that contains exactly K objects with that feature, wherein each draw is either a success or a failure. In contrast, the binomial distribution describes the probability of k successes in n draws with replacement.
So the total probability should be the probability of both belonging to A + probability of both belonging to B + probability of both belonging to C.
This means
P(A) = (nCk(20,2) * nCk(15,0))/nCk(35,2)
P(B) = (nCk(10,2) * nCk(25,0))/nCk(35,2)
P(C) = (nCk(5,2) * nCk(5,0)) / nCk(35,2)
In code terms:
import pandas as pd
import numpy as np
import math
data = {
"Group": ['A', 'B', 'C'],
"Count": [20, 10, 5],
}
df = pd.DataFrame(data)
def nCk(n,k):
f = math.factorial
return f(n) / f(k) / f(n-k)
samples = 2
succeses = 2
observations = df['Count'].sum()
p_a = ((nCk(df[df['Group'] == 'A'].set_index('Group').max(),samples)) * (nCk((observations - df[df['Group'] == 'A'].set_index('Group').max()),(samples-succeses)))) / nCk(observations,samples)
p_b = ((nCk(df[df['Group'] == 'B'].set_index('Group').max(),samples)) * (nCk((observations - df[df['Group'] == 'B'].set_index('Group').max()),(samples-succeses)))) / nCk(observations,samples)
p_c =((nCk(df[df['Group'] == 'C'].set_index('Group').max(),samples)) * (nCk((observations - df[df['Group'] == 'C'].set_index('Group').max()),(samples-succeses)))) / nCk(observations,samples)
proba = p_a + p_b + p_c
print(proba)
Output:
0.41176470588235287
I dont want a,b to be random, i want them to be specific, so user will input the 2 samples, so it can count according to it(variance, mean, size)
I tried it this way:
size = input('Size of sample: ')
N = size
source2 = input('Mean: ')
source3 = input('Distribution: ')
a = source2
print("a:",a)
Gauss-eloszlású adat, átlag = 0 és variancia = 1
b = source3
print("b:",b)
Import the packages
import numpy as np
from scipy import stats
Define 2 random distributions
Sample Size
N = 10
Gaussian distributed data with mean = 2 and var = 1
a = np.random.randn(N) + 2
Gaussian distributed data with with mean = 0 and var = 1
b = np.random.randn(N)
Calculate the Standard Deviation
Calculate the variance to get the standard deviation
For unbiased max likelihood estimate we have to divide the var by N-1, and therefore the parameter ddof = 1
var_a = a.var(ddof=1)
var_b = b.var(ddof=1)
std deviation
s = np.sqrt((var_a + var_b)/2)
s
Calculate the t-statistics
t = (a.mean() - b.mean())/(s*np.sqrt(2/N))
Compare with the critical t-value
Degrees of freedom
df = 2*N - 2
p-value after comparison with the t
p = 1 - stats.t.cdf(t,df=df)
print("t = " + str(t))
print("p = " + str(2*p))
Note that we multiply the p value by 2 because its a twp tail t-test
You can see that after comparing the t statistic with the critical t value (computed internally) we get a good p value of 0.0005 and thus we reject the null hypothesis and thus it proves that the mean of the two distributions are different and statistically significant.
Cross Checking with the internal scipy function
t2, p2 = stats.ttest_ind(a,b)
print("t = " + str(t2))
print("p = " + str(2*p2))
I need to know how to generate 1000 random numbers between 500 and 600 that has a mean = 550 and standard deviation = 30 in python.
import pylab
import random
xrandn = pylab.zeros(1000,float)
for j in range(500,601):
xrandn[j] = pylab.randn()
???????
You are looking for stats.truncnorm:
import scipy.stats as stats
a, b = 500, 600
mu, sigma = 550, 30
dist = stats.truncnorm((a - mu) / sigma, (b - mu) / sigma, loc=mu, scale=sigma)
values = dist.rvs(1000)
There are other choices for your problem too. Wikipedia has a list of continuous distributions with bounded intervals, depending on the distribution you may be able to get your required characteristics with the right parameters. For example, if you want something like "a bounded Gaussian bell" (not truncated) you can pick the (scaled) beta distribution:
import numpy as np
import scipy.stats
import matplotlib.pyplot as plt
def my_distribution(min_val, max_val, mean, std):
scale = max_val - min_val
location = min_val
# Mean and standard deviation of the unscaled beta distribution
unscaled_mean = (mean - min_val) / scale
unscaled_var = (std / scale) ** 2
# Computation of alpha and beta can be derived from mean and variance formulas
t = unscaled_mean / (1 - unscaled_mean)
beta = ((t / unscaled_var) - (t * t) - (2 * t) - 1) / ((t * t * t) + (3 * t * t) + (3 * t) + 1)
alpha = beta * t
# Not all parameters may produce a valid distribution
if alpha <= 0 or beta <= 0:
raise ValueError('Cannot create distribution for the given parameters.')
# Make scaled beta distribution with computed parameters
return scipy.stats.beta(alpha, beta, scale=scale, loc=location)
np.random.seed(100)
min_val = 1.5
max_val = 35
mean = 9.87
std = 3.1
my_dist = my_distribution(min_val, max_val, mean, std)
# Plot distribution PDF
x = np.linspace(min_val, max_val, 100)
plt.plot(x, my_dist.pdf(x))
# Stats
print('mean:', my_dist.mean(), 'std:', my_dist.std())
# Get a large sample to check bounds
sample = my_dist.rvs(size=100000)
print('min:', sample.min(), 'max:', sample.max())
Output:
mean: 9.87 std: 3.100000000000001
min: 1.9290674232087306 max: 25.03903889816994
Probability density function plot:
Note that not every possible combination of bounds, mean and standard deviation will produce a valid distribution in this case, though, and depending on the resulting values of alpha and beta the probability density function may look like an "inverted bell" instead (even though mean and standard deviation would still be correct).
I'm not exactly sure what the OP desired, but if he just wanted an array xrandn fulfilling the bottom plot - below I present the steps:
First, create a standard distribution (Gaussian distribution), the easiest way might be to use numpy:
import numpy as np
random_nums = np.random.normal(loc=550, scale=30, size=1000)
And then you keep only the numbers within the desired range with a list comprehension:
random_nums_filtered = [i for i in random_nums if i>500 and i<600]
I have another question that I was hoping someone could help me with.
I'm using the Jensen-Shannon-Divergence to measure the similarity between two probability distributions. The similarity scores appear to be correct in the sense that they fall between 1 and 0 given that one uses the base 2 logarithm, with 0 meaning that the distributions are equal.
However, I'm not sure whether there is in fact an error somewhere and was wondering whether someone might be able to say 'yes it's correct' or 'no, you did something wrong'.
Here is the code:
from numpy import zeros, array
from math import sqrt, log
class JSD(object):
def __init__(self):
self.log2 = log(2)
def KL_divergence(self, p, q):
""" Compute KL divergence of two vectors, K(p || q)."""
return sum(p[x] * log((p[x]) / (q[x])) for x in range(len(p)) if p[x] != 0.0 or p[x] != 0)
def Jensen_Shannon_divergence(self, p, q):
""" Returns the Jensen-Shannon divergence. """
self.JSD = 0.0
weight = 0.5
average = zeros(len(p)) #Average
for x in range(len(p)):
average[x] = weight * p[x] + (1 - weight) * q[x]
self.JSD = (weight * self.KL_divergence(array(p), average)) + ((1 - weight) * self.KL_divergence(array(q), average))
return 1-(self.JSD/sqrt(2 * self.log2))
if __name__ == '__main__':
J = JSD()
p = [1.0/10, 9.0/10, 0]
q = [0, 1.0/10, 9.0/10]
print J.Jensen_Shannon_divergence(p, q)
The problem is that I feel that the scores are not high enough when comparing two text documents, for instance. However, this is purely a subjective feeling.
Any help is, as always, appreciated.
Note that the scipy entropy call below is the Kullback-Leibler divergence.
See: http://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence
#!/usr/bin/env python
from scipy.stats import entropy
from numpy.linalg import norm
import numpy as np
def JSD(P, Q):
_P = P / norm(P, ord=1)
_Q = Q / norm(Q, ord=1)
_M = 0.5 * (_P + _Q)
return 0.5 * (entropy(_P, _M) + entropy(_Q, _M))
Also note that the test case in the Question looks erred?? The sum of the p distribution does not add to 1.0.
See: http://www.itl.nist.gov/div898/handbook/eda/section3/eda361.htm
Since the Jensen-Shannon distance (distance.jensenshannon) has been included in Scipy 1.2, the Jensen-Shannon divergence can be obtained as the square of the Jensen-Shannon distance:
from scipy.spatial import distance
distance.jensenshannon([1.0/10, 9.0/10, 0], [0, 1.0/10, 9.0/10]) ** 2
# 0.5306056938642212
Get some data for distributions with known divergence and compare your results against those known values.
BTW: the sum in KL_divergence may be rewritten using the zip built-in function like this:
sum(_p * log(_p / _q) for _p, _q in zip(p, q) if _p != 0)
This does away with lots of "noise" and is also much more "pythonic". The double comparison with 0.0 and 0 is not necessary.
A general version, for n probability distributions, in python
import numpy as np
from scipy.stats import entropy as H
def JSD(prob_distributions, weights, logbase=2):
# left term: entropy of misture
wprobs = weights * prob_distributions
mixture = wprobs.sum(axis=0)
entropy_of_mixture = H(mixture, base=logbase)
# right term: sum of entropies
entropies = np.array([H(P_i, base=logbase) for P_i in prob_distributions])
wentropies = weights * entropies
sum_of_entropies = wentropies.sum()
divergence = entropy_of_mixture - sum_of_entropies
return(divergence)
# From the original example with three distributions:
P_1 = np.array([1/2, 1/2, 0])
P_2 = np.array([0, 1/10, 9/10])
P_3 = np.array([1/3, 1/3, 1/3])
prob_distributions = np.array([P_1, P_2, P_3])
n = len(prob_distributions)
weights = np.empty(n)
weights.fill(1/n)
print(JSD(prob_distributions, weights))
#0.546621319446
Explicitly following the math in the Wikipedia article:
def jsdiv(P, Q):
"""Compute the Jensen-Shannon divergence between two probability distributions.
Input
-----
P, Q : array-like
Probability distributions of equal length that sum to 1
"""
def _kldiv(A, B):
return np.sum([v for v in A * np.log2(A/B) if not np.isnan(v)])
P = np.array(P)
Q = np.array(Q)
M = 0.5 * (P + Q)
return 0.5 * (_kldiv(P, M) +_kldiv(Q, M))