Calculate probability 2 random people are in the same group? - python

In my dataset, there are N people who are each split into one 3 groups (groups = {A, B, C}). I want to find the probability that two random people, n_1 and n_2, belong to the same group.
I have data on each of these groups and how many people belong to them. Importantly, each group is of a different size.
import pandas as pd
import numpy as np
import math
data = {
"Group": ['A', 'B', 'C'],
"Count": [20, 10, 5],
}
df = pd.DataFrame(data)
Group Count
0 A 20
1 B 10
2 C 5
I think I know how to get the sample space, S but I am unsure how to get the numerator.
def nCk(n,k):
f = math.factorial
return f(n) / f(k) / f(n-k)
n = sum(df['Count'])
k = 2
s = nCk(n, k)

My discrete mathematics skills are a bit rusty so feel free to correct me. You have N people split into groups of sizes s_1, ..., s_n so that N = s_1 + ... + s_n.
The chance of one random person belonging to group i is s_i / N
The chance of a second person being in group i is (s_i - 1) / (N - 1)
The chance of both being in group i is s_i / N * (s_i - 1) / (N - 1)
The probability of them being together in any group is the sum of the probabilities in #3 across all groups.
Code:
import numpy as np
s = df['Count'].values
n = s.sum()
prob = np.sum(s/n * (s-1)/(n-1)) # 0.4117647058823529
We can generalize this solution to "the probability of k people all being in the same group":
k = 2
i = np.arange(k)[:, None]
tmp = (s-i) / (n-i)
prob = np.prod(tmp, axis=0).sum()
When k > s.max() (20 in this case), the answer is 0 because you cannot fit all of them in one group. When k > s.sum() (35 in this case), the result is nan.

I will answer your problem by using hypergeometric distribution, hypergeometric distribution is a discrete probability distribution that describes the probability of k successes (random draws for which the object drawn has a specified feature) in n draws, without replacement, from a finite population of size N that contains exactly K objects with that feature, wherein each draw is either a success or a failure. In contrast, the binomial distribution describes the probability of k successes in n draws with replacement.
So the total probability should be the probability of both belonging to A + probability of both belonging to B + probability of both belonging to C.
This means
P(A) = (nCk(20,2) * nCk(15,0))/nCk(35,2)
P(B) = (nCk(10,2) * nCk(25,0))/nCk(35,2)
P(C) = (nCk(5,2) * nCk(5,0)) / nCk(35,2)
In code terms:
import pandas as pd
import numpy as np
import math
data = {
"Group": ['A', 'B', 'C'],
"Count": [20, 10, 5],
}
df = pd.DataFrame(data)
def nCk(n,k):
f = math.factorial
return f(n) / f(k) / f(n-k)
samples = 2
succeses = 2
observations = df['Count'].sum()
p_a = ((nCk(df[df['Group'] == 'A'].set_index('Group').max(),samples)) * (nCk((observations - df[df['Group'] == 'A'].set_index('Group').max()),(samples-succeses)))) / nCk(observations,samples)
p_b = ((nCk(df[df['Group'] == 'B'].set_index('Group').max(),samples)) * (nCk((observations - df[df['Group'] == 'B'].set_index('Group').max()),(samples-succeses)))) / nCk(observations,samples)
p_c =((nCk(df[df['Group'] == 'C'].set_index('Group').max(),samples)) * (nCk((observations - df[df['Group'] == 'C'].set_index('Group').max()),(samples-succeses)))) / nCk(observations,samples)
proba = p_a + p_b + p_c
print(proba)
Output:
0.41176470588235287

Related

Scipy - probability in binomial distribution

I'm trying to use scipy in order to calculate a probability, given a binomial distribution:
The probability: in an exam with 45 questions, each one with 5 items, what is the probability of randomly choose right (instead of wrong) more than half the exam, that is, 22.5?
I've tried:
from scipy.stats import binom
n = 45
p = 0.20
mu = n * p
p_x = binom.pmf(1,n,p)
How do I calculate this with scipy?
Assuming there's exactly one correct choice for each question, the random variable X which counts the number of correctly answered questions by choosing randomly is indeed binomial distributed with parameters n=45 and p=0.2. Hence, you want to calculate P(X >= 23) = P(X = 23 ) + ... + P(X = 45 ) = 1 - P(X <= 22), so there are two ways to compute it:
from scipy.stats import binom
n = 45
p = 0.2
# (1)
prob = sum(binom.pmf(k, n, p) for k in range(23, 45 + 1))
# (2)
prob = 1 - binom.cdf(22, n, p)

Python vs MATLAB performance on algorithm

I have a performance question about two bits of code. One is implemented in python and one in MATLAB. The code calculates the sample entropy of a time series (which sounds complicated but is basically a bunch of for loops).
I am running both implementations on relatively large time series (~95k+ samples) depending on the time series. The MATLAB implementation finishes the calculation in ~45 sec to 1 min. The python one basically never finishes. I threw tqdm over the python for loops and the upper loop was only moving at about ~1.85s/it which gives 50+ hours as a estimated completion time (I've let it run for 15+ mins and the iteration count was pretty consistent).
Example inputs and runtimes:
MATLAB (~ 52 sec):
a = rand(1, 95000)
sampenc(a, 4, 0.1 * std(a))
Python (currently 5 mins in with 49 hours estimated):
import numpy as np
a = np.random.rand(1, 95000)[0]
sample_entropy(a, 4, 0.1 * np.std(a))
Python Implementation:
# https://github.com/nikdon/pyEntropy
def sample_entropy(time_series, sample_length, tolerance=None):
"""Calculate and return Sample Entropy of the given time series.
Distance between two vectors defined as Euclidean distance and can
be changed in future releases
Args:
time_series: Vector or string of the sample data
sample_length: Number of sequential points of the time series
tolerance: Tolerance (default = 0.1...0.2 * std(time_series))
Returns:
Vector containing Sample Entropy (float)
References:
[1] http://en.wikipedia.org/wiki/Sample_Entropy
[2] http://physionet.incor.usp.br/physiotools/sampen/
[3] Madalena Costa, Ary Goldberger, CK Peng. Multiscale entropy analysis
of biological signals
"""
if tolerance is None:
tolerance = 0.1 * np.std(time_series)
n = len(time_series)
prev = np.zeros(n)
curr = np.zeros(n)
A = np.zeros((sample_length, 1)) # number of matches for m = [1,...,template_length - 1]
B = np.zeros((sample_length, 1)) # number of matches for m = [1,...,template_length]
for i in range(n - 1):
nj = n - i - 1
ts1 = time_series[i]
for jj in range(nj):
j = jj + i + 1
if abs(time_series[j] - ts1) < tolerance: # distance between two vectors
curr[jj] = prev[jj] + 1
temp_ts_length = min(sample_length, curr[jj])
for m in range(int(temp_ts_length)):
A[m] += 1
if j < n - 1:
B[m] += 1
else:
curr[jj] = 0
for j in range(nj):
prev[j] = curr[j]
N = n * (n - 1) / 2
B = np.vstack(([N], B[:sample_length - 1]))
similarity_ratio = A / B
se = - np.log(similarity_ratio)
se = np.reshape(se, -1)
return se
MATLAB Implementation:
function [e,A,B]=sampenc(y,M,r);
%function [e,A,B]=sampenc(y,M,r);
%
%Input
%
%y input data
%M maximum template length
%r matching tolerance
%
%Output
%
%e sample entropy estimates for m=0,1,...,M-1
%A number of matches for m=1,...,M
%B number of matches for m=0,...,M-1 excluding last point
n=length(y);
lastrun=zeros(1,n);
run=zeros(1,n);
A=zeros(M,1);
B=zeros(M,1);
p=zeros(M,1);
e=zeros(M,1);
for i=1:(n-1)
nj=n-i;
y1=y(i);
for jj=1:nj
j=jj+i;
if abs(y(j)-y1)<r
run(jj)=lastrun(jj)+1;
M1=min(M,run(jj));
for m=1:M1
A(m)=A(m)+1;
if j<n
B(m)=B(m)+1;
end
end
else
run(jj)=0;
end
end
for j=1:nj
lastrun(j)=run(j);
end
end
N=n*(n-1)/2;
B=[N;B(1:(M-1))];
p=A./B;
e=-log(p);
I've also tried a few other python implementations and all of them have the same slow result:
vectorized-sample-entropy
sampen
sampen2.py
Wikipedia sample entropy implementation
I don't think computer issue as it runs relativity fast in MATLAB.
As far as I can tell, implementation-wise both sets of code are the same. I have no idea why the python implementations are so slow. I would understand a difference of a few seconds but not such a large discrepancy. Let me know your thoughts on why this is or suggestions on how to improve the python versions.
BTW: I'm using Python 3.6.5 with numpy 1.14.5 and MATLAB R2018a.
As said in the comments, Matlab uses a jit-compiler by default Python doesn't. In Python you could use Numba to do quite the same.
Your code with slight modifications
import numba as nb
import numpy as np
import time
#nb.jit(fastmath=True,error_model='numpy')
def sample_entropy(time_series, sample_length, tolerance=None):
"""Calculate and return Sample Entropy of the given time series.
Distance between two vectors defined as Euclidean distance and can
be changed in future releases
Args:
time_series: Vector or string of the sample data
sample_length: Number of sequential points of the time series
tolerance: Tolerance (default = 0.1...0.2 * std(time_series))
Returns:
Vector containing Sample Entropy (float)
References:
[1] http://en.wikipedia.org/wiki/Sample_Entropy
[2] http://physionet.incor.usp.br/physiotools/sampen/
[3] Madalena Costa, Ary Goldberger, CK Peng. Multiscale entropy analysis
of biological signals
"""
if tolerance is None:
tolerance = 0.1 * np.std(time_series)
n = len(time_series)
prev = np.zeros(n)
curr = np.zeros(n)
A = np.zeros((sample_length)) # number of matches for m = [1,...,template_length - 1]
B = np.zeros((sample_length)) # number of matches for m = [1,...,template_length]
for i in range(n - 1):
nj = n - i - 1
ts1 = time_series[i]
for jj in range(nj):
j = jj + i + 1
if abs(time_series[j] - ts1) < tolerance: # distance between two vectors
curr[jj] = prev[jj] + 1
temp_ts_length = min(sample_length, curr[jj])
for m in range(int(temp_ts_length)):
A[m] += 1
if j < n - 1:
B[m] += 1
else:
curr[jj] = 0
for j in range(nj):
prev[j] = curr[j]
N = n * (n - 1) // 2
B2=np.empty(sample_length)
B2[0]=N
B2[1:]=B[:sample_length - 1]
similarity_ratio = A / B2
se = - np.log(similarity_ratio)
return se
Timings
a = np.random.rand(1, 95000)[0] #Python
a = rand(1, 95000) #Matlab
Python 3.6, Numba 0.40dev, Matlab 2016b, Core i5-3210M
Python: 487s
Python+Numba: 12.2s
Matlab: 71.1s

Rows and columns restrictions on python

I have a list of lists m which I need to modify
I need that the sum of each row to be greater than A and the sum of each column to be lesser than B
I have something like this
x = 5 #or other number, not relevant
rows = len(m)
cols = len(m[0])
for r in range(rows):
while sum(m[r]) < A:
c = randint(0, cols-1)
m[r][c] += x
for c in range(cols):
cant = sum([m[r][c] for r in range(rows)])
while cant > B:
r = randint(0, rows-1)
if m[r][c] >= x: #I don't want negatives
m[r][c] -= x
My problem is: I need to satisfy both conditions and, this way, after the second for I won't be sure if the first condition is still met.
Any suggestions on how to satisfy both conditions and, of course, with the best execution? I could definitely consider the use of numpy
Edit (an example)
#input
m = [[0,0,0],
[0,0,0]]
A = 20
B = 25
# one desired output (since it chooses random positions)
m = [[10,0,15],
[15,0,5]]
I may need to add
This is for the generation of the random initial population of a genetic algorithm, the restrictions are to make them a possible solution, and I would need to run this like 80 times to get different possible solutions
Something like this should to the trick:
import numpy
from scipy.optimize import linprog
A = 10
B = 20
m = 2
n = m * m
# the coefficients of a linear function to minimize.
# setting this to all ones minimizes the sum of all variable
# values in the matrix, which solves the problem, but see below.
c = numpy.ones(n)
# the constraint matrix.
# This is matrix-multiplied with the current solution candidate
# to form the left hand side of a set of normalized
# linear inequality constraint equations, i.e.
#
# x_0 * A_ub[0][0] + x_1 * A_ub[0][1] <= b_0
# x_1 * A_ub[1][0] + x_1 * A_ub[1][1] <= b_1
# ...
A_ub = numpy.zeros((2 * m, n))
# row sums. Since the <= inequality is a fixed component,
# we just multiply everthing by (-1), i.e. we demand that
# the negative sums are smaller than the negative limit -A.
#
# Assign row ranges all at once, because numpy can do this.
for r in xrange(0, m):
A_ub[r][r * m:(r + 1) * m] = -1
# We want that the sum of the x in each (flattened)
# column is smaller than B
#
# The manual stepping for the column sums in row-major encoding
# is a little bit annoying here.
for r in xrange(0, m):
for j in xrange(0, m):
A_ub[r + m][r + m * j] = 1
# the actual upper limits for the normalized inequalities.
b_ub = [-A] * m + [B] * m
# hand the linear program to scipy
solution = linprog(c, A_ub=A_ub, b_ub=b_ub)
# bring the solution into the desired matrix form
print numpy.reshape(solution.x, (m, m))
Caveats
I use <=, not < as indicated in your question, because that's what numpy supports.
This minimizes the total sum of all values in the target vector.
For your use case, you probably want to minimize the distance
to the original sample, which the linear program cannot handle, since neither the squared error nor the absolute difference can be expressed using a linear combination (which is what c stands for). For that, you will probably need to go to full minimize().
Still, this should get you rough idea.
A NumPy solution:
import numpy as np
val = B / len(m) # column sums <= B
assert val * len(m[0]) >= A # row sums >= A
# create array shaped like m, filled with val
arr = np.empty_like(m)
arr[:] = val
I chose to ignore the original content of m - it's all zero in your example anyway.
from random import *
m = [[0,0,0],
[0,0,0]]
A = 20
B = 25
x = 1 #or other number, not relevant
rows = len(m)
cols = len(m[0])
def runner(list1, a1, b1, x1):
list1_backup = list(list1)
rows = len(list1)
cols = len(list1[0])
for r in range(rows):
while sum(list1[r]) <= a1:
c = randint(0, cols-1)
list1[r][c] += x1
for c in range(cols):
cant = sum([list1[r][c] for r in range(rows)])
while cant >= b1:
r = randint(0, rows-1)
if list1[r][c] >= x1: #I don't want negatives
list1[r][c] -= x1
good_a_int = 0
for r in range(rows):
test1 = sum(list1[r]) > a1
good_a_int += 0 if test1 else 1
if good_a_int == 0:
return list1
else:
return runner(list1=list1_backup, a1=a1, b1=b1, x1=x1)
m2 = runner(m, A, B, x)
for row in m:
print ','.join(map(lambda x: "{:>3}".format(x), row))

Jensen-Shannon Divergence

I have another question that I was hoping someone could help me with.
I'm using the Jensen-Shannon-Divergence to measure the similarity between two probability distributions. The similarity scores appear to be correct in the sense that they fall between 1 and 0 given that one uses the base 2 logarithm, with 0 meaning that the distributions are equal.
However, I'm not sure whether there is in fact an error somewhere and was wondering whether someone might be able to say 'yes it's correct' or 'no, you did something wrong'.
Here is the code:
from numpy import zeros, array
from math import sqrt, log
class JSD(object):
def __init__(self):
self.log2 = log(2)
def KL_divergence(self, p, q):
""" Compute KL divergence of two vectors, K(p || q)."""
return sum(p[x] * log((p[x]) / (q[x])) for x in range(len(p)) if p[x] != 0.0 or p[x] != 0)
def Jensen_Shannon_divergence(self, p, q):
""" Returns the Jensen-Shannon divergence. """
self.JSD = 0.0
weight = 0.5
average = zeros(len(p)) #Average
for x in range(len(p)):
average[x] = weight * p[x] + (1 - weight) * q[x]
self.JSD = (weight * self.KL_divergence(array(p), average)) + ((1 - weight) * self.KL_divergence(array(q), average))
return 1-(self.JSD/sqrt(2 * self.log2))
if __name__ == '__main__':
J = JSD()
p = [1.0/10, 9.0/10, 0]
q = [0, 1.0/10, 9.0/10]
print J.Jensen_Shannon_divergence(p, q)
The problem is that I feel that the scores are not high enough when comparing two text documents, for instance. However, this is purely a subjective feeling.
Any help is, as always, appreciated.
Note that the scipy entropy call below is the Kullback-Leibler divergence.
See: http://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence
#!/usr/bin/env python
from scipy.stats import entropy
from numpy.linalg import norm
import numpy as np
def JSD(P, Q):
_P = P / norm(P, ord=1)
_Q = Q / norm(Q, ord=1)
_M = 0.5 * (_P + _Q)
return 0.5 * (entropy(_P, _M) + entropy(_Q, _M))
Also note that the test case in the Question looks erred?? The sum of the p distribution does not add to 1.0.
See: http://www.itl.nist.gov/div898/handbook/eda/section3/eda361.htm
Since the Jensen-Shannon distance (distance.jensenshannon) has been included in Scipy 1.2, the Jensen-Shannon divergence can be obtained as the square of the Jensen-Shannon distance:
from scipy.spatial import distance
distance.jensenshannon([1.0/10, 9.0/10, 0], [0, 1.0/10, 9.0/10]) ** 2
# 0.5306056938642212
Get some data for distributions with known divergence and compare your results against those known values.
BTW: the sum in KL_divergence may be rewritten using the zip built-in function like this:
sum(_p * log(_p / _q) for _p, _q in zip(p, q) if _p != 0)
This does away with lots of "noise" and is also much more "pythonic". The double comparison with 0.0 and 0 is not necessary.
A general version, for n probability distributions, in python
import numpy as np
from scipy.stats import entropy as H
def JSD(prob_distributions, weights, logbase=2):
# left term: entropy of misture
wprobs = weights * prob_distributions
mixture = wprobs.sum(axis=0)
entropy_of_mixture = H(mixture, base=logbase)
# right term: sum of entropies
entropies = np.array([H(P_i, base=logbase) for P_i in prob_distributions])
wentropies = weights * entropies
sum_of_entropies = wentropies.sum()
divergence = entropy_of_mixture - sum_of_entropies
return(divergence)
# From the original example with three distributions:
P_1 = np.array([1/2, 1/2, 0])
P_2 = np.array([0, 1/10, 9/10])
P_3 = np.array([1/3, 1/3, 1/3])
prob_distributions = np.array([P_1, P_2, P_3])
n = len(prob_distributions)
weights = np.empty(n)
weights.fill(1/n)
print(JSD(prob_distributions, weights))
#0.546621319446
Explicitly following the math in the Wikipedia article:
def jsdiv(P, Q):
"""Compute the Jensen-Shannon divergence between two probability distributions.
Input
-----
P, Q : array-like
Probability distributions of equal length that sum to 1
"""
def _kldiv(A, B):
return np.sum([v for v in A * np.log2(A/B) if not np.isnan(v)])
P = np.array(P)
Q = np.array(Q)
M = 0.5 * (P + Q)
return 0.5 * (_kldiv(P, M) +_kldiv(Q, M))

Calculating Pearson correlation and significance in Python

I am looking for a function that takes as input two lists, and returns the Pearson correlation, and the significance of the correlation.
You can have a look at scipy.stats:
from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)
>>>
Help on function pearsonr in module scipy.stats.stats:
pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing
non-correlation.
The Pearson correlation coefficient measures the linear relationship
between two datasets. Strictly speaking, Pearson's correlation requires
that each dataset be normally distributed. Like other correlation
coefficients, this one varies between -1 and +1 with 0 implying no
correlation. Correlations of -1 or +1 imply an exact linear
relationship. Positive correlations imply that as x increases, so does
y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets. The p-values are not entirely
reliable but are probably reasonable for datasets larger than 500 or so.
Parameters
----------
x : 1D array
y : 1D array the same length as x
Returns
-------
(Pearson's correlation coefficient,
2-tailed p-value)
References
----------
http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
The Pearson correlation can be calculated with numpy's corrcoef.
import numpy
numpy.corrcoef(list1, list2)[0, 1]
An alternative can be a native scipy function from linregress which calculates:
slope : slope of the regression line
intercept : intercept of the regression line
r-value : correlation coefficient
p-value : two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero
stderr : Standard error of the estimate
And here is an example:
a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)
will return you:
LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)
If you don't feel like installing scipy, I've used this quick hack, slightly modified from Programming Collective Intelligence:
def pearsonr(x, y):
# Assume len(x) == len(y)
n = len(x)
sum_x = float(sum(x))
sum_y = float(sum(y))
sum_x_sq = sum(xi*xi for xi in x)
sum_y_sq = sum(yi*yi for yi in y)
psum = sum(xi*yi for xi, yi in zip(x, y))
num = psum - (sum_x * sum_y/n)
den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
if den == 0: return 0
return num / den
The following code is a straight-up interpretation of the definition:
import math
def average(x):
assert len(x) > 0
return float(sum(x)) / len(x)
def pearson_def(x, y):
assert len(x) == len(y)
n = len(x)
assert n > 0
avg_x = average(x)
avg_y = average(y)
diffprod = 0
xdiff2 = 0
ydiff2 = 0
for idx in range(n):
xdiff = x[idx] - avg_x
ydiff = y[idx] - avg_y
diffprod += xdiff * ydiff
xdiff2 += xdiff * xdiff
ydiff2 += ydiff * ydiff
return diffprod / math.sqrt(xdiff2 * ydiff2)
Test:
print pearson_def([1,2,3], [1,5,7])
returns
0.981980506062
This agrees with Excel, this calculator, SciPy (also NumPy), which return 0.981980506 and 0.9819805060619657, and 0.98198050606196574, respectively.
R:
> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805
EDIT: Fixed a bug pointed out by a commenter.
You can do this with pandas.DataFrame.corr, too:
import pandas as pd
a = [[1, 2, 3],
[5, 6, 9],
[5, 6, 11],
[5, 6, 13],
[5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()
This gives
0 1 2
0 1.000000 0.745601 0.916579
1 0.745601 1.000000 0.544248
2 0.916579 0.544248 1.000000
Rather than rely on numpy/scipy, I think my answer should be the easiest to code and understand the steps in calculating the Pearson Correlation Coefficient (PCC) .
import math
# calculates the mean
def mean(x):
sum = 0.0
for i in x:
sum += i
return sum / len(x)
# calculates the sample standard deviation
def sampleStandardDeviation(x):
sumv = 0.0
for i in x:
sumv += (i - mean(x))**2
return math.sqrt(sumv/(len(x)-1))
# calculates the PCC using both the 2 functions above
def pearson(x,y):
scorex = []
scorey = []
for i in x:
scorex.append((i - mean(x))/sampleStandardDeviation(x))
for j in y:
scorey.append((j - mean(y))/sampleStandardDeviation(y))
# multiplies both lists together into 1 list (hence zip) and sums the whole list
return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)
The significance of PCC is basically to show you how strongly correlated the two variables/lists are.
It is important to note that the PCC value ranges from -1 to 1.
A value between 0 to 1 denotes a positive correlation.
Value of 0 = highest variation (no correlation whatsoever).
A value between -1 to 0 denotes a negative correlation.
Pearson coefficient calculation using pandas in python:
I would suggest trying this approach since your data contains lists. It will be easy to interact with your data and manipulate it from the console since you can visualise your data structure and update it as you wish. You can also export the data set and save it and add new data out of the python console for later analysis. This code is simpler and contains less lines of code. I am assuming you need a few quick lines of code to screen your data for further analysis
Example:
data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}
import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes
df = pd.DataFrame(data, columns = ['list 1','list 2'])
from scipy import stats # For in-built method to get PCC
pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results
However, you did not post your data for me to see the size of the data set or the transformations that might be needed before the analysis.
Hmm, many of these responses have long and hard to read code...
I'd suggest using numpy with its nifty features when working with arrays:
import numpy as np
def pcc(X, Y):
''' Compute Pearson Correlation Coefficient. '''
# Normalise X and Y
X -= X.mean(0)
Y -= Y.mean(0)
# Standardise X and Y
X /= X.std(0)
Y /= Y.std(0)
# Compute mean product
return np.mean(X*Y)
# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)
Here's a variant on mkh's answer that runs much faster than it, and scipy.stats.pearsonr, using numba.
import numba
#numba.jit
def corr(data1, data2):
M = data1.size
sum1 = 0.
sum2 = 0.
for i in range(M):
sum1 += data1[i]
sum2 += data2[i]
mean1 = sum1 / M
mean2 = sum2 / M
var_sum1 = 0.
var_sum2 = 0.
cross_sum = 0.
for i in range(M):
var_sum1 += (data1[i] - mean1) ** 2
var_sum2 += (data2[i] - mean2) ** 2
cross_sum += (data1[i] * data2[i])
std1 = (var_sum1 / M) ** .5
std2 = (var_sum2 / M) ** .5
cross_mean = cross_sum / M
return (cross_mean - mean1 * mean2) / (std1 * std2)
This is a implementation of Pearson Correlation function using numpy:
def corr(data1, data2):
"data1 & data2 should be numpy arrays."
mean1 = data1.mean()
mean2 = data2.mean()
std1 = data1.std()
std2 = data2.std()
# corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2)
corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
return corr
Here is an implementation for pearson correlation based on sparse vector. The vectors here are expressed as a list of tuples expressed as (index, value). The two sparse vectors can be of different length but over all vector size will have to be same. This is useful for text mining applications where the vector size is extremely large due to most features being bag of words and hence calculations are usually performed using sparse vectors.
def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0):
indexed_feature_dict = {}
if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0:
raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation")
sum_a = sum(value for index, value in first_feature_vector)
sum_b = sum(value for index, value in second_feature_vector)
avg_a = float(sum_a) / length_of_featureset
avg_b = float(sum_b) / length_of_featureset
mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + ((
length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2)))
mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + ((
length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2)))
covariance_a_b = 0
#calculate covariance for the sparse vectors
for tuple in first_feature_vector:
if len(tuple) != 2:
raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
indexed_feature_dict[tuple[0]] = tuple[1]
count_of_features = 0
for tuple in second_feature_vector:
count_of_features += 1
if len(tuple) != 2:
raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
if tuple[0] in indexed_feature_dict:
covariance_a_b += ((indexed_feature_dict[tuple[0]] - avg_a) * (tuple[1] - avg_b))
del (indexed_feature_dict[tuple[0]])
else:
covariance_a_b += (0 - avg_a) * (tuple[1] - avg_b)
for index in indexed_feature_dict:
count_of_features += 1
covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b)
#adjust covariance with rest of vector with 0 value
covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b
if mean_sq_error_a == 0 or mean_sq_error_b == 0:
return -1
else:
return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)
Unit tests:
def test_get_get_pearson_corelation(self):
vector_a = [(1, 1), (2, 2), (3, 3)]
vector_b = [(1, 1), (2, 5), (3, 7)]
self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None)
vector_a = [(1, 1), (2, 2), (3, 3)]
vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)]
self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)
I have a very simple and easy to understand solution for this. For two arrays of equal length, Pearson coefficient can be easily computed as follows:
def manual_pearson(a,b):
"""
Accepts two arrays of equal length, and computes correlation coefficient.
Numerator is the sum of product of (a - a_avg) and (b - b_avg),
while denominator is the product of a_std and b_std multiplied by
length of array.
"""
a_avg, b_avg = np.average(a), np.average(b)
a_stdev, b_stdev = np.std(a), np.std(b)
n = len(a)
denominator = a_stdev * b_stdev * n
numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
p_coef = numerator/denominator
return p_coef
Starting in Python 3.10, the Pearson’s correlation coefficient (statistics.correlation) is directly available in the standard library:
from statistics import correlation
# a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
# b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
correlation(a, b)
# 0.1449981545806852
You may wonder how to interpret your probability in the context of looking for a correlation in a particular direction (negative or positive correlation.) Here is a function I wrote to help with that. It might even be right!
It's based on info I gleaned from http://www.vassarstats.net/rsig.html and http://en.wikipedia.org/wiki/Student%27s_t_distribution, thanks to other answers posted here.
# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
# (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# that there is no correlation in the given direction.
#
# direction:
# if positive, p is the probability that there is no positive correlation in
# the population sampled by X and Y
# if negative, p is the probability that there is no negative correlation
# if 0, p is the probability that there is no correlation in either direction
def probabilityNotCorrelated(X, Y, direction=0):
x = len(X)
if x != len(Y):
raise ValueError("variables not same len: " + str(x) + ", and " + \
str(len(Y)))
if x < 6:
raise ValueError("must have at least 6 samples, but have " + str(x))
(corr, prb_2_tail) = stats.pearsonr(X, Y)
if not direction:
return (corr, prb_2_tail)
prb_1_tail = prb_2_tail / 2
if corr * direction > 0:
return (corr, prb_1_tail)
return (corr, 1 - prb_1_tail)
You can take a look at this article. This is a well-documented example for calculating correlation based on historical forex currency pairs data from multiple files using pandas library (for Python), and then generating a heatmap plot using seaborn library.
http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/
Calculating Correlation:
Correlation - measures similarity of two different variables
Using pearson correlation
from scipy.stats import pearsonr
# final_data is the dataframe with n set of columns
pearson_correlation = final_data.corr(method='pearson')
pearson_correlation
# print correlation of n*n column
Using Spearman correlation
from scipy.stats import spearmanr
# final_data is the dataframe with n set of columns
spearman_correlation = final_data.corr(method='spearman')
spearman_correlation
# print correlation of n*n column
Using Kendall correlation
kendall_correlation=final_data.corr(method='kendall')
kendall_correlation
def correlation_score(y_true, y_pred):
"""Scores the predictions according to the competition rules.
It is assumed that the predictions are not constant.
Returns the average of each sample's Pearson correlation coefficient"""
y2 = y_pred.copy()
y2 -= y2.mean(axis=0); y2 /= y2.std(axis=0)
y1 = y_true.copy();
y1 -= y1.mean(axis=0); y1 /= y1.std(axis=0)
c = (y1*y2).mean().mean()# Correlation for rescaled matrices is just matrix product and average
return c
def pearson(x,y):
n=len(x)
vals=range(n)
sumx=sum([float(x[i]) for i in vals])
sumy=sum([float(y[i]) for i in vals])
sumxSq=sum([x[i]**2.0 for i in vals])
sumySq=sum([y[i]**2.0 for i in vals])
pSum=sum([x[i]*y[i] for i in vals])
# Calculating Pearson correlation
num=pSum-(sumx*sumy/n)
den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
if den==0: return 0
r=num/den
return r

Categories

Resources