Python: Indices of identical tuple elements are also identical? - python

just starting out so I apologize if this is a stupid question. Python 2.7 if it's important. I'm writing a program that evaluates a polynomial whose coefficients are represented by the elements of a tuple at some x whose power is the index of the coefficient. It runs fine when all the coefficients are different, the issue I'm having is when any of the coefficients are the same. Code is below -
def evaluate_poly(poly, x):
"""polynomial coefficients represented by elements of tuple.
each coefficient evaluated at x ** index of coefficient"""
poly_sum = 0.0
for coefficient in poly:
val = coefficient * (x ** poly.index(coefficient))
poly_sum += val
return poly_sum
poly = (1, 2, 3)
x = 5
print evaluate_poly(poly, x)
##for coefficient in poly:
##print poly.index(coefficient)
Which returns 86 as you would expect.
The commented out print statement will return the indices of each element in poly. When they're all different (1, 2, 3) it returns what you would expect (0, 1, 2) but if any of the elements are the same (1, 1, 2) their indices will also be the same (0, 0, 1), so I'm really only able to evaluate polynomials where all the coefficients are different. What am I doing wrong here? I figure it has something to do with -
poly.index(coefficient)
but I can't figure out why exactly. Thanks in advance

Use enumerate, index will get the index of the first occurrence so for repeated elements it will obviously fail, in your code poly.index(1) using (1, 1, 2) is going to return 0 each time:
Uusing enumerate will give you each actual index of every element and also more efficiently:
def evaluate_poly(poly, x):
"""polynomial coefficients represented by elements of tuple.
each coefficient evaluated at x ** index of coefficient"""
poly_sum = 0.0
# ind is each index, coefficient is each element
for ind, coefficient in enumerate(poly):
# no need for val just += coefficient * (x ** ind)
poly_sum += coefficient * (x ** ind)
return poly_sum
If you print(list(enumerate(poly))) you will see each element and it's index in the list:
[(0, 1), (1, 1), (2, 3)]
So ind each time in the loop refers to the index of each coefficient in your poly list.
You can also just return a generator expression using sum:
def evaluate_poly(poly, x):
"""polynomial coefficients represented by elements of tuple.
each coefficient evaluated at x ** index of coefficient"""
return sum((coefficient * (x ** ind) for ind, coefficient in enumerate(poly)),0.0)
using 0.0 as the start value will mean a float is returned as opposed to an int. You could also cast float(sum... but i think it is simpler just to pass the start value as a float.

try this one here:
def evaluate_poly(poly, x):
'''
polynomial coefficients represented by elements of tuple.
each coefficient evaluated at x ** index of coefficient
'''
poly_sum = 0.0
for ind, coefficient in enumerate(poly):
print ind
val = coefficient * (x ** ind)
poly_sum += val
return poly_sum
poly = (1, 2, 3)
x = 5
print evaluate_poly(poly, x)
it works with poly = (1, 1, 3) too!

Related

In Python, how do I add an incrementing element of a polynomial?

I construct a Newton polynomial based on a given simple sine function. Implemented intermediate calculations, but stopped at the final stage - to obtain the formula of the polynomial. Recursion may help here, but it's inaccurate. Here is the formula of the polynomial
The formula iterates over the values from the table below: we go through the column of x's and the first line of the calculated deltas (we go up to the delta, which degree of the polynomial we get). For example, if the degree is 2, then we will take 2 deltas in the first row and values up to 2.512 in the column of x (9 brackets with x differences will be in the last block of the polynomial)
In the formula, there is a set of constant blocks where values are iterated through, but I have a snag in the element (x —x_0)**[n]. This is the degree of the polynomial n that the user sets. Here [n] means that the expression in the parenthesis is expanded:
I use the sympy library for symbolic calculations: x in the formula of the future polynomial should remain x (as a symbol, not its value). How to implement a part of a block repeating in a polynomial that grows with a new bracket of the degree of the polynomial?
Code:
import numpy as np
from sympy import *
import pandas as pd
from scipy.special import factorial
def func(x):
return np.sin(x)
def poly(order):
# building columns X and Y:
x_i_list = [round( (0.1*np.pi*i), 4 ) for i in range(0, 11)]
y_i_list = []
for x in x_i_list:
y_i = round( (func(x)), 4 )
y_i_list.append(y_i)
# we get deltas:
n=order
if n < len(y_i_list):
result = [ np.diff(y_i_list, n=d) for d in np.arange(1, len(y_i_list)) ]
print(result)
else:
print(f'Determine the order of the polynomial less than {len(y_i_list)}')
# We determine the index in the x column based on the degree of the polynomial:
delta_index=len(result[order-1])-1
x_index = delta_index
h = (x_i_list[x_index] - x_i_list[0]) / n # calculate h
b=x_i_list[x_index]
a=x_i_list[0]
y_0=x_i_list[0]
string_one = [] # list with deltas of the first row (including the degree column of the polynomial)
for elen in result:
string_one.append(round(elen[0], 4))
# creating a list for the subsequent passage through the x's
x_col_list = []
for col in x_i_list:
if col <= x_i_list[x_index]:
x_col_list.append(col)
x = Symbol('x') # for symbolic representation of x's
# we go along the deltas of the first line:
for delta in string_one:
# we go along the column of x's
for arg in x_col_list:
for n in range(1, order+1):
polynom = ( delta/(factorial(n)*h**n) )*(x - arg) # Here I stopped
I guess you're looking for something like this:
In [52]: from sympy import symbols, prod
In [53]: x = symbols('x')
In [54]: nums = [1, 2, 3, 4]
In [55]: prod((x-n) for n in nums)
Out[55]: (x - 4)⋅(x - 3)⋅(x - 2)⋅(x - 1)
EDIT: Actually it's more efficient to do this with Mul rather than prod:
In [134]: Mul(*((x-n) for n in nums))
Out[134]: (x - 4)⋅(x - 3)⋅(x - 2)⋅(x - 1)

Implementing negative log-likelihood function in python

I'm having having some difficulty implementing a negative log likelihood function in python
My Negative log likelihood function is given as:
This is my implementation but i keep getting error:ValueError: shapes (31,1) and (2458,1) not aligned: 1 (dim 1) != 2458 (dim 0)
def negative_loglikelihood(X, y, theta):
J = np.sum(-y # X # theta) + np.sum(np.exp(X # theta))+ np.sum(np.log(y))
return J
X is a dataframe of size:(2458, 31), y is a dataframe of size: (2458, 1) theta is dataframe of size: (31,1)
i cannot fig out what am i missing. Is my implementation incorrect somehow? Any help would be much appreciated. thanks
You cannot use matrix multiplication here, what you want is multiplying elements with the same index together, ie element wise multiplication. The correct operator is * for this purpose.
Moreover, you must transpose theta so numpy can broadcast the dimension with size 1 to 2458 (same for y: 1 is broadcasted to 31.)
x = np.random.rand(2458, 31)
y = np.random.rand(2458, 1)
theta = np.random.rand(31, 1)
def negative_loglikelihood(x, y, theta):
J = np.sum(-y * x * theta.T) + np.sum(np.exp(x * theta.T))+ np.sum(np.log(y))
return J
negative_loglikelihood(x, y, theta)
>>> 88707.699
EDIT: your formula includes a y! inside the logarithm, you should also update your code to match.
If you look at your equation you are passing yixiθ is Summing over i=1 to M so it means you should pass the same i over y and x otherwise pass the separate function over it.

How can I simplify this more?

I am trying to apply numpy to this code I wrote for trapezium rule integration:
def integral(a,b,n):
delta = (b-a)/float(n)
s = 0.0
s+= np.sin(a)/(a*2)
for i in range(1,n):
s +=np.sin(a + i*delta)/(a + i*delta)
s += np.sin(b)/(b*2.0)
return s * delta
I am trying to get the return value from the new function something like this:
return delta *((2 *np.sin(x[1:-1])) +np.sin(x[0])+np.sin(x[-1]) )/2*x
I am trying for a long time now to make any breakthrough but all my attempts failed.
One of the things I attempted and I do not get is why the following code gives too many indices for array error?
def integral(a,b,n):
d = (b-a)/float(n)
x = np.arange(a,b,d)
J = np.where(x[:,1] < np.sin(x[:,0])/x[:,0])[0]
Every hint/advice is very much appreciated.
You forgot to sum over sin(x):
>>> def integral(a, b, n):
... x, delta = np.linspace(a, b, n+1, retstep=True)
... y = np.sin(x)
... y[0] /= 2
... y[-1] /= 2
... return delta * y.sum()
...
>>> integral(0, np.pi / 2, 10000)
0.9999999979438324
>>> integral(0, 2 * np.pi, 10000)
0.0
>>> from scipy.integrate import quad
>>> quad(np.sin, 0, np.pi / 2)
(0.9999999999999999, 1.1102230246251564e-14)
>>> quad(np.sin, 0, 2 * np.pi)
(2.221501482512777e-16, 4.3998892617845996e-14)
I tried this meanwhile, too.
import numpy as np
def T_n(a, b, n, fun):
delta = (b - a)/float(n) # delta formula
x_i = lambda a,i,delta: a + i * delta # calculate x_i
return 0.5 * delta * \
(2 * sum(fun(x_i(a, np.arange(0, n + 1), delta))) \
- fun(x_i(a, 0, delta)) \
- fun(x_i(a, n, delta)))
Reconstructed the code using formulas at bottom of this page
https://matheguru.com/integralrechnung/trapezregel.html
The summing over the range(0, n+1) - which gives [0, 1, ..., n] -
is implemented using numpy. Usually, you would collect the values using a for loop in normal Python.
But numpy's vectorized behaviour can be used here.
np.arange(0, n+1) gives a np.array([0, 1, ...,n]).
If given as argument to the function (here abstracted as fun) - the function formula for x_0 to x_n
will be then calculated. and collected in a numpy-array. So fun(x_i(...)) returns a numpy-array of the function applied on x_0 to x_n. This array/list is summed up by sum().
The entire sum() is multiplied by 2, and then the function value of x_0 and x_n subtracted afterwards. (Since in the trapezoid formula only the middle summands, but not the first and the last, are multiplied by 2). This was kind of a hack.
The linked German page uses as a function fun(x) = x ^ 2 + 3
which can be nicely defined on the fly by using a lambda expression:
fun = lambda x: x ** 2 + 3
a = -2
b = 3
n = 6
You could instead use a normal function definition, too: defun fun(x): return x ** 2 + 3.
So I tested by typing the command:
T_n(a, b, n, fun)
Which correctly returned:
## Out[172]: 27.24537037037037
For your case, just allocate np.sin tofun and your values for a, b, and n into this function call.
Like:
fun = np.sin # by that eveywhere where `fun` is placed in function,
# it will behave as if `np.sin` will stand there - this is possible,
# because Python treats its functions as first class citizens
a = #your value
b = #your value
n = #your value
Finally, you can call:
T_n(a, b, n, fun)
And it will work!

Finding a way to replace a column in an np.ogrid with a different formula for a specific value of the iterable

a,b=np.ogrid[0:n:1,0:n:1]
A=np.exp(1j*(np.pi/3)*np.abs(a-b))
a,b=np.diag_indices_from(A)
A[a,b]=1-1j/np.sqrt(3)
is my basis. it produces a grid which acts as an n*n matrix.
My issue is I need to replace a column in the grid, say for example where b=17.
I need for this column to be:
A=np.exp(1j*(np.pi/3)*np.abs(a-17+geo_mean(x)))
except for where a=b where it needs to stay as:
A[a,b]=1-1j/np.sqrt(3)
geo_mean(x) is just a geometric average of 50 values determined from a pseudo random number generator, defined in my code as:
x=[random.uniform(0,0.5) for p in range(0,50)]
def geo_mean(iterable):
a = np.array(iterable)
return a.prod()**(1.0/len(a))
So how do i go about replacing a column to include the geo_mean in the exponent formula and do it without changing the diagonal value?
Let's start by saying that diag_indices_from() is kind of useless here since we already know that diagonal elements are those that have equal indices i and j and run up to value n. Therefore, let's simplify the code a little bit at the beginning:
a, b = np.ogrid[0:n:1, 0:n:1]
A = np.exp(1j * (np.pi / 3) * np.abs(a - b))
diag = np.arange(n)
A[diag, diag] = 1 - 1j / np.sqrt(3)
Now, let's say you would like to set the column k values, except for the diagonal element, to
np.exp(1j * (np.pi/3) * np.abs(a - 17 + geo_mean(x)))
(I guess a in the above formula is row index).
This can be done using integer indices, especially that they are almost computed: we already have diag and we just need to remove from it the index of the diagonal element that needs to be kept unchanged:
r = np.delete(diag, k)
Then
x = np.random.uniform(0, 0.5, (r.size, 50))
A[r, k] = np.exp(1j * (np.pi/3) * np.abs(r - k + geo_mean(x)))
However, for the above to work, you need to rewrite your geo_mean() function in a such a way that it will work with 2D input arrays (I will also add some checks and conversions to make it backward compatible):
def geo_mean(x):
x = np.asarray(x)
dim = len(x.shape)
x = np.atleast_2d(x)
v = np.prod(x, axis=1) ** (1.0 / x.shape[1])
return v[0] if dim == 1 else v

Calculating Pearson correlation and significance in Python

I am looking for a function that takes as input two lists, and returns the Pearson correlation, and the significance of the correlation.
You can have a look at scipy.stats:
from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)
>>>
Help on function pearsonr in module scipy.stats.stats:
pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing
non-correlation.
The Pearson correlation coefficient measures the linear relationship
between two datasets. Strictly speaking, Pearson's correlation requires
that each dataset be normally distributed. Like other correlation
coefficients, this one varies between -1 and +1 with 0 implying no
correlation. Correlations of -1 or +1 imply an exact linear
relationship. Positive correlations imply that as x increases, so does
y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets. The p-values are not entirely
reliable but are probably reasonable for datasets larger than 500 or so.
Parameters
----------
x : 1D array
y : 1D array the same length as x
Returns
-------
(Pearson's correlation coefficient,
2-tailed p-value)
References
----------
http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
The Pearson correlation can be calculated with numpy's corrcoef.
import numpy
numpy.corrcoef(list1, list2)[0, 1]
An alternative can be a native scipy function from linregress which calculates:
slope : slope of the regression line
intercept : intercept of the regression line
r-value : correlation coefficient
p-value : two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero
stderr : Standard error of the estimate
And here is an example:
a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)
will return you:
LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)
If you don't feel like installing scipy, I've used this quick hack, slightly modified from Programming Collective Intelligence:
def pearsonr(x, y):
# Assume len(x) == len(y)
n = len(x)
sum_x = float(sum(x))
sum_y = float(sum(y))
sum_x_sq = sum(xi*xi for xi in x)
sum_y_sq = sum(yi*yi for yi in y)
psum = sum(xi*yi for xi, yi in zip(x, y))
num = psum - (sum_x * sum_y/n)
den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
if den == 0: return 0
return num / den
The following code is a straight-up interpretation of the definition:
import math
def average(x):
assert len(x) > 0
return float(sum(x)) / len(x)
def pearson_def(x, y):
assert len(x) == len(y)
n = len(x)
assert n > 0
avg_x = average(x)
avg_y = average(y)
diffprod = 0
xdiff2 = 0
ydiff2 = 0
for idx in range(n):
xdiff = x[idx] - avg_x
ydiff = y[idx] - avg_y
diffprod += xdiff * ydiff
xdiff2 += xdiff * xdiff
ydiff2 += ydiff * ydiff
return diffprod / math.sqrt(xdiff2 * ydiff2)
Test:
print pearson_def([1,2,3], [1,5,7])
returns
0.981980506062
This agrees with Excel, this calculator, SciPy (also NumPy), which return 0.981980506 and 0.9819805060619657, and 0.98198050606196574, respectively.
R:
> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805
EDIT: Fixed a bug pointed out by a commenter.
You can do this with pandas.DataFrame.corr, too:
import pandas as pd
a = [[1, 2, 3],
[5, 6, 9],
[5, 6, 11],
[5, 6, 13],
[5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()
This gives
0 1 2
0 1.000000 0.745601 0.916579
1 0.745601 1.000000 0.544248
2 0.916579 0.544248 1.000000
Rather than rely on numpy/scipy, I think my answer should be the easiest to code and understand the steps in calculating the Pearson Correlation Coefficient (PCC) .
import math
# calculates the mean
def mean(x):
sum = 0.0
for i in x:
sum += i
return sum / len(x)
# calculates the sample standard deviation
def sampleStandardDeviation(x):
sumv = 0.0
for i in x:
sumv += (i - mean(x))**2
return math.sqrt(sumv/(len(x)-1))
# calculates the PCC using both the 2 functions above
def pearson(x,y):
scorex = []
scorey = []
for i in x:
scorex.append((i - mean(x))/sampleStandardDeviation(x))
for j in y:
scorey.append((j - mean(y))/sampleStandardDeviation(y))
# multiplies both lists together into 1 list (hence zip) and sums the whole list
return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)
The significance of PCC is basically to show you how strongly correlated the two variables/lists are.
It is important to note that the PCC value ranges from -1 to 1.
A value between 0 to 1 denotes a positive correlation.
Value of 0 = highest variation (no correlation whatsoever).
A value between -1 to 0 denotes a negative correlation.
Pearson coefficient calculation using pandas in python:
I would suggest trying this approach since your data contains lists. It will be easy to interact with your data and manipulate it from the console since you can visualise your data structure and update it as you wish. You can also export the data set and save it and add new data out of the python console for later analysis. This code is simpler and contains less lines of code. I am assuming you need a few quick lines of code to screen your data for further analysis
Example:
data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}
import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes
df = pd.DataFrame(data, columns = ['list 1','list 2'])
from scipy import stats # For in-built method to get PCC
pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results
However, you did not post your data for me to see the size of the data set or the transformations that might be needed before the analysis.
Hmm, many of these responses have long and hard to read code...
I'd suggest using numpy with its nifty features when working with arrays:
import numpy as np
def pcc(X, Y):
''' Compute Pearson Correlation Coefficient. '''
# Normalise X and Y
X -= X.mean(0)
Y -= Y.mean(0)
# Standardise X and Y
X /= X.std(0)
Y /= Y.std(0)
# Compute mean product
return np.mean(X*Y)
# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)
Here's a variant on mkh's answer that runs much faster than it, and scipy.stats.pearsonr, using numba.
import numba
#numba.jit
def corr(data1, data2):
M = data1.size
sum1 = 0.
sum2 = 0.
for i in range(M):
sum1 += data1[i]
sum2 += data2[i]
mean1 = sum1 / M
mean2 = sum2 / M
var_sum1 = 0.
var_sum2 = 0.
cross_sum = 0.
for i in range(M):
var_sum1 += (data1[i] - mean1) ** 2
var_sum2 += (data2[i] - mean2) ** 2
cross_sum += (data1[i] * data2[i])
std1 = (var_sum1 / M) ** .5
std2 = (var_sum2 / M) ** .5
cross_mean = cross_sum / M
return (cross_mean - mean1 * mean2) / (std1 * std2)
This is a implementation of Pearson Correlation function using numpy:
def corr(data1, data2):
"data1 & data2 should be numpy arrays."
mean1 = data1.mean()
mean2 = data2.mean()
std1 = data1.std()
std2 = data2.std()
# corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2)
corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
return corr
Here is an implementation for pearson correlation based on sparse vector. The vectors here are expressed as a list of tuples expressed as (index, value). The two sparse vectors can be of different length but over all vector size will have to be same. This is useful for text mining applications where the vector size is extremely large due to most features being bag of words and hence calculations are usually performed using sparse vectors.
def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0):
indexed_feature_dict = {}
if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0:
raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation")
sum_a = sum(value for index, value in first_feature_vector)
sum_b = sum(value for index, value in second_feature_vector)
avg_a = float(sum_a) / length_of_featureset
avg_b = float(sum_b) / length_of_featureset
mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + ((
length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2)))
mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + ((
length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2)))
covariance_a_b = 0
#calculate covariance for the sparse vectors
for tuple in first_feature_vector:
if len(tuple) != 2:
raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
indexed_feature_dict[tuple[0]] = tuple[1]
count_of_features = 0
for tuple in second_feature_vector:
count_of_features += 1
if len(tuple) != 2:
raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
if tuple[0] in indexed_feature_dict:
covariance_a_b += ((indexed_feature_dict[tuple[0]] - avg_a) * (tuple[1] - avg_b))
del (indexed_feature_dict[tuple[0]])
else:
covariance_a_b += (0 - avg_a) * (tuple[1] - avg_b)
for index in indexed_feature_dict:
count_of_features += 1
covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b)
#adjust covariance with rest of vector with 0 value
covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b
if mean_sq_error_a == 0 or mean_sq_error_b == 0:
return -1
else:
return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)
Unit tests:
def test_get_get_pearson_corelation(self):
vector_a = [(1, 1), (2, 2), (3, 3)]
vector_b = [(1, 1), (2, 5), (3, 7)]
self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None)
vector_a = [(1, 1), (2, 2), (3, 3)]
vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)]
self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)
I have a very simple and easy to understand solution for this. For two arrays of equal length, Pearson coefficient can be easily computed as follows:
def manual_pearson(a,b):
"""
Accepts two arrays of equal length, and computes correlation coefficient.
Numerator is the sum of product of (a - a_avg) and (b - b_avg),
while denominator is the product of a_std and b_std multiplied by
length of array.
"""
a_avg, b_avg = np.average(a), np.average(b)
a_stdev, b_stdev = np.std(a), np.std(b)
n = len(a)
denominator = a_stdev * b_stdev * n
numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
p_coef = numerator/denominator
return p_coef
Starting in Python 3.10, the Pearson’s correlation coefficient (statistics.correlation) is directly available in the standard library:
from statistics import correlation
# a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
# b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
correlation(a, b)
# 0.1449981545806852
You may wonder how to interpret your probability in the context of looking for a correlation in a particular direction (negative or positive correlation.) Here is a function I wrote to help with that. It might even be right!
It's based on info I gleaned from http://www.vassarstats.net/rsig.html and http://en.wikipedia.org/wiki/Student%27s_t_distribution, thanks to other answers posted here.
# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
# (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# that there is no correlation in the given direction.
#
# direction:
# if positive, p is the probability that there is no positive correlation in
# the population sampled by X and Y
# if negative, p is the probability that there is no negative correlation
# if 0, p is the probability that there is no correlation in either direction
def probabilityNotCorrelated(X, Y, direction=0):
x = len(X)
if x != len(Y):
raise ValueError("variables not same len: " + str(x) + ", and " + \
str(len(Y)))
if x < 6:
raise ValueError("must have at least 6 samples, but have " + str(x))
(corr, prb_2_tail) = stats.pearsonr(X, Y)
if not direction:
return (corr, prb_2_tail)
prb_1_tail = prb_2_tail / 2
if corr * direction > 0:
return (corr, prb_1_tail)
return (corr, 1 - prb_1_tail)
You can take a look at this article. This is a well-documented example for calculating correlation based on historical forex currency pairs data from multiple files using pandas library (for Python), and then generating a heatmap plot using seaborn library.
http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/
Calculating Correlation:
Correlation - measures similarity of two different variables
Using pearson correlation
from scipy.stats import pearsonr
# final_data is the dataframe with n set of columns
pearson_correlation = final_data.corr(method='pearson')
pearson_correlation
# print correlation of n*n column
Using Spearman correlation
from scipy.stats import spearmanr
# final_data is the dataframe with n set of columns
spearman_correlation = final_data.corr(method='spearman')
spearman_correlation
# print correlation of n*n column
Using Kendall correlation
kendall_correlation=final_data.corr(method='kendall')
kendall_correlation
def correlation_score(y_true, y_pred):
"""Scores the predictions according to the competition rules.
It is assumed that the predictions are not constant.
Returns the average of each sample's Pearson correlation coefficient"""
y2 = y_pred.copy()
y2 -= y2.mean(axis=0); y2 /= y2.std(axis=0)
y1 = y_true.copy();
y1 -= y1.mean(axis=0); y1 /= y1.std(axis=0)
c = (y1*y2).mean().mean()# Correlation for rescaled matrices is just matrix product and average
return c
def pearson(x,y):
n=len(x)
vals=range(n)
sumx=sum([float(x[i]) for i in vals])
sumy=sum([float(y[i]) for i in vals])
sumxSq=sum([x[i]**2.0 for i in vals])
sumySq=sum([y[i]**2.0 for i in vals])
pSum=sum([x[i]*y[i] for i in vals])
# Calculating Pearson correlation
num=pSum-(sumx*sumy/n)
den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
if den==0: return 0
r=num/den
return r

Categories

Resources