Calculating Pearson correlation and significance in Python

Calculating Pearson correlation and significance in Python - python

I am looking for a function that takes as input two lists, and returns the Pearson correlation, and the significance of the correlation.

You can have a look at scipy.stats:
from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)
>>>
Help on function pearsonr in module scipy.stats.stats:
pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing
non-correlation.
The Pearson correlation coefficient measures the linear relationship
between two datasets. Strictly speaking, Pearson's correlation requires
that each dataset be normally distributed. Like other correlation
coefficients, this one varies between -1 and +1 with 0 implying no
correlation. Correlations of -1 or +1 imply an exact linear
relationship. Positive correlations imply that as x increases, so does
y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets. The p-values are not entirely
reliable but are probably reasonable for datasets larger than 500 or so.
Parameters
----------
x : 1D array
y : 1D array the same length as x
Returns
-------
(Pearson's correlation coefficient,
2-tailed p-value)
References
----------
http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation

The Pearson correlation can be calculated with numpy's corrcoef.
import numpy
numpy.corrcoef(list1, list2)[0, 1]

An alternative can be a native scipy function from linregress which calculates:
slope : slope of the regression line
intercept : intercept of the regression line
r-value : correlation coefficient
p-value : two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero
stderr : Standard error of the estimate
And here is an example:
a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)
will return you:
LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)

If you don't feel like installing scipy, I've used this quick hack, slightly modified from Programming Collective Intelligence:
def pearsonr(x, y):
# Assume len(x) == len(y)
n = len(x)
sum_x = float(sum(x))
sum_y = float(sum(y))
sum_x_sq = sum(xi*xi for xi in x)
sum_y_sq = sum(yi*yi for yi in y)
psum = sum(xi*yi for xi, yi in zip(x, y))
num = psum - (sum_x * sum_y/n)
den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
if den == 0: return 0
return num / den

The following code is a straight-up interpretation of the definition:
import math
def average(x):
assert len(x) > 0
return float(sum(x)) / len(x)
def pearson_def(x, y):
assert len(x) == len(y)
n = len(x)
assert n > 0
avg_x = average(x)
avg_y = average(y)
diffprod = 0
xdiff2 = 0
ydiff2 = 0
for idx in range(n):
xdiff = x[idx] - avg_x
ydiff = y[idx] - avg_y
diffprod += xdiff * ydiff
xdiff2 += xdiff * xdiff
ydiff2 += ydiff * ydiff
return diffprod / math.sqrt(xdiff2 * ydiff2)
Test:
print pearson_def([1,2,3], [1,5,7])
returns
0.981980506062
This agrees with Excel, this calculator, SciPy (also NumPy), which return 0.981980506 and 0.9819805060619657, and 0.98198050606196574, respectively.
R:
> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805
EDIT: Fixed a bug pointed out by a commenter.

You can do this with pandas.DataFrame.corr, too:
import pandas as pd
a = [[1, 2, 3],
[5, 6, 9],
[5, 6, 11],
[5, 6, 13],
[5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()
This gives
0 1 2
0 1.000000 0.745601 0.916579
1 0.745601 1.000000 0.544248
2 0.916579 0.544248 1.000000

Rather than rely on numpy/scipy, I think my answer should be the easiest to code and understand the steps in calculating the Pearson Correlation Coefficient (PCC) .
import math
# calculates the mean
def mean(x):
sum = 0.0
for i in x:
sum += i
return sum / len(x)
# calculates the sample standard deviation
def sampleStandardDeviation(x):
sumv = 0.0
for i in x:
sumv += (i - mean(x))**2
return math.sqrt(sumv/(len(x)-1))
# calculates the PCC using both the 2 functions above
def pearson(x,y):
scorex = []
scorey = []
for i in x:
scorex.append((i - mean(x))/sampleStandardDeviation(x))
for j in y:
scorey.append((j - mean(y))/sampleStandardDeviation(y))
# multiplies both lists together into 1 list (hence zip) and sums the whole list
return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)
The significance of PCC is basically to show you how strongly correlated the two variables/lists are.
It is important to note that the PCC value ranges from -1 to 1.
A value between 0 to 1 denotes a positive correlation.
Value of 0 = highest variation (no correlation whatsoever).
A value between -1 to 0 denotes a negative correlation.

Pearson coefficient calculation using pandas in python:
I would suggest trying this approach since your data contains lists. It will be easy to interact with your data and manipulate it from the console since you can visualise your data structure and update it as you wish. You can also export the data set and save it and add new data out of the python console for later analysis. This code is simpler and contains less lines of code. I am assuming you need a few quick lines of code to screen your data for further analysis
Example:
data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]}
import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes
df = pd.DataFrame(data, columns = ['list 1','list 2'])
from scipy import stats # For in-built method to get PCC
pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on
print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results
However, you did not post your data for me to see the size of the data set or the transformations that might be needed before the analysis.

Hmm, many of these responses have long and hard to read code...
I'd suggest using numpy with its nifty features when working with arrays:
import numpy as np
def pcc(X, Y):
''' Compute Pearson Correlation Coefficient. '''
# Normalise X and Y
X -= X.mean(0)
Y -= Y.mean(0)
# Standardise X and Y
X /= X.std(0)
Y /= Y.std(0)
# Compute mean product
return np.mean(X*Y)
# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)

Here's a variant on mkh's answer that runs much faster than it, and scipy.stats.pearsonr, using numba.
import numba
#numba.jit
def corr(data1, data2):
M = data1.size
sum1 = 0.
sum2 = 0.
for i in range(M):
sum1 += data1[i]
sum2 += data2[i]
mean1 = sum1 / M
mean2 = sum2 / M
var_sum1 = 0.
var_sum2 = 0.
cross_sum = 0.
for i in range(M):
var_sum1 += (data1[i] - mean1) ** 2
var_sum2 += (data2[i] - mean2) ** 2
cross_sum += (data1[i] * data2[i])
std1 = (var_sum1 / M) ** .5
std2 = (var_sum2 / M) ** .5
cross_mean = cross_sum / M
return (cross_mean - mean1 * mean2) / (std1 * std2)

This is a implementation of Pearson Correlation function using numpy:
def corr(data1, data2):
"data1 & data2 should be numpy arrays."
mean1 = data1.mean()
mean2 = data2.mean()
std1 = data1.std()
std2 = data2.std()
# corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2)
corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2)
return corr

Here is an implementation for pearson correlation based on sparse vector. The vectors here are expressed as a list of tuples expressed as (index, value). The two sparse vectors can be of different length but over all vector size will have to be same. This is useful for text mining applications where the vector size is extremely large due to most features being bag of words and hence calculations are usually performed using sparse vectors.
def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0):
indexed_feature_dict = {}
if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0:
raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation")
sum_a = sum(value for index, value in first_feature_vector)
sum_b = sum(value for index, value in second_feature_vector)
avg_a = float(sum_a) / length_of_featureset
avg_b = float(sum_b) / length_of_featureset
mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + ((
length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2)))
mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + ((
length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2)))
covariance_a_b = 0
#calculate covariance for the sparse vectors
for tuple in first_feature_vector:
if len(tuple) != 2:
raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
indexed_feature_dict[tuple[0]] = tuple[1]
count_of_features = 0
for tuple in second_feature_vector:
count_of_features += 1
if len(tuple) != 2:
raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,)
if tuple[0] in indexed_feature_dict:
covariance_a_b += ((indexed_feature_dict[tuple[0]] - avg_a) * (tuple[1] - avg_b))
del (indexed_feature_dict[tuple[0]])
else:
covariance_a_b += (0 - avg_a) * (tuple[1] - avg_b)
for index in indexed_feature_dict:
count_of_features += 1
covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b)
#adjust covariance with rest of vector with 0 value
covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b
if mean_sq_error_a == 0 or mean_sq_error_b == 0:
return -1
else:
return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)
Unit tests:
def test_get_get_pearson_corelation(self):
vector_a = [(1, 1), (2, 2), (3, 3)]
vector_b = [(1, 1), (2, 5), (3, 7)]
self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None)
vector_a = [(1, 1), (2, 2), (3, 3)]
vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)]
self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)

I have a very simple and easy to understand solution for this. For two arrays of equal length, Pearson coefficient can be easily computed as follows:
def manual_pearson(a,b):
"""
Accepts two arrays of equal length, and computes correlation coefficient.
Numerator is the sum of product of (a - a_avg) and (b - b_avg),
while denominator is the product of a_std and b_std multiplied by
length of array.
"""
a_avg, b_avg = np.average(a), np.average(b)
a_stdev, b_stdev = np.std(a), np.std(b)
n = len(a)
denominator = a_stdev * b_stdev * n
numerator = np.sum(np.multiply(a-a_avg, b-b_avg))
p_coef = numerator/denominator
return p_coef

Starting in Python 3.10, the Pearson’s correlation coefficient (statistics.correlation) is directly available in the standard library:
from statistics import correlation
# a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
# b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
correlation(a, b)
# 0.1449981545806852

You may wonder how to interpret your probability in the context of looking for a correlation in a particular direction (negative or positive correlation.) Here is a function I wrote to help with that. It might even be right!
It's based on info I gleaned from http://www.vassarstats.net/rsig.html and http://en.wikipedia.org/wiki/Student%27s_t_distribution, thanks to other answers posted here.
# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
# (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# that there is no correlation in the given direction.
#
# direction:
# if positive, p is the probability that there is no positive correlation in
# the population sampled by X and Y
# if negative, p is the probability that there is no negative correlation
# if 0, p is the probability that there is no correlation in either direction
def probabilityNotCorrelated(X, Y, direction=0):
x = len(X)
if x != len(Y):
raise ValueError("variables not same len: " + str(x) + ", and " + \
str(len(Y)))
if x < 6:
raise ValueError("must have at least 6 samples, but have " + str(x))
(corr, prb_2_tail) = stats.pearsonr(X, Y)
if not direction:
return (corr, prb_2_tail)
prb_1_tail = prb_2_tail / 2
if corr * direction > 0:
return (corr, prb_1_tail)
return (corr, 1 - prb_1_tail)

You can take a look at this article. This is a well-documented example for calculating correlation based on historical forex currency pairs data from multiple files using pandas library (for Python), and then generating a heatmap plot using seaborn library.
http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/

Calculating Correlation:
Correlation - measures similarity of two different variables
Using pearson correlation
from scipy.stats import pearsonr
# final_data is the dataframe with n set of columns
pearson_correlation = final_data.corr(method='pearson')
pearson_correlation
# print correlation of n*n column
Using Spearman correlation
from scipy.stats import spearmanr
# final_data is the dataframe with n set of columns
spearman_correlation = final_data.corr(method='spearman')
spearman_correlation
# print correlation of n*n column
Using Kendall correlation
kendall_correlation=final_data.corr(method='kendall')
kendall_correlation

def correlation_score(y_true, y_pred):
"""Scores the predictions according to the competition rules.
It is assumed that the predictions are not constant.
Returns the average of each sample's Pearson correlation coefficient"""
y2 = y_pred.copy()
y2 -= y2.mean(axis=0); y2 /= y2.std(axis=0)
y1 = y_true.copy();
y1 -= y1.mean(axis=0); y1 /= y1.std(axis=0)
c = (y1*y2).mean().mean()# Correlation for rescaled matrices is just matrix product and average
return c

def pearson(x,y):
n=len(x)
vals=range(n)
sumx=sum([float(x[i]) for i in vals])
sumy=sum([float(y[i]) for i in vals])
sumxSq=sum([x[i]**2.0 for i in vals])
sumySq=sum([y[i]**2.0 for i in vals])
pSum=sum([x[i]*y[i] for i in vals])
# Calculating Pearson correlation
num=pSum-(sumx*sumy/n)
den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5
if den==0: return 0
r=num/den
return r

Related

In Python, how do I add an incrementing element of a polynomial?

I construct a Newton polynomial based on a given simple sine function. Implemented intermediate calculations, but stopped at the final stage - to obtain the formula of the polynomial. Recursion may help here, but it's inaccurate. Here is the formula of the polynomial
The formula iterates over the values from the table below: we go through the column of x's and the first line of the calculated deltas (we go up to the delta, which degree of the polynomial we get). For example, if the degree is 2, then we will take 2 deltas in the first row and values up to 2.512 in the column of x (9 brackets with x differences will be in the last block of the polynomial)
In the formula, there is a set of constant blocks where values are iterated through, but I have a snag in the element (x —x_0)**[n]. This is the degree of the polynomial n that the user sets. Here [n] means that the expression in the parenthesis is expanded:
I use the sympy library for symbolic calculations: x in the formula of the future polynomial should remain x (as a symbol, not its value). How to implement a part of a block repeating in a polynomial that grows with a new bracket of the degree of the polynomial?
Code:
import numpy as np
from sympy import *
import pandas as pd
from scipy.special import factorial
def func(x):
return np.sin(x)
def poly(order):
# building columns X and Y:
x_i_list = [round( (0.1*np.pi*i), 4 ) for i in range(0, 11)]
y_i_list = []
for x in x_i_list:
y_i = round( (func(x)), 4 )
y_i_list.append(y_i)
# we get deltas:
n=order
if n < len(y_i_list):
result = [ np.diff(y_i_list, n=d) for d in np.arange(1, len(y_i_list)) ]
print(result)
else:
print(f'Determine the order of the polynomial less than {len(y_i_list)}')
# We determine the index in the x column based on the degree of the polynomial:
delta_index=len(result[order-1])-1
x_index = delta_index
h = (x_i_list[x_index] - x_i_list[0]) / n # calculate h
b=x_i_list[x_index]
a=x_i_list[0]
y_0=x_i_list[0]
string_one = [] # list with deltas of the first row (including the degree column of the polynomial)
for elen in result:
string_one.append(round(elen[0], 4))
# creating a list for the subsequent passage through the x's
x_col_list = []
for col in x_i_list:
if col <= x_i_list[x_index]:
x_col_list.append(col)
x = Symbol('x') # for symbolic representation of x's
# we go along the deltas of the first line:
for delta in string_one:
# we go along the column of x's
for arg in x_col_list:
for n in range(1, order+1):
polynom = ( delta/(factorial(n)*h**n) )*(x - arg) # Here I stopped

I guess you're looking for something like this:
In [52]: from sympy import symbols, prod
In [53]: x = symbols('x')
In [54]: nums = [1, 2, 3, 4]
In [55]: prod((x-n) for n in nums)
Out[55]: (x - 4)⋅(x - 3)⋅(x - 2)⋅(x - 1)
EDIT: Actually it's more efficient to do this with Mul rather than prod:
In [134]: Mul(*((x-n) for n in nums))
Out[134]: (x - 4)⋅(x - 3)⋅(x - 2)⋅(x - 1)

Calculate probability 2 random people are in the same group?

In my dataset, there are N people who are each split into one 3 groups (groups = {A, B, C}). I want to find the probability that two random people, n_1 and n_2, belong to the same group.
I have data on each of these groups and how many people belong to them. Importantly, each group is of a different size.
import pandas as pd
import numpy as np
import math
data = {
"Group": ['A', 'B', 'C'],
"Count": [20, 10, 5],
}
df = pd.DataFrame(data)
Group Count
0 A 20
1 B 10
2 C 5
I think I know how to get the sample space, S but I am unsure how to get the numerator.
def nCk(n,k):
f = math.factorial
return f(n) / f(k) / f(n-k)
n = sum(df['Count'])
k = 2
s = nCk(n, k)

My discrete mathematics skills are a bit rusty so feel free to correct me. You have N people split into groups of sizes s_1, ..., s_n so that N = s_1 + ... + s_n.
The chance of one random person belonging to group i is s_i / N
The chance of a second person being in group i is (s_i - 1) / (N - 1)
The chance of both being in group i is s_i / N * (s_i - 1) / (N - 1)
The probability of them being together in any group is the sum of the probabilities in #3 across all groups.
Code:
import numpy as np
s = df['Count'].values
n = s.sum()
prob = np.sum(s/n * (s-1)/(n-1)) # 0.4117647058823529
We can generalize this solution to "the probability of k people all being in the same group":
k = 2
i = np.arange(k)[:, None]
tmp = (s-i) / (n-i)
prob = np.prod(tmp, axis=0).sum()
When k > s.max() (20 in this case), the answer is 0 because you cannot fit all of them in one group. When k > s.sum() (35 in this case), the result is nan.

I will answer your problem by using hypergeometric distribution, hypergeometric distribution is a discrete probability distribution that describes the probability of k successes (random draws for which the object drawn has a specified feature) in n draws, without replacement, from a finite population of size N that contains exactly K objects with that feature, wherein each draw is either a success or a failure. In contrast, the binomial distribution describes the probability of k successes in n draws with replacement.
So the total probability should be the probability of both belonging to A + probability of both belonging to B + probability of both belonging to C.
This means
P(A) = (nCk(20,2) * nCk(15,0))/nCk(35,2)
P(B) = (nCk(10,2) * nCk(25,0))/nCk(35,2)
P(C) = (nCk(5,2) * nCk(5,0)) / nCk(35,2)
In code terms:
import pandas as pd
import numpy as np
import math
data = {
"Group": ['A', 'B', 'C'],
"Count": [20, 10, 5],
}
df = pd.DataFrame(data)
def nCk(n,k):
f = math.factorial
return f(n) / f(k) / f(n-k)
samples = 2
succeses = 2
observations = df['Count'].sum()
p_a = ((nCk(df[df['Group'] == 'A'].set_index('Group').max(),samples)) * (nCk((observations - df[df['Group'] == 'A'].set_index('Group').max()),(samples-succeses)))) / nCk(observations,samples)
p_b = ((nCk(df[df['Group'] == 'B'].set_index('Group').max(),samples)) * (nCk((observations - df[df['Group'] == 'B'].set_index('Group').max()),(samples-succeses)))) / nCk(observations,samples)
p_c =((nCk(df[df['Group'] == 'C'].set_index('Group').max(),samples)) * (nCk((observations - df[df['Group'] == 'C'].set_index('Group').max()),(samples-succeses)))) / nCk(observations,samples)
proba = p_a + p_b + p_c
print(proba)
Output:
0.41176470588235287

How to perform cubic spline interpolation in python?

I have two lists to describe the function y(x):
x = [0,1,2,3,4,5]
y = [12,14,22,39,58,77]
I would like to perform cubic spline interpolation so that given some value u in the domain of x, e.g.
u = 1.25
I can find y(u).
I found this in SciPy but I am not sure how to use it.

Short answer:
from scipy import interpolate
def f(x):
x_points = [ 0, 1, 2, 3, 4, 5]
y_points = [12,14,22,39,58,77]
tck = interpolate.splrep(x_points, y_points)
return interpolate.splev(x, tck)
print(f(1.25))
Long answer:
scipy separates the steps involved in spline interpolation into two operations, most likely for computational efficiency.
The coefficients describing the spline curve are computed,
using splrep(). splrep returns an array of tuples containing the
coefficients.
These coefficients are passed into splev() to actually
evaluate the spline at the desired point x (in this example 1.25).
x can also be an array. Calling f([1.0, 1.25, 1.5]) returns the
interpolated points at 1, 1.25, and 1,5, respectively.
This approach is admittedly inconvenient for single evaluations, but since the most common use case is to start with a handful of function evaluation points, then to repeatedly use the spline to find interpolated values, it is usually quite useful in practice.

In case, scipy is not installed:
import numpy as np
from math import sqrt
def cubic_interp1d(x0, x, y):
"""
Interpolate a 1-D function using cubic splines.
x0 : a float or an 1d-array
x : (N,) array_like
A 1-D array of real/complex values.
y : (N,) array_like
A 1-D array of real values. The length of y along the
interpolation axis must be equal to the length of x.
Implement a trick to generate at first step the cholesky matrice L of
the tridiagonal matrice A (thus L is a bidiagonal matrice that
can be solved in two distinct loops).
additional ref: www.math.uh.edu/~jingqiu/math4364/spline.pdf
"""
x = np.asfarray(x)
y = np.asfarray(y)
# remove non finite values
# indexes = np.isfinite(x)
# x = x[indexes]
# y = y[indexes]
# check if sorted
if np.any(np.diff(x) < 0):
indexes = np.argsort(x)
x = x[indexes]
y = y[indexes]
size = len(x)
xdiff = np.diff(x)
ydiff = np.diff(y)
# allocate buffer matrices
Li = np.empty(size)
Li_1 = np.empty(size-1)
z = np.empty(size)
# fill diagonals Li and Li-1 and solve [L][y] = [B]
Li[0] = sqrt(2*xdiff[0])
Li_1[0] = 0.0
B0 = 0.0 # natural boundary
z[0] = B0 / Li[0]
for i in range(1, size-1, 1):
Li_1[i] = xdiff[i-1] / Li[i-1]
Li[i] = sqrt(2*(xdiff[i-1]+xdiff[i]) - Li_1[i-1] * Li_1[i-1])
Bi = 6*(ydiff[i]/xdiff[i] - ydiff[i-1]/xdiff[i-1])
z[i] = (Bi - Li_1[i-1]*z[i-1])/Li[i]
i = size - 1
Li_1[i-1] = xdiff[-1] / Li[i-1]
Li[i] = sqrt(2*xdiff[-1] - Li_1[i-1] * Li_1[i-1])
Bi = 0.0 # natural boundary
z[i] = (Bi - Li_1[i-1]*z[i-1])/Li[i]
# solve [L.T][x] = [y]
i = size-1
z[i] = z[i] / Li[i]
for i in range(size-2, -1, -1):
z[i] = (z[i] - Li_1[i-1]*z[i+1])/Li[i]
# find index
index = x.searchsorted(x0)
np.clip(index, 1, size-1, index)
xi1, xi0 = x[index], x[index-1]
yi1, yi0 = y[index], y[index-1]
zi1, zi0 = z[index], z[index-1]
hi1 = xi1 - xi0
# calculate cubic
f0 = zi0/(6*hi1)*(xi1-x0)**3 + \
zi1/(6*hi1)*(x0-xi0)**3 + \
(yi1/hi1 - zi1*hi1/6)*(x0-xi0) + \
(yi0/hi1 - zi0*hi1/6)*(xi1-x0)
return f0
if __name__ == '__main__':
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 11)
y = np.sin(x)
plt.scatter(x, y)
x_new = np.linspace(0, 10, 201)
plt.plot(x_new, cubic_interp1d(x_new, x, y))
plt.show()

If you have scipy version >= 0.18.0 installed you can use CubicSpline function from scipy.interpolate for cubic spline interpolation.
You can check scipy version by running following commands in python:
#!/usr/bin/env python3
import scipy
scipy.version.version
If your scipy version is >= 0.18.0 you can run following example code for cubic spline interpolation:
#!/usr/bin/env python3
import numpy as np
from scipy.interpolate import CubicSpline
# calculate 5 natural cubic spline polynomials for 6 points
# (x,y) = (0,12) (1,14) (2,22) (3,39) (4,58) (5,77)
x = np.array([0, 1, 2, 3, 4, 5])
y = np.array([12,14,22,39,58,77])
# calculate natural cubic spline polynomials
cs = CubicSpline(x,y,bc_type='natural')
# show values of interpolation function at x=1.25
print('S(1.25) = ', cs(1.25))
## Aditional - find polynomial coefficients for different x regions
# if you want to print polynomial coefficients in form
# S0(0<=x<=1) = a0 + b0(x-x0) + c0(x-x0)^2 + d0(x-x0)^3
# S1(1< x<=2) = a1 + b1(x-x1) + c1(x-x1)^2 + d1(x-x1)^3
# ...
# S4(4< x<=5) = a4 + b4(x-x4) + c5(x-x4)^2 + d5(x-x4)^3
# x0 = 0; x1 = 1; x4 = 4; (start of x region interval)
# show values of a0, b0, c0, d0, a1, b1, c1, d1 ...
cs.c
# Polynomial coefficients for 0 <= x <= 1
a0 = cs.c.item(3,0)
b0 = cs.c.item(2,0)
c0 = cs.c.item(1,0)
d0 = cs.c.item(0,0)
# Polynomial coefficients for 1 < x <= 2
a1 = cs.c.item(3,1)
b1 = cs.c.item(2,1)
c1 = cs.c.item(1,1)
d1 = cs.c.item(0,1)
# ...
# Polynomial coefficients for 4 < x <= 5
a4 = cs.c.item(3,4)
b4 = cs.c.item(2,4)
c4 = cs.c.item(1,4)
d4 = cs.c.item(0,4)
# Print polynomial equations for different x regions
print('S0(0<=x<=1) = ', a0, ' + ', b0, '(x-0) + ', c0, '(x-0)^2 + ', d0, '(x-0)^3')
print('S1(1< x<=2) = ', a1, ' + ', b1, '(x-1) + ', c1, '(x-1)^2 + ', d1, '(x-1)^3')
print('...')
print('S5(4< x<=5) = ', a4, ' + ', b4, '(x-4) + ', c4, '(x-4)^2 + ', d4, '(x-4)^3')
# So we can calculate S(1.25) by using equation S1(1< x<=2)
print('S(1.25) = ', a1 + b1*0.25 + c1*(0.25**2) + d1*(0.25**3))
# Cubic spline interpolation calculus example
# https://www.youtube.com/watch?v=gT7F3TWihvk

Just putting this here if you want a dependency-free solution.
Code taken from an answer above: https://stackoverflow.com/a/48085583/36061
def my_cubic_interp1d(x0, x, y):
"""
Interpolate a 1-D function using cubic splines.
x0 : a 1d-array of floats to interpolate at
x : a 1-D array of floats sorted in increasing order
y : A 1-D array of floats. The length of y along the
interpolation axis must be equal to the length of x.
Implement a trick to generate at first step the cholesky matrice L of
the tridiagonal matrice A (thus L is a bidiagonal matrice that
can be solved in two distinct loops).
additional ref: www.math.uh.edu/~jingqiu/math4364/spline.pdf
# original function code at: https://stackoverflow.com/a/48085583/36061
This function is licenced under: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
https://creativecommons.org/licenses/by-sa/3.0/
Original Author raphael valentin
Date 3 Jan 2018
Modifications made to remove numpy dependencies:
-all sub-functions by MR
This function, and all sub-functions, are licenced under: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
Mod author: Matthew Rowles
Date 3 May 2021
"""
def diff(lst):
"""
numpy.diff with default settings
"""
size = len(lst)-1
r = [0]*size
for i in range(size):
r[i] = lst[i+1] - lst[i]
return r
def list_searchsorted(listToInsert, insertInto):
"""
numpy.searchsorted with default settings
"""
def float_searchsorted(floatToInsert, insertInto):
for i in range(len(insertInto)):
if floatToInsert <= insertInto[i]:
return i
return len(insertInto)
return [float_searchsorted(i, insertInto) for i in listToInsert]
def clip(lst, min_val, max_val, inPlace = False):
"""
numpy.clip
"""
if not inPlace:
lst = lst[:]
for i in range(len(lst)):
if lst[i] < min_val:
lst[i] = min_val
elif lst[i] > max_val:
lst[i] = max_val
return lst
def subtract(a,b):
"""
returns a - b
"""
return a - b
size = len(x)
xdiff = diff(x)
ydiff = diff(y)
# allocate buffer matrices
Li = [0]*size
Li_1 = [0]*(size-1)
z = [0]*(size)
# fill diagonals Li and Li-1 and solve [L][y] = [B]
Li[0] = sqrt(2*xdiff[0])
Li_1[0] = 0.0
B0 = 0.0 # natural boundary
z[0] = B0 / Li[0]
for i in range(1, size-1, 1):
Li_1[i] = xdiff[i-1] / Li[i-1]
Li[i] = sqrt(2*(xdiff[i-1]+xdiff[i]) - Li_1[i-1] * Li_1[i-1])
Bi = 6*(ydiff[i]/xdiff[i] - ydiff[i-1]/xdiff[i-1])
z[i] = (Bi - Li_1[i-1]*z[i-1])/Li[i]
i = size - 1
Li_1[i-1] = xdiff[-1] / Li[i-1]
Li[i] = sqrt(2*xdiff[-1] - Li_1[i-1] * Li_1[i-1])
Bi = 0.0 # natural boundary
z[i] = (Bi - Li_1[i-1]*z[i-1])/Li[i]
# solve [L.T][x] = [y]
i = size-1
z[i] = z[i] / Li[i]
for i in range(size-2, -1, -1):
z[i] = (z[i] - Li_1[i-1]*z[i+1])/Li[i]
# find index
index = list_searchsorted(x0,x)
index = clip(index, 1, size-1)
xi1 = [x[num] for num in index]
xi0 = [x[num-1] for num in index]
yi1 = [y[num] for num in index]
yi0 = [y[num-1] for num in index]
zi1 = [z[num] for num in index]
zi0 = [z[num-1] for num in index]
hi1 = list( map(subtract, xi1, xi0) )
# calculate cubic - all element-wise multiplication
f0 = [0]*len(hi1)
for j in range(len(f0)):
f0[j] = zi0[j]/(6*hi1[j])*(xi1[j]-x0[j])**3 + \
zi1[j]/(6*hi1[j])*(x0[j]-xi0[j])**3 + \
(yi1[j]/hi1[j] - zi1[j]*hi1[j]/6)*(x0[j]-xi0[j]) + \
(yi0[j]/hi1[j] - zi0[j]*hi1[j]/6)*(xi1[j]-x0[j])
return f0

Minimal python3 code:
from scipy import interpolate
if __name__ == '__main__':
x = [ 0, 1, 2, 3, 4, 5]
y = [12,14,22,39,58,77]
# tck : tuple (t,c,k) a tuple containing the vector of knots,
# the B-spline coefficients, and the degree of the spline.
tck = interpolate.splrep(x, y)
print(interpolate.splev(1.25, tck)) # Prints 15.203125000000002
print(interpolate.splev(...other_value_here..., tck))
Based on comment of cwhy and answer by youngmit

In my previous post, I wrote a code based on a Cholesky development to solve the matrix generated by the cubic algorithm. Unfortunately, due to the square root function, it may perform badly on some sets of points (typically a non-uniform set of points).
In the same spirit than previously, there is another idea using the Thomas algorithm (TDMA) (see https://en.wikipedia.org/wiki/Tridiagonal_matrix_algorithm) to solve partially the tridiagonal matrix during its definition loop. However, the condition to use TDMA is that it requires at least that the matrix shall be diagonally dominant. However, in our case, it shall be true since |bi| > |ai| + |ci| with ai = h[i], bi = 2*(h[i]+h[i+1]), ci = h[i+1], with h[i] unconditionally positive. (see https://www.cfd-online.com/Wiki/Tridiagonal_matrix_algorithm_-TDMA(Thomas_algorithm)
I refer again to the document from jingqiu (see my previous post, unfortunately the link is broken, but it is still possible to find it in the cache of the web).
An optimized version of the TDMA solver can be described as follows:
def TDMAsolver(a,b,c,d):
""" This function is licenced under: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
https://creativecommons.org/licenses/by-sa/3.0/
Author raphael valentin
Date 25 Mar 2022
ref. https://www.cfd-online.com/Wiki/Tridiagonal_matrix_algorithm_-_TDMA_(Thomas_algorithm)
"""
n = len(d)
w = np.empty(n-1,float)
g = np.empty(n, float)
w[0] = c[0]/b[0]
g[0] = d[0]/b[0]
for i in range(1, n-1):
m = b[i] - a[i-1]*w[i-1]
w[i] = c[i] / m
g[i] = (d[i] - a[i-1]*g[i-1]) / m
g[n-1] = (d[n-1] - a[n-2]*g[n-2]) / (b[n-1] - a[n-2]*w[n-2])
for i in range(n-2, -1, -1):
g[i] = g[i] - w[i]*g[i+1]
return g
When it is possible to get each individual for ai, bi, ci, di, it becomes easy to combine the definitions of the natural cubic spline interpolator function within these 2 single loops.
def cubic_interpolate(x0, x, y):
""" Natural cubic spline interpolate function
This function is licenced under: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
https://creativecommons.org/licenses/by-sa/3.0/
Author raphael valentin
Date 25 Mar 2022
"""
xdiff = np.diff(x)
dydx = np.diff(y)
dydx /= xdiff
n = size = len(x)
w = np.empty(n-1, float)
z = np.empty(n, float)
w[0] = 0.
z[0] = 0.
for i in range(1, n-1):
m = xdiff[i-1] * (2 - w[i-1]) + 2 * xdiff[i]
w[i] = xdiff[i] / m
z[i] = (6*(dydx[i] - dydx[i-1]) - xdiff[i-1]*z[i-1]) / m
z[-1] = 0.
for i in range(n-2, -1, -1):
z[i] = z[i] - w[i]*z[i+1]
# find index (it requires x0 is already sorted)
index = x.searchsorted(x0)
np.clip(index, 1, size-1, index)
xi1, xi0 = x[index], x[index-1]
yi1, yi0 = y[index], y[index-1]
zi1, zi0 = z[index], z[index-1]
hi1 = xi1 - xi0
# calculate cubic
f0 = zi0/(6*hi1)*(xi1-x0)**3 + \
zi1/(6*hi1)*(x0-xi0)**3 + \
(yi1/hi1 - zi1*hi1/6)*(x0-xi0) + \
(yi0/hi1 - zi0*hi1/6)*(xi1-x0)
return f0
This function gives the same results as the function/class CubicSpline from scipy.interpolate, as we can see in the next plot.
It is possible to implement as well the first and second analytical derivatives that can be described such way:
f1p = -zi0/(2*hi1)*(xi1-x0)**2 + zi1/(2*hi1)*(x0-xi0)**2 + (yi1/hi1 - zi1*hi1/6) + (yi0/hi1 - zi0*hi1/6)
f2p = zi0/hi1 * (xi1-x0) + zi1/hi1 * (x0-xi0)
Then, it is easy to verify that f2p[0] and f2p[-1] are equal to 0, then that the interpolator function yields natural splines.
An additional reference concerning natural spline:
https://faculty.ksu.edu.sa/sites/default/files/numerical_analysis_9th.pdf#page=167
An example of use:
import matplotlib.pyplot as plt
import numpy as np
x = [-8,-4.19,-3.54,-3.31,-2.56,-2.31,-1.66,-0.96,-0.22,0.62,1.21,3]
y = [-0.01,0.01,0.03,0.04,0.07,0.09,0.16,0.28,0.45,0.65,0.77,1]
x = np.asfarray(x)
y = np.asfarray(y)
plt.scatter(x, y)
x_new= np.linspace(min(x), max(x), 10000)
y_new = cubic_interpolate(x_new, x, y)
plt.plot(x_new, y_new)
from scipy.interpolate import CubicSpline
f = CubicSpline(x, y, bc_type='natural')
plt.plot(x_new, f(x_new), label='ref')
plt.legend()
plt.show()
In a conclusion, this updated algorithm shall perform interpolation with better stability and faster than the previous code (O(n)). Associated with numba or cython, it shall be even very fast. Finally, it is totally independent of Scipy.
Important, note that as most of algorithms, it is sometimes useful to normalize the data (e.g. against large or small number values) to get the best results. As well, in this code, I do not check nan values or ordered data.
Whatever, this update was a good lesson learning for me and I hope it can help someone. Let me know if you find something strange.

If you want to get the value
from scipy.interpolate import CubicSpline
import numpy as np
x = [-5,-4.19,-3.54,-3.31,-2.56,-2.31,-1.66,-0.96,-0.22,0.62,1.21,3]
y = [-0.01,0.01,0.03,0.04,0.07,0.09,0.16,0.28,0.45,0.65,0.77,1]
value = 2
#ascending order
if np.any(np.diff(x) < 0):
indexes = np.argsort(x).astype(int)
x = np.array(x)[indexes]
y = np.array(y)[indexes]
f = CubicSpline(x, y, bc_type='natural')
specificVal = f(value).item(0) #f(value) is numpy.ndarray!!
print(specificVal)
If you want to plot the interpolated function.
np.linspace third parameter increase the "accuracy".
from scipy.interpolate import CubicSpline
import numpy as np
import matplotlib.pyplot as plt
x = [-5,-4.19,-3.54,-3.31,-2.56,-2.31,-1.66,-0.96,-0.22,0.62,1.21,3]
y = [-0.01,0.01,0.03,0.04,0.07,0.09,0.16,0.28,0.45,0.65,0.77,1]
#ascending order
if np.any(np.diff(x) < 0):
indexes = np.argsort(x).astype(int)
x = np.array(x)[indexes]
y = np.array(y)[indexes]
f = CubicSpline(x, y, bc_type='natural')
x_new = np.linspace(min(x), max(x), 100)
y_new = f(x_new)
plt.plot(x_new, y_new)
plt.scatter(x, y)
plt.title('Cubic Spline Interpolation')
plt.show()
output:

Yes, as others have already noted, it should be as simple as
>>> from scipy.interpolate import CubicSpline
>>> CubicSpline(x,y)(u)
array(15.203125)
(you can, for example, convert it to float to get the value from a 0d NumPy array)
What has not been described yet is boundary conditions: the default ‘not-a-knot’ boundary conditions work best if you have zero knowledge about the data you’re going to interpolate.
If you see the following ‘features’ on the plot, you can fine-tune the boundary conditions to get a better result:
the first derivative vanishes at boundaries => bc_type=‘clamped’
the second derivative vanishes at boundaries => bc_type='natural'
the function is periodic => bc_type='periodic'
See my article for more details and an interactive demo.

Least squares regression on 2d array

The numpy.linalg.lstsq(a,b) function accepts an array a with size nx2 and a 1-dimensional array b which is the dependent variable.
How would I go about doing a least squares regression where the data points are presented as a 2d array generated from an image file? The array looks something like this:
[[0, 0, 0, 0, e]
[0, 0, c, d, 0]
[b, a, f, 0, 0]]
where a, b, c, d, e, f are positive integer values.
I want to fit a line to these points. Can I use np.linalg.lstsq (and if so, how) or is there something which may make more sense (and if so, how)?
Thanks very much.

once a while I saw a similar python program from
# Prac 2 for Monte Carlo methods in a nutshell
# Richard Chopping, ANU RSES and Geoscience Australia, October 2012
# Useage
# python prac_q2.py [number of bootstrap runs]
# e.g. python prac_q2.py 10000
# would execute this and perform 10 000 bootstrap runs.
# Default is 100 runs.
# sys cause I need to access the arguments the script was called with
import sys
# math cause it's handy for scalar maths
import math
# time cause I want to benchmark how long things take
import time
# numpy cause it gives us awesome array / matrix manipulation stuff
import numpy
# scipy just in case
import scipy
# scipy.stats to make life simpler statistcally speaking
import scipy.stats as stats
def main():
print "Prac 2 solution: no graphs"
true_model = numpy.array([17.0, 10.0, 1.96])
# Here's a nifty way to write out numpy arrays.
# Unlike the data table in the prac handouts, I've got time first
# and height second.
# You can mix up the order but you need to change a lot of calculations
# to deal with this change.
data = numpy.array([[1.0, 26.94],
[2.0, 33.45],
[3.0, 40.72],
[4.0, 42.32],
[5.0, 44.30],
[6.0, 47.19],
[7.0, 43.33],
[8.0, 40.13]])
# Perform the least squares regression to find the best fit solution
best_fit = regression(data)
# Nifty way to get out elements from an array
m1,m2,m3 = best_fit
print "Best fit solution:"
print "m1 is", m1, "and m2 is", m2, "and m3 is", m3
# Calculate residuals from the best fit solution
best_fit_resid = residuals(data, best_fit)
print "The residuals from the best fit solution are:"
print best_fit_resid
print ""
# Bootstrap part
# --------------
# Number of bootstraps to run. 100 is a minimum and our default number.
num_booties = 100
# If we have an argument to the python script, use this as the
# number of bootstrap runs
if len(sys.argv) > 1:
num_booties = int(sys.argv[1])
# preallocate an array to store the results.
ensemble = numpy.zeros((num_booties, 3))
print "Starting up the bootstrap routine"
# How to do timing within a Python script - here I start a stopwatch running
start_time = time.clock()
for index in range(num_booties):
# Print every 10 % so we know where we're up to in long runs
if print_progress(index, num_booties):
percent = (float(index) / float(num_booties)) * 100.0
print "Have completed", percent, "percent"
# For each iteration of the bootstrap algorithm,
# first calculate mixed up residuals...
resamp_resid = resamp_with_replace(best_fit_resid)
# ... then generate new data...
new_data = calc_new_data(data, best_fit, resamp_resid)
# ... then perform another regression to generate a new set of m1, m2, m3
bootstrap_model = regression(new_data)
ensemble[index] = (bootstrap_model[0], bootstrap_model[1], bootstrap_model[2])
# Done with the loop
# Calculate the time the run took - what's the current time, minus when we started.
loop_time = time.clock() - start_time
print ""
print "Ensemble calculated based on", num_booties, "bootstrap runs."
print "Bootstrap runs took", loop_time, "seconds."
print ""
# Stats on the ensemble time
# --------------------------
B = num_booties
# Mean is pretty simple, 1.0/B to force it to use floating points
# This gives us an array of the means of the 3 model parameters
mean = 1.0/B * numpy.sum(ensemble, axis=0)
print "Mean is ([m1 m2 m3]):", mean
# Variance
var2 = 1.0/B * numpy.sum(((ensemble - mean)**2), axis=0)
print "Variance squared is ([m1 m2 m3]):", var2
# Bias
bias = mean - best_fit
print "Bias is ([m1 m2 m3]):", bias
bias_corr = best_fit - bias
print "Bias corrected solution is ([m1 m2 m3]):", bias_corr
print "The original solution was ([m1 m2 m3]):", best_fit
print "And the true solution is ([m1 m2 m3]):", true_model
print ""
# Confidence intervals
# ---------------------
# Sort column 1 to calculate confidence intervals
# Sorting in numpy sucks.
# Need to declare what the fields are (so it knows how to sort it)
# f8 => numpy's floating point number
# Then need to delcare what we sort it on
# Here we sort on the first column, then the second, then the third.
# f0,f1,f2 field 0, then field 1, then field 2.
# Then we make sure we sort it by column (axis = 0)
# Then we take a view of that data as a float64 so it works properly
sorted_m1 = numpy.sort(ensemble.view('f8,f8,f8'), order=['f0','f1','f2'], axis=0).view(numpy.float64)
# stats is my name for scipy.stats
# This has a wonderful function that calculates percentiles, including performing interpolation
# (important for low numbers of bootstrap runs)
m1_perc0p5 = stats.scoreatpercentile(sorted_m1,0.5)[0]
m1_perc2p5 = stats.scoreatpercentile(sorted_m1,2.5)[0]
m1_perc16 = stats.scoreatpercentile(sorted_m1,16)[0]
m1_perc84 = stats.scoreatpercentile(sorted_m1,84)[0]
m1_perc97p5 = stats.scoreatpercentile(sorted_m1,97.5)[0]
m1_perc99p5 = stats.scoreatpercentile(sorted_m1,99.5)[0]
print "m1 68% confidence interval is from", m1_perc16, "to", m1_perc84
print "m1 95% confidence interval is from", m1_perc2p5, "to", m1_perc97p5
print "m1 99% confidence interval is from", m1_perc0p5, "to", m1_perc99p5
print ""
# Now column 2, sort it...
sorted_m2 = numpy.sort(ensemble.view('f8,f8,f8'), order=['f1','f0','f2'], axis=0).view(numpy.float64)
# ... and do stats.
m2_perc0p5 = stats.scoreatpercentile(sorted_m2,0.5)[1]
m2_perc2p5 = stats.scoreatpercentile(sorted_m2,2.5)[1]
m2_perc16 = stats.scoreatpercentile(sorted_m2,16)[1]
m2_perc84 = stats.scoreatpercentile(sorted_m2,84)[1]
m2_perc97p5 = stats.scoreatpercentile(sorted_m2,97.5)[1]
m2_perc99p5 = stats.scoreatpercentile(sorted_m2,99.5)[1]
print "m2 68% confidence interval is from", m2_perc16, "to", m2_perc84
print "m2 95% confidence interval is from", m2_perc2p5, "to", m2_perc97p5
print "m2 99% confidence interval is from", m2_perc0p5, "to", m2_perc99p5
print ""
# and finally column 3, again, sort it..
sorted_m3 = numpy.sort(ensemble.view('f8,f8,f8'), order=['f2','f1','f0'], axis=0).view(numpy.float64)
# ... and do stats.
m3_perc0p5 = stats.scoreatpercentile(sorted_m3,0.5)[1]
m3_perc2p5 = stats.scoreatpercentile(sorted_m3,2.5)[1]
m3_perc16 = stats.scoreatpercentile(sorted_m3,16)[1]
m3_perc84 = stats.scoreatpercentile(sorted_m3,84)[1]
m3_perc97p5 = stats.scoreatpercentile(sorted_m3,97.5)[1]
m3_perc99p5 = stats.scoreatpercentile(sorted_m3,99.5)[1]
print "m3 68% confidence interval is from", m3_perc16, "to", m3_perc84
print "m3 95% confidence interval is from", m3_perc2p5, "to", m3_perc97p5
print "m3 99% confidence interval is from", m3_perc0p5, "to", m3_perc99p5
print ""
# End of the main function
#
#
# Helper functions go down here
#
#
# regression
# This takes a 2D numpy array and performs a least-squares regression
# using the formula on the practical sheet, page 3
# Stored in the top are the real values
# Returns an array of m1, m2 and m3.
def regression(data):
# While testing, just return the real values
# real_values = numpy.array([17.0, 10.0, 1.96])
# Creating the G matrix
# ---------------------
# Because I'm using numpy arrays here, we need
# to learn some notation.
# data[:,0] is the FIRST column
# Length of this = number of time samples in data
N = len(data[:,0])
# numpy.sum adds up all data in a row or column.
# Axis = 0 implies add up each column. [0] at end
# returns the sum of the first column
# This is the sum of Ti for i = 1..N
sum_Ti = numpy.sum(data, axis=0)[0]
# numpy.power takes each element of an array and raises them to a given power
# In this one call we also take the sum of the columns (as above) after they have
# been squared, and then just take the t column
sum_Ti2 = numpy.sum(numpy.power(data, 2), axis=0)[0]
# Now we need to get the cube of Ti, then sum that result
sum_Ti3 = numpy.sum(numpy.power(data, 3), axis=0)[0]
# Finally we need the quartic of Ti, then sum that result
sum_Ti4 = numpy.sum(numpy.power(data, 4), axis=0)[0]
# Now we can construct the G matrix
G = numpy.array([[N, sum_Ti, -0.5 * sum_Ti2],
[sum_Ti, sum_Ti2, -0.5 * sum_Ti3],
[-0.5 * sum_Ti2, -0.5 * sum_Ti3, 0.25 * sum_Ti4]])
# We also need to take the inverse of the G matrix
G_inv = numpy.linalg.inv(G)
# Creating the d matrix
# ---------------------
# Hello numpy.sum, my old friend...
sum_Yi = numpy.sum(data, axis=0)[1]
# numpy.prod multiplies the values in an array.
# We need to do the products along axis 1 (i.e. row by row)
# Then sum all the elements
sum_TiYi = numpy.sum(numpy.prod(data, axis=1))
# The final element we need is a bit tricky.
# We need the product as above
TiYi = numpy.prod(data, axis=1)
# Then we get tricky. * works how we need it here,
# remember that the Ti column is referenced by data[:,0] as above
Ti2Yi = TiYi * data[:,0]
# Then we sum
sum_Ti2Yi = numpy.sum(Ti2Yi)
#With all the elements, we make the d matrix
d = numpy.array([sum_Yi,
sum_TiYi,
-0.5 * sum_Ti2Yi])
# Do the linear algebra stuff
# To multiple numpy arrays in a matrix style,
# we need to use numpy.dot()
# Not the most useful notation, but there you go.
# To help out the Matlab users: http://www.scipy.org/NumPy_for_Matlab_Users
result = G_inv.dot(d)
#Return this result
return result
# residuals:
# Takes in a data array, and an array of best fit paramers
# calculates the difference between the observed and predicted data
# and returns an array
def residuals(data, best_fit):
# Extract ti from the data array
ti = data[:,0]
# We also need an array of the square of ti
ti2 = numpy.power(ti, 2)
# Extract yi
yi = data[:,1]
# Calculate residual (data minus predicted)
result = yi - best_fit[0] - (best_fit[1] * ti) + (0.5 * best_fit[2] * ti2)
return result
# resamp_with_replace:
# Perform a dataset resampling with replacement on parameter set.
# Uses numpy.random to generate the random numbers to pick the indices to look up.
# So for item 0, ... N, we look up a random index from the set and put that in
# our resampled data.
def resamp_with_replace(set):
# How many things do we need to do this for?
N = len(set)
# Preallocate our result array
result = numpy.zeros(N)
# Generate N random integers between 0 and N-1
indices = numpy.random.randint(0, N - 1, N)
# For i from the set 0...N-1 (that's what the range() command gives us),
# our result for that i is given by the index we randomly generated above
for i in range(N):
result[i] = set[indices[i]]
return result
# calc_new_data:
# Given a set of resampled residuals, use the model parameters to derive
# new data. This is used for bootstrapping the residuals.
# true_data is a numpy array of rows of ti, yi. We only need the ti column though.
# model is an array of three parameters, corresponding to m1, m2, m3.
# residuals are an array of our resudials
def calc_new_data(true_data, model, residuals):
# Extract the time information from the new data array
ti = true_data[:,0]
# Calculate new data using array maths
# This goes through and does the sums etc for each element of the array
# Nice and compact way to represent it.
y_new = residuals + model[0] + (model[1] * ti) - (0.5 * model[2] * ti**2)
# Our result needs to be an array of ti, y_new, so we need to combine them using
# the numpy.column_stack routine
result = numpy.column_stack((ti, y_new))
# Return this combined array
return result
# print_progress:
# Just a quick thing that returns true if we want to print for this index
# and false otherwise
def print_progress(index, total):
index = float(index)
total = float(total)
result = False
# Floating point maths is irritating
# We want to print at the start, every 10%, and at the end.
# This works up to index = 100,000
# Would also be lovely if Python had a switch statement
if (((index / total) * 100) <= 0.00001):
result = True
elif (((index / total) * 100) >= 9.99999) and (((index / total) * 100) <= 10.00001):
result = True
elif (((index / total) * 100) >= 19.99999) and (((index / total) * 100) <= 20.00001):
result = True
elif (((index / total) * 100) >= 29.99999) and (((index / total) * 100) <= 30.00001):
result = True
elif (((index / total) * 100) >= 39.99999) and (((index / total) * 100) <= 40.00001):
result = True
elif (((index / total) * 100) >= 49.99999) and (((index / total) * 100) <= 50.00001):
result = True
elif (((index / total) * 100) >= 59.99999) and (((index / total) * 100) <= 60.00001):
result = True
elif (((index / total) * 100) >= 69.99999) and (((index / total) * 100) <= 70.00001):
result = True
elif (((index / total) * 100) >= 79.99999) and (((index / total) * 100) <= 80.00001):
result = True
elif (((index / total) * 100) >= 89.99999) and (((index / total) * 100) <= 90.00001):
result = True
elif ((((index+1) / total) * 100) > 99.99999):
result = True
else:
result = False
return result
#
#
# End of helper functions
#
#
# So we can easily execute our script
if __name__ == "__main__":
main()
I guess you can take a look, here is link to complete information

Use sklearn instead of numpy (sklearn is derived from numpy but much better for this kind of calculation) :
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1,
normalize=False)
clf.coef_
array([ 0.5, 0.5])

Specify the shift for numpy.correlate

I wonder if there is a possibility to specify the shift expressed by k variable for the cross-correlation of two 1D arrays. Because with the numpy.correlate function and its mode parameter set to 'full' I will get cross-correlate coefficients for each k shift for whole length of the taken array (assuming that both arrays are the same size). Let me show you what I mean exactly on below example:
import numpy as np
# Define signal 1.
signal_1 = np.array([1, 2 ,3])
# Define signal 2.
signal_2 = np.array([1, 2, 3])
# Other definitions.
Xi = signal_1
Yi = signal_2
N = np.size(Xi)
k = 3
Xs = np.average(Xi)
Ys = np.average(Yi)
# Cross-covariance coefficient function.
def crossCovariance(Xi, Yi, N, k, Xs, Ys, forCorrelation = False):
autoCov = 0
for i in np.arange(0, N-k):
autoCov += ((Xi[i+k])-Xs)*(Yi[i]-Ys)
if forCorrelation == True:
return autoCov/N
else:
return (1/(N-1))*autoCov
# Expected value function.
def E(X, P):
expectedValue = 0
for i in np.arange(0, np.size(X)):
expectedValue += X[i] * (P[i] / np.size(X))
return expectedValue
# Cross-correlation coefficient function.
def crossCorrelation(Xi, Yi, k):
# Calculate the covariance coefficient.
cov = crossCovariance(Xi, Yi, N, k, Xs, Ys, forCorrelation = True)
# Calculate standard deviations.
EX = E(Xi, np.ones(np.size(Xi)))
SDX = (E((Xi - EX) ** 2, np.ones(np.size(Xi)))) ** (1/2)
EY = E(Yi, np.ones(np.size(Yi)))
SDY = (E((Yi - EY) ** 2, np.ones(np.size(Yi)))) ** (1/2)
# Calculate correlation coefficient.
return cov / (SDX * SDY)
# Express cross-covariance or cross-correlation function in a form of a 1D vector.
def array(k, norm = True):
# If norm = True, return array of autocorrelation coefficients.
# If norm = False, return array of autocovariance coefficients.
vector = np.array([])
shifts = np.abs(np.arange(-k, k+1, 1))
for i in shifts:
if norm == True:
vector = np.append(crossCorrelation(Xi, Yi, i), vector)
else:
vector = np.append(crossCovariance(Xi, Yi, N, i, Xs, Ys), vector)
return vector
In my example, calling the method array(k, norm = True) for different values of k will give resuslt as I shown below:
k = 3, [ 0. -0.5 0. 1. 0. -0.5 0. ]
k = 2, [-0.5 0. 1. 0. -0.5]
k = 1, [ 0. 1. 0.]
k = 0, [ 1.]
My approach is good for the learning purposes but I need to move to the native numpy functions in order to speed up my analysis. How one could specify the k shift value while using the native numpy.correlate function? PS k parameter specify the "time" shift between two arrays. Thank you in advance.

Whilst I'm not aware of any built-in function for computing the cross-correlation for a particular range of signal lags, you can speed your version up a lot by vectorization, i.e. performing operations on arrays rather than single elements in an array.
This version uses only a single Python loop over the lags:
import numpy as np
def xcorr(x, y, k, normalize=True):
n = x.shape[0]
# initialize the output array
out = np.empty((2 * k) + 1, dtype=np.double)
lags = np.arange(-k, k + 1)
# pre-compute E(x), E(y)
mu_x = x.mean()
mu_y = y.mean()
# loop over lags
for ii, lag in enumerate(lags):
# use slice indexing to get 'shifted' views of the two input signals
if lag < 0:
xi = x[:lag]
yi = y[-lag:]
elif lag > 0:
xi = x[:-lag]
yi = y[lag:]
else:
xi = x
yi = y
# x - mu_x; y - mu_y
xdiff = xi - mu_x
ydiff = yi - mu_y
# E[(x - mu_x) * (y - mu_y)]
out[ii] = xdiff.dot(ydiff) / n
# NB: xdiff.dot(ydiff) == (xdiff * ydiff).sum()
if normalize:
# E[(x - mu_x) * (y - mu_y)] / (sigma_x * sigma_y)
out /= np.std(x) * np.std(y)
return lags, out
Some more general points of advice:
As I mentioned in the comments, you should try to give your functions names that are informative, and that aren't likely to conflict with other things in your namespace (e.g. array vs np.array).
It's much better to make your functions self-contained. In your version, N, k, Xs and Ys are defined outside the main function. In this situation you might accidentally modify or overwrite one of these variables, and it can get tricky to debug errors caused by this sort of thing.
Appending to numpy arrays (e.g. using np.append or np.concatenate) is slow, so avoid it whenever you can. If, as in this case, you know the size of the output ahead of time, it's much faster to pre-allocate the output array (e.g. using np.empty or np.zeros), then fill in the elements. If you absolutely have to do concatenation, it's often faster to append to a normal Python list, then convert it to a numpy array at the end.

It's available by specifying maxlags:
import matplotlib.pyplot as plt
xcorr = plt.xcorr(signal_1, signal_2, maxlags=1)
Documentation can be found here. This implementation is based on np.correlate.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculating Pearson correlation and significance in Python - python

I am looking for a function that takes as input two lists, and returns the Pearson correlation, and the significance of the correlation.

The Pearson correlation can be calculated with numpy's corrcoef. import numpy numpy.corrcoef(list1, list2)[0, 1]

You can do this with pandas.DataFrame.corr, too: import pandas as pd a = [[1, 2, 3], [5, 6, 9], [5, 6, 11], [5, 6, 13], [5, 3, 13]] df = pd.DataFrame(data=a) df.corr() This gives 0 1 2 0 1.000000 0.745601 0.916579 1 0.745601 1.000000 0.544248 2 0.916579 0.544248 1.000000

Starting in Python 3.10, the Pearson’s correlation coefficient (statistics.correlation) is directly available in the standard library: from statistics import correlation # a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3] # b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15] correlation(a, b) # 0.1449981545806852

Related

In Python, how do I add an incrementing element of a polynomial?

Calculate probability 2 random people are in the same group?

How to perform cubic spline interpolation in python?

Least squares regression on 2d array

Specify the shift for numpy.correlate

Categories

Resources