I have two numpy arrays
X.shape = (100, 10)
Y.shape = (100, 10)
I want to find the pearson correlations between columns of X and Y
i.e.
from scipy.stats.stats import pearsonr
def corr( X, Y ):
return np.array([ pearsonr( x, y )[0] for x,y in zip( X.T, Y.T ) ] )
corr( X, Y ).shape = (10, )
Is there a function for this? So far, all the functions I can find calculate correlation matrices. There is a pairwise correlation function in Matlab, so I'm pretty sure someone must have written one for Python.
The reason why I don't like the example function above is because it seems slow.
If columns are variables and rows are observations in X, Y (and you would like to find column-wise correlations between X and Y):
X = (X - X.mean(axis=0)) / X.std(axis=0)
Y = (Y - Y.mean(axis=0)) / Y.std(axis=0)
pearson_r = np.dot(X.T, Y) / X.shape[0]
To find the p-value, convert the pearson_r to t statistics:
t = pearson_r * np.sqrt(X.shape[0] - 2) / np.sqrt(1 - pearson_r ** 2)
and the p-values is 2 × P(T > t).
I modified from scipy.stats.pearsonr:
from scipy.stats import pearsonr
x = np.random.rand(100, 10)
y = np.random.rand(100, 10)
def corr( X, Y ):
return np.array([ pearsonr( x, y )[0] for x,y in zip( X.T, Y.T) ] )
def pair_pearsonr(x, y, axis=0):
mx = np.mean(x, axis=axis, keepdims=True)
my = np.mean(y, axis=axis, keepdims=True)
xm, ym = x-mx, y-my
r_num = np.add.reduce(xm * ym, axis=axis)
r_den = np.sqrt((xm*xm).sum(axis=axis) * (ym*ym).sum(axis=axis))
r = r_num / r_den
return r
np.allclose(pair_pearsonr(x, y, axis=0), corr(x, y))
Related
I am trying to use scipy.odr to create a regression of data with uncertanty in both the x-data and the y-data. I have read that i need to use the attribute output.sd_beta. However i seem to get some weird results.
The following code shows that sd_beta returns zero if the uncertanty of the data is zero even though the data is noisy. However np.sqrt(np.diag(regression.cov_beta)) does the excact opposite. I think i have to add the uncertanty from a noisy signal to the uncertanty of the data as follows uPopt = np.sqrt(np.diag(regression.cov_beta)) + output.sd_beta., but i am unsure. Can anyone please confirm or deny my gutfeeling?
import numpy as np
import scipy.odr as odr
def lin(B, x):
b = B[0]
return b + 0 * x
def odrWrapper(description, x, y, sx, sy):
# Function to create a regression using ODR and print the output
data = odr.RealData(x, y, sx, sy)
regression = odr.ODR(data, odr.Model(lin), beta0=[1])
regression = regression.run()
popt = regression.beta
cov_beta = np.sqrt(np.diag(regression.cov_beta))
sd_beta = regression.sd_beta
print(description, popt, sd_beta, cov_beta)
# constants
b = 50
n = 10000
noiseScale = 10
uncert = 1
np.random.seed(0)
# no noise no uncertanty
x = np.linspace(0, 100, n)
y = np.ones(n) * b
sx = [1e-10] * n # very smalle value as the uncertanty can not be zero
sy = [1e-10] * n # very smalle value as the uncertanty can not be zero
odrWrapper('No noise no uncertanty: ', x, y, sx, sy)
>> No noise no uncertanty: [50.] [0.] [1.e-12]
# noise but no uncertanty
x = np.linspace(0, 100, n)
y = np.ones(n) * b
y += noiseScale * (2 * np.random.rand(n) - 1)
sx = [1e-10] * n
sy = [1e-10] * n
odrWrapper('Noise but no uncertanty: ', x, y, sx, sy)
>> Noise but no uncertanty: [49.92917783] [0.05792112] [1.e-12]
# no noise but uncertanty
x = np.linspace(0, 100, n)
y = np.ones(n) * b
sx = [1e-10] * n
sy = [uncert] * n
odrWrapper('No noise but uncertanty: ', x, y, sx, sy)
>> No noise but uncertanty: [50.] [0.] [0.01]
# noise and uncertanty
x = np.linspace(0, 100, n)
y = np.ones(n) * b
y += noiseScale * (2 * np.random.rand(n) - 1)
sx = [1e-10] * n
sy = [1] * n
odrWrapper('Noise and uncertanty: ', x, y, sx, sy)
>> Noise and uncertanty: [49.90479242] [0.05826096] [0.01]
I'd like to find the best fit (least-squares solution) for the a coefficients in an equation similar to this one:
b = f(x,y,z) = (a0 + a1*x + a2*y + a3*z + a4*x*y + a5*x*z + a6*y*z + a7*x*y*z)
x, y, and z are small arrays with a length of about 20. The shown example is for x**k with k=1. I'm looking for a solution up k=3.
I have found this solution for a 2d fit Equivalent of `polyfit` for a 2D polynomial in Python
Now I'm looking for a similar solution but in 3d.
You right, similar technic works:
import numpy as np
x, y, z = np.random.randn(3, 20)
grid = np.meshgrid(x, y, z, indexing='ij')
x, y, z = np.stack(grid).reshape(3, -1)
b = np.random.randn(*x.shape).reshape(-1)
A = np.stack([np.ones_like(x, dtype=x.dtype), x, y, z, x * y, x * z, y * z, x * y * z], axis=1)
coeff, r, rank, s = np.linalg.lstsq(A, b, rcond=None)
I am trying to implement gradient descent in python. Though my code is returning result by I think results I am getting are completely wrong.
Here is the code I have written:
import numpy as np
import pandas
dataset = pandas.read_csv('D:\ML Data\house-prices-advanced-regression-techniques\\train.csv')
X = np.empty((0, 1),int)
Y = np.empty((0, 1), int)
for i in range(dataset.shape[0]):
X = np.append(X, dataset.at[i, 'LotArea'])
Y = np.append(Y, dataset.at[i, 'SalePrice'])
X = np.c_[np.ones(len(X)), X]
Y = Y.reshape(len(Y), 1)
def gradient_descent(X, Y, theta, iterations=100, learningRate=0.000001):
m = len(X)
for i in range(iterations):
prediction = np.dot(X, theta)
theta = theta - (1/m) * learningRate * (X.T.dot(prediction - Y))
return theta
theta = np.random.randn(2,1)
theta = gradient_descent(X, Y, theta)
print('theta',theta)
The result I get after running this program is:
theta [[-5.23237458e+228]
[-1.04560188e+233]]
Which are very high values. Can someone point out the mistake I have made in implementation.
Also, 2nd problem is I have to set value of learning rate very low (in this case i have set to 0.000001) to work other wise program throws an error.
Please help me in diagnosis the problem.
try to reduce the learning rate with iteration otherwise it wont be able to reach the optimal lowest.try this
import numpy as np
import pandas
dataset = pandas.read_csv('start.csv')
X = np.empty((0, 1),int)
Y = np.empty((0, 1), int)
for i in range(dataset.shape[0]):
X = np.append(X, dataset.at[i, 'R&D Spend'])
Y = np.append(Y, dataset.at[i, 'Profit'])
X = np.c_[np.ones(len(X)), X]
Y = Y.reshape(len(Y), 1)
def gradient_descent(X, Y, theta, iterations=50, learningRate=0.01):
m = len(X)
for i in range(iterations):
prediction = np.dot(X, theta)
theta = theta - (1/m) * learningRate * (X.T.dot(prediction - Y))
learningRate/=10;
return theta
theta = np.random.randn(2,1)
theta = gradient_descent(X, Y, theta)
print('theta',theta)
I'd like to find a least-squares solution for the a coefficients in
z = (a0 + a1*x + a2*y + a3*x**2 + a4*x**2*y + a5*x**2*y**2 + a6*y**2 +
a7*x*y**2 + a8*x*y)
given arrays x, y, and z of length 20. Basically I'm looking for the equivalent of numpy.polyfit but for a 2D polynomial.
This question is similar, but the solution is provided via MATLAB.
Here is an example showing how you can use numpy.linalg.lstsq for this task:
import numpy as np
x = np.linspace(0, 1, 20)
y = np.linspace(0, 1, 20)
X, Y = np.meshgrid(x, y, copy=False)
Z = X**2 + Y**2 + np.random.rand(*X.shape)*0.01
X = X.flatten()
Y = Y.flatten()
A = np.array([X*0+1, X, Y, X**2, X**2*Y, X**2*Y**2, Y**2, X*Y**2, X*Y]).T
B = Z.flatten()
coeff, r, rank, s = np.linalg.lstsq(A, B)
the adjusting coefficients coeff are:
array([ 0.00423365, 0.00224748, 0.00193344, 0.9982576 , -0.00594063,
0.00834339, 0.99803901, -0.00536561, 0.00286598])
Note that coeff[3] and coeff[6] respectively correspond to X**2 and Y**2, and they are close to 1. because the example data was created with Z = X**2 + Y**2 + small_random_component.
Based on the answers from #Saullo and #Francisco I have made a function which I have found helpful:
def polyfit2d(x, y, z, kx=3, ky=3, order=None):
'''
Two dimensional polynomial fitting by least squares.
Fits the functional form f(x,y) = z.
Notes
-----
Resultant fit can be plotted with:
np.polynomial.polynomial.polygrid2d(x, y, soln.reshape((kx+1, ky+1)))
Parameters
----------
x, y: array-like, 1d
x and y coordinates.
z: np.ndarray, 2d
Surface to fit.
kx, ky: int, default is 3
Polynomial order in x and y, respectively.
order: int or None, default is None
If None, all coefficients up to maxiumum kx, ky, ie. up to and including x^kx*y^ky, are considered.
If int, coefficients up to a maximum of kx+ky <= order are considered.
Returns
-------
Return paramters from np.linalg.lstsq.
soln: np.ndarray
Array of polynomial coefficients.
residuals: np.ndarray
rank: int
s: np.ndarray
'''
# grid coords
x, y = np.meshgrid(x, y)
# coefficient array, up to x^kx, y^ky
coeffs = np.ones((kx+1, ky+1))
# solve array
a = np.zeros((coeffs.size, x.size))
# for each coefficient produce array x^i, y^j
for index, (j, i) in enumerate(np.ndindex(coeffs.shape)):
# do not include powers greater than order
if order is not None and i + j > order:
arr = np.zeros_like(x)
else:
arr = coeffs[i, j] * x**i * y**j
a[index] = arr.ravel()
# do leastsq fitting and return leastsq result
return np.linalg.lstsq(a.T, np.ravel(z), rcond=None)
And the resultant fit can be visualised with:
fitted_surf = np.polynomial.polynomial.polyval2d(x, y, soln.reshape((kx+1,ky+1)))
plt.matshow(fitted_surf)
Excellent answer by Saullo Castro. Just to add the code to reconstruct the function using the least-squares solution for the a coefficients,
def poly2Dreco(X, Y, c):
return (c[0] + X*c[1] + Y*c[2] + X**2*c[3] + X**2*Y*c[4] + X**2*Y**2*c[5] +
Y**2*c[6] + X*Y**2*c[7] + X*Y*c[8])
You can also use scikit-learn for this.
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
x = np.linspace(0, 1, 20)
y = np.linspace(0, 1, 20)
X, Y = np.meshgrid(x, y, copy=False)
X = X.flatten()
Y = Y.flatten()
# Generate noisy data
np.random.seed(0)
Z = X**2 + Y**2 + np.random.randn(*X.shape)*0.01
# Process 2D inputs
poly = PolynomialFeatures(degree=2)
input_pts = np.stack([X, Y]).T
assert(input_pts.shape == (400, 2))
in_features = poly.fit_transform(input_pts)
# Linear regression
model = LinearRegression()
model.fit(in_features, Z)
# Display coefficients
print(dict(zip(poly.get_feature_names_out(), model.coef_.round(4))))
# Check fit
print(f"R-squared: {model.score(poly.transform(input_pts), Z):.3f}")
# Make predictions
Z_predicted = model.predict(poly.transform(input_pts))
Out:
{'1': 0.0, 'x0': 0.003, 'x1': -0.0074, 'x0^2': 0.9974, 'x0 x1': 0.0047, 'x1^2': 1.0014}
R-squared: 1.000
Note that if kx != ky the code will fail because the j and i indices are inverted in the loop.
You get (j,i) from enumerate(np.ndindex(coeffs.shape)), but then you address elements in coeffs as coeffs[i,j]. Since the shape of the coefficient matrix is given by the maximum polynomial order that you are asking to use, the matrix will be rectangular if kx != ky and you will exceed one of its dimensions.
I have a set of x, y points and I'd like to find the line of best fit such that the line is below all points using SciPy. I'm trying to use leastsq for this, but I'm unsure how to adjust the line to be below all points instead of the line of best fit. The coefficients for the line of best fit can be produced via:
def linreg(x, y):
fit = lambda params, x: params[0] * x - params[1]
err = lambda p, x, y: (y - fit(p, x))**2
# initial slope/intercept
init_p = np.array((1, 0))
p, _ = leastsq(err, init_p.copy(), args=(x, y))
return p
xs = sp.array([1, 2, 3, 4, 5])
ys = sp.array([10, 20, 30, 40, 50])
print linreg(xs, ys)
The output is the coefficients for the line of best fit:
array([ 9.99999997e+00, -1.68071668e-15])
How can I get the coefficients of the line of best fit that is below all points?
A possible algorithm is as follows:
Move the axes to have all the data on the positive half of the x axis.
If the fit is of the form y = a * x + b, then for a given b the best fit for a will be the minimum of the slopes joining the point (0, b) with each of the (x, y) points.
You can then calculate a fit error, which is a function of only b, and use scipy.optimize.minimize to find the best value for b.
All that's left is computing a for that b and calculating b for the original position of the axes.
The following does that most of the time, except when the minimization fails with some mysterious error:
from __future__ import division
import numpy as np
import scipy.optimize
import matplotlib.pyplot as plt
def fit_below(x, y) :
idx = np.argsort(x)
x = x[idx]
y = y[idx]
x0, y0 = x[0] - 1, y[0]
x -= x0
y -= y0
def error_function_2(b, x, y) :
a = np.min((y - b) / x)
return np.sum((y - a * x - b)**2)
b = scipy.optimize.minimize(error_function_2, [0], args=(x, y)).x[0]
a = np.min((y - b) / x)
return a, b - a * x0 + y0
x = np.arange(10).astype(float)
y = x * 2 + 3 + 3 * np.random.rand(len(x))
a, b = fit_below(x, y)
plt.plot(x, y, 'o')
plt.plot(x, a*x + b, '-')
plt.show()
And as TheodrosZelleke wisely predicted, it goes through two points that are part of the convex hull: