Given a tall m by n matrix X, I need to calculate s = 1 + x(X.T X)^{-1} x.T. Here, x is a row vector and s is scalar. Is there an efficient (or, recommended) way to compute this in python?
Needless to say, X.T X will be symmetric positive definite.
My attempt:
If we consider the QR decomposition of X, i.e., X = QR, where Q is orthogonal, R is upper triangular, then X.T X = R.T R.
QR decomposition can be easily obtained using numpy.linalg.qr, that is
Q,R = numpy.linalg.qr(X)
But then again, is there a particularly efficient way to calculate inv(R.T R)?
If you are doing the QR factorization of X, resulting in X.T X = R.T R, you may avoid using np.linalg.inv (and np.linalg.solve) by using forward and backward substitution instead (R.T is lower triangular!) with scipy.linalg.solve_triangular:
import numpy as np
import scipy.linalg as LA
Q,R = np.linalg.qr(X)
# solve R.T R z = x such that R z = y
# with step (a) then (b)
# step (a) solve R.T y = x
y = LA.solve_triangular(R,x,trans='T')
# step (b) solve R z = y
z = LA.solve_triangular(R,x)
s = 1 + x # z
where # is the python3 matrix multiplication operator.
Related
I want to calculate the divergent of a given vector with sympy. Is there any function in python responsible for this? I looked for something in the functions of einsteinpy, but I still haven't found any that help.
Basically I want to calculate \nabla_\mu (n v^\mu)=0 from a given vector v; n being a constant number.
\nabla_\mu (nv^\mu)=0 represents a divergence where \mu will take the derivative with respect to x, y or z of the vector element corresponding to the component. For example:
\nabla_\mu (n v^\mu) = \partial_x (u^x) + \partial_y(u^y) + \partial_z(u^z)
u can be something like (2x,4y,6z)
I appreciate any help.
As shown by #mikuszefski, you can use the module sympy.vector such that you have the implementation of the divergence in a space.
Another way to do what you want is to use the function derive_by_array to get a tensor and do einsten contraction.
import sympy as sp
x, y, z = sp.symbols("x y z") # dim = 3
# Now the functions that you want:
u, v, w = 2*x, 4*y, 6*z
# In a more general way, you can do:
u = sp.Function("u")(x, y, z)
v = sp.Function("v")(x, y, z)
w = sp.Function("w")(x, y, z)
U = sp.Array([u, v, w]) # U is a vector of dim = 3 (or sympy.Array)
X = sp.Array([x, y, z]) # X is a vector of dim = 3 (or sympy.Array)
dUdX = sp.derive_by_array(U, X) # dUdX is a tensor of dim = 3 and order = 2
# Frist way:
divU = sp.trace(sp.Matrix(sp.derive_by_array(U, X))) # Limited
# Second way:
divU = sp.tensorcontraction(sp.derive_by_array(U, X), (0, 1)) # More general
This solution works fine when dim = 2 for example, but you must have that len(X) == len(U)
I have a 100 by 1 response variable Y, and a 100 by 3 predictor matrix X. I want to calcualte the regression coefficient (X'X)^{-1}X'Y.
Currently I'm doing it as follows:
invXpX=inv(np.dot(np.transpose(X),X))
XpY=np.dot(np.transpose(X),Y)
betahat=np.dot(invXpX,XpY)
This looks pretty cumbersome, while in MATLAB we could do it just like the original math formula: inv(X'*X)*X'*Y. Is there an easier way to calculate this regression coefficient in python?
Thanks!
Yes it can be written more compact, but note that this will not always improve your code, or the readability.
The transpose of a numpy array can be found using dot T (.T). If you use numpy matrix instead of numpy arrays you can also use .I for the inverse, but I would recommend you to use ndarray. For the dot product you can use #. Thereby np.dot(X,Y) = X.dot(Y) when X and Y are numpy arrays.
import numpy as np
# Simulate data using a quadratic equation with coefficients y=ax^2+bx+c
a, b, c = 1, 2, 3
x = np.arange(100)
# Add random component to y values for estimation
y = a*x**2 + b*x + c + np.random.randn(100)
# Get X matrix [100x3]
X = np.vstack([x**2, x, np.ones(x.shape)]).T
# Estimate coefficients a, b, c
x_hat = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
>>> array([0.99998334, 2.00246583, 2.95697339])
x_hat = np.linalg.inv(X.T#(X))#(X.T)#(y)
>>> array([0.99998334, 2.00246583, 2.95697339])
# Use matrix:
X_mat = np.matrix(X)
x_hat = (X_mat.T#X_mat).I#X_mat.T#y
>>> matrix([[0.99998334, 2.00246583, 2.95697339]])
# without noise:
y = a*x**2 + b*x + c
x_hat = (X_mat.T#X_mat).I#X_mat.T#y
>>> matrix([[1., 2., 3.]])
You can try this:
np.invert(X.T # X) # (X.T # Y)
This question is intended to be a canonical duplicate target
Given two arrays X and Y of shapes (i, n) and (j, n), representing lists of n-dimensional coordinates,
def test_data(n, i, j, r = 100):
X = np.random.rand(i, n) * r - r / 2
Y = np.random.rand(j, n) * r - r / 2
return X, Y
X, Y = test_data(3, 1000, 1000)
what are the fastest ways to find:
The distance D with shape (i,j) between every point in X and every point in Y
The indices k_i and distance k_d of the k nearest neighbors against all points in X for every point in Y
The indices r_i, r_j and distance r_d of every point in X within distance r of every point j in Y
Given the following sets of restrictions:
Only using numpy
Using any python package
Including the special case:
Y is X
In all cases distance primarily means Euclidean distance, but feel free to highlight methods that allow other distance calculations.
#1. All Distances
only using numpy
The naive method is:
D = np.sqrt(np.sum((X[:, None, :] - Y[None, :, :])**2, axis = -1))
However this takes up a lot of memory creating an (i, j, n)-shaped intermediate matrix, and is very slow
However, thanks to a trick from #Divakar (eucl_dist package, wiki), we can use a bit of algebra and np.einsum to decompose as such: (X - Y)**2 = X**2 - 2*X*Y + Y**2
D = np.sqrt( # (X - Y) ** 2
np.einsum('ij, ij ->i', X, X)[:, None] + # = X ** 2 \
np.einsum('ij, ij ->i', Y, Y) - # + Y ** 2 \
2 * X.dot(Y.T)) # - 2 * X * Y
Y is X
Similar to above:
XX = np.einsum('ij, ij ->i', X, X)
D = np.sqrt(XX[:, None] + XX - 2 * X.dot(X.T))
Beware that floating-point imprecision can make the diagonal terms deviate very slightly from zero with this method. If you need to make sure they are zero, you'll need to explicitly set it:
np.einsum('ii->i', D)[:] = 0
Any Package
scipy.spatial.distance.cdist is the most intuitive builtin function for this, and far faster than bare numpy
from scipy.spatial.distance import cdist
D = cdist(X, Y)
cdist can also deal with many, many distance measures as well as user-defined distance measures (although these are not optimized). Check the documentation linked above for details.
Y is X
For self-referring distances, scipy.spatial.distance.pdist works similar to cdist, but returns a 1-D condensed distance array, saving space on the symmetric distance matrix by only having each term once. You can convert this to a square matrix using squareform
from scipy.spatial.distance import pdist, squareform
D_cond = pdist(X)
D = squareform(D_cond)
#2. K Nearest Neighbors (KNN)
Only using numpy
We could use np.argpartition to get the k-nearest indices and use those to get the corresponding distance values. So, with D as the array holding the distance values obtained above, we would have -
if k == 1:
k_i = D.argmin(0)
else:
k_i = D.argpartition(k, axis = 0)[:k]
k_d = np.take_along_axis(D, k_i, axis = 0)
However we can speed this up a bit by not taking the square roots until we have reduced our dataset. np.sqrt is the slowest part of calculating the Euclidean norm, so we don't want to do that until the end.
D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
if k == 1:
k_i = D_sq.argmin(0)
else:
k_i = D_sq.argpartition(k, axis = 0)[:k]
k_d = np.sqrt(np.take_along_axis(D_sq, k_i, axis = 0))
Now, np.argpartition performs indirect partition and doesn't necessarily give us the elements in sorted order and only makes sure that the first k elements are the smallest ones. So, for a sorted output, we need to use argsort on the output from previous step -
sorted_idx = k_d.argsort(axis = 0)
k_i_sorted = np.take_along_axis(k_i, sorted_idx, axis = 0)
k_d_sorted = np.take_along_axis(k_d, sorted_idx, axis = 0)
If you only need, k_i, you never need the square root at all:
D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
if k == 1:
k_i = D_sq.argmin(0)
else:
k_i = D_sq.argpartition(k, axis = 0)[:k]
k_d_sq = np.take_along_axis(D_sq, k_i, axis = 0)
sorted_idx = k_d_sq.argsort(axis = 0)
k_i_sorted = np.take_along_axis(k_i, sorted_idx, axis = 0)
X is Y
In the above code, replace:
D_sq = np.einsum('ij, ij ->i', X, X)[:, None] +\
np.einsum('ij, ij ->i', Y, Y) - 2 * X.dot(Y.T)
with:
XX = np.einsum('ij, ij ->i', X, X)
D_sq = XX[:, None] + XX - 2 * X.dot(X.T))
Any Package
KD-Tree is a much faster method to find neighbors and constrained distances. Be aware the while KDTree is usually much faster than brute force solutions above for 3d (as long as oyu have more than 8 points), if you have n-dimensions, KDTree only scales well if you have more than 2**n points. For discussion and more advanced methods for high dimensions, see Here
The most recommended method for implementing KDTree is to use scipy's scipy.spatial.KDTree or scipy.spatial.cKDTree
from scipy.spatial import KDTree
X_tree = KDTree(X)
k_d, k_i = X_tree.query(Y, k = k)
Unfortunately scipy's KDTree implementation is slow and has a tendency to segfault for larger data sets. As pointed out by #HansMusgrave here, pykdtree increases the performance a lot, but is not as common an include as scipy and can only deal with Euclidean distance currently (while the KDTree in scipy can handle Minkowsi p-norms of any order)
X is Y
Use instead:
k_d, k_i = X_tree.query(X, k = k)
Arbitrary metrics
A BallTree has similar algorithmic properties to a KDTree. I'm not aware of a parallel/vectorized/fast BallTree in Python, but using scipy we can still have reasonable KNN queries for user-defined metrics. If available, builtin metrics will be much faster.
def d(a, b):
return max(np.abs(a-b))
tree = sklearn.neighbors.BallTree(X, metric=d)
k_d, k_i = tree.query(Y)
This answer will be wrong if d() is not a metric. The only reason a BallTree is faster than brute force is because the properties of a metric allow it to rule out some solutions. For truly arbitrary functions, brute force is actually necessary.
#3. Radius search
Only using numpy
The simplest method is just to use boolean indexing:
mask = D_sq < r**2
r_i, r_j = np.where(mask)
r_d = np.sqrt(D_sq[mask])
Any Package
Similar to above, you can use scipy.spatial.KDTree.query_ball_point
r_ij = X_tree.query_ball_point(Y, r = r)
or scipy.spatial.KDTree.query_ball_tree
Y_tree = KDTree(Y)
r_ij = X_tree.query_ball_tree(Y_tree, r = r)
Unfortunately r_ij ends up being a list of index arrays that are a bit difficult to untangle for later use.
Much easier is to use cKDTree's sparse_distance_matrix, which can output a coo_matrix
from scipy.spatial import cKDTree
X_cTree = cKDTree(X)
Y_cTree = cKDTree(Y)
D_coo = X_cTree.sparse_distance_matrix(Y_cTree, r = r, output_type = `coo_matrix`)
r_i = D_coo.row
r_j = D_coo.column
r_d = D_coo.data
This is an extraordinarily flexible format for the distance matrix, as it stays an actual matrix (if converted to csr) can also be used for many vectorized operations.
I'm trying to create a matrix with values based on x,y values I have stored in a tuple. I use a loop to iterate over the tuple and perform a simple calculation on the data:
import numpy as np
# Trying to fit quadratic equation to the measured dots
N = 6
num_of_params = 3
# x values
x = (1,4,3,5,2,6)
# y values
y = (3.96, 24.96,14.15,39.8,7.07,59.4)
# X is a matrix N * 3 with the x values to the power of {0,1,2}
X = np.zeros((N,3))
Y = np.zeros((N,1))
print X,"\n\n",Y
for i in range(len(x)):
for p in range(num_of_params):
X[i][p] = x[i]**(num_of_params - p - 1)
Y[i] = y[i]
print "\n\n"
print X,"\n\n",Y
Is this can be achieved in an easier way? I'm looking for some way to init the matrix like X = np.zeros((N,3), read_values_from = x)
Is it possible? Is there another simple way?
Python 2.7
Extend array version of x to 2D with a singleton dim (dim with length=1) along the second one using np.newaxis/None. This lets us leverage NumPy broadcasting to get the 2D output in a vectorized manner. Similar philosophy for y.
Hence, the implementation would be -
X = np.asarray(x)[:,None]**(num_of_params - np.arange(num_of_params) - 1)
Y = np.asarray(y)[:,None]
Or use the built-in outer method for np.power to get X that takes care of the array conversion under the hoods -
X = np.power.outer(x, num_of_params - np.arange(num_of_params) - 1)
Alternatively, for Y, use np.expand_dims -
Y = np.expand_dims(y,1)
I've written some beginner code to calculate the co-efficients of a simple linear model using the normal equation.
# Modules
import numpy as np
# Loading data set
X, y = np.loadtxt('ex1data3.txt', delimiter=',', unpack=True)
data = np.genfromtxt('ex1data3.txt', delimiter=',')
def normalEquation(X, y):
m = int(np.size(data[:, 1]))
# This is the feature / parameter (2x2) vector that will
# contain my minimized values
theta = []
# I create a bias_vector to add to my newly created X vector
bias_vector = np.ones((m, 1))
# I need to reshape my original X(m,) vector so that I can
# manipulate it with my bias_vector; they need to share the same
# dimensions.
X = np.reshape(X, (m, 1))
# I combine these two vectors together to get a (m, 2) matrix
X = np.append(bias_vector, X, axis=1)
# Normal Equation:
# theta = inv(X^T * X) * X^T * y
# For convenience I create a new, tranposed X matrix
X_transpose = np.transpose(X)
# Calculating theta
theta = np.linalg.inv(X_transpose.dot(X))
theta = theta.dot(X_transpose)
theta = theta.dot(y)
return theta
p = normalEquation(X, y)
print(p)
Using the small data set found here:
http://www.lauradhamilton.com/tutorial-linear-regression-with-octave
I get the co-efficients: [-0.34390603; 0.2124426 ] using the above code instead of: [24.9660; 3.3058]. Could anyone help clarify where I am going wrong?
You can implement normal equation like below:
import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance
y_predict = X_new_b.dot(theta_best)
y_predict
This assumes X is an m by n+1 dimensional matrix where x_0 always = 1 and y is a m-dimensional vector.
import numpy as np
step1 = np.dot(X.T, X)
step2 = np.linalg.pinv(step1)
step3 = np.dot(step2, X.T)
theta = np.dot(step3, y) # if y is m x 1. If 1xm, then use y.T
Your implementation is correct. You've only swapped X and y (look closely how they define x and y), that's why you get a different result.
The call normalEquation(y, X) gives [ 24.96601443 3.30576144] as it should.
Here is the normal equation in one line:
theta = np.dot(np.linalg.inv(np.dot(X.T,X)),np.dot(X.T,Y))