Least-Squares Regression of Matrices with Numpy - python

I'm looking to calculate least squares linear regression from an N by M matrix and a set of known, ground-truth solutions, in a N-1 matrix. From there, I'd like to get the slope, intercept, and residual value of each regression. Basic idea being, I know the actual value of that should be predicted for each sample in a row of N, and I'd like to determine which set of predicted values in a column of M is most accurate using the residuals.
I don't describe matrices well, so here's a drawing:
(N,M) matrix with predicted values for each row N
in each column of M...
##NOTE: Values of M and N are not actually 4 and 3, just examples
4 columns in "M"
[1, 1.1, 0.8, 1.3]
[2, 1.9, 2.2, 1.7] 3 rows in "N"
[3, 3.1, 2.8, 3.3]
(1,N) matrix with actual values of N
[1]
[2] Actual value of each sample N, in a single column
[3]
So again, for clarity's sake, I'm looking to calculate the lstsq regression between each column of the (N,M) matrix and the (1,N) matrix.
For instance, the regression between
[1] and [1]
[2] [2]
[3] [3]
then the regression between
[1] and [1.1]
[2] [1.9]
[3] [3.1]
and so on, outputting the slope, intercept, and standard error (average residual) for each regression calculated.
So far in the numpy/scipy documentation and around the 'net, I've only found examples computing one column at a time. I had thought numpy had the capability to compute regressions on each column in a set with the standard
np.linalg.lstsq(arrayA,arrayB)
But that returns the error
ValueError: array dimensions must agree except for d_0
Do I need to split the columns into their own arrays, then compute one at a time?
Is there a parameter or matrix operation I need to use to have numpy calculate the regressions on each column independently?
I feel like it should be simpler? I've looked it all over, and I can't seem to find anyone doing something similar.

Maybe you switched A and b?
Following works for me:
A=np.random.rand(4)+np.arange(3)[:,None]
# A is now a (3,4) array
b=np.arange(3)
np.linalg.lstsq(A,b)

The 0th dimension of arrayB must be the same as the 0th dimension of arrayA (ref: the official documentation of np.linalg.lstsq). You need matrices with dimensions (N, M) and (N, 1) or (N, M) and (N) instead of the (N,M) and (1,N) matrices you're using now.
Note that the (N, 1) and N dimensional matrices will give identical results -- but the shapes of the arrays will be different.
I get a slightly different exception from you, but that may be due to different versions (I am using Python 2.7, Numpy 1.6 on Windows):
>>> A = np.arange(12).reshape(3, 4)
>>> b = np.arange(3).reshape(1, 3)
>>> np.linalg.lstsq(A,b)
# This gives "LinAlgError: Incompatible dimensions" exception
>>> np.linalg.lstsq(A,b.T)
# This works, note that I am using the transpose of b here

Related

Python NumPy Array and Matrix Multiplication

I'm going through the "Make Your Own Neural Networks" book and following through the examples to implement my first NN. I understood the basic concepts and in particular this equation where the output is calculated doing a matrix dot product of the inputs and weights:
X = W * I
Where X is the output before applying the Sigmoid, W the link weights and I the inputs.
Now in the book, they do have a function that takes in this input as an array and then they translate that array to a 2 dimensional one. My understanding is that, the value of X is calculated like this based on:
W = [0.1, 0.2, 0.3
0.4, 0.5, 0.6
0.7, 0.8, 0.9]
I = [1
2
3]
So if I now pass in an array for my inputs like [1,2,3], why is that I need do the following to have it converted to a 2-D array as it is done in the book:
inputs = numpy.array(inputs, ndmin=2).T
Any ideas?
Your input here is a one-dimensional list (or a one-dimensional array):
I = [1, 2, 3]
The idea behind this one-dimensional array is the following: if these numbers represent the width in centimetres of a flower petal, its length, and its weight in grams: your flower petal will have a width of 1cm, a length of 2cm, and a weight of 3g.
Converting your input I to a 2-D array is necessary here for two things:
first, by default, converting this list to a NumPy array using numpy.array(inputs) will yield an array of shape (3,), with the second dimension left undefined. By setting ndmin=2, it forces the dimensions to be (3, 1), which allows to not generate any NumPy-related problems, for instance when using matrix multiplication, etc.
secondly, and perhaps more importantly, as I said in my comment, data in Neural Networks are conventionally stored in arrays this way, under the idea that each row in your array will represent a different feature (so there is a unique list for each feature). In other words, it's just a conventional way to say your not confusing apples and pears (in that case, length and weight)
So when you do inputs = numpy.array(inputs, ndmin=2).T, you end up with:
array([[1], # width
[2], # length
[3]]) # weight
and not:
array([1, 2, 3])
Hope it made things a bit clearer!

Transpose input matrix before LinearRegression in sklearn

Here is my python program:
import numpy as np
from sklearn import linear_model
X=np.array([[1, 2, 4]]).T**2
y=np.array([1, 4, 16])
model=linear_model.LinearRegression()
model.fit(X,y)
print('Coefficients: \n', model.coef_)
As a result i have:
Coefficients:
[1.]
It is a first program i test with sklearn.
My question is: why i have to use the transpose .T**2 in the third instruction ?
Without
T**2
i have these errors https://imgur.com/a/XWzJx0f
i use http://jupyter.org/try
As the documentation says, you have to pass a matrix with n_samples (3) and n_features (1). So your input X in the form [[1,2,3]] needs the inner vector in a vertical position.
After **T:
array([[ 1],
[ 4],
[16]])
This is what happens under the hood: https://machinelearningmastery.com/solve-linear-regression-using-linear-algebra/
You have to match X,y in same dimensions (same number of training samples)
If you do not use transpose, you have 1 training sample [1,2,4] but 3 labels, which does not match
If you use transpose, you could have [1][2][4] 3 samples and thus could match 3 labels
the **2 does not matters
The initial shape of matrix X in (1,3). You need to pass the matrix in form of (3,1) as the documentation says and mentioned in answer by Alessandro
The **2 part is just squaring each of the element of matrix X. You can run it without that part. The coefficient will differ then. Currently, when you squared, you have each of the X and y values as (1,1), (4,4) and (16,16) so the coefficient (slope of equation y=mx+ c, if you plot these on graph) is 1. If you don't square, coefficient will differ accordingly

numpy vector multiplication speed?

I'm new to numpy, and found such strange(as for me) behavior.
I'm implementing logistic regression cost function, here I have 2 column vectors with same dimension and same types(dfloat). y contains bunch of zeros and ones, and a contains float numbers in range (-1, 1).
At some point I should get dot product so I transpose one and multiply them:
x = y.T # a
But when I use
x = y # a.T
occasionally performance decreases about 3 times, while results are the same
Why is this so? Isn't operations are the same?
Thanks.
The performance decreases, and you get a very different answer!
For vector multiplication (unlike number multiplication) a # b != b # a. In your case (assuming column vectors), a.T # b is a number, but a # b.T is a full-blown matrix! So, if your vectors are both of shape (1, y), the last operation will result in a (y, y) matrix, which may be pretty huge. Of course, it'll take way more time to compute such a matrix (a.k.a. add a whole lot of numbers and produce a whole lot of numbers), than to add a bunch of numbers and produce one single number.
That's how matrix (or vector) multiplication works.

numpy covariance between each column of a matrix and a vector

Based on this post, I can get covariance between two vectors using np.cov((x,y), rowvar=0). I have a matrix MxN and a vector Mx1. I want to find the covariance between each column of the matrix and the given vector. I know that I can use for loop to write. I was wondering if I can somehow use np.cov() to get the result directly.
As Warren Weckesser said, the numpy.cov(X, Y) is a poor fit for the job because it will simply join the arrays in one M by (N+1) array and find the huge (N+1) by (N+1) covariance matrix. But we'll always have the definition of covariance and it's easy to use:
A = np.sqrt(np.arange(12).reshape(3, 4)) # some 3 by 4 array
b = np.array([[2], [4], [5]]) # some 3 by 1 vector
cov = np.dot(b.T - b.mean(), A - A.mean(axis=0)) / (b.shape[0]-1)
This returns the covariances of each column of A with b.
array([[ 2.21895142, 1.53934466, 1.3379221 , 1.20866607]])
The formula I used is for sample covariance (which is what numpy.cov computes, too), hence the division by (b.shape[0] - 1). If you divide by b.shape[0] you get the unadjusted population covariance.
For comparison, the same computation using np.cov:
import numpy as np
A = np.sqrt(np.arange(12).reshape(3, 4))
b = np.array([[2], [4], [5]])
np.cov(A, b, rowvar=False)[-1, :-1]
Same output, but it takes about twice this long (and for large matrices, the difference will be much larger). The slicing at the end is because np.cov computes a 5 by 5 matrix, in which only the first 4 entries of the last row are what you wanted. The rest is covariance of A with itself, or of b with itself.
Correlation coefficient
The correlation coefficientis obtained by dividing by square roots of variances. Watch out for that -1 adjustment mentioned earlier: numpy.var does not make it by default, to make it happen you need ddof=1 parameter.
corr = cov / np.sqrt(np.var(b, ddof=1) * np.var(A, axis=0, ddof=1))
Check that the output is the same as the less efficient version
np.corrcoef(A, b, rowvar=False)[-1, :-1]

How to compare a search query to SVD resulting 'w' matrix

I am working on developing a search algorithm and I am struggling to understand how to actually use the results of a singular value decomposition ( u,w,vt = svd(a) ) reduction on a term-document matrix.
For example, say I have an M x N matrix as follows where each column represents a document vector (number of terms in each document)
a = [[ 0, 0, 1 ],
[ 0, 1, 2 ],
[ 1, 1, 1 ],
[ 0, 2, 3 ]]
Now, I could run a tf-idf function on this matrix to generate a score for each term/document value, but for the sake of clarity, I will ignore that.
SVD Results
Upon running SVD on this matrix, I end up with the following diagonal vector for 'w'
import svd
u,w,vt = svd.svd(a)
print w
// [4.545183973611469, 1.0343228430392626, 0.5210363733873331]
I understand more or less what this represents (thanks to a lot of reading and particularly this article https://simonpaarlberg.com/post/latent-semantic-analyses/), however I can't figure out how to relate this resulting 'approximation' matrix back to the original documents? What do these weights represent? How can I use this result in my code to find documents related to a term query?
Basically... How do I use this?
The rank-r SVD reduces a rank-R MxN matrix A into r orthogonal rank-1 MxN matrices (u_n * s_n * v_n'). If you use these singular values and vectors to reconstruct the original matrix, you will obtain the best rank-r approximation of A.
Instead of storing the full matrix A, you just store the u_n, s_n, and v_n. (A is MxN, but U is Mxr, S can be stored in one dimension as r elements, and V' is rxN).
To approximate A * x, you simply compute (U * (S * (V' * x))) [Mxr x rxr x rxN x Nx1]. You can speed this up further by storing (U * S) instead of U and S separately.
So what do the singular values represent? In a way, they represent the energy of each rank-1 matrix. The higher a singular value is, the more that its associated rank-1 matrix contributes to the original matrix and the worse your reconstruction will be if it is not included if it is truncated.
Note that this procedure is closely related to Principal Component Analysis, which is performed on covariance matrices and is commonly used in machine learning to reduce the dimensionality of measured N-dimensional variables.
Additionally, it should be noted that the SVD is useful for many other applications in signal processing.
More information is on the Wikipedia article.

Categories

Resources