I'm trying to understand logistic and linear regression and was able to understand the theory behind it (doing andrew ng course).
We have X -> given features -> matrix of (m , n+1) where m - no. of cases and n- features given (excluding x0)
We have y - > the label to predict -> matrix of (m,1)
Now while I'm implementing it from scratch in python, I'm confused as to why we use transpose of theta in the sigmoid function.
Also we use theta transpose X for linear regression too.
We do not have to perform matrix multiplication anywhere while coding, its straight element to element coding, what's the need for the transpose or is my understanding wrong and we need to take matrix multiplication during implementation.
My main concern is that I'm very confused as to where we do matrix multiplication and where we do element wise multiplication in logistic and linear regression
You are a bit off topic for this area, but the piece you appear to be hung up on is the treatment of x and Theta.
In the use cases you describe, x is a vector of inputs, or the "feature vector". The Theta vector is the vector of coefficients. Both are usually expressed as column vectors and of course, must be of the same dimension.
So to "make a prediction" you need the inner product of these two, and the output needs to be a scalar (by definition for inner product) so you need to transpose the theta vector in order to properly express that operation, which is a matrix multiplication of two vectors. Make sense?
For matrix multiplication, the number of Columns in the first element must equal the number of rows in the second element. Since one of the elements your multiplying has either one column or one row, it does not appear to be matrix multiplication due to it's simplicity. But it still is matrix multiplication
Let me provide an example,
Let A be (m,n) matrix
We can perform scalar multiplication, for some fixed a in the real numbers
If we want to multiply A to some vector, x, we need to meet some restrictions. Here it is common to mistake the dot product for matrix multiplication, but they serve completely different purposes.
So our restrictions for multiplying an (m,n) matrix, A by a vector x is that x has the same number of entries as A has columns To do this in your example, one of the elements needed to be transposed.
Related
I have a basis set of square matrices and a data set that I need to find the coefficients for given that my data is a linear sum of the basis set.
def basis(a,b,c):
return a*gam1+b*gam2+c*kapp+allsky
So data = basis and I need the best fit (least square) values of the coefficients a,b and c. The basis matrices and the data matrices are all square matrices of 89x89. I have tried using np.linalg.lstsq however since my A matrix would need to be a matrix of the 4 basis matrices the array dimension becomes 4 and throws an error stating the array dimension must be 2. Any help is appreciated.
I'm new to numpy, and found such strange(as for me) behavior.
I'm implementing logistic regression cost function, here I have 2 column vectors with same dimension and same types(dfloat). y contains bunch of zeros and ones, and a contains float numbers in range (-1, 1).
At some point I should get dot product so I transpose one and multiply them:
x = y.T # a
But when I use
x = y # a.T
occasionally performance decreases about 3 times, while results are the same
Why is this so? Isn't operations are the same?
Thanks.
The performance decreases, and you get a very different answer!
For vector multiplication (unlike number multiplication) a # b != b # a. In your case (assuming column vectors), a.T # b is a number, but a # b.T is a full-blown matrix! So, if your vectors are both of shape (1, y), the last operation will result in a (y, y) matrix, which may be pretty huge. Of course, it'll take way more time to compute such a matrix (a.k.a. add a whole lot of numbers and produce a whole lot of numbers), than to add a bunch of numbers and produce one single number.
That's how matrix (or vector) multiplication works.
I'm currently learning basics of Data Science online. In one of the session on Multiple Linear Regression using Python, the tutor executed below step to add an array on ones to the Matrix of features ; I did not understand why it is being added. From online forums, it is mentioned that it is added so that model (equation) have a constant offset. But why 1 and not any other values. Does the number of independent variables (3) have any impact on this value
X -> Matrix of features ; number of rows in data set : 50 ; Number of
X = np.append(arr = np.ones([50,1]).astype(int), values = X,axis=1)
To better explain, let's imagine you have only 1 feature stored, and let's say 3
training examples.
Then, your parameters are:
And your input variables are:
If you want to realize a linear classification, you must compute the cost function for each training example i:
And if you need to vectorize the calculus (for efficiency and code readability), you want to compute the following matricial product:
However, by definition of the matricial product, the number of columns of matrix X should be the same than the number of rows of matrix Theta. Thus, to compute the product but leave the result unchanged, you add a column of ones to the left of matrix X:
Then, the result for each sample i is the following:
TLDR: You need to append a column of ones to X for the matricial product X*Theta to be defined. If you were adding any other coefficient c instead of 1, then your constant offset theta_0 would be multiplied by your coefficient c.
I think the cost function is the summation of errors between the predicted label and the actual label which we want to minimise. The J function given above is the hypothesis function.
The tutorial on MNIST for ML Beginners, in Implementing the Regression, shows how to make the regression on a single line, followed by an explanation that mentions the use of a trick (emphasis mine):
y = tf.nn.softmax(tf.matmul(x, W) + b)
First, we multiply x by W with the expression tf.matmul(x, W). This is flipped from when we multiplied them in our equation, where we had Wx, as a small trick to deal with x being a 2D tensor with multiple inputs.
What is the trick here, and why are we using it?
Well, there's no trick here. That line basically points to one previous equation multiplication order
# Here the order of W and x, this equation for single example
y = Wx +b
# if you want to use batch of examples you need the change the order of multiplication; instead of using another transpose op
y = xW +b
# hence
y = tf.matmul(x, W)
Ok, I think the main point is that if you train in batches (i.e. train with several instances of the training set at once), TensorFlow always assumes that the zeroth dimension of x indicates the number of events per batch.
Suppose you want to map a training instance of dimension M to a target instance of dimension N. You would typically do this by multiplying x (a column vector) with a NxM matrix (and, optionally, add a bias with dimension N (also a column vector)), i.e.
y = W*x + b, where y is also a column vector.
This is perfectly alright seen from the perspective of linear algebra. But now comes the point with the training in batches, i.e. training with several training instances at once.
To get to understand this, it might be helpful to not view x (and y) as vectors of dimension M (and N), but as matrices with the dimensions Mx1 (and Nx1 for y).
Since TensorFlow assumes that the different training instances constituting a batch are aligned along the zeroth dimension, we get into trouble here since the zeroth dimension is occupied by the different elements of one single instance.
The trick is then to transpose the above equation (remember that transposition of a product also switches the order of the two transposed objects):
y^T = x^T * W^T + b^T
This is pretty much what has been described in short within the tutorial.
Note that y^T is now a matrix of dimension 1xN (practically a row vector), while x^T is a matrix of dimension 1xM (also a row vector). W^T is a matrix of dimension MxN. In the tutorial, they did not write x^T or y^T, but simply defined the placeholders according to this transposed equation. The only point that is not clear to me is why they did not define b the "transposed way". I assume that the + operator automatically transposes b if it is necessary in order to get the correct dimensions.
The rest is now pretty easy: if you have batches larger than 1 instance, you just "stack" multiple of the x (1xM) matrices, say to a matrix of dimensions (AxM) (where A is the batch size). b will hopefully automatically broadcasted to this number of events (that means to a matrix of dimension (AxN). If you then use
y^T = x^T * W^T + b^T,
you will get a (AxN) matrix of the targets for each element of the batch.
I am trying to make a hack of tf.gradient in tensorflow that would give, for a tensor y of rank (M,N) and a tensor x of rank (Q,P) a gradient tensor of rank (M,N,Q,P) as one would naturally expect.
As pointed out in multiple questions on this site*, what one gets is a rank (Q,P) which is the grad of the sum of the elements of y. Now what I can't figure out, looking into the tensorflow code is where is that sum over elements of y is made? Is it as the beginning or at the end? Could someone help me pinpoint the lines of code where that is done?
*
Tensorflow gradients: without automatic implicit sum
TensorFlow: Compute Hessian matrix (and higher order derivatives)
Unaggregated gradients / gradients per example in tensorflow
Separate gradients in tf.gradients
I've answered it here but I'm guessing it's not very useful because you can't use this knowledge to be able to differentiate with respect to non-scalar y. Scalar y assumption is central to design of reverse AD algorithm, and there's not a single place you can modify to support non-scalar ys. Since this confusion keeps coming up, let me go in a bit more detail as to why it's non-trivial:
First of all, how reverse AD works -- suppose we have a function f that's composition of component functions f_i. Each component function takes a vector of length n and produces a vector of length n.
Its derivative can be expressed as a sequence of matrix multiplications. The entire expression can be expressed below.
When differentiating, function composition becomes matrix multiplication of corresponding component function Jacobians.
Note that this involves matrix/matrix products which proves to be too expensive for neural networks. IE, AlexNet contains 8k activations in its convnet->fc transition layer. Doing matrix multiples where each matrix is 8k x 8k would take too long. The trick that makes it efficient is to assume that last function in the chain produces a scalar. Then its Jacobian is a vector, and the whole thing can be rewritten in terms of vector-matrix multiplies, instead of matrix-matrix multiplies.
This product can be computed efficiently by doing multiplication left to right so everything you do is an nxn vector-matrix multiply instead of nxn matrix-matrix multiply.
You can make it even more efficient by never forming those nxn derivative matrices in a first place, and associate each component function with an op that does vector x Jacobian matrix product implicitly. That's what TensorFlow tf.RegisterGradient does. Here's an illustration of the "grad" associated with an a component function.
Now, this is done for vector value functions, what if your functions are matrix valued? This is a typical situation we deal with in neural networks. IE, in a layer that does matrix multiply, matrix that you multiply by is an unknown and it is matrix valued. In that case, the last derivative has rank 2, and remaining derivatives have rank 3.
Now to apply the chain rule you'd have to deal with extra notation because now "x" in chain rule means matrix multiplication generalized to tensors of rank-3.
However, note that we never have to do the multiplication explicitly since we are using a grad operator. So now in practice, this operator now takes values of rank-2 and produces values of rank-2.
So in all of this, there's an assumption that final target is scalar which allows fully connected layers to be differentiated by passing matrices around.
If you want to extend this to support non-scalar vector, you would need to modify the reverse AD algorithm to to propagate more info. IE, for fully connected feed-forward nets you would propagate rank-3 tensors around instead of matrices.
With the jacobian function in Tensorflow 2, this is a straightforward task.
with tf.GradientTape() as tape1:
with tf.GradientTape() as tape2:
y = layer(x)
loss = tf.reduce_mean(y ** 2)
first_order_gradient = tape2.gradient(loss, layer.trainable_weights)
hessian = tape1.jacobian(first_order_gradient, layer.trainable_weights)
https://www.tensorflow.org/guide/advanced_autodiff#hessian