I'm new to numpy, and found such strange(as for me) behavior.
I'm implementing logistic regression cost function, here I have 2 column vectors with same dimension and same types(dfloat). y contains bunch of zeros and ones, and a contains float numbers in range (-1, 1).
At some point I should get dot product so I transpose one and multiply them:
x = y.T # a
But when I use
x = y # a.T
occasionally performance decreases about 3 times, while results are the same
Why is this so? Isn't operations are the same?
Thanks.
The performance decreases, and you get a very different answer!
For vector multiplication (unlike number multiplication) a # b != b # a. In your case (assuming column vectors), a.T # b is a number, but a # b.T is a full-blown matrix! So, if your vectors are both of shape (1, y), the last operation will result in a (y, y) matrix, which may be pretty huge. Of course, it'll take way more time to compute such a matrix (a.k.a. add a whole lot of numbers and produce a whole lot of numbers), than to add a bunch of numbers and produce one single number.
That's how matrix (or vector) multiplication works.
Related
I'm trying to understand logistic and linear regression and was able to understand the theory behind it (doing andrew ng course).
We have X -> given features -> matrix of (m , n+1) where m - no. of cases and n- features given (excluding x0)
We have y - > the label to predict -> matrix of (m,1)
Now while I'm implementing it from scratch in python, I'm confused as to why we use transpose of theta in the sigmoid function.
Also we use theta transpose X for linear regression too.
We do not have to perform matrix multiplication anywhere while coding, its straight element to element coding, what's the need for the transpose or is my understanding wrong and we need to take matrix multiplication during implementation.
My main concern is that I'm very confused as to where we do matrix multiplication and where we do element wise multiplication in logistic and linear regression
You are a bit off topic for this area, but the piece you appear to be hung up on is the treatment of x and Theta.
In the use cases you describe, x is a vector of inputs, or the "feature vector". The Theta vector is the vector of coefficients. Both are usually expressed as column vectors and of course, must be of the same dimension.
So to "make a prediction" you need the inner product of these two, and the output needs to be a scalar (by definition for inner product) so you need to transpose the theta vector in order to properly express that operation, which is a matrix multiplication of two vectors. Make sense?
For matrix multiplication, the number of Columns in the first element must equal the number of rows in the second element. Since one of the elements your multiplying has either one column or one row, it does not appear to be matrix multiplication due to it's simplicity. But it still is matrix multiplication
Let me provide an example,
Let A be (m,n) matrix
We can perform scalar multiplication, for some fixed a in the real numbers
If we want to multiply A to some vector, x, we need to meet some restrictions. Here it is common to mistake the dot product for matrix multiplication, but they serve completely different purposes.
So our restrictions for multiplying an (m,n) matrix, A by a vector x is that x has the same number of entries as A has columns To do this in your example, one of the elements needed to be transposed.
I'm going to be doing some geometric calculations involving 2-D and 3D points using numpy.
What is the canonical representation of a 2-D or 3-D point? Please assume minimal familiarity with numpy, data shapes, etc.
The representation of a single point in Cartesian space is somewhat trivial. You could even use flat tuples or lists to represent them and matrix operations would still work, but if you want to add or scale them (which is fundamentally what linear spaces are for) you have to use arrays. I don't see a reason why not to use a 1d array with shape (d,) in d dimensions: you can use those both as column and row vectors on either side of a matrix using the # matmul operator:
import numpy as np
rot90 = np.array([[0, -1, 0], [1, 0, 0], [0, 0, 1]]) # rotate 90 degrees around z
inp = np.array([1, 0, 0]) # x
# rotate:
inp_rot = rot90 # inp # y
# inverse transform:
inp_invrot = inp # rot90 # -y
A much better question is how to represent collections of points in Cartesian space. If you have N points you will probably want to use a 2d array. But which shape should it be, (N, d) or (d, N)? The answer depends on your use case but without further input you'll want to choose (N, d).
Arrays in numpy are "C-contiguous" by default, which is also called row-major memory layout. This means that on creation an array occupies a contiguous block of memory by default, and items are laid out in memory row after row, with these indices as an example:
>>> np.arange(2*3).reshape(2, 3)
array([[0, 1, 2],
[3, 4, 5]])
One of the reasons we use numpy is that a contiguous block of memory for a given type occupies much less space than a native python container of the same size, at least for large datasets. The other reason is that we can use vectorized operations that work on slices of the input "simultaneously". The quotes are there because fundamentally the hands of the CPU are bound, but it turns out that you can achieve quite some speedup by making good use of CPU caches. And this is where memory layout comes into play: by using operations on an array that access elements close in memory you have a higher chance of making use of caching, and the reduced communication between RAM and CPU will lead to shorter runtimes.
The problem is not trivial, because vectorizing along larger non-contiguous dimensions might end up faster than vectorizing along smaller contiguous ones. However, without any additional information it's a good rule of thumb to put those dimensions last where you are likely to perform vectorized operations and reductions such as .mean() or .sum(). In case of N points in d-dimensional space it's quite likely that you will want to handle each point separately. Loops in matrix multiplications and things like scalar products and vector norms will all want you to work with one component after the other for a given point.
This is why you will see numpy and scipy functions usually assume arrays of shape (N, d): the inner dimension is second and the "batch" index is first. Consider for example numpy.linalg.eig:
Parameters:
a : (…, M, M) array
Matrices for which the eigenvalues and right eigenvectors will be computed
Returns:
w : (…, M) array
The eigenvalues, each repeated according to its multiplicity. The eigenvalues
are not necessarily ordered. The resulting array will be of complex type,
unless the imaginary part is zero in which case it will be cast to a real
type. When a is real the resulting eigenvalues will be real (0 imaginary
part) or occur in conjugate pairs
[...]
It treats multidimensional arrays as batches of matrices, where the last two indices correspond to the Cartesian indices. Similarly the returned eigenvalues and eigenvectors have batch indices first and vector space indices last.
A more direct example is scipy.spatial.distance.pdist which computes the distance between pairs of points in a collection:
Parameters
X : ndarray
An m by n array of m original observations in an n-dimensional space.
[...]
Again you can see the convention that Cartesian indices are last. The same goes for scipy.interpolate.griddata and probably a bunch of other functions.
So if you have a good reason to use either representation: do that. But if you don't have a good indicator (such as the results of profiling both representations) you should stick with the "batch of vectors/matrices" approach usually employed by numpy and scipy (shape (N, d)), because you might even end up using some of these functions, for which your representation will then be native.
Represent them in your source code as tuples or lists, e.g. (1, 0) or [1, 0, 1].
As per this example from scipy:
>>> from scipy.spatial import distance
>>> distance.euclidean([1, 0, 0], [0, 1, 0])
1.4142135623730951
The tutorial on MNIST for ML Beginners, in Implementing the Regression, shows how to make the regression on a single line, followed by an explanation that mentions the use of a trick (emphasis mine):
y = tf.nn.softmax(tf.matmul(x, W) + b)
First, we multiply x by W with the expression tf.matmul(x, W). This is flipped from when we multiplied them in our equation, where we had Wx, as a small trick to deal with x being a 2D tensor with multiple inputs.
What is the trick here, and why are we using it?
Well, there's no trick here. That line basically points to one previous equation multiplication order
# Here the order of W and x, this equation for single example
y = Wx +b
# if you want to use batch of examples you need the change the order of multiplication; instead of using another transpose op
y = xW +b
# hence
y = tf.matmul(x, W)
Ok, I think the main point is that if you train in batches (i.e. train with several instances of the training set at once), TensorFlow always assumes that the zeroth dimension of x indicates the number of events per batch.
Suppose you want to map a training instance of dimension M to a target instance of dimension N. You would typically do this by multiplying x (a column vector) with a NxM matrix (and, optionally, add a bias with dimension N (also a column vector)), i.e.
y = W*x + b, where y is also a column vector.
This is perfectly alright seen from the perspective of linear algebra. But now comes the point with the training in batches, i.e. training with several training instances at once.
To get to understand this, it might be helpful to not view x (and y) as vectors of dimension M (and N), but as matrices with the dimensions Mx1 (and Nx1 for y).
Since TensorFlow assumes that the different training instances constituting a batch are aligned along the zeroth dimension, we get into trouble here since the zeroth dimension is occupied by the different elements of one single instance.
The trick is then to transpose the above equation (remember that transposition of a product also switches the order of the two transposed objects):
y^T = x^T * W^T + b^T
This is pretty much what has been described in short within the tutorial.
Note that y^T is now a matrix of dimension 1xN (practically a row vector), while x^T is a matrix of dimension 1xM (also a row vector). W^T is a matrix of dimension MxN. In the tutorial, they did not write x^T or y^T, but simply defined the placeholders according to this transposed equation. The only point that is not clear to me is why they did not define b the "transposed way". I assume that the + operator automatically transposes b if it is necessary in order to get the correct dimensions.
The rest is now pretty easy: if you have batches larger than 1 instance, you just "stack" multiple of the x (1xM) matrices, say to a matrix of dimensions (AxM) (where A is the batch size). b will hopefully automatically broadcasted to this number of events (that means to a matrix of dimension (AxN). If you then use
y^T = x^T * W^T + b^T,
you will get a (AxN) matrix of the targets for each element of the batch.
Given this...
I have to explain what this code does, knowing that it performs the vectorized evaluation of F, using broadcasting and element wise operations concepts...
def F(x_pos, alpha):
D = x_pos.reshape(1,-1) - x_pos.reshape(-1,1)
return (1./alpha) * (alpha.reshape(1,-1) * R(D)).sum(axis=1)
My explanation is:
In the first line of the function F receives x_pos and alpha as parameters (both numpy arrays), in the second line the matrix D is calculated by means of broadcasting (basic operations such as addition in arrays numpy are performed elementwise, ie, element by element, but it is also possible with arranys of different size if numpy can transform them into others of the same size, this conversion is called broadcasting), subtracting an array of order 1xN with another of order Nx1, resulting in the matrix D of order NxN containing x_j - x_1, x_j - x_2, etc. as elements, finally, in the last line the reciprocal of alpha is calculated (which clearly is an arrangement), where each element is multiplied by the sum of the R evaluation of each cell of the matrix D multiplied by alpha_j horizontally (due to axis = 1 in the argument)
Questions:
Considering I'm new to Python, is my explanation OK?
The code has an error or not? Because I don't see that the "j must be different from 1, 2, ..., n" in each sum is taken into consideration in the code... and If it's in fact wrong... How can I fix the code so it do exactly the same thing as stated as in the image?
Few comments/improvements/fixes could be suggested here.
1] The first step could be alternatively done with just introducing a new axis and subtracting with itself, like so -
D = x_pos[:,None] - x_pos
In my opinion, this is a cleaner option. The performance benefit might be just marginal.
2] In the second line, I think it needs a fix as we need to avoid computations for the diagonal elements of R(D). So, If I got that correctly, the corrected code would be -
vals = R(D)
np.fill_diagonal(vals,0)
out = (1./alpha) * (alpha.reshape(1,-1) * vals).sum(axis=1)
Now, let's make the code a bit more idiomatic/cleaner.
At that line, we could write : (alpha * vals) instead of alpha.reshape(1,-1) * vals. This is because the shapes are already aligned for broadcasting as shown in a schematic diagram below -
alpha : n
vals : n x n
Thus, alpha would be automatically extended to 2D with its elements broadcasted along the first axis for the length of vals and then elementwise multiplications being generated with it. Again, this is meant as a cleaner code.
There's a further performance improvement possible here with (alpha.reshape(1,-1) * vals).sum(axis=1) being replaceable with a matrix-multiplicatiion using np.dot as alpha.dot(vals). The benefit on performance should be noticeable with this step.
So, the second step reduces to -
out = (1./alpha) * alpha.dot(vals)
Im have N pairs of portfolio weights stored in a numpy array and would like to calculate portfolio risk which is w * E * w_T where w_T is weight transpose. The way I came up with is to loop through each weight pair and apply the matrix multiplication. Is there a vectorized approach to this such that given a weight pair (or if possible N weights that all sum to 1) I apply a single covariance matrix to each row to get the risk (ie without loop)?
import numpy as np
w = np.array([[0.2,0.8],[0.5,0.5]])
covar = np.array([0.000046,0.000017,0.000017,0.000032]).reshape([2,2])
w1 = w[0].reshape([1,2]) # each row in w
#portfolio risk
np.dot(np.dot(w1,covar),w1.T)
#Adam's answer is valid, but for big arrays, can result with very big temporary arrays (NxN), and unnecessary computations (computing the off-diagonal elements).
Here's a similar, yet much more efficient solution:
(I added another weight-pair, to distinguish between the different dimensions of the problem)
w = np.array([[0.2,0.8],[0.5,0.5], [0.33, 0.67]])
covar = np.array([0.000046,0.000017,0.000017,0.000032]).reshape([2,2])
(np.dot(w, covar) * w).sum(axis=-1)
=> array([ 2.77600000e-05, 2.80000000e-05, 2.68916000e-05])
By using plain-multiplication in the second step, I'm avoiding the unnecessary compuations of the off-diagonals.
EDIT: explaining the temporary arrays
# first multiplication (in both solutions)
np.dot(w, covar).shape
(3, 2)
# second, my solution
(np.dot(w, covar) * w).shape
(3, 2)
# second, Adam's solution
np.dot(np.dot(w,covar),w.T).shape
(3, 3)
Now, if you have N sets of weights you want to compute risk for (in this example N=3), and M instruments in your portfolio (here M=2), and N>>M, you get an array which is much bigger with Adam's solution (NxN). Not only that it will consume more memory, the computation populating the off-diagonal elements are expensive (matrix multiplication), and unnecessary.
It seems like your code is already set up for a vectorized approach, but you are only dealing with one row at a time. Grabbing the diagonals from the result when using your full weight matrix should give you what you want.
# portfolio risk
np.diagonal(np.dot(np.dot(w,covar),w.T))