I'm going to be doing some geometric calculations involving 2-D and 3D points using numpy.
What is the canonical representation of a 2-D or 3-D point? Please assume minimal familiarity with numpy, data shapes, etc.
The representation of a single point in Cartesian space is somewhat trivial. You could even use flat tuples or lists to represent them and matrix operations would still work, but if you want to add or scale them (which is fundamentally what linear spaces are for) you have to use arrays. I don't see a reason why not to use a 1d array with shape (d,) in d dimensions: you can use those both as column and row vectors on either side of a matrix using the # matmul operator:
import numpy as np
rot90 = np.array([[0, -1, 0], [1, 0, 0], [0, 0, 1]]) # rotate 90 degrees around z
inp = np.array([1, 0, 0]) # x
# rotate:
inp_rot = rot90 # inp # y
# inverse transform:
inp_invrot = inp # rot90 # -y
A much better question is how to represent collections of points in Cartesian space. If you have N points you will probably want to use a 2d array. But which shape should it be, (N, d) or (d, N)? The answer depends on your use case but without further input you'll want to choose (N, d).
Arrays in numpy are "C-contiguous" by default, which is also called row-major memory layout. This means that on creation an array occupies a contiguous block of memory by default, and items are laid out in memory row after row, with these indices as an example:
>>> np.arange(2*3).reshape(2, 3)
array([[0, 1, 2],
[3, 4, 5]])
One of the reasons we use numpy is that a contiguous block of memory for a given type occupies much less space than a native python container of the same size, at least for large datasets. The other reason is that we can use vectorized operations that work on slices of the input "simultaneously". The quotes are there because fundamentally the hands of the CPU are bound, but it turns out that you can achieve quite some speedup by making good use of CPU caches. And this is where memory layout comes into play: by using operations on an array that access elements close in memory you have a higher chance of making use of caching, and the reduced communication between RAM and CPU will lead to shorter runtimes.
The problem is not trivial, because vectorizing along larger non-contiguous dimensions might end up faster than vectorizing along smaller contiguous ones. However, without any additional information it's a good rule of thumb to put those dimensions last where you are likely to perform vectorized operations and reductions such as .mean() or .sum(). In case of N points in d-dimensional space it's quite likely that you will want to handle each point separately. Loops in matrix multiplications and things like scalar products and vector norms will all want you to work with one component after the other for a given point.
This is why you will see numpy and scipy functions usually assume arrays of shape (N, d): the inner dimension is second and the "batch" index is first. Consider for example numpy.linalg.eig:
Parameters:
a : (…, M, M) array
Matrices for which the eigenvalues and right eigenvectors will be computed
Returns:
w : (…, M) array
The eigenvalues, each repeated according to its multiplicity. The eigenvalues
are not necessarily ordered. The resulting array will be of complex type,
unless the imaginary part is zero in which case it will be cast to a real
type. When a is real the resulting eigenvalues will be real (0 imaginary
part) or occur in conjugate pairs
[...]
It treats multidimensional arrays as batches of matrices, where the last two indices correspond to the Cartesian indices. Similarly the returned eigenvalues and eigenvectors have batch indices first and vector space indices last.
A more direct example is scipy.spatial.distance.pdist which computes the distance between pairs of points in a collection:
Parameters
X : ndarray
An m by n array of m original observations in an n-dimensional space.
[...]
Again you can see the convention that Cartesian indices are last. The same goes for scipy.interpolate.griddata and probably a bunch of other functions.
So if you have a good reason to use either representation: do that. But if you don't have a good indicator (such as the results of profiling both representations) you should stick with the "batch of vectors/matrices" approach usually employed by numpy and scipy (shape (N, d)), because you might even end up using some of these functions, for which your representation will then be native.
Represent them in your source code as tuples or lists, e.g. (1, 0) or [1, 0, 1].
As per this example from scipy:
>>> from scipy.spatial import distance
>>> distance.euclidean([1, 0, 0], [0, 1, 0])
1.4142135623730951
Related
I'm new to numpy, and found such strange(as for me) behavior.
I'm implementing logistic regression cost function, here I have 2 column vectors with same dimension and same types(dfloat). y contains bunch of zeros and ones, and a contains float numbers in range (-1, 1).
At some point I should get dot product so I transpose one and multiply them:
x = y.T # a
But when I use
x = y # a.T
occasionally performance decreases about 3 times, while results are the same
Why is this so? Isn't operations are the same?
Thanks.
The performance decreases, and you get a very different answer!
For vector multiplication (unlike number multiplication) a # b != b # a. In your case (assuming column vectors), a.T # b is a number, but a # b.T is a full-blown matrix! So, if your vectors are both of shape (1, y), the last operation will result in a (y, y) matrix, which may be pretty huge. Of course, it'll take way more time to compute such a matrix (a.k.a. add a whole lot of numbers and produce a whole lot of numbers), than to add a bunch of numbers and produce one single number.
That's how matrix (or vector) multiplication works.
I have a sparse matrix and another vector and I want to multiply the matrix and vector so that each column of the vector where it's equal to zero it'll zero the entire column of the sparse matrix.
How can I achieve that?
You didn't mention anything about how the array and matrix are defined, it can be assumed that those are numpy matrix and array...
Do you mean something like the following?
import numpy as np
from scipy.sparse import csr_matrix
A = csr_matrix([[1, 2, 0], [0, 0, 3], [4, 0, 5]])
v = np.array([1, 0, 1])
print(A.dot(v))
if so take a look at here:
https://docs.scipy.org/doc/scipy/reference/sparse.html
The main problem is the size of your problem and the fact you're using Python which is on the order of 10-100x slower for matrix multiplication than some other languages. Unless you use something like Cython I don't see you getting an improvement.
If you don't like the speed of matrix multiplication, then you have to consider modification of the matrix attributes directly. But depending on the format that may be slower.
To zero-out columns of a csr, you can find the relevant nonzero elements, and set the data values to zero. Then run the eliminate_zeros method to remove those elements from the sparsity structure.
Setting columns of a csc format may be simpler - find the relevant value in the indptr. At least the elements that you want to remove will be clustered together. I won't go into the details.
Zeroing rows of a lil format should be fairly easy - replace the relevant lists with [].
Anyways with familiarity of the formats it should possible to work out alternatives to matrix multiplication. But without doing so, and doing sometimes, I could say which ones are faster.
For example, I have an equation for projection matrix which works for 1 dimensional vectors:
where P is projection matrix and T is transpose.
We know that we can't simplify this fraction more (by cancelling terms) since denominator is a dot product (thus 0 dimensional scalar, number) and numerator is a matrix (column multiplied by row is a matrix).
I'm not sure how could I define function for this equation in numpy, considering that the current function that I'm using does not differentiate between these terms, multiplication is treated as it has commutative property. I'm using numpy.multiply method:
>>> import numpy as np
>>> a = np.array(a)
>>> a*a.T
array([1, 4, 9])
>>> a.T*a
array([1, 4, 9])
As you see, both of them output vectors.
I've also tried using numpy.matmul method:
>>> np.matmul(a, a.T)
14
>>> np.matmul(a.T, a)
14
which gives dot product for both of the function calls.
I also did try numpy.dot but it obviously doesn't work for numerator terms.
From my understanding, the first function call should output matrix (since column is multiplied by row) and the second function call should output a scalar in a proper case.
Am I mistaken? Is there any method that differentiates between a multiplied by a transpose and a transpose multiplied by a?
Thank you!
Note that 1-dimensional numpy arrays are not column vectors (and operations such as transposition do not make sense). If you want to obtain a column vector you should define your array as a 2-dimensional array (with the second dimension size equal to 1).
However, you don't need to define a column vector, as numpy offers functions to do what you want by manipulating an 1D array as follows
P = np.outer(a,a)/np.inner(a,a)
Stelios' answer is the best, no doubt but for completeness you can use the # operator with 2-d arrays:
a = np.array([1,4,9])[np.newaxis]
P = (a.T # a) / (a # a.T)
I'm currently looking into (manual & high dimensional) feature extraction on very large datasets. I am encoding n2 edges in a graph in its simplest form i -> j.
I'm taking advantage of that features are independent of the i -> j relationships, and can simply be encoded with something ala encode(i, target=False), encode(j, target=True). This way I can encode a single graph in linear time (n time as opposed to n²).
This data is encoded into a tensor of the shape:
# E :: (n, 2, d)
with d being a feature dimension. Indexing into an edge is therefor achieved by:
# edge_ij = np.concat([E[source_node, 0, :), E[target_node, 1, :]], axis=-1)
My challenge is now that I'd like to interface into this ndarray as if it was of shape E' :: (n,n,d*2) ultimately so that I can utilize it to index into a weight vector W and compute a score, ala:
graph_features = W[E']
graph_scores = graph_features.sum(axis=-1)
There are more computations, which I'd like to do with the resulting graph scores, but this is solved if this is solved.
All my approaches have resulted in a lot of unnecessary array allocations, which I need to avoid to make my experiments feasible.
Is it perhaps possible to create some sort of memoryview? (cython is within reach)
Any ideas?
The core of your question seems to be how to view an array with shape (n, 2, d) as one with shape (n, n, d*2). Since changing from (2, d) to (d*2) is trivial with reshape() I will ignore that part and focus on viewing a 1D array as a 2D square one.
You can use stride_tricks.broadcast_to():
line = np.arange(10)
square = np.lib.stride_tricks.broadcast_to(line, (10, 10))
This makes square a view of line (meaning it does not extra memory beyond some constant overhead), with all the values repeated 10 times.
My understanding is that 1-D arrays in numpy can be interpreted as either a column-oriented vector or a row-oriented vector. For instance, a 1-D array with shape (8,) can be viewed as a 2-D array of shape (1,8) or shape (8,1) depending on context.
The problem I'm having is that the functions I write to manipulate arrays tend to generalize well in the 2-D case to handle both vectors and matrices, but not so well in the 1-D case.
As such, my functions end up doing something like this:
if arr.ndim == 1:
# Do it this way
else:
# Do it that way
Or even this:
# Reshape the 1-D array to a 2-D array
if arr.ndim == 1:
arr = arr.reshape((1, arr.shape[0]))
# ... Do it the 2-D way ...
That is, I find I can generalize code to handle 2-D cases (r,1), (1,c), (r,c), but not the 1-D cases without branching or reshaping.
It gets even uglier when the function operates on multiple arrays as I would check and convert each argument.
So my question is: am I missing some better idiom? Is the pattern I've described above common to numpy code?
Also, as a related matter of API design principles, if the caller passes a 1-D array to some function that returns a new array, and the return value is also a vector, is it common practice to reshape a 2-D vector (r,1) or (1,c) back to a 1-D array or simply document that the function returns a 2-D array regardless?
Thanks
I think in general NumPy functions that require an array of shape (r,c) make no special allowance for 1-D arrays. Instead, they expect the user to either pass an array of shape (r,c) exactly, or for the user to pass a 1-D array that broadcasts up to shape (r,c).
If you pass such a function a 1-D array of shape (c,) it will broadcast to shape (1,c), since broadcasting adds new axes on the left. It can also broadcast to shape (r,c) for an arbitrary r (depending on what other array it is being combined with).
On the other hand, if you have a 1-D array, x, of shape (r,) and you need it to broadcast up to shape (r,c), then NumPy expects the user to pass an array of shape (r,1) since broadcasting will not add the new axes on the right for you.
To do that, the user must pass x[:,np.newaxis] instead of just x.
Regarding return values: I think it better to always return a 2-D array. If the user knows the output will be of shape (1,c), and wants a 1-D array, let her slice off the 1-D array x[0] herself.
By making the return value always the same shape, it will be easier to understand code that uses this function, since it is not always immediately apparent what the shape of the inputs are.
Also, broadcasting blurs the distinction between a 1-D array of shape (c,) and a 2-D array of shape (r,c). If your function returns a 1-D array when fed 1-D input, and a 2-D array when fed 2-D input, then your function makes the distinction strict instead of blurred. Stylistically, this reminds me of checking if isinstance(obj,type), which goes against the grain of duck-typing. Don't do it if you don't have to.
unutbu's explanation is good, but I disagree on the return dimension.
The function internal pattern depends on the type of function.
Reduce operations with an axis argument can often be written so that the number of dimensions doesn't matter.
Numpy has also an atleast_2d (and atleast_1d) function that is also commonly used if you need an explicit 2d array. In statistics, I sometimes use a function like atleast_2d_cols, that reshapes 1d (r,) to 2d (r,1) for code that expects 2d, or if the input array is 1d, then the interpretation and linear algebra requires a column vector. (reshaping is cheap so this is not a problem)
In a third case, I might have different code paths if the lower dimensional case can be done cheaper or simpler than the higher dimensional case. (example: if 2d requires several dot products.)
return dimension
I think not following the numpy convention with the return dimension can be very confusing to users for general functions. (topic specific functions can be different.)
For example, reduce operations loose one dimension.
For many other functions the output dimension matches the input dimension. I think a 1d input should have a 1d output and not an extra redundant dimension. Except for functions in linalg, I don't remember any functions that would return a redundant extra dimension. (The scalar versus 1-element array case is not always consistent.)
Stylistically this reminds me of an isinstance check:
Try without it if you allow for example for numpy matrices and masked arrays. You will get funny results that are not easy to debug. Although, for most numpy and scipy functions the user has to know whether the array type will work with them, since there are few isinstance checks and asarray might not always do the right thing.
As a user, I always know what kind of "array_like" I have, a list, tuple or which array subclass, especially when I use multiplication.
np.array(np.eye(3).tolist()*3)
np.matrix(range(3)) * np.eye(3)
np.arange(3) * np.eye(3)
another example: What does this do?
>>> x = np.array(tuple(range(3)), [('',int)]*3)
>>> x
array((0, 1, 2),
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
>>> x * np.eye(3)
This question has already very good answers. Here I just want to add what I usually do (which somehow summarizes responses by others) when I want to write functions that accept a wide range of inputs while the operations I do on them require a 2d row or column vector.
If I know the input is always 1d (array or list):
a. if I need a row: x = np.asarray(x)[None,:]
b. if I need a column: x = np.asarray(x)[:,None]
If the input can be either 2d (array or list) with the right shape or 1d (which needs to be converted to 2d row/column):
a. if I need a row: x = np.atleast_2d(x)
b. if I need a column: x = np.atleast_2d(np.asarray(x).T).T or x = np.reshape(x, (len(x),-1)) (the latter seems faster)
This is a good use for decorators
def atmost_2d(func):
def wrapr(x):
return func(np.atleast_2d(x)).squeeze()
return wrapr
For example, this function will pick out the last column of its input.
#atmost_2d
def g(x):
return x[:,-1]
But: it works for:
1d:
In [46]: b
Out[46]: array([0, 1, 2, 3, 4, 5])
In [47]: g(b)
Out[47]: array(5)
2d:
In [49]: A
Out[49]:
array([[0, 1],
[2, 3],
[4, 5]])
In [50]: g(A)
Out[50]: array([1, 3, 5])
0d:
In [51]: g(99)
Out[51]: array(99)
This answer builds on the previous two.