I want to view a numpy matrix by specifying the row and column number. For example, row 0 and 2 and column 0 and 2 of a 3×3 matrix.
M = np.array(range(9)).reshape((3,3))
M[:,[0,2]][[0,2],:]
But I know this is not a view, a new matrix is created due to the iterated indexing. Is it possible to do such a view?
I think it is strange that i can do
M[:2,:2]
to view the matrix. but not use
M[[0,1],[0,1]]
to achieve the same view.
EDIT: provide one more example. If I have a matrix
M = np.array(range(16)).reshape((4,4))
How do I get rows [1,2,3] and columns [0,2,3] with a single step of indexing? This will do it in 2 steps:
M[[1,2,3],:][:,[0,2,3]]
How do I get rows [1,2,3] and columns [0,2,3] with a single step of indexing?
You could use np.ix_ instead but this is neither less typing nor is it faster. In fact its slower:
%timeit M[np.ix_([1,2,3],[0,2,3])]
100000 loops, best of 3: 17.8 µs per loop
%timeit M[[1,2,3],:][:, [0,2,3]]
100000 loops, best of 3: 10.9 µs per loop
How to force a view (if possible)?
You can use numpy.lib.stride_tricks.as_strided to ask for a tailored view of an array.
Here is an example of its use from scipy-lectures.
This would allow you to get a view instead of a copy in your very first example:
from numpy.lib.stride_tricks import as_strided
M = np.array(range(9)).reshape((3,3))
sub_1 = M[:,[0,2]][[0,2],:]
sub_2 = as_strided(M, shape=(2, 2), strides=(48,16))
print sub_1
print ''
print sub_2
[[0 2]
[6 8]]
[[0 2]
[6 8]]
# change the initial array
M[0,0] = -1
print sub_1
print ''
print sub_2
[[0 2]
[6 8]]
[[-1 2]
[ 6 8]]
As you can see sub_2 is indeed a view since it reflects changes made to the initial array M.
The strides argument passed to as_strided specifies the byte-sizes to "walk" in each dimension:
The datatype of the initial array M is numpy.int64 (on my machine) so an int is 8 bytes in memory. Since Numpy arranges arrays by default in C-style (row-major order), one row of M is consecutive in memory and takes 24 bytes. Since you want every other row you specify 48 bytes as stride in the first dimension. For the second dimension you want also every other element -- which now sit next to each other in memory -- so you specify 16 bytes as stride.
For your latter example Numpy is not able to return a view because the requested indices are to irregular to be described through shape and strides.
For your second example:
import numpy as np
M = np.array(range(16)).reshape((4,4))
print(M[np.meshgrid([1,2,3],[0,2,3])].transpose())
the .transpose() is necessary because of meshgrid's order of indexing. According to Numpy doc there is a new indexing option, so that M[np.meshgrid([1,2,3],[0,2,3],indexing='ij')] should work, but I don't have Numpy's latest version and can't test it.
M[[0,1],[0,1]] returns elements at (0,0) and (1,1) in the matrix.
Slicing a numpy array gives a view of the array, but your code M[:2, :2] gets a submatrix with row 0,1 and column 0,1 of M, you need ::2:
In [1710]: M
Out[1710]:
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
In [1711]: M[:2, :2]
Out[1711]:
array([[0, 1],
[3, 4]])
In [1712]: M[::2, ::2]
Out[1712]:
array([[0, 2],
[6, 8]])
To understand this behavior of numpy, you need to read up on numpy array striding. The great power of numpy lies in providing a uniform interface for the whole numpy/scipy ecosystem to grow around. That interface is the ndarray, which provides a simple yet general method for storing numerical data.
'Simple' and 'general' are value judgements of course, but a balance has been struck by settling on strided arrays to form this interface. Every numpy array has a set of strides that tells you how to find any given element in the array, as a simple inner product between strides and indices.
Of course one could imagine an alternative numpy which had different code paths for all kinds of other data representations; much in the same way as one could imagine the pyramids of Giza, except ten times bigger. Easy to imagine; but building it is a little more work.
What is however impossible to imagine, is indexing an array as arr[[2,0,1]], and representing that array as a strided view on the same piece of memory. arr[[1,0]] on the other hand could be represented as a view, but returning a view or copy depending on the content of the indices you are indexing with would mean a performance hit for what should be a simple operation; and it would make for rather funny semantics as well.
Related
I am trying to randomly select a set of integers in numpy and am encountering a strange error. If I define a numpy array with two sets of different sizes, np.random.choice chooses between them without issue:
Set1 = np.array([[1, 2, 3], [2, 4]])
In: np.random.choice(Set1)
Out: [4, 5]
However, once the numpy array are sets of the same size, I get a value error:
Set2 = np.array([[1, 3, 5], [2, 4, 6]])
In: np.random.choice(Set2)
ValueError: a must be 1-dimensional
Could be user error, but I've checked several times and the only difference is the size of the sets. I realize I can do something like:
Chosen = np.random.choice(N, k)
Selection = Set[Chosen]
Where N is the number of sets and k is the number of samples, but I'm just wondering if there was a better way and specifically what I am doing wrong to raise a value error when the sets are the same size.
Printout of Set1 and Set2 for reference:
In: Set1
Out: array([list([1, 3, 5]), list([2, 4])], dtype=object)
In: type(Set1)
Out: numpy.ndarray
In: Set2
Out:
array([[1, 3, 5],
[2, 4, 6]])
In: type(Set2)
Out: numpy.ndarray
Your issue is caused by a misunderstanding of how numpy arrays work. The first example can not "really" be turned into an array because numpy does not support ragged arrays. You end up with an array of object references that points to two python lists. The second example is a proper 2xN numerical array. I can think of two types of solutions here.
The obvious approach (which would work in both cases, by the way), would be to choose the index instead of the sublist. Since you are sampling with replacement, you can just generate the index and use it directly:
Set[np.random.randint(N, size=k)]
This is the same as
Set[np.random.choice(N, k)]
If you want to choose without replacement, your best bet is to use np.random.choice, with replace=False. This is similar to, but less efficient than shuffling. In either case, you can write a one-liner for the index:
Set[np.random.choice(N, k, replace=False)]
Or:
index = np.arange(Set.shape[0])
np.random.shuffle(index)
Set[index[:k]]
The nice thing about np.random.shuffle, though, is that you can apply it to Set directly, whether it is a one- or many-dimensional array. Shuffling will always happen along the first axis, so you can just take the top k elements afterwards:
np.random.shuffle(Set)
Set[:k]
The shuffling operation works only in-place, so you have to write it out the long way. It's also less efficient for large arrays, since you have to create the entire range up front, no matter how small k is.
The other solution is to turn the second example into an array of list objects like the first one. I do not recommend this solution unless the only reason you are using numpy is for the choice function. In fact I wouldn't recommend it at all, since you can, and probably should, use pythons standard random module at this point. Disclaimers aside, you can coerce the datatype of the second array to be object. It will remove any benefits of using numpy, and can't be done directly. Simply setting dtype=object will still create a 2D array, but will store references to python int objects instead of primitives in it. You have to do something like this:
Set = np.zeros(N, dtype=object)
Set[:] = [[1, 2, 3], [2, 4]]
You will now get an object essentially equivalent to the one in the first example, and can therefore apply np.random.choice directly.
Note
I show the legacy np.random methods here because of personal inertia if nothing else. The correct way, as suggested in the documentation I link to, is to use the new Generator API. This is especially true for the choice method, which is much more efficient in the new implementation. The usage is not any more difficult:
Set[np.random.default_rng().choice(N, k, replace=False)]
There are additional advantages, like the fact that you can now choose directly, even from a multidimensional array:
np.random.default_rng().choice(Set2, k, replace=False)
The same goes for shuffle, which, like choice, now allows you to select the axis you want to rearrange:
np.random.default_rng().shuffle(Set)
Set[:k]
I'm just learning python, but have decided to do so by recoding and improving some old java based school AI project.
My project involved a mathematical operation that is basically a discrete convolution operation, but without one of the functions time reversed.
So, while in my original java project I just wrote all the code to do the operation myself, since I'm working in python, and it's got great math libraries like numpy and scipy, I figured I could just make use of an existing convolution function like scipy.convolve. However, this would require me to pre-reverse one of the two arrays so that when scipy.convolve runs, and reverses one of the arrays to perform the convolution, it's really un-reversing the array. (I also still don't know how I can be sure to pre-reverse the right one of the two arrays so that the two arrays are still slid past each other both forwards rather than both backwards, but I assume I should ask that as a separate question.)
Unlike my java code, which only handled one dimensional data, I wanted to extend this project to multidimensional data. And so, while I have learned that if I had a numpy array of known dimension, such as a three dimensional array a, I could fully reverse the array (or rather get back a view that is reversed, which is much faster), by
a = a(::-1, ::-1, ::-1)
However, this requires me to have a ::-1 for every dimension. How can I perform this same reversal within a method for an array of arbitrary dimension that has the same result as the above code?
You can use np.flip. From the documentation:
numpy.flip(m, axis=None)
Reverse the order of elements in an array along the given axis.
The shape of the array is preserved, but the elements are reordered.
Note: flip(m) corresponds to m[::-1,::-1,...,::-1] with ::-1 at all positions.
This is a possible solution:
slices = tuple([slice(-1, -n-1, -1) for n in a.shape])
result = a[slices]
extends to arbitrary number of axes. Verification:
a = np.arange(8).reshape(2, 4)
slices = tuple([slice(-1, -n-1, -1) for n in a.shape])
result = a[slices]
yields:
>>> a
array([[0, 1, 2, 3],
[4, 5, 6, 7]])
>>> result
array([[7, 6, 5, 4],
[3, 2, 1, 0]])
I have 1-dimensional numpy array and want to store sparse updates of it.
Say I have array of length 500000 and want to do 100 updates of 100 elements. Updates are either adds or just changing the values (I do not think it matters).
What is the best way to do it using numpy?
I wanted to just store two arrays: indices, values_to_add and therefore have two objects: one stores dense matrix and other just keeps indices and values to add, and I can just do something like this with the dense matrix:
dense_matrix[indices] += values_to_add
And if I have multiple updates, I just concat them.
But this numpy syntax doesn't work fine with repeated elements: they are just ignored.
Updating pair when we have an update that repeats index is O(n). I thought of using dict instead of array to store updates, which looks fine from the point of view of complexity, but it doesn't look good numpy style.
What is the most expressive way to achieve this? I know about scipy sparse objects, but (1) I want pure numpy because (2) I want to understand the most efficient way to implement it.
If you have repeated indices you could use at, from the documentation:
Performs unbuffered in place operation on operand ‘a’ for elements
specified by ‘indices’. For addition ufunc, this method is equivalent
to a[indices] += b, except that results are accumulated for elements
that are indexed more than once.
Code
a = np.arange(10)
indices = [0, 2, 2]
np.add.at(a, indices, [-44, -55, -55])
print(a)
Output
[ -44 1 -108 3 4 5 6 7 8 9]
I was growing confused during the development of a small Python script involving matrix operations, so I fired up a shell to play around with a toy example and develop a better understanding of matrix indexing in Numpy.
This is what I did:
>>> import numpy as np
>>> A = np.matrix([1,2,3])
>>> A
matrix([[1, 2, 3]])
>>> A[0]
matrix([[1, 2, 3]])
>>> A[0][0]
matrix([[1, 2, 3]])
>>> A[0][0][0]
matrix([[1, 2, 3]])
>>> A[0][0][0][0]
matrix([[1, 2, 3]])
As you can imagine, this has not helped me develop a better understanding of matrix indexing in Numpy. This behavior would make sense for something that I would describe as "An array of itself", but I doubt anyone in their right mind would choose that as a model for matrices in a scientific library.
What is, then, the logic to the output I obtained? Why would the first element of a matrix object be itself?
PS: I know how to obtain the first entry of the matrix. What I am interested in is the logic behind this design decision.
EDIT: I'm not asking how to access a matrix element, or why a matrix row behaves like a matrix. I'm asking for a definition of the behavior of a matrix when indexed with a single number. It's an action typical of arrays, but the resulting behavior is nothing like the one you would expect from an array. I would like to know how this is implemented and what's the logic behind the design decision.
Look at the shape after indexing:
In [295]: A=np.matrix([1,2,3])
In [296]: A.shape
Out[296]: (1, 3)
In [297]: A[0]
Out[297]: matrix([[1, 2, 3]])
In [298]: A[0].shape
Out[298]: (1, 3)
The key to this behavior is that np.matrix is always 2d. So even if you select one row (A[0,:]), the result is still 2d, shape (1,3). So you can string along as many [0] as you like, and nothing new happens.
What are you trying to accomplish with A[0][0]? The same as A[0,0]?
For the base np.ndarray class these are equivalent.
Note that Python interpreter translates indexing to __getitem__ calls.
A.__getitem__(0).__getitem__(0)
A.__getitem__((0,0))
[0][0] is 2 indexing operations, not one. So the effect of the second [0] depends on what the first produces.
For an array A[0,0] is equivalent to A[0,:][0]. But for a matrix, you need to do:
In [299]: A[0,:][:,0]
Out[299]: matrix([[1]]) # still 2d
=============================
"An array of itself", but I doubt anyone in their right mind would choose that as a model for matrices in a scientific library.
What is, then, the logic to the output I obtained? Why would the first element of a matrix object be itself?
In addition, A[0,:] is not the same as A[0]
In light of these comments let me suggest some clarifications.
A[0] does not mean 'return the 1st element'. It means select along the 1st axis. For a 1d array that means the 1st item. For a 2d array it means the 1st row. For ndarray that would be a 1d array, but for a matrix it is another matrix. So for a 2d array or matrix, A[i,:] is the same thing as A[i].
A[0] does not just return itself. It returns a new matrix. Different id:
In [303]: id(A)
Out[303]: 2994367932
In [304]: id(A[0])
Out[304]: 2994532108
It may have the same data, shape and strides, but it's a new object. It's just as unique as the ith row of a many row matrix.
Most of the unique matrix activity is defined in: numpy/matrixlib/defmatrix.py. I was going to suggest looking at the matrix.__getitem__ method, but most of the action is performed in np.ndarray.__getitem__.
np.matrix class was added to numpy as a convenience for old-school MATLAB programmers. numpy arrays can have almost any number of dimensions, 0, 1, .... MATLAB allowed only 2, though a release around 2000 generalized it to 2 or more.
Imagine you have the following
>> A = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
If you want to get the second column value, use the following:
>> A.T[1]
array([ 2, 6, 10])
For example, I want a 2 row matrix, with a first row of length 1, and second row of length 2. I could do,
list1 = np.array([1])
list2 = np.array([2,3])
matrix = []
matrix.append(list1)
matrix.append(list2)
matrix = np.array(matrix)
I wonder if I could declare a matrix of this shape directly in the beginning of a program without going through the above procedure?
A matrix is by definition a rectangular array of numbers. NumPy does not support arrays that do not have a rectangular shape. Currently, what your code produces is an array, containing a list (matrix), containing two more arrays.
array([array([1]), array([2, 3])], dtype=object)
I don't really see what the purpose of this shape could be, and would advise you simply use nested lists for whatever you are doing with this shape. Should you have found some use for this structure with NumPy however, you can produce it much more idiomatically like this:
>>> np.array([list1,list2])
array([array([1]), array([2, 3])], dtype=object)