Dealing with dimension collapse in python arrays

Dealing with dimension collapse in python arrays - python

A recurring error I run into when using NumPy is that an attempt to index an array fails because one of the dimensions of the array was a singleton, and thus that dimension got wiped out and can't be indexed. This is especially problematic in functions designed to operate on arrays of arbitrary size. I'm looking for the cheapest, most universal way to avoid this error.
Here's an example:
import numpy as np
f = (lambda t, u, i=0: t[:,i]*u[::-1])
a = np.eye(3)
b = np.array([1,2,3])
f(a,b)
f(a[:,0],b[1])
The first call works as expected. The second call fails in two ways: 1) t can't be indexed by [:,0] because is has shape (3,), and 2) u can't be indexed at all because it's a scalar.
Here are the fixes that occur to me:
1) Use np.atleast_1d and np.atleast_2d etc. (possibly with conditionals to make sure that the dimensions are in the right order) inside f to make sure that all parameters have the dimensions they need. This precludes use of lambdas, and can take a few lines that I would rather not need.
2) Instead of writing f(a[:,0],b[1]) above, use f(a[:,[0]],b[[1]]). This is fine, but I always have to remember to put in the extra brackets, and if the index is stored in a variable you might not know if you should put the extra brackets in or not. E.g.:
idx = 1
f(a[:,[0]],b[[idx]])
idx = [2,0,1]
f(a[:,[0]],b[idx])
In this case, you would seem to have to call np.atleast_1d on idx first, which may be even more cumbersome than putting np.atleast_1d in the function.
3) In some cases I can get away with just not putting in an index. E.g.:
f = lambda t, u: t[0]*u
f(a,b)
f(a[:,0],b[0])
This works, and is apparently the slickest solution when it applies. But it doesn't help in every case (in particular, your dimensions have to be in the right order to begin with).
So, are there better approaches than the above?

There are lots of ways to avoid this behaviour.
First, whenever you index into a dimension of an np.ndarray with a slice rather than an integer, the number of dimensions of the output will be the same as that of the input:
import numpy as np
x = np.arange(12).reshape(3, 4)
print x[:, 0].shape # integer indexing
# (3,)
print x[:, 0:1].shape # slice
# (3, 1)
This is my preferred way of avoiding the problem, since it generalizes very easily from single-element to multi-element selections (e.g. x[:, i:i+1] vs x[:, i:i+n]).
As you've already touched on, you can also avoid dimension loss by using any sequence of integers to index into a dimension:
print x[:, [0]].shape # list
# (3, 1)
print x[:, (0,)].shape # tuple
# (3, 1)
print x[:, np.array((0,))].shape # array
# (3, 1)
If you choose to stick with integer indices, you can always insert a new singleton dimension using np.newaxis (or equivalently, None):
print x[:, 0][:, np.newaxis]
# (3, 1)
print x[:, 0][:, None]
# (3, 1)
Or else you could manually reshape it to the correct size (here using -1 to infer the size of the first dimension automatically):
print x[:, 0].reshape(-1, 1).shape
# (3, 1)
Finally, you can use an np.matrix rather than an np.ndarray. np.matrix behaves more like a MATLAB matrix, where singleton dimensions are left in whenever you index with an integer:
y = np.matrix(x)
print y[:, 0].shape
# (3, 1)
However, you should be aware that there are a number of other important differences between np.matrix and np.ndarray, for example the * operator performs elementwise multiplication on arrays, but matrix multiplication on matrices. In most circumstances it's best to stick to np.ndarrays.

Related

Is there an alternative, vectored way to write the to_array function?

Suppose we have a ragged, nested sequence like the following:
import numpy as np
x = np.ones((10, 20))
y = np.zeros((10, 20))
a = [[0, x], [y, 1]]
and want to create a full numpy array that broadcasts the ragged sub-sequences (to match the maximum dimension of any other sub-sequence, in this case (10,20)) where necessary. First, we might try to use np.array(a), which yields the warning:
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
By changing to np.array(a, dtype=object), we do get an array. However, this is an array of objects rather than floats, and retains the ragged subsequences, which have not been broadcasted as desired. To fix this, I created a new function to_array which takes a (possibly ragged, nested) sequence and a shape and returns a full numpy array of that shape:
def to_array(a, shape):
a = np.array(a, dtype=object)
b = np.empty(shape)
for index in np.ndindex(a.shape):
b[index] = a[index]
return b
b = np.array(a, dtype=object)
c = to_array(a, (2, 2, 10, 20))
print(b.shape, b.dtype) # prints (2, 2) object
print(c.shape, c.dtype) # prints (2, 2, 10, 20) float64
Note that c, not b, is the desired result. However, to_array relies on a for loop over nindex, and Python for loops are slow for big arrays.
Is there an alternative, vectorized way to write the to_array function?

Given the target shape, a few iterations doesn't seem overly expensive:
In [35]: C = np.empty((A.shape+x.shape), x.dtype)
In [36]: for idx in np.ndindex(A.shape):
...: C[idx] = A[idx]
...:
Alternatively you could replace the 0 and 1 with the appropraite (10,20) arrays. Here you've already created those, x and y:
In [37]: D = np.array([[y,x],[y,x]])
In [38]: np.allclose(C,D)
Out[38]: True
In general a few iterations on a complex task are ok. Keep in mind that (many) operations on an object dtype array are actually slower than operations on an equivalent list. It's the whole-array compiled operations on a numeric array that are relatively fast. That's not your case.
But
C[0,0,:,:] = 0
uses broadcasting - all (10,20) values of C[0,0] are filled with the scalar 0 via broadcasting.
C[0,1,:,:] = x
is a different broadcasting, where the RHS matches the left. It's unreasonable to expect numpy to handle both cases with one broadcasting operation.

Selecting last column of a Numpy array while maintaining then umber of dimensions? [duplicate]

I'm using numpy and want to index a row without losing the dimension information.
import numpy as np
X = np.zeros((100,10))
X.shape # >> (100, 10)
xslice = X[10,:]
xslice.shape # >> (10,)
In this example xslice is now 1 dimension, but I want it to be (1,10).
In R, I would use X[10,:,drop=F]. Is there something similar in numpy. I couldn't find it in the documentation and didn't see a similar question asked.
Thanks!

Another solution is to do
X[[10],:]
or
I = array([10])
X[I,:]
The dimensionality of an array is preserved when indexing is performed by a list (or an array) of indexes. This is nice because it leaves you with the choice between keeping the dimension and squeezing.

It's probably easiest to do x[None, 10, :] or equivalently (but more readable) x[np.newaxis, 10, :]. None or np.newaxis increases the dimension of the array by 1, so that you're back to the original after the slicing eliminates a dimension.
As far as why it's not the default, personally, I find that constantly having arrays with singleton dimensions gets annoying very quickly. I'd guess the numpy devs felt the same way.
Also, numpy handle broadcasting arrays very well, so there's usually little reason to retain the dimension of the array the slice came from. If you did, then things like:
a = np.zeros((100,100,10))
b = np.zeros(100,10)
a[0,:,:] = b
either wouldn't work or would be much more difficult to implement.
(Or at least that's my guess at the numpy dev's reasoning behind dropping dimension info when slicing)

I found a few reasonable solutions.
1) use numpy.take(X,[10],0)
2) use this strange indexing X[10:11:, :]
Ideally, this should be the default. I never understood why dimensions are ever dropped. But that's a discussion for numpy...

Here's an alternative I like better. Instead of indexing with a single number, index with a range. That is, use X[10:11,:]. (Note that 10:11 does not include 11).
import numpy as np
X = np.zeros((100,10))
X.shape # >> (100, 10)
xslice = X[10:11,:]
xslice.shape # >> (1,10)
This makes it easy to understand with more dimensions too, no None juggling and figuring out which axis to use which index. Also no need to do extra bookkeeping regarding array size, just i:i+1 for any i that you would have used in regular indexing.
b = np.ones((2, 3, 4))
b.shape # >> (2, 3, 4)
b[1:2,:,:].shape # >> (1, 3, 4)
b[:, 2:3, :].shape . # >> (2, 1, 4)

To add to the solution involving indexing by lists or arrays by gnebehay, it is also possible to use tuples:
X[(10,),:]

This is especially annoying if you're indexing by an array that might be length 1 at runtime. For that case, there's np.ix_:
some_array[np.ix_(row_index,column_index)]

I've been using np.reshape to achieve the same as shown below
import numpy as np
X = np.zeros((100,10))
X.shape # >> (100, 10)
xslice = X[10,:].reshape(1, -1)
xslice.shape # >> (1, 10)

Iterating through a subset of dimensions

I would like to iterate through a subset of dimensions of a numpy array and compare the resulting array elements (which are arrays or the remaining dimension(s)).
The code below does this:
import numpy
def min(h,m):
return h*60+m
exclude_times_default=[min(3,00),min(6,55)]
d=exclude_times_default
exclude_times_wkend=[min(3,00),min(9,00)]
w=exclude_times_wkend;
exclude_times=numpy.array([[[min(3,00),min(6,20)],d,d,d,d,d,[min(3,00),min(6,20)],d,d,[min(3,00),min(6,20)]],
[d,d,d,d,[min(3,00),min(9,30)],[min(3,00),min(9,30)],d,d,d,d],
[[min(20,00),min(7,15)],[min(3,00),min(23,15)],[min(3,00),min(7,15)],[min(3,00),min(7,15)],[min(3,00),min(23,15)],[min(3,00),min(23,15)],d,d,d,d]])
num_level=exclude_times.shape[0]
num_wind=exclude_times.shape[1]
for level in range(num_level):
for window in range(num_wind):
if (exclude_times[level,window,:]==d).all():
print("Default")
exclude_times[level][window]=w
print(level,window,exclude_times[level][window])
The solution does not look very elegant to me, just wondering if there are more elegant solutions.

You can get a 2D mask pinpointing all the window/level combinations set to default like this:
mask = (exclude_times == d[None, None, :]).all(axis=-1)
The expression d[None, None, :] introduces two new axes into a view of d to make it broadcast to the shape of exclude_times properly. Another way to do that would be with an explicit reshape: np.reshape(d, (1, 1, -1)) or d.reshape(1, 1, -1). There are many other ways as well.
The .all(axis=-1) operation reduces the 3D boolean mask along the last axis, giving you a 2D mask indexed be level and window.
To count the number of default entries, use np.countnonzero:
nnz = np.countnonzero(mask)
To count the defaults for each window:
np.countnonzero(mask, axis=0)
To count the defaults for each level:
np.countnonzero(mask, axis=1)
Remember, the axis parameter is the one you reduce, not the one(s) you keep.
Assigning w to the default elements is a bit more complex. The problem is that exclude_times[mask[:, :, None]] is a copy of the original data, and doesn't preserve the shape of the original at all.
You have to do a couple of extra steps to reshape correctly:
exclude_times[mask[:, :, None]] = np.broadcast_to(w[None, :], (nnz, 2)).ravel()

Numpy [...,None]

I have found myself needing to add features to existing numpy arrays which has led to a question around what the last portion of the following code is actually doing:
np.ones(shape=feature_set.shape)[...,None]
Set-up
As an example, let's say I wish to solve for linear regression parameter estimates by using numpy and solving:
Assume I have a feature set shape (50,1), a target variable of shape (50,), and I wish to use the shape of my target variable to add a column for intercept values.
It would look something like this:
# Create random target & feature set
y_train = np.random.randint(0,100, size = (50,))
feature_set = np.random.randint(0,100,size=(50,1))
# Build a set of 1s after shape of target variable
int_train = np.ones(shape=y_train.shape)[...,None]
# Able to then add int_train to feature set
X = np.concatenate((int_train, feature_set),1)
What I Think I Know
I see the difference in output when I include [...,None] vs when I leave it off. Here it is:
The second version returns an error around input arrays needing the same number of dimensions, and eventually I stumbled on the solution to use [...,None].
Main Question
While I see the output of [...,None] gives me what I want, I am struggling to find any information on what it is actually supposed to do. Can anybody walk me through what this code actually means, what the None argument is doing, etc?
Thank you!

The slice of [..., None] consists of two "shortcuts":
The ellipsis literal component:
The dots (...) represent as many colons as needed to produce a complete indexing tuple. For example, if x is a rank 5 array (i.e., it has 5 axes), then
x[1,2,...] is equivalent to x[1,2,:,:,:],
x[...,3] to x[:,:,:,:,3] and
x[4,...,5,:] to x[4,:,:,5,:].
(Source)
The None component:
numpy.newaxis
The newaxis object can be used in all slicing operations to create an axis of length one. newaxis is an alias for ‘None’, and ‘None’ can be used in place of this with the same result.
(Source)
So, arr[..., None] takes an array of dimension N and "adds" a dimension "at the end" for a resulting array of dimension N+1.
Example:
import numpy as np
x = np.array([[1,2,3],[4,5,6]])
print(x.shape) # (2, 3)
y = x[...,None]
print(y.shape) # (2, 3, 1)
z = x[:,:,np.newaxis]
print(z.shape) # (2, 3, 1)
a = np.expand_dims(x, axis=-1)
print(a.shape) # (2, 3, 1)
print((y == z).all()) # True
print((y == a).all()) # True

Consider this code:
np.ones(shape=(2,3))[...,None].shape
As you see the 'None' phrase change the (2,3) matrix to a (2,3,1) tensor. As a matter of fact it put the matrix in the LAST index of the tensor.
If you use
np.ones(shape=(2,3))[None, ...].shape
it put the matrix in the FIRST‌ index of the tensor

Numpy index slice without losing dimension information

I'm using numpy and want to index a row without losing the dimension information.
import numpy as np
X = np.zeros((100,10))
X.shape # >> (100, 10)
xslice = X[10,:]
xslice.shape # >> (10,)
In this example xslice is now 1 dimension, but I want it to be (1,10).
In R, I would use X[10,:,drop=F]. Is there something similar in numpy. I couldn't find it in the documentation and didn't see a similar question asked.
Thanks!

Another solution is to do
X[[10],:]
or
I = array([10])
X[I,:]
The dimensionality of an array is preserved when indexing is performed by a list (or an array) of indexes. This is nice because it leaves you with the choice between keeping the dimension and squeezing.

It's probably easiest to do x[None, 10, :] or equivalently (but more readable) x[np.newaxis, 10, :]. None or np.newaxis increases the dimension of the array by 1, so that you're back to the original after the slicing eliminates a dimension.
As far as why it's not the default, personally, I find that constantly having arrays with singleton dimensions gets annoying very quickly. I'd guess the numpy devs felt the same way.
Also, numpy handle broadcasting arrays very well, so there's usually little reason to retain the dimension of the array the slice came from. If you did, then things like:
a = np.zeros((100,100,10))
b = np.zeros(100,10)
a[0,:,:] = b
either wouldn't work or would be much more difficult to implement.
(Or at least that's my guess at the numpy dev's reasoning behind dropping dimension info when slicing)

I found a few reasonable solutions.
1) use numpy.take(X,[10],0)
2) use this strange indexing X[10:11:, :]
Ideally, this should be the default. I never understood why dimensions are ever dropped. But that's a discussion for numpy...

Here's an alternative I like better. Instead of indexing with a single number, index with a range. That is, use X[10:11,:]. (Note that 10:11 does not include 11).
import numpy as np
X = np.zeros((100,10))
X.shape # >> (100, 10)
xslice = X[10:11,:]
xslice.shape # >> (1,10)
This makes it easy to understand with more dimensions too, no None juggling and figuring out which axis to use which index. Also no need to do extra bookkeeping regarding array size, just i:i+1 for any i that you would have used in regular indexing.
b = np.ones((2, 3, 4))
b.shape # >> (2, 3, 4)
b[1:2,:,:].shape # >> (1, 3, 4)
b[:, 2:3, :].shape . # >> (2, 1, 4)

To add to the solution involving indexing by lists or arrays by gnebehay, it is also possible to use tuples:
X[(10,),:]

This is especially annoying if you're indexing by an array that might be length 1 at runtime. For that case, there's np.ix_:
some_array[np.ix_(row_index,column_index)]

I've been using np.reshape to achieve the same as shown below
import numpy as np
X = np.zeros((100,10))
X.shape # >> (100, 10)
xslice = X[10,:].reshape(1, -1)
xslice.shape # >> (1, 10)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dealing with dimension collapse in python arrays - python

Related

Is there an alternative, vectored way to write the to_array function?

Selecting last column of a Numpy array while maintaining then umber of dimensions? [duplicate]

Iterating through a subset of dimensions

Numpy [...,None]

Numpy index slice without losing dimension information

Categories

Resources