Delete element from multi-dimensional numpy array by value - python

Given a numpy array
a = np.array([[0, -1, 0], [1, 0, 0], [1, 0, -1]])
what's the fastest way to delete all elements of value -1 to get an array of the form
np.array([[0, 0], [1, 0, 0], [1, 0]])

Another method you might consider:
def iterative_numpy(a):
mask = a != 1
out = np.array([ a[i,mask[i]] for i xrange(a.shape[0]) ])
return out
Divakar's method loop_compr_based calculates sums along the rows of mask and a cumulative sum of that result. This method avoids such summations but still has to iterate through the rows of a. It also returns an array of arrays. This has the annoyance that out has to be indexed with the syntax out[1][2] rather than out[1,2]. Comparing the times with a matrix random integer matrices:
In [4]: a = np.random.random_integers(-1,1, size = (3,30))
In [5]: %timeit iterative_numpy(a)
100000 loops, best of 3: 11.1 us per loop
In [6]: %timeit loop_compr_based(a)
10000 loops, best of 3: 20.2 us per loop
In [7]: a = np.random.random_integers(-1,1, size = (30,3))
In [8]: %timeit iterative_numpy(a)
10000 loops, best of 3: 59.5 us per loop
In [9]: %timeit loop_compr_based(a)
10000 loops, best of 3: 30.8 us per loop
In [10]: a = np.random.random_integers(-1,1, size = (30,30))
In [11]: %timeit iterative_numpy(a)
10000 loops, best of 3: 64.6 us per loop
In [12]: %timeit loop_compr_based(a)
10000 loops, best of 3: 36 us per loop
When there are more columns than rows, iterative_numpy wins out. When there are more rows than columns, loop_compr_based wins but transposing a first will improve the performance of both methods. When the dimensions are comparably the same, loop_compr_based is best.
Important Side Discussion
Outside of the implementation, it's important to note that any numpy array which has a non-uniform shape is not an actual array in the sense that the values do not occupy a contiguous section of memory and further, the usual array operations will not work as expected.
As an example:
>>> a = np.array([[1,2,3],[1,2],[1]])
>>> a*2
array([[1, 2, 3, 1, 2, 3], [1, 2, 1, 2], [1, 1]], dtype=object)
Notice that numpy actually informs us that this is not the usual numpy array with the note dtype=object.
Thus it might be best to just make a list of numpy arrays and use them accordingly.

Approach #1 : Using NumPy splitting of array -
def split_based(a, val):
mask = a!=val
p = np.split(a[mask],mask.sum(1)[:-1].cumsum())
out = np.array(list(map(list,p)))
return out
Approach #2 : Using loop comprehension, but minimal work within the loop -
def loop_compr_based(a, val):
mask = a!=val
stop = mask.sum(1).cumsum()
start = np.append(0,stop[:-1])
am = a[mask].tolist()
out = np.array([am[start[i]:stop[i]] for i in range(len(start))])
return out
Sample run -
In [391]: a
Out[391]:
array([[ 0, -1, 0],
[ 1, 0, 0],
[ 1, 0, -1],
[-1, -1, 8],
[ 3, 7, 2]])
In [392]: split_based(a, val=-1)
Out[392]: array([[0, 0], [1, 0, 0], [1, 0], [8], [3, 7, 2]], dtype=object)
In [393]: loop_compr_based(a, val=-1)
Out[393]: array([[0, 0], [1, 0, 0], [1, 0], [8], [3, 7, 2]], dtype=object)
Runtime test -
In [387]: a = np.random.randint(-2,10,(1000,1000))
In [388]: %timeit split_based(a, val=-1)
10 loops, best of 3: 161 ms per loop
In [389]: %timeit loop_compr_based(a, val=-1)
10 loops, best of 3: 29 ms per loop

Use indexes = np.where(a == -1) to get indexes of elements
Find indices of elements equal to zero from numpy array
Then delete specific elements by index with np.delete(your_array, indexes)
How to remove specific elements in a numpy array

How about this?
print([[y for y in x if y > -1] for x in a])
[[0, 0], [1, 0, 0], [1, 0]]

For almost everything you might want to do with such an array, you can use a masked array
a = np.array([[0, -1, 0], [1, 0, 0], [1, 0, -1]])
b=np.ma.masked_equal(a,-1)
b
Out[5]:
masked_array(data =
[[0 -- 0]
[1 0 0]
[1 0 --]],
mask =
[[False True False]
[False False False]
[False False True]],
fill_value = -1)
If you really want the ragged array, it can be .compressed() by line
c=np.array([b[i].compressed() for i in range(b.shape[0])])
c
Out[10]: array([array([0, 0]), array([1, 0, 0]), array([1, 0])], dtype=object)

Related

Number of unique elements per row in a NumPy array

For example, for
a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
I want to get
[2, 2, 3]
Is there a way to do this without for loops or using np.vectorize?
Edit: Actual data consists of 1000 rows of 100 elements each, with each element ranging from 1 to 365. The ultimate goal is to determine the percentage of rows that have duplicates. This was a homework problem which I already solved (with a for loop), but I was just wondering if there was a better way to do it with numpy.
Approach #1
One vectorized approach with sorting -
In [8]: b = np.sort(a,axis=1)
In [9]: (b[:,1:] != b[:,:-1]).sum(axis=1)+1
Out[9]: array([2, 2, 3])
Approach #2
Another method for ints that aren't very large would be with offsetting each row by an offset that would differentiate elements off each row from others and then doing binned-summation and counting number of non-zero bins per row -
n = a.max()+1
a_off = a+(np.arange(a.shape[0])[:,None])*n
M = a.shape[0]*n
out = (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
Runtime test
Approaches as funcs -
def sorting(a):
b = np.sort(a,axis=1)
return (b[:,1:] != b[:,:-1]).sum(axis=1)+1
def bincount(a):
n = a.max()+1
a_off = a+(np.arange(a.shape[0])[:,None])*n
M = a.shape[0]*n
return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
# From #wim's post
def pandas(a):
df = pd.DataFrame(a.T)
return df.nunique()
# #jp_data_analysis's soln
def numpy_apply(a):
return np.apply_along_axis(compose(len, np.unique), 1, a)
Case #1 : Square shaped one
In [164]: np.random.seed(0)
In [165]: a = np.random.randint(0,5,(10000,10000))
In [166]: %timeit numpy_apply(a)
...: %timeit sorting(a)
...: %timeit bincount(a)
...: %timeit pandas(a)
1 loop, best of 3: 1.82 s per loop
1 loop, best of 3: 1.93 s per loop
1 loop, best of 3: 354 ms per loop
1 loop, best of 3: 879 ms per loop
Case #2 : Large number of rows
In [167]: np.random.seed(0)
In [168]: a = np.random.randint(0,5,(1000000,10))
In [169]: %timeit numpy_apply(a)
...: %timeit sorting(a)
...: %timeit bincount(a)
...: %timeit pandas(a)
1 loop, best of 3: 8.42 s per loop
10 loops, best of 3: 153 ms per loop
10 loops, best of 3: 66.8 ms per loop
1 loop, best of 3: 53.6 s per loop
Extending to number of unique elements per column
To extend, we just need to do the slicing and ufunc operations along the other axis for the two proposed approaches, like so -
def nunique_percol_sort(a):
b = np.sort(a,axis=0)
return (b[1:] != b[:-1]).sum(axis=0)+1
def nunique_percol_bincount(a):
n = a.max()+1
a_off = a+(np.arange(a.shape[1]))*n
M = a.shape[1]*n
return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
Generic ndarray with generic axis
Let's see how we can extend to ndarray of generic dimensions and get those number of unique counts along a generic axis. We will make use of np.diff with its axis param to get those consecutive differences and hence make it generic, like so -
def nunique(a, axis):
return (np.diff(np.sort(a,axis=axis),axis=axis)!=0).sum(axis=axis)+1
Sample runs -
In [77]: a
Out[77]:
array([[1, 0, 2, 2, 0],
[1, 0, 1, 2, 0],
[0, 0, 0, 0, 2],
[1, 2, 1, 0, 1],
[2, 0, 1, 0, 0]])
In [78]: nunique(a, axis=0)
Out[78]: array([3, 2, 3, 2, 3])
In [79]: nunique(a, axis=1)
Out[79]: array([3, 3, 2, 3, 3])
If you are working with floating pt numbers and want to make the unique-ness case based on some tolerance value rather than absolute match, we can use np.isclose. Two such options would be -
(~np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0)).sum(axis)+1
a.shape[axis]-np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0).sum(axis)
For a custom tolerance value, feed those with np.isclose.
This solution via np.apply_along_axis isn't vectorised and involves a Python-level loop. But it is relatively intuitive using len + np.unique functions.
import numpy as np
from toolz import compose
a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
np.apply_along_axis(compose(len, np.unique), 1, a) # [2, 2, 3]
A oneliner using sort:
In [6]: np.count_nonzero(np.diff(np.sort(a)), axis=1)+1
Out[6]: array([2, 2, 3])
Are you open to considering pandas? Dataframes have a dedicated method for this
>>> a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
>>> df = pd.DataFrame(a.T)
>>> print(*df.nunique())
2 2 3

Padding a 2D numpy with varying rows into a same size [duplicate]

The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.
v = [[1], [1, 2]]
np.array(v)
>>> array([[1], [1, 2]], dtype=object)
Trying to force another type will cause an exception:
np.array(v, dtype=np.int32)
ValueError: setting an array element with a sequence.
What is the most efficient way to get a dense NumPy array of type int32, by filling the "missing" values with a given placeholder?
From my sample sequence v, I would like to get something like this, if 0 is the placeholder
array([[1, 0], [1, 2]], dtype=int32)
You can use itertools.zip_longest:
import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
Out:
array([[1, 0],
[1, 2]])
Note: For Python 2, it is itertools.izip_longest.
Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -
def boolean_indexing(v):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())
out = np.zeros(mask.shape,dtype=int)
out[mask] = np.concatenate(v)
return out
Sample run
In [27]: v
Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]
In [28]: out
Out[28]:
array([[1, 0, 0, 0, 0],
[1, 2, 0, 0, 0],
[3, 6, 7, 8, 9],
[4, 0, 0, 0, 0]])
*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.
Runtime test
In this section I am timing DataFrame-based solution by #Alberto Garcia-Raboso, itertools-based solution by #ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.
Case #1 : Larger size variation
In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]
In [45]: v = v*1000
In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 9.82 ms per loop
In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
100 loops, best of 3: 5.11 ms per loop
In [48]: %timeit boolean_indexing(v)
100 loops, best of 3: 6.88 ms per loop
Case #2 : Lesser size variation
In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]
In [50]: v = v*1000
In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 3.12 ms per loop
In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1000 loops, best of 3: 1.55 ms per loop
In [53]: %timeit boolean_indexing(v)
100 loops, best of 3: 5 ms per loop
Case #3 : Larger number of elements (100 max) per list element
In [139]: # Setup inputs
...: N = 10000 # Number of elems in list
...: maxn = 100 # Max. size of a list element
...: lens = np.random.randint(0,maxn,(N))
...: v = [list(np.random.randint(0,9,(L))) for L in lens]
...:
In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
1 loops, best of 3: 292 ms per loop
In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1 loops, best of 3: 264 ms per loop
In [142]: %timeit boolean_indexing(v)
10 loops, best of 3: 95.7 ms per loop
To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!
Pandas and its DataFrame-s deal beautifully with missing data.
import numpy as np
import pandas as pd
v = [[1], [1, 2]]
print(pd.DataFrame(v).fillna(0).values.astype(np.int32))
# array([[1, 0],
# [1, 2]], dtype=int32)
max_len = max(len(sub_list) for sub_list in v)
result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])
>>> result
array([[1, 0],
[1, 2]])
>>> type(result)
numpy.ndarray
Here is a general way:
>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]]
>>> max_len = np.argmax(v)
>>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len)
array([[ 1, 0, 0, 0],
[ 2, 3, 4, 0],
[ 5, 6, 0, 0],
[ 7, 8, 9, 10],
[11, 12, 0, 0]], dtype=int32)
you can try to convert pandas dataframe first, after that convert it to numpy array
ll = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
df = pd.DataFrame(ll)
print(df)
# 0 1 2 3
# 0 1 2 3.0 NaN
# 1 4 5 NaN NaN
# 2 6 7 8.0 9.0
npl = df.to_numpy()
print(npl)
# [[ 1. 2. 3. nan]
# [ 4. 5. nan nan]
# [ 6. 7. 8. 9.]]
I was having a numpy broadcast error with Alexander's answer so I added a small variation with numpy.pad:
pad = len(max(X, key=len))
result = np.array([np.pad(i, (0, pad-len(i)), 'constant') for i in X])
If you want to extend the same logic to deeper levels (list of lists of lists,..) you can use tensorflow ragged tensors and convert to tensors/arrays. For example:
import tensorflow as tf
v = [[1], [1, 2]]
padded_v = tf.ragged.constant(v).to_tensor(0)
This creates an array padded with 0.
or a deeper example:
w = [[[1]], [[2],[1, 2]]]
padded_w = tf.ragged.constant(w).to_tensor(0)

Find the index of the n smallest values in a 3 dimensional numpy array

Given a 3 dimensional numpy array, how to find the indexes of top n smallest values ? The index of the minimum value can be found as:
i,j,k = np.where(my_array == my_array.min())
Here's one approach for generic n-dims and generic N smallest numbers -
def smallestN_indices(a, N):
idx = a.ravel().argsort()[:N]
return np.stack(np.unravel_index(idx, a.shape)).T
Each row of the the 2D output array would hold the indexing tuple that corresponds to one of the smallest array numbers.
We can also use argpartition, but that might not maintain the order. So, we need a bit more additional work with argsort there -
def smallestN_indices_argparitition(a, N, maintain_order=False):
idx = np.argpartition(a.ravel(),N)[:N]
if maintain_order:
idx = idx[a.ravel()[idx].argsort()]
return np.stack(np.unravel_index(idx, a.shape)).T
Sample run -
In [141]: np.random.seed(1234)
...: a = np.random.randint(111,999,(2,5,4,3))
...:
In [142]: smallestN_indices(a, N=3)
Out[142]:
array([[0, 3, 2, 0],
[1, 2, 3, 0],
[1, 2, 2, 1]])
In [143]: smallestN_indices_argparitition(a, N=3)
Out[143]:
array([[1, 2, 3, 0],
[0, 3, 2, 0],
[1, 2, 2, 1]])
In [144]: smallestN_indices_argparitition(a, N=3, maintain_order=True)
Out[144]:
array([[0, 3, 2, 0],
[1, 2, 3, 0],
[1, 2, 2, 1]])
Runtime test -
In [145]: a = np.random.randint(111,999,(20,50,40,30))
In [146]: %timeit smallestN_indices(a, N=3)
...: %timeit smallestN_indices_argparitition(a, N=3)
...: %timeit smallestN_indices_argparitition(a, N=3, maintain_order=True)
...:
10 loops, best of 3: 97.6 ms per loop
100 loops, best of 3: 8.32 ms per loop
100 loops, best of 3: 8.34 ms per loop

Convert Python sequence to NumPy array, filling missing values

The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.
v = [[1], [1, 2]]
np.array(v)
>>> array([[1], [1, 2]], dtype=object)
Trying to force another type will cause an exception:
np.array(v, dtype=np.int32)
ValueError: setting an array element with a sequence.
What is the most efficient way to get a dense NumPy array of type int32, by filling the "missing" values with a given placeholder?
From my sample sequence v, I would like to get something like this, if 0 is the placeholder
array([[1, 0], [1, 2]], dtype=int32)
You can use itertools.zip_longest:
import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
Out:
array([[1, 0],
[1, 2]])
Note: For Python 2, it is itertools.izip_longest.
Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -
def boolean_indexing(v):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())
out = np.zeros(mask.shape,dtype=int)
out[mask] = np.concatenate(v)
return out
Sample run
In [27]: v
Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]
In [28]: out
Out[28]:
array([[1, 0, 0, 0, 0],
[1, 2, 0, 0, 0],
[3, 6, 7, 8, 9],
[4, 0, 0, 0, 0]])
*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.
Runtime test
In this section I am timing DataFrame-based solution by #Alberto Garcia-Raboso, itertools-based solution by #ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.
Case #1 : Larger size variation
In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]
In [45]: v = v*1000
In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 9.82 ms per loop
In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
100 loops, best of 3: 5.11 ms per loop
In [48]: %timeit boolean_indexing(v)
100 loops, best of 3: 6.88 ms per loop
Case #2 : Lesser size variation
In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]
In [50]: v = v*1000
In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 3.12 ms per loop
In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1000 loops, best of 3: 1.55 ms per loop
In [53]: %timeit boolean_indexing(v)
100 loops, best of 3: 5 ms per loop
Case #3 : Larger number of elements (100 max) per list element
In [139]: # Setup inputs
...: N = 10000 # Number of elems in list
...: maxn = 100 # Max. size of a list element
...: lens = np.random.randint(0,maxn,(N))
...: v = [list(np.random.randint(0,9,(L))) for L in lens]
...:
In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
1 loops, best of 3: 292 ms per loop
In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1 loops, best of 3: 264 ms per loop
In [142]: %timeit boolean_indexing(v)
10 loops, best of 3: 95.7 ms per loop
To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!
Pandas and its DataFrame-s deal beautifully with missing data.
import numpy as np
import pandas as pd
v = [[1], [1, 2]]
print(pd.DataFrame(v).fillna(0).values.astype(np.int32))
# array([[1, 0],
# [1, 2]], dtype=int32)
max_len = max(len(sub_list) for sub_list in v)
result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])
>>> result
array([[1, 0],
[1, 2]])
>>> type(result)
numpy.ndarray
Here is a general way:
>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]]
>>> max_len = np.argmax(v)
>>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len)
array([[ 1, 0, 0, 0],
[ 2, 3, 4, 0],
[ 5, 6, 0, 0],
[ 7, 8, 9, 10],
[11, 12, 0, 0]], dtype=int32)
you can try to convert pandas dataframe first, after that convert it to numpy array
ll = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
df = pd.DataFrame(ll)
print(df)
# 0 1 2 3
# 0 1 2 3.0 NaN
# 1 4 5 NaN NaN
# 2 6 7 8.0 9.0
npl = df.to_numpy()
print(npl)
# [[ 1. 2. 3. nan]
# [ 4. 5. nan nan]
# [ 6. 7. 8. 9.]]
I was having a numpy broadcast error with Alexander's answer so I added a small variation with numpy.pad:
pad = len(max(X, key=len))
result = np.array([np.pad(i, (0, pad-len(i)), 'constant') for i in X])
If you want to extend the same logic to deeper levels (list of lists of lists,..) you can use tensorflow ragged tensors and convert to tensors/arrays. For example:
import tensorflow as tf
v = [[1], [1, 2]]
padded_v = tf.ragged.constant(v).to_tensor(0)
This creates an array padded with 0.
or a deeper example:
w = [[[1]], [[2],[1, 2]]]
padded_w = tf.ragged.constant(w).to_tensor(0)

Mask numpy array based on index

How do I mask an array based on the actual index values?
That is, if I have a 10 x 10 x 30 matrix and I want to mask the array when the first and second index equal each other.
For example, [1, 1 , :] should be masked because 1 and 1 equal each other but [1, 2, :] should not because they do not.
I'm only asking this with the third dimension because it resembles my current problem and may complicate things. But my main question is, how to mask arrays based on the value of the indices?
In general, to access the value of the indices, you can use np.meshgrid:
i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij')
m.mask = (i == j)
The advantage of this method is that it works for arbitrary boolean functions on i, j, and k. It is a bit slower than the use of the identity special case.
In [56]: %%timeit
....: i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij')
....: i == j
10000 loops, best of 3: 96.8 µs per loop
As #Jaime points out, meshgrid supports a sparse option, which doesn't do so much duplication, but requires a bit more care in some cases because they don't broadcast. It will save memory and speed things up a little. For example,
In [77]: x = np.arange(5)
In [78]: np.meshgrid(x, x)
Out[78]:
[array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]]),
array([[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2],
[3, 3, 3, 3, 3],
[4, 4, 4, 4, 4]])]
In [79]: np.meshgrid(x, x, sparse=True)
Out[79]:
[array([[0, 1, 2, 3, 4]]),
array([[0],
[1],
[2],
[3],
[4]])]
So, you can use the sparse version as he says, but you must force the broadcasting as such:
i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij', sparse=True)
m.mask = np.repeat(i==j, k.size, axis=2)
And the speedup:
In [84]: %%timeit
....: i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij', sparse=True)
....: np.repeat(i==j, k.size, axis=2)
10000 loops, best of 3: 73.9 µs per loop
In your special case of wanting to mask the diagonals, you can use the np.identity() function which returns ones along the diagonal. Since you have the third dimension, we have to add that third dimension to the the identity matrix:
m.mask = np.identity(10)[...,None]*np.ones((1,1,30))
There might be a better way of constructing that array, but it is basically stacking 30 of the np.identity(10) array. For example, this is equivalent:
np.dstack((np.identity(10),)*30)
but slower:
In [30]: timeit np.identity(10)[...,None]*np.ones((1,1,30))
10000 loops, best of 3: 40.7 µs per loop
In [31]: timeit np.dstack((np.identity(10),)*30)
1000 loops, best of 3: 219 µs per loop
And #Ophion's suggestions
In [33]: timeit np.tile(np.identity(10)[...,None], 30)
10000 loops, best of 3: 63.2 µs per loop
In [71]: timeit np.repeat(np.identity(10)[...,None], 30)
10000 loops, best of 3: 45.3 µs per loop

Categories

Resources