Mask numpy array based on index - python

How do I mask an array based on the actual index values?
That is, if I have a 10 x 10 x 30 matrix and I want to mask the array when the first and second index equal each other.
For example, [1, 1 , :] should be masked because 1 and 1 equal each other but [1, 2, :] should not because they do not.
I'm only asking this with the third dimension because it resembles my current problem and may complicate things. But my main question is, how to mask arrays based on the value of the indices?

In general, to access the value of the indices, you can use np.meshgrid:
i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij')
m.mask = (i == j)
The advantage of this method is that it works for arbitrary boolean functions on i, j, and k. It is a bit slower than the use of the identity special case.
In [56]: %%timeit
....: i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij')
....: i == j
10000 loops, best of 3: 96.8 µs per loop
As #Jaime points out, meshgrid supports a sparse option, which doesn't do so much duplication, but requires a bit more care in some cases because they don't broadcast. It will save memory and speed things up a little. For example,
In [77]: x = np.arange(5)
In [78]: np.meshgrid(x, x)
Out[78]:
[array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]]),
array([[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2],
[3, 3, 3, 3, 3],
[4, 4, 4, 4, 4]])]
In [79]: np.meshgrid(x, x, sparse=True)
Out[79]:
[array([[0, 1, 2, 3, 4]]),
array([[0],
[1],
[2],
[3],
[4]])]
So, you can use the sparse version as he says, but you must force the broadcasting as such:
i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij', sparse=True)
m.mask = np.repeat(i==j, k.size, axis=2)
And the speedup:
In [84]: %%timeit
....: i, j, k = np.meshgrid(*map(np.arange, m.shape), indexing='ij', sparse=True)
....: np.repeat(i==j, k.size, axis=2)
10000 loops, best of 3: 73.9 µs per loop

In your special case of wanting to mask the diagonals, you can use the np.identity() function which returns ones along the diagonal. Since you have the third dimension, we have to add that third dimension to the the identity matrix:
m.mask = np.identity(10)[...,None]*np.ones((1,1,30))
There might be a better way of constructing that array, but it is basically stacking 30 of the np.identity(10) array. For example, this is equivalent:
np.dstack((np.identity(10),)*30)
but slower:
In [30]: timeit np.identity(10)[...,None]*np.ones((1,1,30))
10000 loops, best of 3: 40.7 µs per loop
In [31]: timeit np.dstack((np.identity(10),)*30)
1000 loops, best of 3: 219 µs per loop
And #Ophion's suggestions
In [33]: timeit np.tile(np.identity(10)[...,None], 30)
10000 loops, best of 3: 63.2 µs per loop
In [71]: timeit np.repeat(np.identity(10)[...,None], 30)
10000 loops, best of 3: 45.3 µs per loop

Related

Number of unique elements per row in a NumPy array

For example, for
a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
I want to get
[2, 2, 3]
Is there a way to do this without for loops or using np.vectorize?
Edit: Actual data consists of 1000 rows of 100 elements each, with each element ranging from 1 to 365. The ultimate goal is to determine the percentage of rows that have duplicates. This was a homework problem which I already solved (with a for loop), but I was just wondering if there was a better way to do it with numpy.
Approach #1
One vectorized approach with sorting -
In [8]: b = np.sort(a,axis=1)
In [9]: (b[:,1:] != b[:,:-1]).sum(axis=1)+1
Out[9]: array([2, 2, 3])
Approach #2
Another method for ints that aren't very large would be with offsetting each row by an offset that would differentiate elements off each row from others and then doing binned-summation and counting number of non-zero bins per row -
n = a.max()+1
a_off = a+(np.arange(a.shape[0])[:,None])*n
M = a.shape[0]*n
out = (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
Runtime test
Approaches as funcs -
def sorting(a):
b = np.sort(a,axis=1)
return (b[:,1:] != b[:,:-1]).sum(axis=1)+1
def bincount(a):
n = a.max()+1
a_off = a+(np.arange(a.shape[0])[:,None])*n
M = a.shape[0]*n
return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
# From #wim's post
def pandas(a):
df = pd.DataFrame(a.T)
return df.nunique()
# #jp_data_analysis's soln
def numpy_apply(a):
return np.apply_along_axis(compose(len, np.unique), 1, a)
Case #1 : Square shaped one
In [164]: np.random.seed(0)
In [165]: a = np.random.randint(0,5,(10000,10000))
In [166]: %timeit numpy_apply(a)
...: %timeit sorting(a)
...: %timeit bincount(a)
...: %timeit pandas(a)
1 loop, best of 3: 1.82 s per loop
1 loop, best of 3: 1.93 s per loop
1 loop, best of 3: 354 ms per loop
1 loop, best of 3: 879 ms per loop
Case #2 : Large number of rows
In [167]: np.random.seed(0)
In [168]: a = np.random.randint(0,5,(1000000,10))
In [169]: %timeit numpy_apply(a)
...: %timeit sorting(a)
...: %timeit bincount(a)
...: %timeit pandas(a)
1 loop, best of 3: 8.42 s per loop
10 loops, best of 3: 153 ms per loop
10 loops, best of 3: 66.8 ms per loop
1 loop, best of 3: 53.6 s per loop
Extending to number of unique elements per column
To extend, we just need to do the slicing and ufunc operations along the other axis for the two proposed approaches, like so -
def nunique_percol_sort(a):
b = np.sort(a,axis=0)
return (b[1:] != b[:-1]).sum(axis=0)+1
def nunique_percol_bincount(a):
n = a.max()+1
a_off = a+(np.arange(a.shape[1]))*n
M = a.shape[1]*n
return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)
Generic ndarray with generic axis
Let's see how we can extend to ndarray of generic dimensions and get those number of unique counts along a generic axis. We will make use of np.diff with its axis param to get those consecutive differences and hence make it generic, like so -
def nunique(a, axis):
return (np.diff(np.sort(a,axis=axis),axis=axis)!=0).sum(axis=axis)+1
Sample runs -
In [77]: a
Out[77]:
array([[1, 0, 2, 2, 0],
[1, 0, 1, 2, 0],
[0, 0, 0, 0, 2],
[1, 2, 1, 0, 1],
[2, 0, 1, 0, 0]])
In [78]: nunique(a, axis=0)
Out[78]: array([3, 2, 3, 2, 3])
In [79]: nunique(a, axis=1)
Out[79]: array([3, 3, 2, 3, 3])
If you are working with floating pt numbers and want to make the unique-ness case based on some tolerance value rather than absolute match, we can use np.isclose. Two such options would be -
(~np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0)).sum(axis)+1
a.shape[axis]-np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0).sum(axis)
For a custom tolerance value, feed those with np.isclose.
This solution via np.apply_along_axis isn't vectorised and involves a Python-level loop. But it is relatively intuitive using len + np.unique functions.
import numpy as np
from toolz import compose
a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
np.apply_along_axis(compose(len, np.unique), 1, a) # [2, 2, 3]
A oneliner using sort:
In [6]: np.count_nonzero(np.diff(np.sort(a)), axis=1)+1
Out[6]: array([2, 2, 3])
Are you open to considering pandas? Dataframes have a dedicated method for this
>>> a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
>>> df = pd.DataFrame(a.T)
>>> print(*df.nunique())
2 2 3

Can NumPy take care that an array is (nonstrictly) increasing along one axis?

Is there a function in numpy to guarantee or rather fix an array such that it is (nonstrictly) increasing along one particular axis?
For example, I have the following 2D array:
X = array([[1, 2, 1, 4, 5],
[0, 3, 1, 5, 4]])
the output of np.foobar(X) should return
array([[1, 2, 2, 4, 5],
[0, 3, 3, 5, 5]])
Does foobar exist or do I need to do that manually by using something like np.diff and some smart indexing?
Use np.maximum.accumulate for a running (accumulated) max value along that axis to ensure the strictly increasing criteria -
np.maximum.accumulate(X,axis=1)
Sample run -
In [233]: X
Out[233]:
array([[1, 2, 1, 4, 5],
[0, 3, 1, 5, 4]])
In [234]: np.maximum.accumulate(X,axis=1)
Out[234]:
array([[1, 2, 2, 4, 5],
[0, 3, 3, 5, 5]])
For memory efficiency, we can assign it back to the input for in-situ changes with its out argument.
Runtime tests
Case #1 : Array as input
In [254]: X = np.random.rand(1000,1000)
In [255]: %timeit np.maximum.accumulate(X,axis=1)
1000 loops, best of 3: 1.69 ms per loop
# #cᴏʟᴅsᴘᴇᴇᴅ's pandas soln using df.cummax
In [256]: %timeit pd.DataFrame(X).cummax(axis=1).values
100 loops, best of 3: 4.81 ms per loop
Case #2 : Dataframe as input
In [257]: df = pd.DataFrame(np.random.rand(1000,1000))
In [258]: %timeit np.maximum.accumulate(df.values,axis=1)
1000 loops, best of 3: 1.68 ms per loop
# #cᴏʟᴅsᴘᴇᴇᴅ's pandas soln using df.cummax
In [259]: %timeit df.cummax(axis=1)
100 loops, best of 3: 4.68 ms per loop
pandas offers you the df.cummax function:
import pandas as pd
pd.DataFrame(X).cummax(axis=1).values
array([[1, 2, 2, 4, 5],
[0, 3, 3, 5, 5]])
It's useful to know that there's a first class function on hand in case your data is already loaded into a dataframe.

Padding a 2D numpy with varying rows into a same size [duplicate]

The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.
v = [[1], [1, 2]]
np.array(v)
>>> array([[1], [1, 2]], dtype=object)
Trying to force another type will cause an exception:
np.array(v, dtype=np.int32)
ValueError: setting an array element with a sequence.
What is the most efficient way to get a dense NumPy array of type int32, by filling the "missing" values with a given placeholder?
From my sample sequence v, I would like to get something like this, if 0 is the placeholder
array([[1, 0], [1, 2]], dtype=int32)
You can use itertools.zip_longest:
import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
Out:
array([[1, 0],
[1, 2]])
Note: For Python 2, it is itertools.izip_longest.
Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -
def boolean_indexing(v):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())
out = np.zeros(mask.shape,dtype=int)
out[mask] = np.concatenate(v)
return out
Sample run
In [27]: v
Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]
In [28]: out
Out[28]:
array([[1, 0, 0, 0, 0],
[1, 2, 0, 0, 0],
[3, 6, 7, 8, 9],
[4, 0, 0, 0, 0]])
*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.
Runtime test
In this section I am timing DataFrame-based solution by #Alberto Garcia-Raboso, itertools-based solution by #ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.
Case #1 : Larger size variation
In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]
In [45]: v = v*1000
In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 9.82 ms per loop
In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
100 loops, best of 3: 5.11 ms per loop
In [48]: %timeit boolean_indexing(v)
100 loops, best of 3: 6.88 ms per loop
Case #2 : Lesser size variation
In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]
In [50]: v = v*1000
In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 3.12 ms per loop
In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1000 loops, best of 3: 1.55 ms per loop
In [53]: %timeit boolean_indexing(v)
100 loops, best of 3: 5 ms per loop
Case #3 : Larger number of elements (100 max) per list element
In [139]: # Setup inputs
...: N = 10000 # Number of elems in list
...: maxn = 100 # Max. size of a list element
...: lens = np.random.randint(0,maxn,(N))
...: v = [list(np.random.randint(0,9,(L))) for L in lens]
...:
In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
1 loops, best of 3: 292 ms per loop
In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1 loops, best of 3: 264 ms per loop
In [142]: %timeit boolean_indexing(v)
10 loops, best of 3: 95.7 ms per loop
To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!
Pandas and its DataFrame-s deal beautifully with missing data.
import numpy as np
import pandas as pd
v = [[1], [1, 2]]
print(pd.DataFrame(v).fillna(0).values.astype(np.int32))
# array([[1, 0],
# [1, 2]], dtype=int32)
max_len = max(len(sub_list) for sub_list in v)
result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])
>>> result
array([[1, 0],
[1, 2]])
>>> type(result)
numpy.ndarray
Here is a general way:
>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]]
>>> max_len = np.argmax(v)
>>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len)
array([[ 1, 0, 0, 0],
[ 2, 3, 4, 0],
[ 5, 6, 0, 0],
[ 7, 8, 9, 10],
[11, 12, 0, 0]], dtype=int32)
you can try to convert pandas dataframe first, after that convert it to numpy array
ll = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]
df = pd.DataFrame(ll)
print(df)
# 0 1 2 3
# 0 1 2 3.0 NaN
# 1 4 5 NaN NaN
# 2 6 7 8.0 9.0
npl = df.to_numpy()
print(npl)
# [[ 1. 2. 3. nan]
# [ 4. 5. nan nan]
# [ 6. 7. 8. 9.]]
I was having a numpy broadcast error with Alexander's answer so I added a small variation with numpy.pad:
pad = len(max(X, key=len))
result = np.array([np.pad(i, (0, pad-len(i)), 'constant') for i in X])
If you want to extend the same logic to deeper levels (list of lists of lists,..) you can use tensorflow ragged tensors and convert to tensors/arrays. For example:
import tensorflow as tf
v = [[1], [1, 2]]
padded_v = tf.ragged.constant(v).to_tensor(0)
This creates an array padded with 0.
or a deeper example:
w = [[[1]], [[2],[1, 2]]]
padded_w = tf.ragged.constant(w).to_tensor(0)

Find the index of the n smallest values in a 3 dimensional numpy array

Given a 3 dimensional numpy array, how to find the indexes of top n smallest values ? The index of the minimum value can be found as:
i,j,k = np.where(my_array == my_array.min())
Here's one approach for generic n-dims and generic N smallest numbers -
def smallestN_indices(a, N):
idx = a.ravel().argsort()[:N]
return np.stack(np.unravel_index(idx, a.shape)).T
Each row of the the 2D output array would hold the indexing tuple that corresponds to one of the smallest array numbers.
We can also use argpartition, but that might not maintain the order. So, we need a bit more additional work with argsort there -
def smallestN_indices_argparitition(a, N, maintain_order=False):
idx = np.argpartition(a.ravel(),N)[:N]
if maintain_order:
idx = idx[a.ravel()[idx].argsort()]
return np.stack(np.unravel_index(idx, a.shape)).T
Sample run -
In [141]: np.random.seed(1234)
...: a = np.random.randint(111,999,(2,5,4,3))
...:
In [142]: smallestN_indices(a, N=3)
Out[142]:
array([[0, 3, 2, 0],
[1, 2, 3, 0],
[1, 2, 2, 1]])
In [143]: smallestN_indices_argparitition(a, N=3)
Out[143]:
array([[1, 2, 3, 0],
[0, 3, 2, 0],
[1, 2, 2, 1]])
In [144]: smallestN_indices_argparitition(a, N=3, maintain_order=True)
Out[144]:
array([[0, 3, 2, 0],
[1, 2, 3, 0],
[1, 2, 2, 1]])
Runtime test -
In [145]: a = np.random.randint(111,999,(20,50,40,30))
In [146]: %timeit smallestN_indices(a, N=3)
...: %timeit smallestN_indices_argparitition(a, N=3)
...: %timeit smallestN_indices_argparitition(a, N=3, maintain_order=True)
...:
10 loops, best of 3: 97.6 ms per loop
100 loops, best of 3: 8.32 ms per loop
100 loops, best of 3: 8.34 ms per loop

dot product of two 1D vectors in numpy

I'm working with numpy in python to calculate a vector multiplication.
I have a vector x of dimensions n x 1 and I want to calculate x*x_transpose.
This gives me problems because x.T or x.transpose() doesn't affect a 1 dimensional vector (numpy represents vertical and horizontal vectors the same way).
But how do I calculate a (n x 1) x (1 x n) vector multiplication in numpy?
numpy.dot(x,x.T) gives a scalar, not a 2D matrix as I want.
You are essentially computing an Outer Product.
You can use np.outer.
In [15]: a=[1,2,3]
In [16]: np.outer(a,a)
Out[16]:
array([[1, 2, 3],
[2, 4, 6],
[3, 6, 9]])
While np.outer is the simplest way to do this, I'd thought I'd just mention how you might manipulate the (N,) shaped array to do this:
In [17]: a = np.arange(4)
In [18]: np.dot(a[:,None], a[None,:])
Out[18]:
array([[0, 0, 0, 0],
[0, 1, 2, 3],
[0, 2, 4, 6],
[0, 3, 6, 9]])
In [19]: np.outer(a,a)
Out[19]:
array([[0, 0, 0, 0],
[0, 1, 2, 3],
[0, 2, 4, 6],
[0, 3, 6, 9]])
Where you could alternatively replace None with np.newaxis.
Another more exotic way to do this is with np.einsum:
In [20]: np.einsum('i,j', a, a)
Out[20]:
array([[0, 0, 0, 0],
[0, 1, 2, 3],
[0, 2, 4, 6],
[0, 3, 6, 9]])
And just for fun, some timings, which are likely going to vary based on hardware and numpy version/compilation:
Small-ish vector
In [36]: a = np.arange(5, dtype=np.float64)
In [37]: %timeit np.outer(a,a)
100000 loops, best of 3: 17.7 µs per loop
In [38]: %timeit np.dot(a[:,None],a[None,:])
100000 loops, best of 3: 11 µs per loop
In [39]: %timeit np.einsum('i,j', a, a)
1 loops, best of 3: 11.9 µs per loop
In [40]: %timeit a[:, None] * a
100000 loops, best of 3: 9.68 µs per loop
And something a little larger
In [42]: a = np.arange(500, dtype=np.float64)
In [43]: %timeit np.outer(a,a)
1000 loops, best of 3: 605 µs per loop
In [44]: %timeit np.dot(a[:,None],a[None,:])
1000 loops, best of 3: 1.29 ms per loop
In [45]: %timeit np.einsum('i,j', a, a)
1000 loops, best of 3: 359 µs per loop
In [46]: %timeit a[:, None] * a
1000 loops, best of 3: 597 µs per loop
If you want an inner product then use numpy.dot(x,x) for outer product use numpy.outer(x,x)
Another alternative is to define the row / column vector with 2-dimensions, e.g.
a = np.array([1, 2, 3], ndmin=2)
np.dot(a.T, a)
array([[1, 2, 3],
[2, 4, 6],
[3, 6, 9]])
Another alternative is to user numpy.matrix
>>> a = np.matrix([1,2,3])
>>> a
matrix([[1, 2, 3]])
>>> a.T * a
matrix([[1, 2, 3],
[2, 4, 6],
[3, 6, 9]])
Generally use of numpy.arrays is preferred. However, using numpy.matrices can be more readable for long expressions.

Categories

Resources