Suppose I have a numpy array
a = np.array([0, 8, 25, 78, 68, 98, 1])
and a mask array b = [0, 1, 1, 0, 1]
Is there an easy way to get the following array:
[8, 25, 68] - which is first, second and forth element from the original array. Which sounds like a mask for me.
The most obvious way I have tried is a[b], but this does not yield a desirable result.
After this I tried to look into masked operations in numpy but it looks like it guides me in the wrong direction.
If a and b are both numpy arrays and b is strictly 1's and 0's:
>>> a[b.astype(np.bool)]
array([ 8, 25, 68])
It should be noted that this is only noticeably faster for extremely small cases, and is much more limited in scope then #falsetru's answer:
a = np.random.randint(0,2,5)
%timeit a[a==1]
100000 loops, best of 3: 4.39 µs per loop
%timeit a[a.astype(np.bool)]
100000 loops, best of 3: 2.44 µs per loop
For the larger case:
a = np.random.randint(0,2,5E6)
%timeit a[a==1]
10 loops, best of 3: 59.6 ms per loop
%timeit a[a.astype(np.bool)]
10 loops, best of 3: 56 ms per loop
>>> a = np.array([0, 8, 25, 78, 68, 98, 1])
>>> b = np.array([0, 1, 1, 0, 1])
>>> a[b == 1]
array([ 8, 25, 68])
Alternative using itertools.compress:
>>> import itertools
>>> list(itertools.compress(a, b))
[8, 25, 68]
Related
Given a list of numbers, like this:
lst = [0, 10, 15, 17]
I'd like a list that has elements from i -> i + 3 for all i in lst. If there are overlapping ranges, I'd like them merged.
So, for the example above, we first get:
[0, 1, 2, 3, 10, 11, 12, 13, 15, 16, 17, 18, 17, 18, 19, 20]
But for the last 2 groups, the ranges overlap, so upon merging them, you have:
[0, 1, 2, 3, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20]
This is my desired output.
This is what I've thought of:
from collections import OrderedDict
res = list(OrderedDict.fromkeys([y for x in lst for y in range(x, x + 4)]).keys())
print(res) = [0, 1, 2, 3, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20]
However, this is slow (10000 loops, best of 3: 56 µs per loop). I'd like a numpy solution if possible, or a python solution that's faster than this.
Approach #1 : One approach based on broadcasted summation and then using np.unique to get unique numbers -
np.unique(np.asarray(lst)[:,None] + np.arange(4))
Approach #2 : Another based on broadcasted summation and then masking -
def mask_app(lst, interval_len = 4):
arr = np.array(lst)
r = np.arange(interval_len)
ranged_vals = arr[:,None] + r
a_diff = arr[1:] - arr[:-1]
valid_mask = np.vstack((a_diff[:,None] > r, np.ones(interval_len,dtype=bool)))
return ranged_vals[valid_mask]
Runtime test
Original approach -
from collections import OrderedDict
def org_app(lst):
list(OrderedDict.fromkeys([y for x in lst for y in range(x, x + 4)]).keys())
Timings -
In [409]: n = 10000
In [410]: lst = np.unique(np.random.randint(0,4*n,(n))).tolist()
In [411]: %timeit org_app(lst)
...: %timeit np.unique(np.asarray(lst)[:,None] + np.arange(4))
...: %timeit mask_app(lst, interval_len = 4)
...:
10 loops, best of 3: 32.7 ms per loop
1000 loops, best of 3: 1.03 ms per loop
1000 loops, best of 3: 671 µs per loop
In [412]: n = 100000
In [413]: lst = np.unique(np.random.randint(0,4*n,(n))).tolist()
In [414]: %timeit org_app(lst)
...: %timeit np.unique(np.asarray(lst)[:,None] + np.arange(4))
...: %timeit mask_app(lst, interval_len = 4)
...:
1 loop, best of 3: 350 ms per loop
100 loops, best of 3: 14.7 ms per loop
100 loops, best of 3: 9.73 ms per loop
The bottleneck with the two posted approaches seems like is with the conversion to array, though that seems to be paying off well afterwards. Just to give a sense of the time spent on the conversion for the last dataset -
In [415]: %timeit np.array(lst)
100 loops, best of 3: 5.6 ms per loop
I try to calculate the signal energy of my pandas.DataFrame following this formula for discrete-time signal. I tried with apply and applymap, also with reduce, as suggested here: How do I columnwise reduce a pandas dataframe? . But all I tried resulted doing the operation for each element, not for the whole column.
This not a signal processing specific question, it's just an example how to apply a "summarize" (I don't know the right term for this) function to columns.
My workaround, was to get the raw numpy.array data and do my calculations. But I am pretty sure there is a pandatic way to do this (and surly a more numpyic way).
import pandas as pd
import numpy as np
d = np.array([[2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
[0, -1, 2, -3, 4, -5, 6, -7, 8, -9],
[0, 1, -2, 3, -4, 5, -6, 7, -8, 9]]).transpose()
df = pd.DataFrame(d)
energies = []
# a same as d
a = df.as_matrix()
assert(np.array_equal(a, d))
for column in range(a.shape[1]):
energies.append(sum(a[:,column] ** 2))
print(energies) # [40, 285, 285]
Thanks in advance!
You could do the following for dataframe output -
(df**2).sum(axis=0) # Or (df**2).sum(0)
For performance, we could work with array extracted from the dataframe -
(df.values**2).sum(axis=0) # Or (df.values**2).sum(0)
For further performance boost, there's np.einsum -
a = df.values
out = np.einsum('ij,ij->j',a,a)
Runtime test -
In [31]: df = pd.DataFrame(np.random.randint(0,9,(1000,30)))
In [32]: %timeit (df**2).sum(0)
1000 loops, best of 3: 518 µs per loop
In [33]: %timeit (df.values**2).sum(0)
10000 loops, best of 3: 40.2 µs per loop
In [34]: def einsum_based(a):
...: a = df.values
...: return np.einsum('ij,ij->j',a,a)
...:
In [35]: %timeit einsum_based(a)
10000 loops, best of 3: 32.2 µs per loop
You can use DataFrame.pow with DataFrame.sum:
print (df.pow(2).sum())
0 40
1 285
2 285
dtype: int64
print (df.pow(2).sum().values.tolist())
[40, 285, 285]
There is a property df.var() which returns variance of the columns. Which is energy (dependent on definition, you might need to multiply it by the number of elements df.var()*df.shape[0]).
I'm interested in getting the location of the minimum value in an 1-d NumPy array that meets a certain condition (in my case, a medium threshold). For example:
import numpy as np
limit = 3
a = np.array([1, 2, 4, 5, 2, 5, 3, 6, 7, 9, 10])
I'd like to effectively mask all numbers in a that are under the limit, such that the result of np.argmin would be 6. Is there a computationally cheap way to mask values that don't meet a condition and then apply np.argmin?
You could store the valid indices and use those for both selecting the valid elements from a and also indexing into with the argmin() among the selected elements to get the final index output. Thus, the implementation would look something like this -
valid_idx = np.where(a >= limit)[0]
out = valid_idx[a[valid_idx].argmin()]
Sample run -
In [32]: limit = 3
...: a = np.array([1, 2, 4, 5, 2, 5, 3, 6, 7, 9, 10])
...:
In [33]: valid_idx = np.where(a >= limit)[0]
In [34]: valid_idx[a[valid_idx].argmin()]
Out[34]: 6
Runtime test -
For performance benchmarking, in this section I am comparing the other solution based on masked array against a regular array based solution as proposed earlier in this post for various datasizes.
def masked_argmin(a,limit): # Defining func for regular array based soln
valid_idx = np.where(a >= limit)[0]
return valid_idx[a[valid_idx].argmin()]
In [52]: # Inputs
...: a = np.random.randint(0,1000,(10000))
...: limit = 500
...:
In [53]: %timeit np.argmin(np.ma.MaskedArray(a, a<limit))
1000 loops, best of 3: 233 µs per loop
In [54]: %timeit masked_argmin(a,limit)
10000 loops, best of 3: 101 µs per loop
In [55]: # Inputs
...: a = np.random.randint(0,1000,(100000))
...: limit = 500
...:
In [56]: %timeit np.argmin(np.ma.MaskedArray(a, a<limit))
1000 loops, best of 3: 1.73 ms per loop
In [57]: %timeit masked_argmin(a,limit)
1000 loops, best of 3: 1.03 ms per loop
This can simply be accomplished using numpy's MaskedArray
import numpy as np
limit = 3
a = np.array([1, 2, 4, 5, 2, 5, 3, 6, 7, 9, 10])
b = np.ma.MaskedArray(a, a<limit)
np.ma.argmin(b) # == 6
Given a numpy array x of shape (m,) and a numpy array y of shape (m/n,), how do I multiply x by corresponding elements of y efficiently?
Here's my best attempt:
In [13]: x = np.array([1, 5, 3, 2, 9, 1])
In [14]: y = np.array([2, 4, 6])
In [15]: n = 2
In [16]: (y[:, np.newaxis] * x.reshape((-1, n))).flatten()
Out[16]: array([ 2, 10, 12, 8, 54, 6])
Your solution looks pretty good to me.
If you wanted to speed it up slightly, you could:
Use ravel() instead of flatten() (the former will return a view if possible, the latter always returns a copy).
Reshape x in Fortran order to avoid the overhead of another indexing operation on y (although subsequent timings suggest this speedup is negligible)
So rewritten the multiplication becomes:
>>> (x.reshape((2, -1), order='f') * y).ravel('f')
array([ 2, 10, 12, 8, 54, 6])
Timings:
>>> %timeit (y[:, np.newaxis] * x.reshape((-1, n))).flatten()
100000 loops, best of 3: 7.4 µs per loop
>>> %timeit (x.reshape((n, -1), order='f') * y).ravel('f')
100000 loops, best of 3: 4.98 µs per loop
I want to broadcast an array b to the shape it would take if it were in an arithmetic operation with another array a.
For example, if a.shape = (3,3) and b was a scalar, I want to get an array whose shape is (3,3) and is filled with the scalar.
One way to do this is like this:
>>> import numpy as np
>>> a = np.arange(9).reshape((3,3))
>>> b = 1 + a*0
>>> b
array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1]])
Although this works practically, I can't help but feel it looks a bit weird, and wouldn't be obvious to someone else looking at the code what I was trying to do.
Is there any more elegant way to do this? I've looked at the documentation for np.broadcast, but it's orders of magnitude slower.
In [1]: a = np.arange(10000).reshape((100,100))
In [2]: %timeit 1 + a*0
10000 loops, best of 3: 31.9 us per loop
In [3]: %timeit bc = np.broadcast(a,1);np.fromiter((v for u, v in bc),float).reshape(bc.shape)
100 loops, best of 3: 5.2 ms per loop
In [4]: 5.2e-3/32e-6
Out[4]: 162.5
If you just want to fill an array with a scalar, fill is probably the best choice. But it sounds like you want something more generalized. Rather than using broadcast you can use broadcast_arrays to get the result that (I think) you want.
>>> a = numpy.arange(9).reshape(3, 3)
>>> numpy.broadcast_arrays(a, 1)[1]
array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1]])
This generalizes to any two broadcastable shapes:
>>> numpy.broadcast_arrays(a, [1, 2, 3])[1]
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]])
It's not quite as fast as your ufunc-based method, but it's still on the same order of magnitude:
>>> %timeit 1 + a * 0
10000 loops, best of 3: 23.2 us per loop
>>> %timeit numpy.broadcast_arrays(a, 1)[1]
10000 loops, best of 3: 52.3 us per loop
But scalars, fill is still the clear front-runner:
>>> %timeit b = numpy.empty_like(a, dtype='i8'); b.fill(1)
100000 loops, best of 3: 6.59 us per loop
Finally, further testing shows that the fastest approach -- in at least some cases -- is to multiply by ones:
>>> %timeit numpy.broadcast_arrays(a, numpy.arange(100))[1]
10000 loops, best of 3: 53.4 us per loop
>>> %timeit (1 + a * 0) * numpy.arange(100)
10000 loops, best of 3: 45.9 us per loop
>>> %timeit b = numpy.ones_like(a, dtype='i8'); b * numpy.arange(100)
10000 loops, best of 3: 28.9 us per loop
The fastest and cleanest solution I know is:
b_arr = numpy.empty(a.shape) # Empty array
b_arr.fill(b) # Filling with one value
fill sounds like the simplest way:
>>> a = np.arange(9).reshape((3,3))
>>> a
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> a.fill(10)
>>> a
array([[10, 10, 10],
[10, 10, 10],
[10, 10, 10]])
EDIT: As #EOL points out, you don't need arange if you want to create a new array, np.empty((100,100)) (or whatever shape) is better for this.
Timings:
In [3]: a = np.arange(10000).reshape((100,100))
In [4]: %timeit 1 + a*0
100000 loops, best of 3: 19.9 us per loop
In [5]: a = np.arange(10000).reshape((100,100))
In [6]: %timeit a.fill(1)
100000 loops, best of 3: 3.73 us per loop
If you just need to broadcast a scalar to some arbitrary shape, you can do something like this:
a = b*np.ones(shape=(3,3))
Edit: np.tile is more general. You can use it to duplicate any scalar/vector in any number of dimensions:
b = 1
N = 100
a = np.tile(b, reps=(N, N))