Related
Given a numpy array, how can I figure it out if it contains only 0 and 1 quickly?
Is there any implemented method?
Few approaches -
((a==0) | (a==1)).all()
~((a!=0) & (a!=1)).any()
np.count_nonzero((a!=0) & (a!=1))==0
a.size == np.count_nonzero((a==0) | (a==1))
Runtime test -
In [313]: a = np.random.randint(0,2,(3000,3000)) # Only 0s and 1s
In [314]: %timeit ((a==0) | (a==1)).all()
...: %timeit ~((a!=0) & (a!=1)).any()
...: %timeit np.count_nonzero((a!=0) & (a!=1))==0
...: %timeit a.size == np.count_nonzero((a==0) | (a==1))
...:
10 loops, best of 3: 28.8 ms per loop
10 loops, best of 3: 29.3 ms per loop
10 loops, best of 3: 28.9 ms per loop
10 loops, best of 3: 28.8 ms per loop
In [315]: a = np.random.randint(0,3,(3000,3000)) # Contains 2 as well
In [316]: %timeit ((a==0) | (a==1)).all()
...: %timeit ~((a!=0) & (a!=1)).any()
...: %timeit np.count_nonzero((a!=0) & (a!=1))==0
...: %timeit a.size == np.count_nonzero((a==0) | (a==1))
...:
10 loops, best of 3: 28 ms per loop
10 loops, best of 3: 27.5 ms per loop
10 loops, best of 3: 29.1 ms per loop
10 loops, best of 3: 28.9 ms per loop
Their runtimes seem to be comparable.
It looks you can achieve it with something like:
np.array_equal(a, a.astype(bool))
If your array is large, it should avoid copying too many arrays (as in some other answers). Thus, it should probably be slightly faster than other answers (not tested however).
With only a single loop over the data:
0 <= np.bitwise_or.reduce(ar) <= 1
Note that this doesn't work for floating point dtype.
If the values are guaranteed non-negative you can get short-circuiting behavior:
try:
np.empty((2,), bool)[ar]
is_binary = True
except IndexError:
is_binary = False
This method (always) allocates a temp array of the same shape as the argument and seems to loop over the data slower than the first method.
If you have access to Numba (or alternatively cython), you can write something like the following, which will be significantly faster for catching non-binary arrays since it will short circuit the calculation/stop immediately instead of continuing with all of the elements:
import numpy as np
import numba as nb
#nb.njit
def check_binary(x):
is_binary = True
for v in np.nditer(x):
if v.item() != 0 and v.item() != 1:
is_binary = False
break
return is_binary
Running this in pure python without the aid of an accelerator like Numba or Cython makes this approach prohibitively slow.
Timings:
a = np.random.randint(0,2,(3000,3000)) # Only 0s and 1s
%timeit ((a==0) | (a==1)).all()
# 100 loops, best of 3: 15.1 ms per loop
%timeit check_binary(a)
# 100 loops, best of 3: 11.6 ms per loop
a = np.random.randint(0,3,(3000,3000)) # Contains 2 as well
%timeit ((a==0) | (a==1)).all()
# 100 loops, best of 3: 14.9 ms per loop
%timeit check_binary(a)
# 1000000 loops, best of 3: 543 ns per loop
We could use np.isin().
input_array = input_array.squeeze(-1)
is_binary = np.isin(input_array, [0,1]).all()
1st line:
squeeze to unroll the input array, as we don't want to deal with the complication of np.isin() with a multi-dimension array.
2nd line:
np.isin() checks whether all elements of input belong to 0 or 1.
np.isin() returns a list of [True, False, True, True..].
Then all() to ensure that list contain all True.
The following should work:
ans = set(arr).issubset([0,1])
How about numpy unique?
np.unique(arr)
Should return [0,1] if binary.
Consider this performance test on Ipython under python 3:
Create a range, a range_iterator and a generator
In [1]: g1 = range(1000000)
In [2]: g2 = iter(range(1000000))
In [3]: g3 = (i for i in range(1000000))
Measure time for summing using python native sum
In [4]: %timeit sum(g1)
10 loops, best of 3: 47.4 ms per loop
In [5]: %timeit sum(g2)
The slowest run took 374430.34 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 123 ns per loop
In [6]: %timeit sum(g3)
The slowest run took 1302907.54 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 128 ns per loop
Not sure if I should worry about the warning. The range version timing is vary long (why?), but the range_iterator and the generator are similar.
Now let's use numpy.sum
In [7]: import numpy as np
In [8]: %timeit np.sum(g1)
10 loops, best of 3: 174 ms per loop
In [9]: %timeit np.sum(g2)
The slowest run took 8.47 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.51 µs per loop
In [10]: %timeit np.sum(g3)
The slowest run took 9.59 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 446 ns per loop
g1 and g3 became x~3.5 slower, but the range_iterator g2 is now some ~50 times slower compared to the native sum. g3 wins.
In [11]: type(g1)
Out[11]: range
In [12]: type(g2)
Out[12]: range_iterator
In [13]: type(g3)
Out[13]: generator
Why such a penalty to range_iterator on numpy.sum? Should such objects be avoided? Does it generalized - Do "home made" generators always beat other objects on numpy?
EDIT 1: I realized that the np.sum does not evaluate the range_iterator but returns another range_iterator object. So this comparison is not good. Why doesn't it get evaluated?
EDIT 2: I also realized that numpy.sum keeps the range in integer form and accordingly gives the wrong results on my sum due to integer overflow.
In [12]: sum(range(1000000))
Out[12]: 499999500000
In [13]: np.sum(range(1000000))
Out[13]: 1783293664
In [14]: np.sum(range(1000000), dtype=float)
Out[14]: 499999500000.0
Intermediate conclusion - don't use numpy.sum on non numpy objects...?
Did you look at the results of repeated sums on the iter?
95:~/mypy$ g2=iter(range(10))
96:~/mypy$ sum(g2)
Out[96]: 45
97:~/mypy$ sum(g2)
Out[97]: 0
98:~/mypy$ sum(g2)
Out[98]: 0
Why the 0s? Because g2 can be use only once. Same goes for the generator expression.
Or look at it with list
100:~/mypy$ g2=iter(range(10))
101:~/mypy$ list(g2)
Out[101]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
102:~/mypy$ list(g2)
Out[102]: []
In Python 3, range is a range object, not a list. So it's an iterator that regenerates each time it is used.
As for np.sum, np.sum(range(10)) has to make an array first.
When operating on a list, the Python sum is quite fast, faster than np.sum on the same:
116:~/mypy$ %%timeit x=list(range(10000))
...: sum(x)
1000 loops, best of 3: 202 µs per loop
117:~/mypy$ %%timeit x=list(range(10000))
...: np.sum(x)
1000 loops, best of 3: 1.62 ms per loop
But operating on an array, np.sum does much better
118:~/mypy$ %%timeit x=np.arange(10000)
...: sum(x)
100 loops, best of 3: 5.92 ms per loop
119:~/mypy$ %%timeit x=np.arange(10000)
...: np.sum(x)
<caching warning>
100000 loops, best of 3: 18.6 µs per loop
Another timing - various ways of making an array. fromiter can be faster than np.array; but the builtin arange is much better.
124:~/mypy$ timeit np.array(range(100000))
10 loops, best of 3: 39.2 ms per loop
125:~/mypy$ timeit np.fromiter(range(100000),int)
100 loops, best of 3: 12.9 ms per loop
126:~/mypy$ timeit np.arange(100000)
The slowest run took 6.93 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 106 µs per loop
Use range if you intend to work with lists; but use numpy's own range if you need to work with arrays. There is an overhead when creating arrays, so they are more valuable when working with large ones.
==================
On the question of how np.sum handles an iterator - it doesn't. Look at what np.array does to such an object:
In [12]: np.array(iter(range(10)))
Out[12]: array(<range_iterator object at 0xb5998f98>, dtype=object)
It produces a single element array with dtype object.
fromiter will evaluate this iterable:
In [13]: np.fromiter(iter(range(10)),int)
Out[13]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.array follows some complicated rules when it comes to converting the input to an array. It's designed to work primarily, with a list of numbers or nested equal length lists.
If you have questions of how a np function handles a non-array object, first check what np.array does to that object.
What is the best way to check if an numpy array contains any element of another array?
example:
array1 = [10,5,4,13,10,1,1,22,7,3,15,9]
array2 = [3,4,9,10,13,15,16,18,19,20,21,22,23]`
I want to get a True if array1 contains any value of array2, otherwise a False.
Using Pandas, you can use isin:
a1 = np.array([10,5,4,13,10,1,1,22,7,3,15,9])
a2 = np.array([3,4,9,10,13,15,16,18,19,20,21,22,23])
>>> pd.Series(a1).isin(a2).any()
True
And using the in1d numpy function(per the comment from #Norman):
>>> np.any(np.in1d(a1, a2))
True
For small arrays such as those in this example, the solution using set is the clear winner. For larger, dissimilar arrays (i.e. no overlap), the Pandas and Numpy solutions are faster. However, np.intersect1d appears to excel for larger arrays.
Small arrays (12-13 elements)
%timeit set(array1) & set(array2)
The slowest run took 4.22 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 1.69 µs per loop
%timeit any(i in a1 for i in a2)
The slowest run took 12.29 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 1.88 µs per loop
%timeit np.intersect1d(a1, a2)
The slowest run took 10.29 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 15.6 µs per loop
%timeit np.any(np.in1d(a1, a2))
10000 loops, best of 3: 27.1 µs per loop
%timeit pd.Series(a1).isin(a2).any()
10000 loops, best of 3: 135 µs per loop
Using an array with 100k elements (no overlap):
a3 = np.random.randint(0, 100000, 100000)
a4 = a3 + 100000
%timeit np.intersect1d(a3, a4)
100 loops, best of 3: 13.8 ms per loop
%timeit pd.Series(a3).isin(a4).any()
100 loops, best of 3: 18.3 ms per loop
%timeit np.any(np.in1d(a3, a4))
100 loops, best of 3: 18.4 ms per loop
%timeit set(a3) & set(a4)
10 loops, best of 3: 23.6 ms per loop
%timeit any(i in a3 for i in a4)
1 loops, best of 3: 34.5 s per loop
You can try this
>>> array1 = [10,5,4,13,10,1,1,22,7,3,15,9]
>>> array2 = [3,4,9,10,13,15,16,18,19,20,21,22,23]
>>> set(array1) & set(array2)
set([3, 4, 9, 10, 13, 15, 22])
If you get result means there are common elements in both array.
If result is empty means no common elements.
You can use any built-in function and list comprehension:
>>> array1 = [10,5,4,13,10,1,1,22,7,3,15,9]
>>> array2 = [3,4,9,10,13,15,16,18,19,20,21,22,23]
>>> any(i in array2 for i in array1)
True
in pandas' manual, there is this example about indexing:
In [653]: criterion = df2['a'].map(lambda x: x.startswith('t'))
In [654]: df2[criterion]
then Wes wrote:
**# equivalent but slower**
In [655]: df2[[x.startswith('t') for x in df2['a']]]
can anyone here explain a bit why the map approach is faster? Is this a python feature or this is a pandas feature?
Arguments about why a certain way of doing things in Python "should be" faster can't be taken too seriously, because you're often measuring implementation details which may behave differently in certain situations. As a result, when people guess what should be faster, they're often (usually?) wrong. For example, I find that map can actually be slower. Using this setup code:
import numpy as np, pandas as pd
import random, string
def make_test(num, width):
s = [''.join(random.sample(string.ascii_lowercase, width)) for i in range(num)]
df = pd.DataFrame({"a": s})
return df
Let's compare the time they take to make the indexing object -- whether a Series or a list -- and the resulting time it takes to use that object to index into the DataFrame. It could be, for example, that making a list is fast but before using it as an index it needs to be internally converted to a Series or an ndarray or something and so there's extra time added there.
First, for a small frame:
>>> df = make_test(10, 10)
>>> %timeit df['a'].map(lambda x: x.startswith('t'))
10000 loops, best of 3: 85.8 µs per loop
>>> %timeit [x.startswith('t') for x in df['a']]
100000 loops, best of 3: 15.6 µs per loop
>>> %timeit df['a'].str.startswith("t")
10000 loops, best of 3: 118 µs per loop
>>> %timeit df[df['a'].map(lambda x: x.startswith('t'))]
1000 loops, best of 3: 304 µs per loop
>>> %timeit df[[x.startswith('t') for x in df['a']]]
10000 loops, best of 3: 194 µs per loop
>>> %timeit df[df['a'].str.startswith("t")]
1000 loops, best of 3: 348 µs per loop
and in this case the listcomp is fastest. That doesn't actually surprise me too much, to be honest, because going via a lambda is likely to be slower than using str.startswith directly, but it's really hard to guess. 10 is small enough we're probably still measuring things like setup costs for Series; what happens in a larger frame?
>>> df = make_test(10**5, 10)
>>> %timeit df['a'].map(lambda x: x.startswith('t'))
10 loops, best of 3: 46.6 ms per loop
>>> %timeit [x.startswith('t') for x in df['a']]
10 loops, best of 3: 27.8 ms per loop
>>> %timeit df['a'].str.startswith("t")
10 loops, best of 3: 48.5 ms per loop
>>> %timeit df[df['a'].map(lambda x: x.startswith('t'))]
10 loops, best of 3: 47.1 ms per loop
>>> %timeit df[[x.startswith('t') for x in df['a']]]
10 loops, best of 3: 52.8 ms per loop
>>> %timeit df[df['a'].str.startswith("t")]
10 loops, best of 3: 49.6 ms per loop
And now it seems like the map is winning when used as an index, although the difference is marginal. But not so fast: what if we manually turn the listcomp into an array or a Series?
>>> %timeit df[np.array([x.startswith('t') for x in df['a']])]
10 loops, best of 3: 40.7 ms per loop
>>> %timeit df[pd.Series([x.startswith('t') for x in df['a']])]
10 loops, best of 3: 37.5 ms per loop
and now the listcomp wins again!
Conclusion: who knows? But never believe anything without timeit results, and even then you have to ask whether you're testing what you think you are.
often when working with numpy I find the distinction annoying - when I pull out a vector or a row from a matrix and then perform operations with np.arrays there are usually problems.
to reduce headaches, I've taken to sometimes just using np.matrix (converting all np.arrays to np.matrix) just for simplicity. however, I suspect there are some performance implications. could anyone comment as to what those might be and the reasons why?
it seems like if they are both just arrays underneath the hood that element access is simply an offset calculation to get the value, so I'm not sure without reading through the entire source what the difference might be.
more specifically, what performance implications does this have:
v = np.matrix([1, 2, 3, 4])
# versus the below
w = np.array([1, 2, 3, 4])
thanks
I added some more tests, and it appears that an array is considerably faster than matrix when array/matrices are small, but the difference gets smaller for larger data structures:
Small (4x4):
In [11]: a = [[1,2,3,4],[5,6,7,8]]
In [12]: aa = np.array(a)
In [13]: ma = np.matrix(a)
In [14]: %timeit aa.sum()
1000000 loops, best of 3: 1.77 us per loop
In [15]: %timeit ma.sum()
100000 loops, best of 3: 15.1 us per loop
In [16]: %timeit np.dot(aa, aa.T)
1000000 loops, best of 3: 1.72 us per loop
In [17]: %timeit ma * ma.T
100000 loops, best of 3: 7.46 us per loop
Larger (100x100):
In [19]: aa = np.arange(10000).reshape(100,100)
In [20]: ma = np.matrix(aa)
In [21]: %timeit aa.sum()
100000 loops, best of 3: 9.18 us per loop
In [22]: %timeit ma.sum()
10000 loops, best of 3: 22.9 us per loop
In [23]: %timeit np.dot(aa, aa.T)
1000 loops, best of 3: 1.26 ms per loop
In [24]: %timeit ma * ma.T
1000 loops, best of 3: 1.24 ms per loop
Notice that matrices are actually slightly faster for multiplication.
I believe that what I am getting here is consistent with what #Jaime is explaining the comment.
There is a general discusion on SciPy.org and on this question.
To compare performance, I did the following in iPython. It turns out that arrays are significantly faster.
In [1]: import numpy as np
In [2]: %%timeit
...: v = np.matrix([1, 2, 3, 4])
100000 loops, best of 3: 16.9 us per loop
In [3]: %%timeit
...: w = np.array([1, 2, 3, 4])
100000 loops, best of 3: 7.54 us per loop
Therefore numpy arrays seem to have faster performance than numpy matrices.
Versions used:
Numpy: 1.7.1
IPython: 0.13.2
Python: 2.7