Python numpy methods/attributes faster than numpy functions? - python

I recently noticed that some numpy array attributes/methods seem to be significantly faster than the corresponding numpy functions. Example for np.conj(x) vs. x.conjugate():
import numpy as np
import time
np.random.seed(100)
t0_1 = 0
t0_2 = 0
for i in range(1000):
a = np.random.rand(10000)
t0 = time.time()
b = np.conjugate(a)
t0_1 += time.time() - t0; t0 = time.time()
c = a.conjugate()
t0_2 += time.time() - t0; t0 = time.time()
print(t0_1, t0_2)
# example output times: 0.01222848892211914 0.0008714199066162109
Even without proper benchmarks, it looks like there is a performance gain of more than a factor of 10. Similarly, it seems that also x.real, x.imag, x.max() and other basic methods are faster than the corresponding functions np.real(x), np.imag(x), np.max(x) etc.
Can somebody explain to me where the time saving comes from? Does it have to do with in-place operations vs. new array creation? Are there certain checks that the numpy functions do which are skipped for the array methods? Thank you in advance!
Update: Below is a simple comparison of computation times for several common numpy functions/methods, for float, complex and boolean arrays. The largest speed gain factors of methods over functions (float/complex/bool) appear to be for a.real (12/15/12), a.imag(70/15/26) and a.conj(80/15/33), as explained by the post of #hpaulj (imag and conj are not useful for real arrays though), and for a.sort (5/5/1.5) (my guess is that this is due to in-place operations), a.max/a.min (1.6 for bool) (again, max and min are not useful for bool arrays). Other speed gains are typically between 1.1 and 1.4. For a.argsort, a.std and a.__len__, the factors are often around 1, for a.__abs__ even below 1.
So it looks like except for a.real, a.imag and a.sort, the speed gains are often not too large, say 1.2. However, this may depend on array sizes, whether the array is (partially) sorted or not, etc.
import numpy as np
from IPython import get_ipython
ipython = get_ipython()
np.random.seed(1000)
asize = 10000
dtype_list = ['float', 'complex', 'bool']
for i in range(3):
print(dtype_list[i])
print('-----------------')
if i == 0:
a = np.random.rand(asize)
elif i == 1:
a = np.random.rand(asize) + 1j*np.random.rand(asize)
elif i == 2:
a = np.random.randint(2,size=asize).astype(bool)
function_list = [np.real, np.imag, np.conj, np.sum, np.cumsum, np.prod, np.cumprod,
np.max, np.min, np.argmax, np.argmin, np.mean, np.var, np.std,
np.sort, np.argsort, np.all, np.any, np.abs, len]
methatt_list = [a.real, a.imag, a.conj, a.sum, a.cumsum, a.prod, a.cumprod,
a.max, a.min, a.argmax, a.argmin, a.mean, a.var, a.std,
a.sort, a.argsort, a.all, a.any, a.__abs__, a.__len__]
for j in range(len(function_list)):
print(function_list[j].__name__)
ipython.magic('timeit function_list[j](a)')
if callable(methatt_list[j]):
ipython.magic('timeit methatt_list[j]()')
else:
ipython.magic('timeit methatt_list[j]')
print('')
# float
# -----------------
# real
# 740 ns ± 13.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# 60.7 ns ± 0.226 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# imag
# 4.45 µs ± 36.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 60.9 ns ± 0.353 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# conjugate
# 9.64 µs ± 40.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 124 ns ± 0.238 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# sum
# 15.8 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 11.8 µs ± 82.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# cumsum
# 42.4 µs ± 254 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 37.7 µs ± 38.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# prod
# 32.7 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 29 µs ± 57.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# cumprod
# 51.5 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 47.1 µs ± 154 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# amax
# 14.5 µs ± 51.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 10.7 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# amin
# 14.6 µs ± 90.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 10.7 µs ± 45.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# argmax
# 11.1 µs ± 15.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 8.62 µs ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# argmin
# 11.5 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 8.76 µs ± 37 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# mean
# 23.5 µs ± 440 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 19.6 µs ± 569 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# var
# 78.6 µs ± 381 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 73.3 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# std
# 86.7 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 81.9 µs ± 663 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# sort
# 659 µs ± 1.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 141 µs ± 682 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# argsort
# 156 µs ± 508 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 151 µs ± 704 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# all
# 23.4 µs ± 41.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 17.7 µs ± 17.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# any
# 23.4 µs ± 72.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 17.3 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# absolute
# 7.1 µs ± 12.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 7.25 µs ± 20.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# len
# 125 ns ± 0.17 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# 117 ns ± 0.463 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# complex
# -----------------
# real
# 920 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# 61.1 ns ± 0.0517 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# imag
# 898 ns ± 0.792 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# 61.3 ns ± 0.178 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# conjugate
# 18.1 µs ± 45.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 18.6 µs ± 7.75 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# sum
# 24 µs ± 40 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 18.7 µs ± 97 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# cumsum
# 44.8 µs ± 80.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 39.4 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# prod
# 99.6 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 95.4 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# cumprod
# 94.9 µs ± 245 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 89.7 µs ± 284 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# amax
# 41.3 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 37 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# amin
# 41.7 µs ± 65.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 37.1 µs ± 145 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# argmax
# 27.4 µs ± 47.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 24.5 µs ± 77.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# argmin
# 28.8 µs ± 28.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 25.5 µs ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# mean
# 32.2 µs ± 43.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 27.6 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# var
# 139 µs ± 844 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 135 µs ± 476 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# std
# 147 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 145 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# sort
# 774 µs ± 3.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 201 µs ± 145 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# argsort
# 277 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 271 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# all
# 37.9 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 31 µs ± 252 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# any
# 37.5 µs ± 146 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 30.2 µs ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# absolute
# 217 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 216 µs ± 272 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# len
# 121 ns ± 0.38 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# 117 ns ± 1.23 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# bool
# -----------------
# real
# 726 ns ± 4.61 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# 60.5 ns ± 0.0926 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# imag
# 1.55 µs ± 2.44 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# 60.7 ns ± 0.123 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# conjugate
# 4.16 µs ± 18.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 125 ns ± 0.339 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# sum
# 24.2 µs ± 82.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 19.3 µs ± 82.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# cumsum
# 48.2 µs ± 428 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 41.2 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# prod
# 29.2 µs ± 73.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 25.3 µs ± 146 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# cumprod
# 53.7 µs ± 83.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 46.6 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# amax
# 9.37 µs ± 93 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 5.81 µs ± 21.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# amin
# 9.16 µs ± 15.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 5.75 µs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# argmax
# 2.93 µs ± 8.85 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 589 ns ± 5.33 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# argmin
# 3.07 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 622 ns ± 4.37 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# mean
# 33.5 µs ± 27.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 29.1 µs ± 286 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# var
# 111 µs ± 749 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 105 µs ± 735 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# std
# 117 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 113 µs ± 409 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# sort
# 157 µs ± 407 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 105 µs ± 433 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# argsort
# 115 µs ± 192 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 112 µs ± 925 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# all
# 8.26 µs ± 9.85 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 3.86 µs ± 11.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# any
# 8.49 µs ± 23 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 4 µs ± 30.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# absolute
# 1.52 µs ± 3.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# 1.72 µs ± 2.95 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# len
# 122 ns ± 0.24 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# 117 ns ± 0.279 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

numpy functions often delegate the action to a method, if it exists. But they must also check that the argument is an array, and so on. ufuncs also have some extra 'baggage' that handles parameters like out, where. So time differences don't (necessarily) scale with array size.
In [400]: a = np.random.rand(10000)
Comparing conjugate:
In [404]: timeit np.conjugate(a)
10 µs ± 15.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [405]: timeit a.conjugate()
94.2 ns ± 1.42 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
That ns time suggests that the method is taking some sort of shortcut. (I'll explore that later)
max time difference isn't as significant, which I can attribute to the function overhead:
In [406]: timeit np.max(a)
13.2 µs ± 16.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [407]: timeit a.max()
9.46 µs ± 79.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But let's test with a complex array, where conjugate isn't trivial
In [408]: ac = a+1j*a
Now the method and function time the same:
In [409]: timeit np.conjugate(ac)
18.2 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [410]: timeit ac.conjugate()
18.3 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The real attribute is still much faster. Looking at the python code for np.real I think the time difference is just due to the function wrapper.
In [411]: timeit np.real(ac)
743 ns ± 21.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [413]: timeit ac.real
129 ns ± 4.93 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
The conjugate method for a float array just returns a view (or maybe the array itself). That accounts for its speed:
In [418]: a.__array_interface__['data']
Out[418]: (84672384, False)
In [419]: a.conjugate().__array_interface__['data']
Out[419]: (84672384, False)
In [420]: ac.__array_interface__['data']
Out[420]: (84992432, False)
In [421]: ac.conjugate().__array_interface__['data']
Out[421]: (85165216, False)
It's the array itself:
In [422]: id(a)
Out[422]: 140673862490512
In [423]: id(a.conjugate())
Out[423]: 140673862490512
np.real code:
def real(val):
try:
return val.real
except AttributeError:
return asanyarray(val).real

Related

Why pandas.Series.tolist() is faster than pandas.Series.iat[]?

For example, we use the following Series object :
mySeries = pd.Series( range(0,20,2), index=range(1,11), name='col')
What is the proper way to access a value element ?
I would say mySeries.iat[5] or mySeries.at[5] depending we use position or index.
But I found that mySeries.tolist()[5] is 3 or 4 time faster than mySeries.iat[5] which is faster than mySeries.at[5]. ("loc" and "iloc" are even worse.)
It surprises me. What is the advantage of "iat" and "at" ?
Because test short list from small Series, so converting to list and indexing is really fast:
mySeries = pd.Series( range(0,20,2), index=range(1,11), name='col')
%timeit mySeries.iat[5]
3.61 µs ± 261 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mySeries.at[5]
5.11 µs ± 242 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mySeries.tolist()
1.58 µs ± 78.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit mySeries.tolist()[5]
1.63 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
If 1M values it is slow, because bottleneck is converting to list:
mySeries = pd.Series( range(0,2000000,2), name='col')
%timeit mySeries.iat[5]
3.46 µs ± 72.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mySeries.at[5]
4.74 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mySeries.tolist()
40.2 ms ± 618 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit mySeries.tolist()[5]
40.3 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Conjugating a complex number much faster if number has python-native complex type

Conjugating a complex number appears to be about 30 times faster if the type() of the complex number is complex rather than numpy.complex128, see the minimal example below. However, the absolute value takes about the same time. Taking the real and the imaginary part is only about 3 times faster.
Why is the conjugate slower by that much? When I take a from a large complex-valued array, it seems I should cast it to complex first (the complex conjugation is part of a larger code which has many (> 10^6) iterations).
import numpy as np
np.random.seed(100)
a = (np.random.rand(1) + 1j*np.random.rand(1))[0]
b = complex(a)
%timeit a.conjugate() # 2.95 µs ± 24 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a.conj() # 2.86 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit b.conjugate() # 82.8 ns ± 1.28 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit abs(a) # 112 ns ± 1.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit abs(b) # 99.6 ns ± 0.623 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit a.real # 145 ns ± 0.259 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit b.real # 54.8 ns ± 0.121 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit a.imag # 144 ns ± 0.771 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit b.imag # 55.4 ns ± 0.297 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Calling NumPy routines always comes at a fixed cost, which in this case is more expensive than cost of the Python-native routine.
As soon as you start processing more than one number (possibly millions) at once NumPy will be much faster:
import numpy as np
N = 10
a = np.random.rand(N) + 1j*np.random.rand(N)
b = [complex(x) for x in a]
%timeit a.conjugate() # 481 ns ± 1.39 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
%timeit [x.conjugate() for x in b] # 605 ns ± 6.11 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

What's the fastest way to select all rows in one pandas dataframe that do not exist in another?

Beginning with two pandas dataframes of different shapes, what is the fastest way to select all rows in one dataframe that do not exist in the other (or drop all rows in one dataframe that already exist in the other)? And are the fastest methods different for string-valued columns vs. numeric columns? Operation should be roughly equivalent to the code below
import pandas as pd
string_df1 = pd.DataFrame({'latin':['a', 'b', 'c'],
'greek':['alpha', 'beta', 'gamma']})
string_df2 = pd.DataFrame({'latin':['z', 'c'],
'greek':['omega', 'gamma']})
numeric_df1 = pd.DataFrame({'A':[1, 2, 3],
'B':[1.01, 2.02, 3.03]})
numeric_df2 = pd.DataFrame({'A':[3, 9],
'B':[3.03, 9.09]})
def index_matching_rows(df1, df2, cols_to_match=None):
'''
return index of subset of rows of df1 that are equal to at least one row in df2
'''
if cols_to_match is None:
cols_to_match = df1.columns
df1 = df1.reset_index()
m = df1.merge(df2, on=cols_to_match[0], suffixes=('1','2'))
query = '&'.join(['{0}1 == {0}2'.format(str(c)) for c in cols_to_match[1:]])
m = m.query(query)
return m['index']
print(string_df2.drop(index_matching_rows(string_df2, string_df1)))
print(numeric_df2.drop(index_matching_rows(numeric_df2, numeric_df1)))
output
latin greek
0 z omega
A B
1 9 9.09
some naive performance testing
copies = 10
big_sdf1 = pd.concat([string_df1, string_df1]*copies)
big_sdf2 = pd.concat([string_df2, string_df2]*copies)
big_ndf1 = pd.concat([numeric_df1, numeric_df1]*copies)
big_ndf2 = pd.concat([numeric_df2, numeric_df2]*copies)
%%timeit
big_sdf2.drop(index_matching_rows(big_sdf2, big_sdf1))
# copies = 10: 2.61 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 20: 4.44 ms ± 43.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 30: 18.4 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 40: 74.6 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 100: 19.2 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
big_ndf2.drop(index_matching_rows(big_ndf2, big_ndf1))
# copies = 10: 2.56 ms ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 20: 4.38 ms ± 75.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 30: 18.3 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 40: 76.5 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
This code runs about as quickly for strings as for numeric data, and I think it's exponential in the length of the dataframe (the curve above is 1.6*exp(0.094x), fit to the string data). I'm working with dataframes that are on the order of 1e5 rows, so this is not a solution for me.
Here's the same performance check for Raymond Kwok's (accepted) answer below in case anyone can beat it later. It's O(n).
%%timeit
big_sdf1_tuples = big_sdf1.apply(tuple, axis=1)
big_sdf2_tuples = big_sdf2.apply(tuple, axis=1)
big_sdf2_tuples.isin(big_sdf1_tuples)
# copies = 100: 4.82 ms ± 22 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1000: 44.6 ms ± 386 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 1e4: 450 ms ± 9.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# copies = 1e5: 4.42 s ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
big_ndf1_tuples = big_ndf1.apply(tuple, axis=1)
big_ndf2_tuples = big_ndf2.apply(tuple, axis=1)
big_ndf2_tuples.isin(big_ndf1_tuples)
# copies = 100: 4.98 ms ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1000: 47 ms ± 288 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 1e4: 461 ms ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# copies = 1e5: 4.58 s ± 30.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Indexing into the longest dataframe with
big_sdf2_tuples.loc[~big_sdf2_tuples.isin(big_sdf1_tuples)]
to recover the equivalent of the output in my code above adds about 10 ms.
Beginning with 2 dataframes:
df1 = {'Runner': ['A', 'A', 'A', 'A'],
'Day': ['1', '3', '8', '9'],
'Miles': ['3', '4', '4', '2']}
df2 = df.copy().drop([1,3])
where the 2nd has two rows less.
We can hash the rows:
df1_hashed = df1.apply(tuple, axis=1).apply(hash)
df2_hashed = df2.apply(tuple, axis=1).apply(hash)
and believe, like many people will, that 2 different rows are very very very unlikely to get the same hashed value,
and get rows from df1 that do not exist in df2:
df1[~df1_hashed.isin(df2_hashed)]
Runner Day Miles
1 A 3 4
3 A 9 2
As for the speed difference between string/integers, I am sure you can test it with your real data.
Note 1: you may actually remove .apply(hash) from both lines.
Note 2: check the answer of this question out for more on isin and the use of hash.
pandas has a built-in hashing utility that's more than an order of magnitude faster than series of tuples:
%%timeit
big_sdf1_hashed = pd.util.hash_pandas_object(big_sdf1)
big_sdf2_hashed = pd.util.hash_pandas_object(big_sdf2)
big_sdf1.loc[~big_sdf1_hashed.isin(big_sdf2_hashed)]
# copies = 100: 1.05 ms ± 9.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# copies = 1000: 1.99 ms ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1e4: 10.5 ms ± 47.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1e5: 126 ms ± 747 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# copies = 1e7: 14.1 s ± 78.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
big_ndf1_hashed = pd.util.hash_pandas_object(big_ndf1)
big_ndf2_hashed = pd.util.hash_pandas_object(big_ndf2)
big_ndf1.loc[~big_ndf1_hashed.isin(big_ndf2_hashed)]
# copies = 100: 496 µs ± 12.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# copies = 1000: 772 µs ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# copies = 1e4: 3.88 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# copies = 1e5: 67.5 ms ± 775 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
And note that the difference in performance comes from creating the objects to be compared rather than searching series of different objects. For copies = int(1e5):
%%timeit
big_ndf1_hashed = pd.util.hash_pandas_object(big_ndf1)
# 25 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
big_ndf1_tuples = big_ndf1.apply(tuple, axis=1)
# 2.53 s ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
the hashed series is also three times smaller on disk than the tuples ( 9 mb vs. 33)

numpy , applying function over list optimization

I have this two code that are doing the same but for different data structs
res = np.array([np.array([2.0, 4.0, 6.0]), np.array([8.0, 10.0, 12.0])], dtype=np.int)
%timeit np.sum(res, axis=1)
4.08 µs ± 728 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
list_obj_array = np.ndarray((2,), dtype=np.object)
list_obj_array[0] = [2.0, 4.0, 6.0]
list_obj_array[1] = [8.0, 10.0, 12.0]
v_func = np.vectorize(np.sum, otypes=[np.int])
%timeit v_func(list_obj_array)
20.6 µs ± 486 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
the second one is 5 times slower , is there a better way to optimize this?
#nb.jit()
def nb_np_sum(arry_list):
return [np.sum(row) for row in arry_list]
%timeit nb_np_sum(list_obj_array)
30.8 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#nb.jit()
def nb_sum(arry_list):
return [sum(row) for row in arry_list]
%timeit nb_sum(list_obj_array)
13.6 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Best so far (thanks #hpaulj)
%timeit [sum(l) for l in list_obj_array]
850 ns ± 115 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#nb.njit()
def nb_sum(arry_list):
return [sum(row) for row in arry_list]
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'sum': cannot determine Numba type of <class 'builtin_function_or_method'>
File "<ipython-input-54-3bb48c5273bb>", line 3:
def nb_sum(arry_list):
return [sum(row) for row in arry_list]
for longer array
list_obj_array = np.ndarray((n,), dtype=np.object)
for i in range(n):
list_obj_array[i] = list(range(7))
the vectorized version come closer to the best option (list Comprehension)
%timeit [sum(l) for l in list_obj_array]
23.4 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit v_func(list_obj_array)
29.6 µs ± 4.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
numba still is slower
%timeit nb_sum(list_obj_array)
74.4 µs ± 6.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Since you used otypes you read enough of the vectorize docs to know that it is not a performance tool.
In [430]: timeit v_func(list_obj_array)
38.3 µs ± 894 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
A list comprehension is faster:
In [431]: timeit [sum(l) for l in list_obj_array]
2.08 µs ± 62.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Even better if you start with a list of list instead on of an object dtype array:
In [432]: alist = list_obj_array.tolist()
In [433]: timeit [sum(l) for l in alist]
542 ns ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
edit
np.frompyfunc is faster than np.vectorize, especially when working with object dtype arrays:
In [459]: np.frompyfunc(sum,1,1)(list_obj_array)
Out[459]: array([12.0, 30.0], dtype=object)
In [460]: timeit np.frompyfunc(sum,1,1)(list_obj_array)
2.22 µs ± 16.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As I've seen elsewhere frompyfunc is competitive with the list comprehension.
Interestingly, using np.sum instead of sum slows it down. I think that's because np.sum applied to lists has the overhead of converting the lists to arrays. sum applied to lists of numbers is pretty good, using python's own compiled code.
In [461]: timeit np.frompyfunc(np.sum,1,1)(list_obj_array)
30.3 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So let's try sum in your vectorize:
In [462]: v_func = np.vectorize(sum, otypes=[int])
In [463]: timeit v_func(list_obj_array)
8.7 µs ± 331 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Much better.

multiIndex slicing performance issue

define pandas dataframe like below
import numpy as np
import pandas as pd
n=1000
x=np.repeat(range(n),n)
y=np.tile(range(n),n)
z=np.random.random(n*n)
df=pd.DataFrame({'x':x,'y':y,'z':z})
df=df.set_index(['x','y']).sort_index()
idx=pd.IndexSlice
then some index slicing timing
%timeit -n100 df.loc[idx[1],:]
%timeit -n100 df.loc[idx[1,1],:]
%timeit -n100 df.loc[idx[1:10],:]
%timeit -n100 df.loc[idx[1:10,1],:]
gives
361 µs ± 53.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
164 µs ± 1.53 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
165 µs ± 8.45 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.35 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As you can see, df.loc[idx[1:10,1],:] takes much much more times which seems like a performance bug. What is wrong here?
On the other hand, though it is said that pandas index is hashed. But indexing is far slower than dict.
Let's prepare a somewhat equivalent dict
d={i:{k:k for k in range(n)} for i in range(n)}
and similar timing
%timeit -n100 d[1]
%timeit -n100 d[1][1]
%timeit -n100 [d[i] for i in range(10)]
%timeit -n100 [d[i][1] for i in range(10)]
gives
36.3 ns ± 3.68 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
52.7 ns ± 3.54 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
811 ns ± 7.54 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.02 µs ± 79.6 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
Wow, 1000 times faster than pandas indexing! Why pandas index slicing is so slow?

Categories

Resources