multiIndex slicing performance issue

multiIndex slicing performance issue - python

define pandas dataframe like below
import numpy as np
import pandas as pd
n=1000
x=np.repeat(range(n),n)
y=np.tile(range(n),n)
z=np.random.random(n*n)
df=pd.DataFrame({'x':x,'y':y,'z':z})
df=df.set_index(['x','y']).sort_index()
idx=pd.IndexSlice
then some index slicing timing
%timeit -n100 df.loc[idx[1],:]
%timeit -n100 df.loc[idx[1,1],:]
%timeit -n100 df.loc[idx[1:10],:]
%timeit -n100 df.loc[idx[1:10,1],:]
gives
361 µs ± 53.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
164 µs ± 1.53 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
165 µs ± 8.45 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.35 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As you can see, df.loc[idx[1:10,1],:] takes much much more times which seems like a performance bug. What is wrong here?
On the other hand, though it is said that pandas index is hashed. But indexing is far slower than dict.
Let's prepare a somewhat equivalent dict
d={i:{k:k for k in range(n)} for i in range(n)}
and similar timing
%timeit -n100 d[1]
%timeit -n100 d[1][1]
%timeit -n100 [d[i] for i in range(10)]
%timeit -n100 [d[i][1] for i in range(10)]
gives
36.3 ns ± 3.68 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
52.7 ns ± 3.54 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
811 ns ± 7.54 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.02 µs ± 79.6 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
Wow, 1000 times faster than pandas indexing! Why pandas index slicing is so slow?

Related

Why pandas.Series.tolist() is faster than pandas.Series.iat[]?

For example, we use the following Series object :
mySeries = pd.Series( range(0,20,2), index=range(1,11), name='col')
What is the proper way to access a value element ?
I would say mySeries.iat[5] or mySeries.at[5] depending we use position or index.
But I found that mySeries.tolist()[5] is 3 or 4 time faster than mySeries.iat[5] which is faster than mySeries.at[5]. ("loc" and "iloc" are even worse.)
It surprises me. What is the advantage of "iat" and "at" ?

Because test short list from small Series, so converting to list and indexing is really fast:
mySeries = pd.Series( range(0,20,2), index=range(1,11), name='col')
%timeit mySeries.iat[5]
3.61 µs ± 261 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mySeries.at[5]
5.11 µs ± 242 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mySeries.tolist()
1.58 µs ± 78.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit mySeries.tolist()[5]
1.63 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
If 1M values it is slow, because bottleneck is converting to list:
mySeries = pd.Series( range(0,2000000,2), name='col')
%timeit mySeries.iat[5]
3.46 µs ± 72.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mySeries.at[5]
4.74 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mySeries.tolist()
40.2 ms ± 618 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit mySeries.tolist()[5]
40.3 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Conjugating a complex number much faster if number has python-native complex type

Conjugating a complex number appears to be about 30 times faster if the type() of the complex number is complex rather than numpy.complex128, see the minimal example below. However, the absolute value takes about the same time. Taking the real and the imaginary part is only about 3 times faster.
Why is the conjugate slower by that much? When I take a from a large complex-valued array, it seems I should cast it to complex first (the complex conjugation is part of a larger code which has many (> 10^6) iterations).
import numpy as np
np.random.seed(100)
a = (np.random.rand(1) + 1j*np.random.rand(1))[0]
b = complex(a)
%timeit a.conjugate() # 2.95 µs ± 24 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a.conj() # 2.86 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit b.conjugate() # 82.8 ns ± 1.28 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit abs(a) # 112 ns ± 1.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit abs(b) # 99.6 ns ± 0.623 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit a.real # 145 ns ± 0.259 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit b.real # 54.8 ns ± 0.121 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit a.imag # 144 ns ± 0.771 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit b.imag # 55.4 ns ± 0.297 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Calling NumPy routines always comes at a fixed cost, which in this case is more expensive than cost of the Python-native routine.
As soon as you start processing more than one number (possibly millions) at once NumPy will be much faster:
import numpy as np
N = 10
a = np.random.rand(N) + 1j*np.random.rand(N)
b = [complex(x) for x in a]
%timeit a.conjugate() # 481 ns ± 1.39 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
%timeit [x.conjugate() for x in b] # 605 ns ± 6.11 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

numpy , applying function over list optimization

I have this two code that are doing the same but for different data structs
res = np.array([np.array([2.0, 4.0, 6.0]), np.array([8.0, 10.0, 12.0])], dtype=np.int)
%timeit np.sum(res, axis=1)
4.08 µs ± 728 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
list_obj_array = np.ndarray((2,), dtype=np.object)
list_obj_array[0] = [2.0, 4.0, 6.0]
list_obj_array[1] = [8.0, 10.0, 12.0]
v_func = np.vectorize(np.sum, otypes=[np.int])
%timeit v_func(list_obj_array)
20.6 µs ± 486 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
the second one is 5 times slower , is there a better way to optimize this?
#nb.jit()
def nb_np_sum(arry_list):
return [np.sum(row) for row in arry_list]
%timeit nb_np_sum(list_obj_array)
30.8 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#nb.jit()
def nb_sum(arry_list):
return [sum(row) for row in arry_list]
%timeit nb_sum(list_obj_array)
13.6 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Best so far (thanks #hpaulj)
%timeit [sum(l) for l in list_obj_array]
850 ns ± 115 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#nb.njit()
def nb_sum(arry_list):
return [sum(row) for row in arry_list]
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'sum': cannot determine Numba type of <class 'builtin_function_or_method'>
File "<ipython-input-54-3bb48c5273bb>", line 3:
def nb_sum(arry_list):
return [sum(row) for row in arry_list]
for longer array
list_obj_array = np.ndarray((n,), dtype=np.object)
for i in range(n):
list_obj_array[i] = list(range(7))
the vectorized version come closer to the best option (list Comprehension)
%timeit [sum(l) for l in list_obj_array]
23.4 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit v_func(list_obj_array)
29.6 µs ± 4.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
numba still is slower
%timeit nb_sum(list_obj_array)
74.4 µs ± 6.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Since you used otypes you read enough of the vectorize docs to know that it is not a performance tool.
In [430]: timeit v_func(list_obj_array)
38.3 µs ± 894 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
A list comprehension is faster:
In [431]: timeit [sum(l) for l in list_obj_array]
2.08 µs ± 62.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Even better if you start with a list of list instead on of an object dtype array:
In [432]: alist = list_obj_array.tolist()
In [433]: timeit [sum(l) for l in alist]
542 ns ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
edit
np.frompyfunc is faster than np.vectorize, especially when working with object dtype arrays:
In [459]: np.frompyfunc(sum,1,1)(list_obj_array)
Out[459]: array([12.0, 30.0], dtype=object)
In [460]: timeit np.frompyfunc(sum,1,1)(list_obj_array)
2.22 µs ± 16.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As I've seen elsewhere frompyfunc is competitive with the list comprehension.
Interestingly, using np.sum instead of sum slows it down. I think that's because np.sum applied to lists has the overhead of converting the lists to arrays. sum applied to lists of numbers is pretty good, using python's own compiled code.
In [461]: timeit np.frompyfunc(np.sum,1,1)(list_obj_array)
30.3 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So let's try sum in your vectorize:
In [462]: v_func = np.vectorize(sum, otypes=[int])
In [463]: timeit v_func(list_obj_array)
8.7 µs ± 331 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Much better.

Python numpy methods/attributes faster than numpy functions?

I recently noticed that some numpy array attributes/methods seem to be significantly faster than the corresponding numpy functions. Example for np.conj(x) vs. x.conjugate():
import numpy as np
import time
np.random.seed(100)
t0_1 = 0
t0_2 = 0
for i in range(1000):
a = np.random.rand(10000)
t0 = time.time()
b = np.conjugate(a)
t0_1 += time.time() - t0; t0 = time.time()
c = a.conjugate()
t0_2 += time.time() - t0; t0 = time.time()
print(t0_1, t0_2)
# example output times: 0.01222848892211914 0.0008714199066162109
Even without proper benchmarks, it looks like there is a performance gain of more than a factor of 10. Similarly, it seems that also x.real, x.imag, x.max() and other basic methods are faster than the corresponding functions np.real(x), np.imag(x), np.max(x) etc.
Can somebody explain to me where the time saving comes from? Does it have to do with in-place operations vs. new array creation? Are there certain checks that the numpy functions do which are skipped for the array methods? Thank you in advance!
Update: Below is a simple comparison of computation times for several common numpy functions/methods, for float, complex and boolean arrays. The largest speed gain factors of methods over functions (float/complex/bool) appear to be for a.real (12/15/12), a.imag(70/15/26) and a.conj(80/15/33), as explained by the post of #hpaulj (imag and conj are not useful for real arrays though), and for a.sort (5/5/1.5) (my guess is that this is due to in-place operations), a.max/a.min (1.6 for bool) (again, max and min are not useful for bool arrays). Other speed gains are typically between 1.1 and 1.4. For a.argsort, a.std and a.__len__, the factors are often around 1, for a.__abs__ even below 1.
So it looks like except for a.real, a.imag and a.sort, the speed gains are often not too large, say 1.2. However, this may depend on array sizes, whether the array is (partially) sorted or not, etc.
import numpy as np
from IPython import get_ipython
ipython = get_ipython()
np.random.seed(1000)
asize = 10000
dtype_list = ['float', 'complex', 'bool']
for i in range(3):
print(dtype_list[i])
print('-----------------')
if i == 0:
a = np.random.rand(asize)
elif i == 1:
a = np.random.rand(asize) + 1j*np.random.rand(asize)
elif i == 2:
a = np.random.randint(2,size=asize).astype(bool)
function_list = [np.real, np.imag, np.conj, np.sum, np.cumsum, np.prod, np.cumprod,
np.max, np.min, np.argmax, np.argmin, np.mean, np.var, np.std,
np.sort, np.argsort, np.all, np.any, np.abs, len]
methatt_list = [a.real, a.imag, a.conj, a.sum, a.cumsum, a.prod, a.cumprod,
a.max, a.min, a.argmax, a.argmin, a.mean, a.var, a.std,
a.sort, a.argsort, a.all, a.any, a.__abs__, a.__len__]
for j in range(len(function_list)):
print(function_list[j].__name__)
ipython.magic('timeit function_list[j](a)')
if callable(methatt_list[j]):
ipython.magic('timeit methatt_list[j]()')
else:
ipython.magic('timeit methatt_list[j]')
print('')
# float
# -----------------
# real
# 740 ns ± 13.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# 60.7 ns ± 0.226 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# imag
# 4.45 µs ± 36.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 60.9 ns ± 0.353 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# conjugate
# 9.64 µs ± 40.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 124 ns ± 0.238 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# sum
# 15.8 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 11.8 µs ± 82.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# cumsum
# 42.4 µs ± 254 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 37.7 µs ± 38.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# prod
# 32.7 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 29 µs ± 57.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# cumprod
# 51.5 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 47.1 µs ± 154 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# amax
# 14.5 µs ± 51.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 10.7 µs ± 61.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# amin
# 14.6 µs ± 90.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 10.7 µs ± 45.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# argmax
# 11.1 µs ± 15.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 8.62 µs ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# argmin
# 11.5 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 8.76 µs ± 37 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# mean
# 23.5 µs ± 440 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 19.6 µs ± 569 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# var
# 78.6 µs ± 381 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 73.3 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# std
# 86.7 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 81.9 µs ± 663 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# sort
# 659 µs ± 1.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 141 µs ± 682 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# argsort
# 156 µs ± 508 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 151 µs ± 704 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# all
# 23.4 µs ± 41.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 17.7 µs ± 17.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# any
# 23.4 µs ± 72.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 17.3 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# absolute
# 7.1 µs ± 12.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 7.25 µs ± 20.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# len
# 125 ns ± 0.17 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# 117 ns ± 0.463 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# complex
# -----------------
# real
# 920 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# 61.1 ns ± 0.0517 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# imag
# 898 ns ± 0.792 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# 61.3 ns ± 0.178 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# conjugate
# 18.1 µs ± 45.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 18.6 µs ± 7.75 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# sum
# 24 µs ± 40 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 18.7 µs ± 97 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# cumsum
# 44.8 µs ± 80.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 39.4 µs ± 135 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# prod
# 99.6 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 95.4 µs ± 108 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# cumprod
# 94.9 µs ± 245 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 89.7 µs ± 284 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# amax
# 41.3 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 37 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# amin
# 41.7 µs ± 65.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 37.1 µs ± 145 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# argmax
# 27.4 µs ± 47.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 24.5 µs ± 77.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# argmin
# 28.8 µs ± 28.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 25.5 µs ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# mean
# 32.2 µs ± 43.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 27.6 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# var
# 139 µs ± 844 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 135 µs ± 476 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# std
# 147 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 145 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# sort
# 774 µs ± 3.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 201 µs ± 145 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# argsort
# 277 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 271 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# all
# 37.9 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 31 µs ± 252 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# any
# 37.5 µs ± 146 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 30.2 µs ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# absolute
# 217 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 216 µs ± 272 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# len
# 121 ns ± 0.38 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# 117 ns ± 1.23 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# bool
# -----------------
# real
# 726 ns ± 4.61 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# 60.5 ns ± 0.0926 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# imag
# 1.55 µs ± 2.44 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# 60.7 ns ± 0.123 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# conjugate
# 4.16 µs ± 18.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 125 ns ± 0.339 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# sum
# 24.2 µs ± 82.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 19.3 µs ± 82.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# cumsum
# 48.2 µs ± 428 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 41.2 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# prod
# 29.2 µs ± 73.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 25.3 µs ± 146 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# cumprod
# 53.7 µs ± 83.5 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 46.6 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# amax
# 9.37 µs ± 93 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 5.81 µs ± 21.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# amin
# 9.16 µs ± 15.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 5.75 µs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# argmax
# 2.93 µs ± 8.85 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 589 ns ± 5.33 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# argmin
# 3.07 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 622 ns ± 4.37 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# mean
# 33.5 µs ± 27.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 29.1 µs ± 286 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# var
# 111 µs ± 749 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 105 µs ± 735 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# std
# 117 µs ± 112 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 113 µs ± 409 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# sort
# 157 µs ± 407 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 105 µs ± 433 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# argsort
# 115 µs ± 192 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 112 µs ± 925 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# all
# 8.26 µs ± 9.85 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 3.86 µs ± 11.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# any
# 8.49 µs ± 23 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# 4 µs ± 30.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# absolute
# 1.52 µs ± 3.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# 1.72 µs ± 2.95 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# len
# 122 ns ± 0.24 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# 117 ns ± 0.279 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

numpy functions often delegate the action to a method, if it exists. But they must also check that the argument is an array, and so on. ufuncs also have some extra 'baggage' that handles parameters like out, where. So time differences don't (necessarily) scale with array size.
In [400]: a = np.random.rand(10000)
Comparing conjugate:
In [404]: timeit np.conjugate(a)
10 µs ± 15.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [405]: timeit a.conjugate()
94.2 ns ± 1.42 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
That ns time suggests that the method is taking some sort of shortcut. (I'll explore that later)
max time difference isn't as significant, which I can attribute to the function overhead:
In [406]: timeit np.max(a)
13.2 µs ± 16.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [407]: timeit a.max()
9.46 µs ± 79.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But let's test with a complex array, where conjugate isn't trivial
In [408]: ac = a+1j*a
Now the method and function time the same:
In [409]: timeit np.conjugate(ac)
18.2 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [410]: timeit ac.conjugate()
18.3 µs ± 10.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The real attribute is still much faster. Looking at the python code for np.real I think the time difference is just due to the function wrapper.
In [411]: timeit np.real(ac)
743 ns ± 21.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [413]: timeit ac.real
129 ns ± 4.93 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
The conjugate method for a float array just returns a view (or maybe the array itself). That accounts for its speed:
In [418]: a.__array_interface__['data']
Out[418]: (84672384, False)
In [419]: a.conjugate().__array_interface__['data']
Out[419]: (84672384, False)
In [420]: ac.__array_interface__['data']
Out[420]: (84992432, False)
In [421]: ac.conjugate().__array_interface__['data']
Out[421]: (85165216, False)
It's the array itself:
In [422]: id(a)
Out[422]: 140673862490512
In [423]: id(a.conjugate())
Out[423]: 140673862490512
np.real code:
def real(val):
try:
return val.real
except AttributeError:
return asanyarray(val).real

Why use reset_index(drop=True) when setting the index is much faster?

Why would I use reset_index(drop=True), when the alternative is much faster? I am sure there is something I am missing. (Or my timings are bad somehow...)
import pandas as pd
l = pd.Series(range(int(1e7)))
%timeit l.reset_index(drop=True)
# 35.9 ms +- 1.29 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
%timeit l.index = range(int(1e7))
# 13 us +- 455 ns per loop (mean +- std. dev. of 7 runs, 100000 loops each)

The costly operation in reseting the index is not to create the new index (as you showed, that is super fast) but to return a copy of the series. If you compare:
%timeit l.reset_index(drop=True)
22.6 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit l.index = range(int(1e7))
14.7 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit l.reset_index(inplace=True, drop=True)
13.7 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
You can see that the inplace operation (where no copy is returned) is more or less equally fast as your methode. However it is generally discouraged to perform inplace operations.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

multiIndex slicing performance issue - python

Related

Why pandas.Series.tolist() is faster than pandas.Series.iat[]?

Conjugating a complex number much faster if number has python-native complex type

numpy , applying function over list optimization

Python numpy methods/attributes faster than numpy functions?

Why use reset_index(drop=True) when setting the index is much faster?

Categories

Resources