Attribute access time very long with DataFrame - python

I would like to understand why Dataframe attributes access time seems so long (often 100 x slower vs other objects). An example :
In [37]: df=pd.DataFrame([])
In [38]: a=np.array([])
In [39]: %timeit df.size
28 µs ± 4.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [40]: %timeit a.size
136 ns ± 9.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Related

Why pandas.Series.tolist() is faster than pandas.Series.iat[]?

For example, we use the following Series object :
mySeries = pd.Series( range(0,20,2), index=range(1,11), name='col')
What is the proper way to access a value element ?
I would say mySeries.iat[5] or mySeries.at[5] depending we use position or index.
But I found that mySeries.tolist()[5] is 3 or 4 time faster than mySeries.iat[5] which is faster than mySeries.at[5]. ("loc" and "iloc" are even worse.)
It surprises me. What is the advantage of "iat" and "at" ?
Because test short list from small Series, so converting to list and indexing is really fast:
mySeries = pd.Series( range(0,20,2), index=range(1,11), name='col')
%timeit mySeries.iat[5]
3.61 µs ± 261 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mySeries.at[5]
5.11 µs ± 242 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mySeries.tolist()
1.58 µs ± 78.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit mySeries.tolist()[5]
1.63 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
If 1M values it is slow, because bottleneck is converting to list:
mySeries = pd.Series( range(0,2000000,2), name='col')
%timeit mySeries.iat[5]
3.46 µs ± 72.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mySeries.at[5]
4.74 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mySeries.tolist()
40.2 ms ± 618 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit mySeries.tolist()[5]
40.3 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Conjugating a complex number much faster if number has python-native complex type

Conjugating a complex number appears to be about 30 times faster if the type() of the complex number is complex rather than numpy.complex128, see the minimal example below. However, the absolute value takes about the same time. Taking the real and the imaginary part is only about 3 times faster.
Why is the conjugate slower by that much? When I take a from a large complex-valued array, it seems I should cast it to complex first (the complex conjugation is part of a larger code which has many (> 10^6) iterations).
import numpy as np
np.random.seed(100)
a = (np.random.rand(1) + 1j*np.random.rand(1))[0]
b = complex(a)
%timeit a.conjugate() # 2.95 µs ± 24 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a.conj() # 2.86 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit b.conjugate() # 82.8 ns ± 1.28 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit abs(a) # 112 ns ± 1.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit abs(b) # 99.6 ns ± 0.623 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit a.real # 145 ns ± 0.259 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit b.real # 54.8 ns ± 0.121 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit a.imag # 144 ns ± 0.771 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit b.imag # 55.4 ns ± 0.297 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Calling NumPy routines always comes at a fixed cost, which in this case is more expensive than cost of the Python-native routine.
As soon as you start processing more than one number (possibly millions) at once NumPy will be much faster:
import numpy as np
N = 10
a = np.random.rand(N) + 1j*np.random.rand(N)
b = [complex(x) for x in a]
%timeit a.conjugate() # 481 ns ± 1.39 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
%timeit [x.conjugate() for x in b] # 605 ns ± 6.11 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Efficiency of numpy.dot with dynamic vs. static arrays

I have to do a lot of dot products in my data processing pipeline. So, I was experimenting with the following two pieces of code where one is 3 times efficient (in terms of runtime) when compared to its slowest counterpart.
slowest method (with arrays created on the fly)
In [33]: %timeit np.dot(np.arange(200000), np.arange(200000, 400000))
352 µs ± 958 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
fastest method (with static arrays)
In [34]: vec1_arr = np.arange(200000)
In [35]: vec2_arr = np.arange(200000, 400000)
In [36]: %timeit np.dot(vec1_arr, vec2_arr)
121 µs ± 90.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
why is the first method of dynamically generating arrays 3x slower when compared to second method? Is it because in the first method much of these extra time is spent in allocating memory for the elements? Or some other factors contributing to this degradation?
To gain little more understanding, I also replicated the setup in pure Python. And surprisingly there is no performance difference between doing it one way or the other, although it is slower than the numpy implementation, which is obvious and expected.
In [42]: %timeit sum(map(operator.mul, range(200000), range(200000, 400000)))
12.5 ms ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [38]: vec1 = range(200000)
In [39]: vec2 = range(200000, 400000)
In [40]: %timeit sum(map(operator.mul, vec1, vec2))
12.5 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The behaviour in the case of pure Python is clear because range function doesn't actually create all those elements. It does lazy evaluation (i.e. it is generated on the fly).
Note: The pure Python impl. is just to make myself convinced that the array allocation might be the factor that is causing the drag. It's not meant to compare it with NumPy implementation.
The pure Python test is not fair. Because np.arange(200000) really returns an array while range(200000) only returns a generator. So these two methods both create arrays on the fly.
import operator
%timeit sum(map(operator.mul, range(200000), range(200000, 400000)))
# 15.1 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vec1 = range(200000)
vec2 = range(200000, 400000)
%timeit sum(map(operator.mul, vec1, vec2))
# 15.2 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vec1 = list(range(200000))
vec2 = list(range(200000, 400000))
%timeit sum(map(operator.mul, vec1, vec2))
# 12.4 ms ± 716 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And we can see the time cost on allocation:
import numpy as np
%timeit np.arange(200000), np.arange(200000, 400000)
# 632 µs ± 9.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.dot(np.arange(200000), np.arange(200000, 400000))
# 703 µs ± 5.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
vec1_arr = np.arange(200000)
vec2_arr = np.arange(200000, 400000)
%timeit np.dot(vec1_arr, vec2_arr)
# 77.7 µs ± 427 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Make sense.
The difference in speed is due to allocating the arrays in the slower case. I pasted the output of %timeit that takes into account the allocation of arrays in the two cases. The OP's timeit commands only took into account allocation in the slower case but not in the faster case.
%timeit np.dot(np.arange(200000), np.arange(200000, 400000))
# 524 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vec1_arr = np.arange(200000); vec2_arr = np.arange(200000, 400000); np.dot(vec1_arr, vec2_arr)
# 523 µs ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The allocation of both arrays takes about 360 microseconds on my machine, and the np.dot operation takes 169 microseconds. The sum of those two durations is 529 microseconds, which is equivalent to the output of %timeit output above.
%timeit vec1_arr = np.arange(200000); vec2_arr = np.arange(200000, 400000)
# 360 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
vec1_arr = np.arange(200000)
vec2_arr = np.arange(200000, 400000)
%timeit np.dot(vec1_arr, vec2_arr)
# 169 µs ± 5.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

multiIndex slicing performance issue

define pandas dataframe like below
import numpy as np
import pandas as pd
n=1000
x=np.repeat(range(n),n)
y=np.tile(range(n),n)
z=np.random.random(n*n)
df=pd.DataFrame({'x':x,'y':y,'z':z})
df=df.set_index(['x','y']).sort_index()
idx=pd.IndexSlice
then some index slicing timing
%timeit -n100 df.loc[idx[1],:]
%timeit -n100 df.loc[idx[1,1],:]
%timeit -n100 df.loc[idx[1:10],:]
%timeit -n100 df.loc[idx[1:10,1],:]
gives
361 µs ± 53.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
164 µs ± 1.53 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
165 µs ± 8.45 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.35 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As you can see, df.loc[idx[1:10,1],:] takes much much more times which seems like a performance bug. What is wrong here?
On the other hand, though it is said that pandas index is hashed. But indexing is far slower than dict.
Let's prepare a somewhat equivalent dict
d={i:{k:k for k in range(n)} for i in range(n)}
and similar timing
%timeit -n100 d[1]
%timeit -n100 d[1][1]
%timeit -n100 [d[i] for i in range(10)]
%timeit -n100 [d[i][1] for i in range(10)]
gives
36.3 ns ± 3.68 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
52.7 ns ± 3.54 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
811 ns ± 7.54 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.02 µs ± 79.6 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
Wow, 1000 times faster than pandas indexing! Why pandas index slicing is so slow?

Computing a slightly different matrix multiplication

I'm trying to find the best way to compute the minimum element wise products between two sets of vectors. The usual matrix multiplication C=A#B computes Cij as the sum of the pairwise products of the elements of the vectors Ai and B^Tj. I would like to perform instead the minimum of the pairwise products. I can't find an efficient way to do this between two matrices with numpy.
One way to achieve this would be to generate the 3D matrix of the pairwise products between A and B (before the sum) and then take the minimum over the third dimension. But this would lead to a huge memory footprint (and I actually dn't know how to do this).
Do you have any idea how I could achieve this operation ?
Example:
A = [[1,1],[1,1]]
B = [[0,2],[2,1]]
matrix matmul:
C = [[1*0+1*2,1*2+1*1][1*0+1*2,1*2+1*1]] = [[2,3],[2,3]]
minimum matmul:
C = [[min(1*0,1*2),min(1*2,1*1)][min(1*0,1*2),min(1*2,1*1)]] = [[0,1],[0,1]]
Use broadcasting after extending A to 3D -
A = np.asarray(A)
B = np.asarray(B)
C_out = np.min(A[:,None]*B,axis=2)
If you care about memory footprint, use numexpr module to be efficient about it -
import numexpr as ne
C_out = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
Timings on large arrays -
In [12]: A = np.random.rand(200,200)
In [13]: B = np.random.rand(200,200)
In [14]: %timeit np.min(A[:,None]*B,axis=2)
34.4 ms ± 614 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %timeit ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
29.3 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]: A = np.random.rand(300,300)
In [17]: B = np.random.rand(300,300)
In [18]: %timeit np.min(A[:,None]*B,axis=2)
113 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [19]: %timeit ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
102 ms ± 691 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, there's some improvement with numexpr, but maybe not as much I was expecting it to be.
Numba can be also an option
I was a bit surprised of the not particularly good Numexpr Timings, so I tried a Numba Version. For large Arrays this can be optimized further. (Quite the same principles like for a dgemm can be applied)
import numpy as np
import numba as nb
import numexpr as ne
#nb.njit(fastmath=True,parallel=True)
def min_pairwise_prod(A,B):
assert A.shape[1]==B.shape[1]
res=np.empty((A.shape[0],B.shape[0]))
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
min_prod=A[i,0]*B[j,0]
for k in range(B.shape[1]):
prod=A[i,k]*B[j,k]
if prod<min_prod:
min_prod=prod
res[i,j]=min_prod
return res
Timings
A=np.random.rand(300,300)
B=np.random.rand(300,300)
%timeit res_1=min_pairwise_prod(A,B) #parallel=True
5.56 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_1=min_pairwise_prod(A,B) #parallel=False
26 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_2 = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
87.7 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=np.min(A[:,None]*B,axis=2)
110 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
A=np.random.rand(1000,300)
B=np.random.rand(1000,300)
%timeit res_1=min_pairwise_prod(A,B) #parallel=True
50.6 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_1=min_pairwise_prod(A,B) #parallel=False
296 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2 = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
992 ms ± 7.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=np.min(A[:,None]*B,axis=2)
1.27 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Categories

Resources