I have this two code that are doing the same but for different data structs
res = np.array([np.array([2.0, 4.0, 6.0]), np.array([8.0, 10.0, 12.0])], dtype=np.int)
%timeit np.sum(res, axis=1)
4.08 µs ± 728 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
list_obj_array = np.ndarray((2,), dtype=np.object)
list_obj_array[0] = [2.0, 4.0, 6.0]
list_obj_array[1] = [8.0, 10.0, 12.0]
v_func = np.vectorize(np.sum, otypes=[np.int])
%timeit v_func(list_obj_array)
20.6 µs ± 486 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
the second one is 5 times slower , is there a better way to optimize this?
#nb.jit()
def nb_np_sum(arry_list):
return [np.sum(row) for row in arry_list]
%timeit nb_np_sum(list_obj_array)
30.8 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#nb.jit()
def nb_sum(arry_list):
return [sum(row) for row in arry_list]
%timeit nb_sum(list_obj_array)
13.6 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Best so far (thanks #hpaulj)
%timeit [sum(l) for l in list_obj_array]
850 ns ± 115 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#nb.njit()
def nb_sum(arry_list):
return [sum(row) for row in arry_list]
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'sum': cannot determine Numba type of <class 'builtin_function_or_method'>
File "<ipython-input-54-3bb48c5273bb>", line 3:
def nb_sum(arry_list):
return [sum(row) for row in arry_list]
for longer array
list_obj_array = np.ndarray((n,), dtype=np.object)
for i in range(n):
list_obj_array[i] = list(range(7))
the vectorized version come closer to the best option (list Comprehension)
%timeit [sum(l) for l in list_obj_array]
23.4 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit v_func(list_obj_array)
29.6 µs ± 4.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
numba still is slower
%timeit nb_sum(list_obj_array)
74.4 µs ± 6.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Since you used otypes you read enough of the vectorize docs to know that it is not a performance tool.
In [430]: timeit v_func(list_obj_array)
38.3 µs ± 894 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
A list comprehension is faster:
In [431]: timeit [sum(l) for l in list_obj_array]
2.08 µs ± 62.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Even better if you start with a list of list instead on of an object dtype array:
In [432]: alist = list_obj_array.tolist()
In [433]: timeit [sum(l) for l in alist]
542 ns ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
edit
np.frompyfunc is faster than np.vectorize, especially when working with object dtype arrays:
In [459]: np.frompyfunc(sum,1,1)(list_obj_array)
Out[459]: array([12.0, 30.0], dtype=object)
In [460]: timeit np.frompyfunc(sum,1,1)(list_obj_array)
2.22 µs ± 16.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As I've seen elsewhere frompyfunc is competitive with the list comprehension.
Interestingly, using np.sum instead of sum slows it down. I think that's because np.sum applied to lists has the overhead of converting the lists to arrays. sum applied to lists of numbers is pretty good, using python's own compiled code.
In [461]: timeit np.frompyfunc(np.sum,1,1)(list_obj_array)
30.3 µs ± 165 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So let's try sum in your vectorize:
In [462]: v_func = np.vectorize(sum, otypes=[int])
In [463]: timeit v_func(list_obj_array)
8.7 µs ± 331 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Much better.
Related
For example, we use the following Series object :
mySeries = pd.Series( range(0,20,2), index=range(1,11), name='col')
What is the proper way to access a value element ?
I would say mySeries.iat[5] or mySeries.at[5] depending we use position or index.
But I found that mySeries.tolist()[5] is 3 or 4 time faster than mySeries.iat[5] which is faster than mySeries.at[5]. ("loc" and "iloc" are even worse.)
It surprises me. What is the advantage of "iat" and "at" ?
Because test short list from small Series, so converting to list and indexing is really fast:
mySeries = pd.Series( range(0,20,2), index=range(1,11), name='col')
%timeit mySeries.iat[5]
3.61 µs ± 261 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mySeries.at[5]
5.11 µs ± 242 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mySeries.tolist()
1.58 µs ± 78.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit mySeries.tolist()[5]
1.63 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
If 1M values it is slow, because bottleneck is converting to list:
mySeries = pd.Series( range(0,2000000,2), name='col')
%timeit mySeries.iat[5]
3.46 µs ± 72.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mySeries.at[5]
4.74 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mySeries.tolist()
40.2 ms ± 618 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit mySeries.tolist()[5]
40.3 ms ± 517 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Conjugating a complex number appears to be about 30 times faster if the type() of the complex number is complex rather than numpy.complex128, see the minimal example below. However, the absolute value takes about the same time. Taking the real and the imaginary part is only about 3 times faster.
Why is the conjugate slower by that much? When I take a from a large complex-valued array, it seems I should cast it to complex first (the complex conjugation is part of a larger code which has many (> 10^6) iterations).
import numpy as np
np.random.seed(100)
a = (np.random.rand(1) + 1j*np.random.rand(1))[0]
b = complex(a)
%timeit a.conjugate() # 2.95 µs ± 24 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit a.conj() # 2.86 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit b.conjugate() # 82.8 ns ± 1.28 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit abs(a) # 112 ns ± 1.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit abs(b) # 99.6 ns ± 0.623 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit a.real # 145 ns ± 0.259 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit b.real # 54.8 ns ± 0.121 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit a.imag # 144 ns ± 0.771 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit b.imag # 55.4 ns ± 0.297 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Calling NumPy routines always comes at a fixed cost, which in this case is more expensive than cost of the Python-native routine.
As soon as you start processing more than one number (possibly millions) at once NumPy will be much faster:
import numpy as np
N = 10
a = np.random.rand(N) + 1j*np.random.rand(N)
b = [complex(x) for x in a]
%timeit a.conjugate() # 481 ns ± 1.39 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
%timeit [x.conjugate() for x in b] # 605 ns ± 6.11 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
I have to do a lot of dot products in my data processing pipeline. So, I was experimenting with the following two pieces of code where one is 3 times efficient (in terms of runtime) when compared to its slowest counterpart.
slowest method (with arrays created on the fly)
In [33]: %timeit np.dot(np.arange(200000), np.arange(200000, 400000))
352 µs ± 958 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
fastest method (with static arrays)
In [34]: vec1_arr = np.arange(200000)
In [35]: vec2_arr = np.arange(200000, 400000)
In [36]: %timeit np.dot(vec1_arr, vec2_arr)
121 µs ± 90.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
why is the first method of dynamically generating arrays 3x slower when compared to second method? Is it because in the first method much of these extra time is spent in allocating memory for the elements? Or some other factors contributing to this degradation?
To gain little more understanding, I also replicated the setup in pure Python. And surprisingly there is no performance difference between doing it one way or the other, although it is slower than the numpy implementation, which is obvious and expected.
In [42]: %timeit sum(map(operator.mul, range(200000), range(200000, 400000)))
12.5 ms ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [38]: vec1 = range(200000)
In [39]: vec2 = range(200000, 400000)
In [40]: %timeit sum(map(operator.mul, vec1, vec2))
12.5 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The behaviour in the case of pure Python is clear because range function doesn't actually create all those elements. It does lazy evaluation (i.e. it is generated on the fly).
Note: The pure Python impl. is just to make myself convinced that the array allocation might be the factor that is causing the drag. It's not meant to compare it with NumPy implementation.
The pure Python test is not fair. Because np.arange(200000) really returns an array while range(200000) only returns a generator. So these two methods both create arrays on the fly.
import operator
%timeit sum(map(operator.mul, range(200000), range(200000, 400000)))
# 15.1 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vec1 = range(200000)
vec2 = range(200000, 400000)
%timeit sum(map(operator.mul, vec1, vec2))
# 15.2 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vec1 = list(range(200000))
vec2 = list(range(200000, 400000))
%timeit sum(map(operator.mul, vec1, vec2))
# 12.4 ms ± 716 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And we can see the time cost on allocation:
import numpy as np
%timeit np.arange(200000), np.arange(200000, 400000)
# 632 µs ± 9.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.dot(np.arange(200000), np.arange(200000, 400000))
# 703 µs ± 5.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
vec1_arr = np.arange(200000)
vec2_arr = np.arange(200000, 400000)
%timeit np.dot(vec1_arr, vec2_arr)
# 77.7 µs ± 427 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Make sense.
The difference in speed is due to allocating the arrays in the slower case. I pasted the output of %timeit that takes into account the allocation of arrays in the two cases. The OP's timeit commands only took into account allocation in the slower case but not in the faster case.
%timeit np.dot(np.arange(200000), np.arange(200000, 400000))
# 524 µs ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vec1_arr = np.arange(200000); vec2_arr = np.arange(200000, 400000); np.dot(vec1_arr, vec2_arr)
# 523 µs ± 17.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The allocation of both arrays takes about 360 microseconds on my machine, and the np.dot operation takes 169 microseconds. The sum of those two durations is 529 microseconds, which is equivalent to the output of %timeit output above.
%timeit vec1_arr = np.arange(200000); vec2_arr = np.arange(200000, 400000)
# 360 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
vec1_arr = np.arange(200000)
vec2_arr = np.arange(200000, 400000)
%timeit np.dot(vec1_arr, vec2_arr)
# 169 µs ± 5.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
define pandas dataframe like below
import numpy as np
import pandas as pd
n=1000
x=np.repeat(range(n),n)
y=np.tile(range(n),n)
z=np.random.random(n*n)
df=pd.DataFrame({'x':x,'y':y,'z':z})
df=df.set_index(['x','y']).sort_index()
idx=pd.IndexSlice
then some index slicing timing
%timeit -n100 df.loc[idx[1],:]
%timeit -n100 df.loc[idx[1,1],:]
%timeit -n100 df.loc[idx[1:10],:]
%timeit -n100 df.loc[idx[1:10,1],:]
gives
361 µs ± 53.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
164 µs ± 1.53 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
165 µs ± 8.45 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.35 ms ± 51.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
As you can see, df.loc[idx[1:10,1],:] takes much much more times which seems like a performance bug. What is wrong here?
On the other hand, though it is said that pandas index is hashed. But indexing is far slower than dict.
Let's prepare a somewhat equivalent dict
d={i:{k:k for k in range(n)} for i in range(n)}
and similar timing
%timeit -n100 d[1]
%timeit -n100 d[1][1]
%timeit -n100 [d[i] for i in range(10)]
%timeit -n100 [d[i][1] for i in range(10)]
gives
36.3 ns ± 3.68 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
52.7 ns ± 3.54 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
811 ns ± 7.54 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.02 µs ± 79.6 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
Wow, 1000 times faster than pandas indexing! Why pandas index slicing is so slow?
I'm trying to find the best way to compute the minimum element wise products between two sets of vectors. The usual matrix multiplication C=A#B computes Cij as the sum of the pairwise products of the elements of the vectors Ai and B^Tj. I would like to perform instead the minimum of the pairwise products. I can't find an efficient way to do this between two matrices with numpy.
One way to achieve this would be to generate the 3D matrix of the pairwise products between A and B (before the sum) and then take the minimum over the third dimension. But this would lead to a huge memory footprint (and I actually dn't know how to do this).
Do you have any idea how I could achieve this operation ?
Example:
A = [[1,1],[1,1]]
B = [[0,2],[2,1]]
matrix matmul:
C = [[1*0+1*2,1*2+1*1][1*0+1*2,1*2+1*1]] = [[2,3],[2,3]]
minimum matmul:
C = [[min(1*0,1*2),min(1*2,1*1)][min(1*0,1*2),min(1*2,1*1)]] = [[0,1],[0,1]]
Use broadcasting after extending A to 3D -
A = np.asarray(A)
B = np.asarray(B)
C_out = np.min(A[:,None]*B,axis=2)
If you care about memory footprint, use numexpr module to be efficient about it -
import numexpr as ne
C_out = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
Timings on large arrays -
In [12]: A = np.random.rand(200,200)
In [13]: B = np.random.rand(200,200)
In [14]: %timeit np.min(A[:,None]*B,axis=2)
34.4 ms ± 614 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %timeit ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
29.3 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]: A = np.random.rand(300,300)
In [17]: B = np.random.rand(300,300)
In [18]: %timeit np.min(A[:,None]*B,axis=2)
113 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [19]: %timeit ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
102 ms ± 691 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, there's some improvement with numexpr, but maybe not as much I was expecting it to be.
Numba can be also an option
I was a bit surprised of the not particularly good Numexpr Timings, so I tried a Numba Version. For large Arrays this can be optimized further. (Quite the same principles like for a dgemm can be applied)
import numpy as np
import numba as nb
import numexpr as ne
#nb.njit(fastmath=True,parallel=True)
def min_pairwise_prod(A,B):
assert A.shape[1]==B.shape[1]
res=np.empty((A.shape[0],B.shape[0]))
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
min_prod=A[i,0]*B[j,0]
for k in range(B.shape[1]):
prod=A[i,k]*B[j,k]
if prod<min_prod:
min_prod=prod
res[i,j]=min_prod
return res
Timings
A=np.random.rand(300,300)
B=np.random.rand(300,300)
%timeit res_1=min_pairwise_prod(A,B) #parallel=True
5.56 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_1=min_pairwise_prod(A,B) #parallel=False
26 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_2 = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
87.7 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=np.min(A[:,None]*B,axis=2)
110 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
A=np.random.rand(1000,300)
B=np.random.rand(1000,300)
%timeit res_1=min_pairwise_prod(A,B) #parallel=True
50.6 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_1=min_pairwise_prod(A,B) #parallel=False
296 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2 = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
992 ms ± 7.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=np.min(A[:,None]*B,axis=2)
1.27 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)