When I run this parallel dask.bag code below, I seem to get much slower computation than the sequential Python code. Any insights into why?
import dask.bag as db
def is_even(x):
return not x % 2
Dask code:
%%timeit
b = db.from_sequence(range(2000000))
c = b.filter(is_even).map(lambda x: x ** 2)
c.compute()
>>> 12.8 s ± 1.15 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
# With n = 8000000
>>> 50.7 s ± 2.76 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Python code:
%%timeit
b = list(range(2000000))
b = list(filter(is_even, b))
b = list(map(lambda x: x ** 2, b))
>>> 547 ms ± 8.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# With n = 8000000
>>> 2.25 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thanks to #abarnert for the suggestion to look at overhead through longer task length.
It seems like the length of each task was too short, and the overhead made Dask slower. I changed the exponent from 2 to 10000 to make each task longer. This example produces what I was expecting:
Python code:
%%timeit
b = list(range(50000))
b = list(filter(is_even, b))
b = list(map(lambda x: x ** 10000, b))
>>> 34.8 s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Dask code:
%%timeit
b = db.from_sequence(range(50000))
c = b.filter(is_even).map(lambda x: x ** 10000)
c.compute()
>>> 26.4 s ± 409 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Related
I am unsure as to the cost of transforming a matrix of tuples into a list form which is easier to manipulate. The main priority is being able to change a column of the matrix as fast as possible
I have a matrix in the form of
[(a,b,c),(d,e,f),(g,h,i)]
which can appear in any size of n x m but for this example we'll take 3x3 matrix.
my main goal is to be able to change the values of any column in the matrix (only one at a time) (eg (b,e,h)).
my initial attempt was to transform the matrix into a list ie
[[a,b,c],[d,e,f],[g,h,i]]
which would be easier
but I feel that it would be costly in terms of transforming every tuple into a list and back into a tuple.
My main question could be how to optimize this to its fullest?
In [37]: def change_column_list_comp(old_m, col, value):
...: return [
...: tuple(list(row[:col]) + [value] + list(row[col + 1:]))
...: for row in old_m
...: ]
...:
In [38]: def change_column_list_convert(old_m, col, value):
...: list_m = list(map(list, old_m))
...: for row in list_m:
...: row[col] = value
...:
...: return list(map(tuple, list_m))
...:
In [39]: m = [tuple('abc'), tuple('def'), tuple('ghi')]
In [40]: %timeit change_column_list_comp(m, 1, 2)
2.05 µs ± 89.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [41]: %timeit change_column_list_convert(m, 1, 2)
1.28 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Looks like converting to a list, modifying the values, and converting back to tuple is faster. Note that this may not be the most efficient way of writing these functions.
However, these functions seem to start to converge as we scale up our matrix.
In [6]: m_100k = [tuple(string.printable)] * 100_000
In [7]: %timeit change_column_list_comp(m_100k, 1, 2)
163 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit change_column_list_convert(m_100k, 1, 2)
117 ms ± 5.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [42]: m_1m = [tuple(string.printable)] * 1_000_000
In [43]: %timeit change_column_list_comp(m_1m, 1, 2)
1.72 s ± 74.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [44]: %timeit change_column_list_convert(m_1m, 1, 2)
1.24 s ± 84.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
At the end of the day you should be using the right tools for the job. While it's not really in the OP, it's just worth mentioning that numpy is simply the better way to go.
In [13]: m_np = np.array([list('abc'), list('def'), list('ghi')])
In [17]: %timeit m_np[:, 1] = 2; m_np
610 ns ± 48.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [20]: m_np_100k = np.array([[string.printable] * 100_000])
In [21]: %timeit m_np_100k[:, 1] = 2; m_np_100k
545 ns ± 63.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [22]: m_np_1m = np.array([[string.printable] * 1_000_000])
# This might be using cached data
In [23]: %timeit m_np_1m[:, 1] = 2; m_np_1m
515 ns ± 31.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# Avoiding cache
In [24]: %timeit m_np_1m[:, 4] = 9; m_np_1m
557 ns ± 37.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
This might not be the fairest comparison as we're manually returning the matrix, but you can see there is significant improvement.
What is the fastest possible way to run:
reduce(lambda x,y : x#y, ls)
in python?
for a list of matrices ls. I don't have an Nvidia GPU, but I do have a lot of CPU cores to work with. I thought I could make the process work in parallel (split it to log iterations), but it seems that for small (1000x1000) matrix, this is actually worst. Here is the code I tried:
from multiprocessing import Pool
import numpy as np
from itertools import zip_longest
def matmul(x):
if x[1] is None:
return x[0]
return x[1]#x[0]
def fast_mul(ls):
while True:
n = len(ls)
if n == 0:
raise Exception("Splitting Error")
if n == 1:
return ls[0]
if n == 2:
return ls[1]#ls[0]
with Pool(processes=(n//2+1)) as pool:
ls = pool.map(matmul, list(zip_longest(*[iter(ls)]*2)))
There is a function to do this: np.linalg.multi_dot, supposedly optimized for the best evaluation order:
np.linalg.multi_dot(ls)
In fact the docs say something very close to your original phrasing:
Think of multi_dot as:
def multi_dot(arrays): return functools.reduce(np.dot, arrays)
You could also try np.einsum, which will allow you to multiply up to 25 matrices:
from string import ascii_lowercase
ls = [...]
index = ','.join(ascii_lowercase[x:x + 2] for x in range(len(ls)))
index += f'->{index[0]}{index[-1]}'
np.einsum(index, *ls)
Timing
Simple case:
ls = np.random.rand(100, 1000, 1000) - 0.5
%timeit reduce(lambda x, y : x # y, ls)
4.3 s ± 76.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit reduce(np.matmul, ls)
4.35 s ± 84.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit reduce(np.dot, ls)
4.86 s ± 68.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.linalg.multi_dot(ls)
5.24 s ± 66.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
More complicated case:
ls = [x.T if i % 2 else x for i, x in enumerate(np.random.rand(100, 2000, 500) - 0.5)]
%timeit reduce(lambda x, y : x # y, ls)
7.94 s ± 96.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit reduce(np.matmul, ls)
7.91 s ± 33.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit reduce(np.dot, ls)
9.38 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.linalg.multi_dot(ls)
2.03 s ± 52.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Notice that the up-front work done by multi_dot has negative benefit in the straightforward case (and more suprisingly, lambda works faster than the raw operator), but saves 75% of the time in the less straightforward case.
So just for completeness, here is a less non-square case:
ls = [x.T if i % 2 else x for i, x in enumerate(np.random.rand(100, 400, 300) - 0.5)]
%timeit reduce(lambda x, y : x # y, ls)
245 ms ± 8.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit reduce(np.matmul, ls)
245 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit reduce(np.dot, ls)
284 ms ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.linalg.multi_dot(ls)
638 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So really it seems that for most general cases, your original reduce call is actually about as good as you need to get. My only suggestion would be to use operator.matmul instead of the lambda.
EDIT: Threw in yet another possible function
EDIT: I added the results with np.linalg.multi_dot, expecting it would be faster than the rest but actually it is much slower somehow. I suppose it is design with other kind of use case in mind.
I'm not sure you will be able to get much faster than that. Here are a few different implementations of the reduction for the case where the data is a 3D array of square matrices:
from multiprocessing import Pool
from functools import reduce
import numpy as np
import numba as nb
def matmul_n_naive(data):
return reduce(np.matmul, data)
# If you don't care about modifying data pass copy=False
def matmul_n_binary(data, copy=True):
if len(data) < 1:
raise ValueError
data = np.array(data, copy=copy)
n, r, c = data.shape
dt = data.dtype
s = 1
while (n + s - 1) // s > 1:
a = data[:n - s:2 * s]
b = data[s:n:2 * s]
np.matmul(a, b, out=a)
s *= 2
return np.array(a[0])
def matmul_n_pool(data):
if len(data) < 1:
raise ValueError
lst = data
with Pool() as pool:
while len(lst) > 1:
lst_next = pool.starmap(np.matmul, zip(lst[::2], lst[1::2]))
if len(lst) % 2 != 0:
lst_next.append(lst[-1])
lst = lst_next
return lst[0]
#nb.njit(parallel=False)
def matmul_n_numba_nopar(data):
res = np.eye(data.shape[1], data.shape[2], dtype=data.dtype)
for i in nb.prange(len(data)):
res = res # data[i]
return res
#nb.njit(parallel=True)
def matmul_n_numba_par(data):
res = np.eye(data.shape[1], data.shape[2], dtype=data.dtype)
for i in nb.prange(len(data)): # Numba knows how to do parallel reductions correctly
res = res # data[i]
return res
def matmul_n_multidot(data):
return np.linalg.multi_dot(data)
And a test:
# Test
import numpy as np
np.random.seed(0)
a = np.random.rand(10, 100, 100) * 2 - 1
b1 = matmul_n_naive(a)
b2 = matmul_n_binary(a)
b3 = matmul_n_pool(a)
b4 = matmul_n_numba_nopar(a)
b5 = matmul_n_numba_par(a)
b6 = matmul_n_multidot(a)
print(np.allclose(b1, b2))
# True
print(np.allclose(b1, b3))
# True
print(np.allclose(b1, b4))
# True
print(np.allclose(b1, b5))
# True
print(np.allclose(b1, b6))
# True
Here are some benchmarks, it seems there is no consistent winner but the "naive" solution is pretty good all around, binary and Numba vary, the process pool is not really good and np.linalg.multi_dot does not seem to be very advantageous with square matrices.
import numpy as np
# 10 matrices 1000x1000
np.random.seed(0)
a = np.random.rand(10, 1000, 1000) * 0.1 - 0.05
%timeit matmul_n_naive(a)
# 121 ms ± 6.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit matmul_n_binary(a)
# 165 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit matmul_n_numba_nopar(a)
# 108 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit matmul_n_numba_par(a)
# 244 ms ± 7.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit matmul_n_multidot(a)
# 132 ms ± 2.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 200 matrices 100x100
np.random.seed(0)
a = np.random.rand(200, 100, 100) * 0.1 - 0.05
%timeit matmul_n_naive(a)
# 4.4 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit matmul_n_binary(a)
# 13.4 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit matmul_n_numba_nopar(a)
# 9.51 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit matmul_n_numba_par(a)
# 4.93 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit matmul_n_multidot(a)
# 1.14 s ± 22.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 300 matrices 10x10
np.random.seed(0)
a = np.random.rand(300, 10, 10) * 0.1 - 0.05
%timeit matmul_n_naive(a)
# 526 µs ± 953 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit matmul_n_binary(a)
# 152 µs ± 508 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit matmul_n_pool(a)
# 610 ms ± 5.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit matmul_n_numba_nopar(a)
# 239 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit matmul_n_numba_par(a)
# 175 µs ± 422 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit matmul_n_multidot(a)
# 3.68 s ± 87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 1000 matrices 10x10
np.random.seed(0)
a = np.random.rand(1000, 10, 10) * 0.1 - 0.05
%timeit matmul_n_naive(a)
# 1.56 ms ± 4.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit matmul_n_binary(a)
# 392 µs ± 790 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit matmul_n_pool(a)
# 727 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit matmul_n_numba_nopar(a)
# 589 µs ± 356 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit matmul_n_numba_par(a)
# 451 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit matmul_n_multidot(a)
# Never finished...
learn python by myself. I made a function for filling list. But I have 2 variants, and I want to discover which one is better and why. Or they both awful anyway I want to know truth.
def foo (x):
l = [0] * x
for i in range(x):
l[i] = i
return l
def foo1 (x):
l = []
for i in range(x):
l.append(i)
return l
from a performance perspective the first version foo is better:
%timeit foo(1000000)
# 52.4 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit foo1(1000000)
# 67.2 ms ± 916 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
but the pythonic way to unpack an iterator in a list will be:
list(range(x))
also is faster:
%timeit list(range(1000000))
# 26.7 ms ± 661 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Is there any downside to validating args this way:
if x in [1,2,3]:
...
Or is it better to do the more traditional way:
if x == 1 or x == 2 or x == 3:
...
The only downside I could see is performance.
On my machine, windows 8.1, python 3.7.3, via ipython, I get:
x = 4
%timeit x in [1,2,3]
183 ns ± 19 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
%timeit x == 1 or x == 2 or x == 3
295 ns ± 19.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
So it seems both cleaner and faster.
FYI, using a set, I get:
%timeit x in {1,2,3}
116 ns ± 7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
I've implemented two python versions of the checksum part of Luhn's algorithm. The code snippets are nearly identical except for how they calculate the second summation. It differs from the usual implementation, in that it computes the total sum, and then updates the sum with a correction factor (as seen here).
def manual_loop(xs):
differences = (0, 1, 2, 3, 4, -4, -3, -2, -1, 0)
total = sum(xs) % 10
for x in xs[len(xs)%2::2]:
total += differences[x]
return total % 10
def builtin_loop(xs):
differences = (0, 1, 2, 3, 4, -4, -3, -2, -1, 0)
total = sum(xs) % 10
total += sum(differences[x] for x in xs[len(xs)%2::2])
return total % 10
I then timed the code in an Ipython terminal using timeit on three samples.
from random import randint
randoms5 = [randint(0, 9) for _ in range(10**5)]
randoms6 = [randint(0, 9) for _ in range(10**6)]
randoms7 = [randint(0, 9) for _ in range(10**7)]
The results are as follows:
>>> %timeit manual_loop(randoms5)
2.69 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit manual_loop(randoms6)
29.3 ms ± 535 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit manual_loop(randoms7)
311 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit builtin_loop(randoms5)
3.31 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit builtin_loop(randoms6)
34.8 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit builtin_loop(randoms7)
337 ms ± 5.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
What gives? I would have expected python's built-in sum to give far better performance than doing the looping myself, especially for lists of this size.
NOTE: I have left out other variants such as doing both summations manually and swapping which summation used the built-in sum as they gave far worse times.
EDIT: Upon request I've rerun the tests with a fixed seed.
import random
random.seed(42)
randoms5 = [random.randint(0, 9) for _ in range(10**5)]
randoms6 = [random.randint(0, 9) for _ in range(10**6)]
randoms7 = [random.randint(0, 9) for _ in range(10**7)]
Results are more expected for randoms5, but remain weird for bigger tests.
>>> %timeit manual_loop(randoms5)
3.36 ms ± 346 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit builtin_loop(randoms5)
3.23 ms ± 76.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit manual_loop(randoms6)
31.2 ms ± 890 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit builtin_loop(randoms6)
35.4 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit manual_loop(randoms7)
311 ms ± 5.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit builtin_loop(randoms7)
341 ms ± 8.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Your sum call doesn't get to avoid an interpreted loop. sum itself isn't interpreted, but the generator expression it's looping over is interpreted. Generator expressions have higher overhead than regular Python loops, due to all the work of suspending and resuming the generator's stack frame over and over.