I was profiling Erlang's lists:reverse Built in Function (BIF) to see how well it scales with the size of the input. More specifically, I tried:
1> X = lists:seq(1, 1000000).
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
23,24,25,26,27,28,29|...]
2> timer:tc(lists, reverse, [X]).
{57737,
[1000000,999999,999998,999997,999996,999995,999994,999993,
999992,999991,999990,999989,999988,999987,999986,999985,
999984,999983,999982,999981,999980,999979,999978,999977,
999976,999975,999974|...]}
3> timer:tc(lists, reverse, [X]).
{46896,
[1000000,999999,999998,999997,999996,999995,999994,999993,
999992,999991,999990,999989,999988,999987,999986,999985,
999984,999983,999982,999981,999980,999979,999978,999977,
999976,999975,999974|...]}
4> Y = lists:seq(1, 10000000).
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
23,24,25,26,27,28,29|...]
5> timer:tc(lists, reverse, [Y]).
{434079,
[10000000,9999999,9999998,9999997,9999996,9999995,9999994,
9999993,9999992,9999991,9999990,9999989,9999988,9999987,
9999986,9999985,9999984,9999983,9999982,9999981,9999980,
9999979,9999978,9999977,9999976,9999975,9999974|...]}
6> timer:tc(lists, reverse, [Y]).
{214173,
[10000000,9999999,9999998,9999997,9999996,9999995,9999994,
9999993,9999992,9999991,9999990,9999989,9999988,9999987,
9999986,9999985,9999984,9999983,9999982,9999981,9999980,
9999979,9999978,9999977,9999976,9999975,9999974|...]}
Ok, so far it seems like the reverse BIF scales in approximately linear time with respect to the input (e.g. multiply the size of the input by 10 and the size of time taken also increases by a factor of 10). In pure Erlang that would make sense since we would use something like tail recursion to reverse the list. I guess that even as a BIF implemented in C, the algorithm for reversing seems a list to be the same (maybe because of the way lists are just represented in Erlang?).
Now I wanted to compare this with something another language - perhaps another dynamically typed language that I already use. So I tried a similar thing in Python - taking care to, very explicitly, use actual lists instead of generators which I anticipate would affect the performance of Python positively in this test, giving it an unfair advantage.
import time
ms_conv_factor = 10**6
def profile(func, *args):
start = time.time()
func(args)
end = time.time()
elapsed_seconds = end - start
print(elapsed_seconds * ms_conv_factor, flush=True)
x = list([i for i in range(0, 1000000)])
y = list([i for i in range(0, 10000000)])
z = list([i for i in range(0, 100000000)])
def f(m):
return m[::-1]
def g(m):
return reversed(m)
if __name__ == "__main__":
print("All done loading the lists, starting now.", flush=True)
print("f:")
profile(f, x)
profile(f, y)
print("")
profile(f, x)
profile(f, y)
print("")
profile(f, z)
print("")
print("g:")
profile(g, x)
profile(g, y)
print("")
profile(g, x)
profile(g, y)
print("")
profile(g, z)
This seems to suggest that after the function has been loaded and run once, the length of the input makes no difference and the reversal times are incredibly fast - in the range of ~0.7µs.
Exact result:
All done loading the lists, starting now.
f:
1.430511474609375
0.7152557373046875
0.7152557373046875
0.2384185791015625
0.476837158203125
g:
1.9073486328125
0.7152557373046875
0.2384185791015625
0.2384185791015625
0.476837158203125
My first, naive, guess was that python might be able to recognize the reverse construct and create something like a reverse iterator and return that (Python can work with references right? Maybe it was using some kind of optimization here). But I don't think that theory makes sense since the original list and the returned list are not the same (changing one shouldn't change the other).
So my question(s) here is(are):
Is my profiling technique here flawed? Have I written the tests in a way that favor one language over the other?
What is the difference in implementation of lists and their reversal in Erlang vs Python that make this situation (of Python being WAY faster) possible?
Thanks for your time (in advance).
This seems to suggest that after the function has been loaded and run
once, the length of the input makes no difference and the reversal
times are incredibly fast - in the range of ~0.7µs.
Because your profiling function is incorrect. It accepts variable positional arguments, but when it passes them to the function, it doesn't unpack them so you are only ever working with a tuple of length one. You need to do the following:
def profile(func, *args):
start = time.time()
func(*args) # Make sure to unpack the args!
end = time.time()
elapsed_seconds = end - start
print(elapsed_seconds * ms_conv_factor, flush=True)
So notice the difference:
>>> def foo(*args):
... print(args)
... print(*args)
...
>>> foo(1,2,3)
(1, 2, 3)
1 2 3
Also note, reversed(m) creates a reversed iterator, so it doesn't actually do anything until you iterate over it. So g will still be constant time.
But rest assured, reversing a list in Python takes linear time.
I am looking for a way to partially apply functions in python that is simple to understand, readable, resusable and as little error prone to coder mistakes as possible. Most of all I want the style to be as performant as possible - less frames on the stack is nice, and less memory footprint for the partially applied functions is also desirable. I have considered 4 styles and written examples below:
import functools
def multiplier(m):
def inner(x):
return m * x
return inner
def divide(n,d):
return n/d
def divider(d):
return functools.partial(divide,d=d)
times2 = multiplier(2)
print(times2(3)) # 6
by2 = divider(2)
print(by2(6)) # 3.0
by3 = functools.partial(divide,d=3)
print(by3(9)) # 3.0
by4 = lambda n: divide(n,4)
print(by4(12)) # 3.0
My analysis of them are:
times2 is a nested thing. I guess python makes a closure with the m bound, and everything is nice. The code is readable (I think) and simple to understand. No external libraries. This is the style I use today.
by2 has an explicit named function which makes it simple for the user. It uses functools, so it gives you extra imports. I like this style to some extent since it is transparent, and give me the option to use divide in other ways if I want to. Contrast this with inner which is not reachable.
by3 is like by2, but forces the reader of the code to be comfortable with functools.partial since they have it right in the face. what I like less is that PyCharm cant give my tooltips with what the arguments to functools.partial should be, since they are effectively arguments to by3. I have to know the signature of divide myself everytime I define some new partial application.
by4 is simple to type, since I can get autocompletion. It needs no import of functools. I think it looks non-pythonic though. Also, I always feel uncomfortable about scoping of variables / closures with lambdas work in python. Never sure how that behaves....
What is the logical difference between the styles and how does that affect memory and CPU?
The first way seems to be the most efficient. I tweaked your code so that all 4 functions compute exactly the same mathematical function:
import functools,timeit
def multiplier(m):
def inner(x):
return m * x
return inner
def mult(x,m):
return m*x
def multer(m):
return functools.partial(mult,m=m)
f1 = multiplier(2)
f2 = multer(2)
f3 = functools.partial(mult,m=2)
f4 = lambda x: mult(x,2)
print(timeit.timeit('f1(10)',setup = 'from __main__ import f1'))
print(timeit.timeit('f2(10)',setup = 'from __main__ import f2'))
print(timeit.timeit('f3(10)',setup = 'from __main__ import f3'))
print(timeit.timeit('f4(10)',setup = 'from __main__ import f4'))
Typical output (on my machine):
0.08207898699999999
0.19439769299999998
0.20093803199999993
0.1442435820000001
The two functools.partial approaches are identical (since one of them is just a wrapper for the other), the first is twice as fast, and the last is somewhere in between (but closer to the first). There is a clear overhead in using functools over a straightforward closure. Since the closure approach is arguably more readable as well (and more flexible than the lambda which doesn't extend well to more complicated functions) I would just go with it.
Technically you're missing one other option as operator.mul does the same thing you're looking to do and you can just use functools.partial on it to get a default first argument without having to reinvent the wheel.
Not only is it the fastest option it also uses the least space compared to a custom function or a lambda statement. The fact that it's a partial is why it uses the same space as the others and I think that's the best route here.
from timeit import timeit
from functools import partial
from sys import getsizeof
from operator import mul
def multiplier(m):
def inner(x):
return m * x
return inner
def mult(x,m):
return m*x
def multer(m):
return partial(mult,m=m)
f1 = multiplier(2)
f2 = multer(2)
f3 = partial(mult,m=2)
f4 = lambda n: mult(n,2)
f5 = partial(mul, 2)
from_main = 'from __main__ import {}'.format
print(timeit('f1(10)', from_main('f1')), getsizeof(f1))
print(timeit('f2(10)', from_main('f2')), getsizeof(f2))
print(timeit('f3(10)', from_main('f3')), getsizeof(f3))
print(timeit('f4(10)', from_main('f4')), getsizeof(f4))
print(timeit('f5(10)', from_main('f5')), getsizeof(f5))
Output
0.5278953390006791 144
1.0804575479996856 96
1.0762036349988193 96
0.9348237040030654 144
0.3904160970050725 96
This should answer your question as far as memory usage and speed.
I'm looking for a simple process-based parallel map for python, that is, a function
parmap(function,[data])
that would run function on each element of [data] on a different process (well, on a different core, but AFAIK, the only way to run stuff on different cores in python is to start multiple interpreters), and return a list of results.
Does something like this exist? I would like something simple, so a simple module would be nice. Of course, if no such thing exists, I will settle for a big library :-/
I seems like what you need is the map method in multiprocessing.Pool():
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only
one iterable argument though). It blocks till the result is ready.
This method chops the iterable into a number of chunks which it submits to the
process pool as separate tasks. The (approximate) size of these chunks can be
specified by setting chunksize to a positive integ
For example, if you wanted to map this function:
def f(x):
return x**2
to range(10), you could do it using the built-in map() function:
map(f, range(10))
or using a multiprocessing.Pool() object's method map():
import multiprocessing
pool = multiprocessing.Pool()
print pool.map(f, range(10))
This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
To parallelize your example, you'd need to define your map function with the #ray.remote decorator, and then invoke it with .remote. This will ensure that every instance of the remote function will executed in a different process.
import time
import ray
ray.init()
# Define the function you want to apply map on, as remote function.
#ray.remote
def f(x):
# Do some work...
time.sleep(1)
return x*x
# Define a helper parmap(f, list) function.
# This function executes a copy of f() on each element in "list".
# Each copy of f() runs in a different process.
# Note f.remote(x) returns a future of its result (i.e.,
# an identifier of the result) rather than the result itself.
def parmap(f, list):
return [f.remote(x) for x in list]
# Call parmap() on a list consisting of first 5 integers.
result_ids = parmap(f, range(1, 6))
# Get the results
results = ray.get(result_ids)
print(results)
This will print:
[1, 4, 9, 16, 25]
and it will finish in approximately len(list)/p (rounded up the nearest integer) where p is number of cores on your machine. Assuming a machine with 2 cores, our example will execute in 5/2 rounded up, i.e, in approximately 3 sec.
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Python3's Pool class has a map() method and that's all you need to parallelize map:
from multiprocessing import Pool
with Pool() as P:
xtransList = P.map(some_func, a_list)
Using with Pool() as P is similar to a process pool and will execute each item in the list in parallel. You can provide the number of cores:
with Pool(processes=4) as P:
For those who looking for Python equivalent of R's mclapply(), here is my implementation. It is an improvement of the following two examples:
"Parallelize Pandas map() or apply()", as mentioned by #Rafael
Valero.
How to apply map to functions with multiple arguments.
It can be apply to map functions with single or multiple arguments.
import numpy as np, pandas as pd
from scipy import sparse
import functools, multiprocessing
from multiprocessing import Pool
num_cores = multiprocessing.cpu_count()
def parallelize_dataframe(df, func, U=None, V=None):
#blockSize = 5000
num_partitions = 5 # int( np.ceil(df.shape[0]*(1.0/blockSize)) )
blocks = np.array_split(df, num_partitions)
pool = Pool(num_cores)
if V is not None and U is not None:
# apply func with multiple arguments to dataframe (i.e. involves multiple columns)
df = pd.concat(pool.map(functools.partial(func, U=U, V=V), blocks))
else:
# apply func with one argument to dataframe (i.e. involves single column)
df = pd.concat(pool.map(func, blocks))
pool.close()
pool.join()
return df
def square(x):
return x**2
def test_func(data):
print("Process working on: ", data.shape)
data["squareV"] = data["testV"].apply(square)
return data
def vecProd(row, U, V):
return np.sum( np.multiply(U[int(row["obsI"]),:], V[int(row["obsJ"]),:]) )
def mProd_func(data, U, V):
data["predV"] = data.apply( lambda row: vecProd(row, U, V), axis=1 )
return data
def generate_simulated_data():
N, D, nnz, K = [302, 184, 5000, 5]
I = np.random.choice(N, size=nnz, replace=True)
J = np.random.choice(D, size=nnz, replace=True)
vals = np.random.sample(nnz)
sparseY = sparse.csc_matrix((vals, (I, J)), shape=[N, D])
# Generate parameters U and V which could be used to reconstruct the matrix Y
U = np.random.sample(N*K).reshape([N,K])
V = np.random.sample(D*K).reshape([D,K])
return sparseY, U, V
def main():
Y, U, V = generate_simulated_data()
# find row, column indices and obvseved values for sparse matrix Y
(testI, testJ, testV) = sparse.find(Y)
colNames = ["obsI", "obsJ", "testV", "predV", "squareV"]
dtypes = {"obsI":int, "obsJ":int, "testV":float, "predV":float, "squareV": float}
obsValDF = pd.DataFrame(np.zeros((len(testV), len(colNames))), columns=colNames)
obsValDF["obsI"] = testI
obsValDF["obsJ"] = testJ
obsValDF["testV"] = testV
obsValDF = obsValDF.astype(dtype=dtypes)
print("Y.shape: {!s}, #obsVals: {}, obsValDF.shape: {!s}".format(Y.shape, len(testV), obsValDF.shape))
# calculate the square of testVals
obsValDF = parallelize_dataframe(obsValDF, test_func)
# reconstruct prediction of testVals using parameters U and V
obsValDF = parallelize_dataframe(obsValDF, mProd_func, U, V)
print("obsValDF.shape after reconstruction: {!s}".format(obsValDF.shape))
print("First 5 elements of obsValDF:\n", obsValDF.iloc[:5,:])
if __name__ == '__main__':
main()
I know this is an old post, but just in case, I wrote a tool to make this super, super easy called parmapper (I actually call it parmap in my use but the name was taken).
It handles a lot of the setup and deconstruction of processes and adds tons of features. In rough order of importance
Can take lambda and other unpickleable functions
Can apply starmap and other similar call methods to make it very easy to directly use.
Can split amongst both threads and/or processes
Includes features such as progress bars
It does incur a small cost but for most uses, that is negligible.
I hope you find it useful.
(Note: It, like map in Python 3+, returns an iterable so if you expect all results to pass through it immediately, use list())