Using multiprocessing Pool.map with lambdas - python

Setup: I have a function preprocess(data, predicate) and a list of predicates that might look like this:
preds = [lambda x: x < 1,
lambda x: x < 2,
lambda x: x < 3,
lambda x: x < 42]
EDIT: I probably should have been more precise, because I thought 1, 2, 3, 42 are obviously identifiable as examples, but it seems that it was too implicit. Actually I'm doing some NLP and data are lists of words and one predicate looks like lambda w: (w.lower() not in stopwords.words('english') and re.search("[a-z]", w.lower())). I want to test different predicates to evaluate which performes best.
Here is what I actually want to do. Call preprocess with every predicate, in parallel.
EDIT: Because this is a preprocessing step I need what is beeing returned by preprocess to continue to work with it.
What I hoped I can do but sadly can't:
pool = Pool(processes=4)
pool.map(lambda p: preprocess(data, p), preds)
As far as I understood this is because everything passed to pool.map has to be pickle-able. In this question there are two solutions suggested of which the first(accepted answer) seems impractical and the secound doesn't seem to work in Python 2.7, which I'm using, even though suggested that it does by pythonic metaphor in the comments.
My Question is whether or not pool.map is the right way to go and if so how to do it? Or shoud I try a different approach?
I know there are quite a lot of questions regarding pool.map and even though I spent some time searching I didn't found an answer. Also if my code-style is awkward feel free to point out. I read that lambda looks strange to some and that I probably should use functools.partial.
Thanks in advance.

In this simple case you can simply modify the preprocess function to accept a threshold attribute. Something like:
def preprocess(data, threshold):
def predicate(x):
return x < threshold
return old_preprocess(data, predicate)
Now in your preds list you can simply put the integers, which are picklable:
preds = [1,2,3,42]
pool = Pool(processes=4)
pool.map(preprocess, zip(data, preds))
You can extend it to choose the operator by using the operator module:
def preprocess(data, pred):
threshold, op = pred
def predicate(x):
return op(x, threshold)
return old_preprocess(data, predicate)
import operator as op
preds = [(1, op.lt), (2, op.gt), (3, op.ge), (42, op.lt)]
pool = Pool(processes=4)
pool.map(preprocess, zip(data, preds))
To extend it with arbitrary predicates things get harder. Probably the easiest way to do this is to use the marshal module, which is able to convert the code of a function into a bytes object and back.
Something like:
real_preds = [marshal.dumps(pred.__code__) for pred in preds]
And then the preprocess should re-build the predicate functions:
import types
def preprocess(data, pred):
pred = types.FunctionType(marshal.loads(pred), globals())
Here's a MWE for this last suggestion:
>>> from multiprocessing import Pool
>>> import marshal
>>> import types
>>> def preprocess(pred):
... pred = types.FunctionType(marshal.loads(pred), globals())
... return pred(2)
...
>>> preds = [lambda x: x < 1,
... lambda x: x <2,
... lambda x: x < 3,
... lambda x: x < 42]
>>> real_preds = [marshal.dumps(pred.__code__) for pred in preds]
>>> pool = Pool(processes=4)
>>> pool.map(preprocess, real_preds)
[False, False, True, True]
Note that the argument to pool.map must be picklable. Which means you cannot use a lambda as first argument to Pool.map:
>>> pool.map(lambda x: preprocess(x), real_preds)
Exception in thread Thread-5:
Traceback (most recent call last):
File "/usr/lib/python3.3/threading.py", line 639, in _bootstrap_inner
self.run()
File "/usr/lib/python3.3/threading.py", line 596, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.3/multiprocessing/pool.py", line 351, in _handle_tasks
put(task)
File "/usr/lib/python3.3/multiprocessing/connection.py", line 206, in send
ForkingPickler(buf, pickle.HIGHEST_PROTOCOL).dump(obj)
_pickle.PicklingError: Can't pickle <class 'function'>: attribute lookup builtins.function failed
Regarding the "is Pool.map the right tool? I believe it highly depends on the size of the data. Using multiprocessing increases quite a lot the overhead, so even if you "make it work" there are high chances that it isn't worth it. In particular, in your edited question you put a more "real world" scenario for predicates:
lambda w: (w.lower() not in stopwords.words('english') and re.search("[a-z]", w.lower()))
I believe that this predicate doesn't take enough time to make it worth using Pool.map. Obviously it depends on the size of w and the number of elements to map.
Doing really fast tests with this predicate I see that using Pool.map starts to become faster when w is around 35000 characters in length. If w is less than 1000 then using Pool is about 15 times slower than a plain map (with 256 strings to check. If the strings are 60000 then Pool is a bit faster).
Notice that if w is quite long then it is worth using a def instead of lambda and avoid the double calculation of w.lower(). Either if you are going to use plain map or if you want to use Pool.map.

You can do this with Pool.map, you just have to organize what you're mapping properly. Maps basically work like this:
result = map(function, things)
is equivalent to
result = []
for thing in things:
result.append(function(thing))
or, more concisely,
result = [function(thing) for thing in things]
You can structure your function so that it accepts an argument (the upper bound) and does the comparison itself:
def mapme(bound):
p = lambda x : x < bound
return preprocess(data, p)
From there, it doesn't matter if you're doing a parallel map or a single threaded one. As long as preprocess doesn't have side effects, you can use a map.

If you're using the functions for their side effects and don't need to use the unified output of pool.map(), you can just simulate it using os.fork() (at least on unix-like systems).
You could try something like this:
import numpy as np
import os
nprocs=4
funcs=np.array_split(np.array(preds),nprocs)
#Forks the program into nprocs programs, each with a procid from 0 to nprocs-1
procid=0
for x in range(1,nprocs):
if (os.fork()==0):
procid=x
break
map(lambda p: preprocess(data, p), funcs[procid])

Related

Issues when profiling list reversal in Python vs Erlang

I was profiling Erlang's lists:reverse Built in Function (BIF) to see how well it scales with the size of the input. More specifically, I tried:
1> X = lists:seq(1, 1000000).
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
23,24,25,26,27,28,29|...]
2> timer:tc(lists, reverse, [X]).
{57737,
[1000000,999999,999998,999997,999996,999995,999994,999993,
999992,999991,999990,999989,999988,999987,999986,999985,
999984,999983,999982,999981,999980,999979,999978,999977,
999976,999975,999974|...]}
3> timer:tc(lists, reverse, [X]).
{46896,
[1000000,999999,999998,999997,999996,999995,999994,999993,
999992,999991,999990,999989,999988,999987,999986,999985,
999984,999983,999982,999981,999980,999979,999978,999977,
999976,999975,999974|...]}
4> Y = lists:seq(1, 10000000).
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
23,24,25,26,27,28,29|...]
5> timer:tc(lists, reverse, [Y]).
{434079,
[10000000,9999999,9999998,9999997,9999996,9999995,9999994,
9999993,9999992,9999991,9999990,9999989,9999988,9999987,
9999986,9999985,9999984,9999983,9999982,9999981,9999980,
9999979,9999978,9999977,9999976,9999975,9999974|...]}
6> timer:tc(lists, reverse, [Y]).
{214173,
[10000000,9999999,9999998,9999997,9999996,9999995,9999994,
9999993,9999992,9999991,9999990,9999989,9999988,9999987,
9999986,9999985,9999984,9999983,9999982,9999981,9999980,
9999979,9999978,9999977,9999976,9999975,9999974|...]}
Ok, so far it seems like the reverse BIF scales in approximately linear time with respect to the input (e.g. multiply the size of the input by 10 and the size of time taken also increases by a factor of 10). In pure Erlang that would make sense since we would use something like tail recursion to reverse the list. I guess that even as a BIF implemented in C, the algorithm for reversing seems a list to be the same (maybe because of the way lists are just represented in Erlang?).
Now I wanted to compare this with something another language - perhaps another dynamically typed language that I already use. So I tried a similar thing in Python - taking care to, very explicitly, use actual lists instead of generators which I anticipate would affect the performance of Python positively in this test, giving it an unfair advantage.
import time
ms_conv_factor = 10**6
def profile(func, *args):
start = time.time()
func(args)
end = time.time()
elapsed_seconds = end - start
print(elapsed_seconds * ms_conv_factor, flush=True)
x = list([i for i in range(0, 1000000)])
y = list([i for i in range(0, 10000000)])
z = list([i for i in range(0, 100000000)])
def f(m):
return m[::-1]
def g(m):
return reversed(m)
if __name__ == "__main__":
print("All done loading the lists, starting now.", flush=True)
print("f:")
profile(f, x)
profile(f, y)
print("")
profile(f, x)
profile(f, y)
print("")
profile(f, z)
print("")
print("g:")
profile(g, x)
profile(g, y)
print("")
profile(g, x)
profile(g, y)
print("")
profile(g, z)
This seems to suggest that after the function has been loaded and run once, the length of the input makes no difference and the reversal times are incredibly fast - in the range of ~0.7µs.
Exact result:
All done loading the lists, starting now.
f:
1.430511474609375
0.7152557373046875
0.7152557373046875
0.2384185791015625
0.476837158203125
g:
1.9073486328125
0.7152557373046875
0.2384185791015625
0.2384185791015625
0.476837158203125
My first, naive, guess was that python might be able to recognize the reverse construct and create something like a reverse iterator and return that (Python can work with references right? Maybe it was using some kind of optimization here). But I don't think that theory makes sense since the original list and the returned list are not the same (changing one shouldn't change the other).
So my question(s) here is(are):
Is my profiling technique here flawed? Have I written the tests in a way that favor one language over the other?
What is the difference in implementation of lists and their reversal in Erlang vs Python that make this situation (of Python being WAY faster) possible?
Thanks for your time (in advance).
This seems to suggest that after the function has been loaded and run
once, the length of the input makes no difference and the reversal
times are incredibly fast - in the range of ~0.7µs.
Because your profiling function is incorrect. It accepts variable positional arguments, but when it passes them to the function, it doesn't unpack them so you are only ever working with a tuple of length one. You need to do the following:
def profile(func, *args):
start = time.time()
func(*args) # Make sure to unpack the args!
end = time.time()
elapsed_seconds = end - start
print(elapsed_seconds * ms_conv_factor, flush=True)
So notice the difference:
>>> def foo(*args):
... print(args)
... print(*args)
...
>>> foo(1,2,3)
(1, 2, 3)
1 2 3
Also note, reversed(m) creates a reversed iterator, so it doesn't actually do anything until you iterate over it. So g will still be constant time.
But rest assured, reversing a list in Python takes linear time.

Performant way to partially apply in Python?

I am looking for a way to partially apply functions in python that is simple to understand, readable, resusable and as little error prone to coder mistakes as possible. Most of all I want the style to be as performant as possible - less frames on the stack is nice, and less memory footprint for the partially applied functions is also desirable. I have considered 4 styles and written examples below:
import functools
def multiplier(m):
def inner(x):
return m * x
return inner
def divide(n,d):
return n/d
def divider(d):
return functools.partial(divide,d=d)
times2 = multiplier(2)
print(times2(3)) # 6
by2 = divider(2)
print(by2(6)) # 3.0
by3 = functools.partial(divide,d=3)
print(by3(9)) # 3.0
by4 = lambda n: divide(n,4)
print(by4(12)) # 3.0
My analysis of them are:
times2 is a nested thing. I guess python makes a closure with the m bound, and everything is nice. The code is readable (I think) and simple to understand. No external libraries. This is the style I use today.
by2 has an explicit named function which makes it simple for the user. It uses functools, so it gives you extra imports. I like this style to some extent since it is transparent, and give me the option to use divide in other ways if I want to. Contrast this with inner which is not reachable.
by3 is like by2, but forces the reader of the code to be comfortable with functools.partial since they have it right in the face. what I like less is that PyCharm cant give my tooltips with what the arguments to functools.partial should be, since they are effectively arguments to by3. I have to know the signature of divide myself everytime I define some new partial application.
by4 is simple to type, since I can get autocompletion. It needs no import of functools. I think it looks non-pythonic though. Also, I always feel uncomfortable about scoping of variables / closures with lambdas work in python. Never sure how that behaves....
What is the logical difference between the styles and how does that affect memory and CPU?
The first way seems to be the most efficient. I tweaked your code so that all 4 functions compute exactly the same mathematical function:
import functools,timeit
def multiplier(m):
def inner(x):
return m * x
return inner
def mult(x,m):
return m*x
def multer(m):
return functools.partial(mult,m=m)
f1 = multiplier(2)
f2 = multer(2)
f3 = functools.partial(mult,m=2)
f4 = lambda x: mult(x,2)
print(timeit.timeit('f1(10)',setup = 'from __main__ import f1'))
print(timeit.timeit('f2(10)',setup = 'from __main__ import f2'))
print(timeit.timeit('f3(10)',setup = 'from __main__ import f3'))
print(timeit.timeit('f4(10)',setup = 'from __main__ import f4'))
Typical output (on my machine):
0.08207898699999999
0.19439769299999998
0.20093803199999993
0.1442435820000001
The two functools.partial approaches are identical (since one of them is just a wrapper for the other), the first is twice as fast, and the last is somewhere in between (but closer to the first). There is a clear overhead in using functools over a straightforward closure. Since the closure approach is arguably more readable as well (and more flexible than the lambda which doesn't extend well to more complicated functions) I would just go with it.
Technically you're missing one other option as operator.mul does the same thing you're looking to do and you can just use functools.partial on it to get a default first argument without having to reinvent the wheel.
Not only is it the fastest option it also uses the least space compared to a custom function or a lambda statement. The fact that it's a partial is why it uses the same space as the others and I think that's the best route here.
from timeit import timeit
from functools import partial
from sys import getsizeof
from operator import mul
def multiplier(m):
def inner(x):
return m * x
return inner
def mult(x,m):
return m*x
def multer(m):
return partial(mult,m=m)
f1 = multiplier(2)
f2 = multer(2)
f3 = partial(mult,m=2)
f4 = lambda n: mult(n,2)
f5 = partial(mul, 2)
from_main = 'from __main__ import {}'.format
print(timeit('f1(10)', from_main('f1')), getsizeof(f1))
print(timeit('f2(10)', from_main('f2')), getsizeof(f2))
print(timeit('f3(10)', from_main('f3')), getsizeof(f3))
print(timeit('f4(10)', from_main('f4')), getsizeof(f4))
print(timeit('f5(10)', from_main('f5')), getsizeof(f5))
Output
0.5278953390006791 144
1.0804575479996856 96
1.0762036349988193 96
0.9348237040030654 144
0.3904160970050725 96
This should answer your question as far as memory usage and speed.

multiprocessing.Pool.map() not working as expected

I understand from simple examples that Pool.map is supposed to behave identically to the 'normal' python code below except in parallel:
def f(x):
# complicated processing
return x+1
y_serial = []
x = range(100)
for i in x: y_serial += [f(x)]
y_parallel = pool.map(f, x)
# y_serial == y_parallel!
However I have two bits of code that I believe should follow this example:
#Linear version
price_datas = []
for csv_file in loop_through_zips(data_directory):
price_datas += [process_bf_data_csv(csv_file)]
#Parallel version
p = Pool()
price_data_parallel = p.map(process_bf_data_csv, loop_through_zips(data_directory))
However the Parallel code doesn't work whereas the Linear code does. From what I can observe, the parallel version appears to be looping through the generator (it's printing out log lines from the generator function) but then not actually performing the "process_bf_data_csv" function. What am I doing wrong here?
.map tries to pull all values from your generator to form it into an iterable before actually starting the work.
Try waiting longer (till the generator runs out) or use multi threading and a queue instead.

Properly lock a multiprocessing.Pool over an iterator without memory overhead

I have a iterable object in python Z, which is to large to fit into memory. I would like to perform a parallel calculation over this object and write the results, in order that they appear in Z, to a file. Consider this silly example:
import numpy as np
import multiprocessing as mp
import itertools as itr
FOUT = open("test",'w')
def f(x):
val = hash(np.random.random())
FOUT.write("%s\n"%val)
N = 10**9
Z = itr.repeat(0,N)
P = mp.Pool()
P.map(f,Z,chunksize=50)
P.close()
P.join()
FOUT.close()
There are two major problems with this:
multiple results can be written to the same line
a result is returned with N objects in it - this will be to big to hold in memory (and we don't need it!).
What I've tried:
Using a global lock mp.Lock() to share the FOUT resource: doesn't help, because I think each worker creates it's own namespace.
Use apply_async instead of map: While having callback fixes 1], 2], it doesn't accept an iterable object.
Use imap instead of map and iterating over the results:
Something like:
def f(x):
val = hash(np.random.random())
return val
P = mp.Pool()
C = P.imap(f,Z,chunksize=50)
for x in C:
FOUT.write("%s\n"%x)
This still uses inordinate amounts of memory, though I'm not sure why.

Is there a simple process-based parallel map for python?

I'm looking for a simple process-based parallel map for python, that is, a function
parmap(function,[data])
that would run function on each element of [data] on a different process (well, on a different core, but AFAIK, the only way to run stuff on different cores in python is to start multiple interpreters), and return a list of results.
Does something like this exist? I would like something simple, so a simple module would be nice. Of course, if no such thing exists, I will settle for a big library :-/
I seems like what you need is the map method in multiprocessing.Pool():
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only
one iterable argument though). It blocks till the result is ready.
This method chops the iterable into a number of chunks which it submits to the
process pool as separate tasks. The (approximate) size of these chunks can be
specified by setting chunksize to a positive integ
For example, if you wanted to map this function:
def f(x):
return x**2
to range(10), you could do it using the built-in map() function:
map(f, range(10))
or using a multiprocessing.Pool() object's method map():
import multiprocessing
pool = multiprocessing.Pool()
print pool.map(f, range(10))
This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
To parallelize your example, you'd need to define your map function with the #ray.remote decorator, and then invoke it with .remote. This will ensure that every instance of the remote function will executed in a different process.
import time
import ray
ray.init()
# Define the function you want to apply map on, as remote function.
#ray.remote
def f(x):
# Do some work...
time.sleep(1)
return x*x
# Define a helper parmap(f, list) function.
# This function executes a copy of f() on each element in "list".
# Each copy of f() runs in a different process.
# Note f.remote(x) returns a future of its result (i.e.,
# an identifier of the result) rather than the result itself.
def parmap(f, list):
return [f.remote(x) for x in list]
# Call parmap() on a list consisting of first 5 integers.
result_ids = parmap(f, range(1, 6))
# Get the results
results = ray.get(result_ids)
print(results)
This will print:
[1, 4, 9, 16, 25]
and it will finish in approximately len(list)/p (rounded up the nearest integer) where p is number of cores on your machine. Assuming a machine with 2 cores, our example will execute in 5/2 rounded up, i.e, in approximately 3 sec.
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Python3's Pool class has a map() method and that's all you need to parallelize map:
from multiprocessing import Pool
with Pool() as P:
xtransList = P.map(some_func, a_list)
Using with Pool() as P is similar to a process pool and will execute each item in the list in parallel. You can provide the number of cores:
with Pool(processes=4) as P:
For those who looking for Python equivalent of R's mclapply(), here is my implementation. It is an improvement of the following two examples:
"Parallelize Pandas map() or apply()", as mentioned by #Rafael
Valero.
How to apply map to functions with multiple arguments.
It can be apply to map functions with single or multiple arguments.
import numpy as np, pandas as pd
from scipy import sparse
import functools, multiprocessing
from multiprocessing import Pool
num_cores = multiprocessing.cpu_count()
def parallelize_dataframe(df, func, U=None, V=None):
#blockSize = 5000
num_partitions = 5 # int( np.ceil(df.shape[0]*(1.0/blockSize)) )
blocks = np.array_split(df, num_partitions)
pool = Pool(num_cores)
if V is not None and U is not None:
# apply func with multiple arguments to dataframe (i.e. involves multiple columns)
df = pd.concat(pool.map(functools.partial(func, U=U, V=V), blocks))
else:
# apply func with one argument to dataframe (i.e. involves single column)
df = pd.concat(pool.map(func, blocks))
pool.close()
pool.join()
return df
def square(x):
return x**2
def test_func(data):
print("Process working on: ", data.shape)
data["squareV"] = data["testV"].apply(square)
return data
def vecProd(row, U, V):
return np.sum( np.multiply(U[int(row["obsI"]),:], V[int(row["obsJ"]),:]) )
def mProd_func(data, U, V):
data["predV"] = data.apply( lambda row: vecProd(row, U, V), axis=1 )
return data
def generate_simulated_data():
N, D, nnz, K = [302, 184, 5000, 5]
I = np.random.choice(N, size=nnz, replace=True)
J = np.random.choice(D, size=nnz, replace=True)
vals = np.random.sample(nnz)
sparseY = sparse.csc_matrix((vals, (I, J)), shape=[N, D])
# Generate parameters U and V which could be used to reconstruct the matrix Y
U = np.random.sample(N*K).reshape([N,K])
V = np.random.sample(D*K).reshape([D,K])
return sparseY, U, V
def main():
Y, U, V = generate_simulated_data()
# find row, column indices and obvseved values for sparse matrix Y
(testI, testJ, testV) = sparse.find(Y)
colNames = ["obsI", "obsJ", "testV", "predV", "squareV"]
dtypes = {"obsI":int, "obsJ":int, "testV":float, "predV":float, "squareV": float}
obsValDF = pd.DataFrame(np.zeros((len(testV), len(colNames))), columns=colNames)
obsValDF["obsI"] = testI
obsValDF["obsJ"] = testJ
obsValDF["testV"] = testV
obsValDF = obsValDF.astype(dtype=dtypes)
print("Y.shape: {!s}, #obsVals: {}, obsValDF.shape: {!s}".format(Y.shape, len(testV), obsValDF.shape))
# calculate the square of testVals
obsValDF = parallelize_dataframe(obsValDF, test_func)
# reconstruct prediction of testVals using parameters U and V
obsValDF = parallelize_dataframe(obsValDF, mProd_func, U, V)
print("obsValDF.shape after reconstruction: {!s}".format(obsValDF.shape))
print("First 5 elements of obsValDF:\n", obsValDF.iloc[:5,:])
if __name__ == '__main__':
main()
I know this is an old post, but just in case, I wrote a tool to make this super, super easy called parmapper (I actually call it parmap in my use but the name was taken).
It handles a lot of the setup and deconstruction of processes and adds tons of features. In rough order of importance
Can take lambda and other unpickleable functions
Can apply starmap and other similar call methods to make it very easy to directly use.
Can split amongst both threads and/or processes
Includes features such as progress bars
It does incur a small cost but for most uses, that is negligible.
I hope you find it useful.
(Note: It, like map in Python 3+, returns an iterable so if you expect all results to pass through it immediately, use list())

Categories

Resources