Python multiprocessing outside main module: slow computations - python

I have a code that aims at multiprocessing at some inner part in some module that is NOT tha main.py. To make it simple, I have created a replica code that contains two files: the main.py and the file f1_mod.py. My main.py would look something like:
import os
from time import time
import f1_mod
parallel = 1
N = 10000000
if __name__ == '__main__':
if not parallel:
time_start = time()
res = f1_mod.normal_exec( N )
time_end = time()
print("-->time is ", time_end - time_start )
else:
time_start = time()
res = f1_mod.parallel_exec( N )
time_end = time()
print("-->time is ", time_end - time_start )
for j in range( 100 ):
print( res[j] )
else:
print("--> CHILD PROCESS ID: ", os.getpid() )
pass
and the f1_mod.py would be:
import multiprocessing as mp
import numpy as np
def pool_worker_function( k, args ):
d = args[0]
return d["p"]*( k**2 ) + d["x"]**2
def single_thread_exec( N ):
a = list( np.linspace( 0, N, N ) )
d = { "p": 0.2, "x": 0.5 }
result = []
for k in a:
result.append( d["p"]*( k**2) + d["x"]**2 )
return result
def parallel_exec( N ):
number_processors = mp.cpu_count()
number_used_processors = number_processors - 1
#try pool map method
from itertools import repeat
pool = mp.Pool( processes = number_used_processors )
d = { "p": 0.2, "x": 0.5 }
a = list( np.linspace( 0, N, N ) )
args = ( d, )
number_tasks = number_used_processors
chuncks = []
n_chunck = int( ( len(a) - 1 )/number_tasks )
for j in range( 0, number_tasks ):
if ( j == number_tasks - 1 ):
chuncks.append( a[ j*n_chunck: ] )
else:
chuncks.append( a[ j*n_chunck: j*n_chunck + n_chunck ] )
result = pool.starmap( pool_worker_function, zip( a, repeat( args ) ) )
pool.close()
pool.join()
return result
I check that both the serial and parallelized versions give the same results, except that the serial version is much faster than the multiprocessed one. In my real code, this is sometimes the case, and the "args" tuple entering the worker function actually contains a big object data container, with a much bigger dictionary that is used to read data from to perform the operations. Can anyone explain why do I observe this behaviour (i.e. slow performance when multiprocessing)? Data needs to be passed to the worker function every time, and actually this makes the worker function to take a lot of arguments are maybe is what slows the code giving the IPC takin place ? (¿?)
The "repeat" in the args passed to the worker is used since all arguments passed by the tuple have to be the same for each worker, the only iterable is the list "a". Does anyone know how to make an efficient multiprocessing of this? Also, note that multiprocessing does not happen at the "main.py" level, but rather at "deep" functions in some module within the logic of the code. I would appreciate some help here to better understand how this multiprocessing works! I am using a 4 core machine under Windows OS, and I know now Windows does not support "fork" like behaviour when using multiprocessing. However, running the code in Ubuntu on my machine seems to be very slow too! Thanks!!

My best guess for the current code : you don't do enough work in pool_worker_function to offset the cost of communication/synchronization.
I may be wrong, but your code seems half finished : the chunks variable is never used. Anyway, passing a big arrays (your chunks) around wouldn't probably be the best solution.
You may try to pass start/len parameters to your pool_worker_function, then create the array and do the computation there, but I'm not 100% sure it will be enough to offset the communication cost for the result.

Related

Can't use Python multiprocessing with large amount of calculations

I have to speed up my current code to do around 10^6 operations in a feasible time. Before I used multiprocessing in my the actual document I tried to do it in a mock case. Following is my attempt:
def chunkIt(seq, num):
avg = len(seq) / float(num)
out = []
last = 0.0
while last < len(seq):
out.append(seq[int(last):int(last + avg)])
last += avg
return out
def do_something(List):
# in real case this function takes about 0.5 seconds to finish for each
iteration
turn = []
for e in List:
turn.append((e[0]**2, e[1]**2,e[2]**2))
return turn
t1 = time.time()
List = []
#in the real case these 20's can go as high as 150
for i in range(1,20-2):
for k in range(i+1,20-1):
for j in range(k+1,20):
List.append((i,k,j))
t3 = time.time()
test = []
List = chunkIt(List,3)
if __name__ == '__main__':
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(do_something,List)
for result in results:
test.append(result)
test= np.array(test)
t2 = time.time()
T = t2-t1
T2 = t3-t1
However, when I increase the size of my "List" my computer tires to use all of my RAM and CPU and freezes. I even cut my "List" into 3 pieces so it will only use 3 of my cores. However, nothing changed. Also, when I tried to use it on a smaller data set I noticed the code ran much slower than when it ran on a single core.
I am still very new to multiprocessing in Python, am I doing something wrong. I would appreciate it if you could help me.
To reduce memory usage, I suggest you use instead the multiprocessing module and specifically the imap method method (or imap_unordered method). Unlike the map method of either multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor, the iterable argument is processed lazily. What this means is that if you use a generator function or generator expression for the iterable argument, you do not need to create the complete list of arguments in memory; as a processor in the pool become free and ready to execute more tasks, the generator will be called upon to generate the next argument for the imap call.
By default a chunksize value of 1 is used, which can be inefficient for a large iterable size. When using map and the default value of None for the chunksize argument, the pool will look at the length of the iterable first converting it to a list if necessary and then compute what it deems to be an efficient chunksize based on that length and the size of the pool. When using imap or imap_unordered, converting the iterable to a list would defeat the whole purpose of using that method. But if you know what that size would be (more or less) if the iterable were converted to a list, then there is no reason not to apply the same chunksize calculation the map method would have, and that is what is done below.
The following benchmarks perform the same processing first as a single process and then using multiprocessing using imap where each invocation of do_something on my desktop takes approximately .5 seconds. do_something now has been modified to just process a single i, k, j tuple as there is no longer any need to break up anything into smaller lists:
from multiprocessing import Pool, cpu_count
import time
def half_second():
HALF_SECOND_ITERATIONS = 10_000_000
sum = 0
for _ in range(HALF_SECOND_ITERATIONS):
sum += 1
return sum
def do_something(tpl):
# in real case this function takes about 0.5 seconds to finish for each iteration
half_second() # on my desktop
return tpl[0]**2, tpl[1]**2, tpl[2]**2
"""
def generate_tpls():
for i in range(1, 20-2):
for k in range(i+1, 20-1):
for j in range(k+1, 20):
yield i, k, j
"""
# Use smaller number of tuples so we finish in a reasonable amount of time:
def generate_tpls():
# 64 tuples:
for i in range(1, 5):
for k in range(1, 5):
for j in range(1, 5):
yield i, k, j
def benchmark1():
""" single processing """
t = time.time()
for tpl in generate_tpls():
result = do_something(tpl)
print('benchmark1 time:', time.time() - t)
def compute_chunksize(iterable_size, pool_size):
""" This is more-or-less the function used by the Pool.map method """
chunksize, remainder = divmod(iterable_size, 4 * pool_size)
if remainder:
chunksize += 1
return chunksize
def benchmark2():
""" multiprocssing """
t = time.time()
pool_size = cpu_count() # 8 logical cores (4 physical cores)
N_TUPLES = 64 # number of tuples that will be generated
pool = Pool(pool_size)
chunksize = compute_chunksize(N_TUPLES, pool_size)
for result in pool.imap(do_something, generate_tpls(), chunksize=chunksize):
pass
print('benchmark2 time:', time.time() - t)
if __name__ == '__main__':
benchmark1()
benchmark2()
Prints:
benchmark1 time: 32.261038303375244
benchmark2 time: 8.174998044967651
The nested For loops creating the array before the main definition appears to be the problem. Moving that part to underneath the main definition clears up any memory problems.
def chunkIt(seq, num):
avg = len(seq) / float(num)
out = []
last = 0.0
while last < len(seq):
out.append(seq[int(last):int(last + avg)])
last += avg
return out
def do_something(List):
# in real case this function takes about 0.5 seconds to finish for each
iteration
turn = []
for e in List:
turn.append((e[0]**2, e[1]**2,e[2]**2))
return turn
if __name__ == '__main__':
t1 = time.time()
List = []
#in the real case these 20's can go as high as 150
for i in range(1,20-2):
for k in range(i+1,20-1):
for j in range(k+1,20):
List.append((i,k,j))
t3 = time.time()
test = []
List = chunkIt(List,3)
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(do_something,List)
for result in results:
test.append(result)
test= np.array(test)
t2 = time.time()
T = t2-t1
T2 = t3-t1

Parallel Processing in Python with nested loop

Due to performance issue, i would like to run in parallel my function in python :
import multiprocessing as mp
source_nodes = [10413173, 10414530, 10414530, 10437199]
sink_nodes = [10420346, 10438770, 10438711, 10414530, 10436258]
path =[]
def createpath(source,sink):
for i in source:
for j in sink:
path = path + list(nx.all_simple_paths(Directed_G,i,j))
return path
From my understanding i must give 1 iterable to apply function. but my idea was to do something like :
results = [pool.apply(createpath, args=(source_nodes, sink_nodes))]
And then don't give any iterable object to applyfunction
I managed to get it work, but i don't think it run on parallel.
Do you think i should include the apply function inside the first loop ?
from multiprocessing import Pool
source_nodes = [1,2,3,4,5,6]
sink_nodes = [1,1,1,1,1,1,1,1,1]
def sum_values(parameter_tuple):
source,sink, start, stop = parameter_tuple
out = 0
for i in range(start, stop):
val_i = source[i]
for j in sink:
out += val_i*j
return out
if __name__ == "__main__":
params = (source_nodes, sink_nodes, 0, 6)
print(sum_values(params))
with Pool(2) as p:
print(p.map(sum_values, [
(source_nodes, sink_nodes, 0, 3),
(source_nodes, sink_nodes, 3, 6),
]))
You can try to run this one. This runs parallel with map pattern on pool of 2 threads. In this case your output result is the sum of result of each process from pool.

Multi-threaded Python application slower than single-threaded implementation

I wrote this program to properly learn how to use multi-threading. I want to implement something similar to this in my own program:
import numpy as np
import time
import os
import math
import random
from threading import Thread
def powExp(x, r):
for c in range(x.shape[1]):
x[r][c] = math.pow(100, x[r][c])
def main():
print()
rows = 100
cols = 100
x = np.random.random((rows, cols))
y = x.copy()
start = time.time()
threads = []
for r in range(x.shape[0]):
t = Thread(target = powExp, args = (x, r))
threads.append(t)
t.start()
for t in threads:
t.join()
end = time.time()
print("Multithreaded calculation took {n} seconds!".format(n = end - start))
start = time.time()
for r in range(y.shape[0]):
for c in range(y.shape[1]):
y[r][c] = math.pow(100, y[r][c])
end = time.time()
print("Singlethreaded calculation took {n} seconds!".format(n = end - start))
print()
randRow = random.randint(0, rows - 1)
randCol = random.randint(0, cols - 1)
print("Checking random indices in x and y:")
print("x[{rR}][{rC}]: = {n}".format(rR = randRow, rC = randCol, n = x[randRow][randCol]))
print("y[{rR}][{rC}]: = {n}".format(rR = randRow, rC = randCol, n = y[randRow][randCol]))
print()
for r in range(x.shape[0]):
for c in range(x.shape[1]):
if(x[r][c] != y[r][c]):
print("ERROR NO WORK WAS DONE")
print("x[{r}][{c}]: {n} == y[{r}][{c}]: {ny}".format(
r = r,
c = c,
n = x[r][c],
ny = y[r][c]
))
quit()
assert(np.array_equal(x, y))
if __name__ == main():
main()
As you can see from the code the goal here is to parallelize the operation math.pow(100, x[r][c]) by creating a thread for every column. However this code is extremely slow, a lot slower than single-threaded versions.
Output:
Multithreaded calculation took 0.026447772979736328 seconds!
Singlethreaded calculation took 0.006798267364501953 seconds!
Checking random indices in x and y:
x[58][58]: = 9.792315687115973
y[58][58]: = 9.792315687115973
I searched through stackoverflow and found some info about the GIL forcing python bytecode to be executed on a single core only. However I'm not sure that this is in fact what is limiting my parallelization. I tried rearranging the parallelized for-loop using pools instead of threads. Nothing seems to be working.
Python code performance decreases with threading
EDIT: This thread discusses the same issue. Is it completely impossible to increase performance using multi-threading in python because of the GIL? Is the GIL causing my slowdowns?
EDIT 2 (2017-01-18): So from what I can gather after searching for quite a bit online it seems like python is really bad for parallelism. What I'm trying to do is parellelize a python function used in a neural network implemented in tensorflow...it seems like adding a custom op is the way to go.
The number of issues here is quite... numerous. Too many (system!) threads that do too little work, the GIL, etc. This is what I consider a really good introduction to parallelism in Python:
https://www.youtube.com/watch?v=MCs5OvhV9S4
Live coding is awesome.

Can Python threads work on the same process?

I am trying to come up with a way to have threads work on the same goal without interfering. In this case I am using 4 threads to add up every number between 0 and 90,000. This code runs but it ends almost immediately (Runtime: 0.00399994850159 sec) and only outputs 0. Originally I wanted to do it with a global variable but I was worried about the threads interfering with each other (ie. the small chance that two threads double count or skip a number due to strange timing of the reads/writes). So instead I distributed the workload beforehand. If there is a better way to do this please share. This is my simple way of trying to get some experience into multi threading. Thanks
import threading
import time
start_time = time.time()
tot1 = 0
tot2 = 0
tot3 = 0
tot4 = 0
def Func(x,y,tot):
tot = 0
i = y-x
while z in range(0,i):
tot = tot + i + z
# class Tester(threading.Thread):
# def run(self):
# print(n)
w = threading.Thread(target=Func, args=(0,22499,tot1))
x = threading.Thread(target=Func, args=(22500,44999,tot2))
y = threading.Thread(target=Func, args=(45000,67499,tot3))
z = threading.Thread(target=Func, args=(67500,89999,tot4))
w.start()
x.start()
y.start()
z.start()
w.join()
x.join()
y.join()
z.join()
# while (w.isAlive() == False | x.isAlive() == False | y.isAlive() == False | z.isAlive() == False): {}
total = tot1 + tot2 + tot3 + tot4
print total
print("--- %s seconds ---" % (time.time() - start_time))
You have a bug that makes this program end almost immediately. Look at while z in range(0,i): in Func. z isn't defined in the function and its only by luck (bad luck really) that you happen to have a global variable z = threading.Thread(target=Func, args=(67500,89999,tot4)) that masks the problem. You are testing whether the thread object is in a list of integers... and its not!
The next problem is with the global variables. First, you are absolutely right that using a single global variable is not thread safe. The threads would mess with each others calculations. But you misunderstand how globals work. When you do threading.Thread(target=Func, args=(67500,89999,tot4)), python passes the object currently referenced by tot4 to the function, but the function has no idea which global it came from. You only update the local variable tot and discard it when the function completes.
A solution is to use a global container to hold the calculations as shown in the example below. Unfortunately, this is actually slower than just doing all the work in one thread. The python global interpreter lock (GIL) only lets 1 thread run at a time and only slows down CPU-intensive tasks implemented in pure python.
You could look at the multiprocessing module to split this into multiple processes. That works well if the cost of running the calculation is large compared to the cost of starting the process and passing it data.
Here is a working copy of your example:
import threading
import time
start_time = time.time()
tot = [0] * 4
def Func(x,y,tot_index):
my_total = 0
i = y-x
for z in range(0,i):
my_total = my_total + i + z
tot[tot_index] = my_total
# class Tester(threading.Thread):
# def run(self):
# print(n)
w = threading.Thread(target=Func, args=(0,22499,0))
x = threading.Thread(target=Func, args=(22500,44999,1))
y = threading.Thread(target=Func, args=(45000,67499,2))
z = threading.Thread(target=Func, args=(67500,89999,3))
w.start()
x.start()
y.start()
z.start()
w.join()
x.join()
y.join()
z.join()
# while (w.isAlive() == False | x.isAlive() == False | y.isAlive() == False | z.isAlive() == False): {}
total = sum(tot)
print total
print("--- %s seconds ---" % (time.time() - start_time))
You can pass in a mutable object that you can add your results either with an identifier, e.g. dict or just a list and append() the results, e.g.:
import threading
def Func(start, stop, results):
results.append(sum(range(start, stop+1)))
rngs = [(0, 22499), (22500, 44999), (45000, 67499), (67500, 89999)]
results = []
jobs = [threading.Thread(target=Func, args=(start, stop, results)) for start, stop in rngs]
for j in jobs:
j.start()
for j in jobs:
j.join()
print(sum(results))
# 4049955000
# 100 loops, best of 3: 2.35 ms per loop
As others have noted you could look multiprocessing in order to split the work to multiple different processes that can run parallel. This would benefit especially in CPU-intensive tasks assuming that there isn't huge amount of data to pass between the processes.
Here's a simple implementation of the same functionality using multiprocessing:
from multiprocessing import Pool
POOL_SIZE = 4
NUMBERS = 90000
def func(_range):
tot = 0
for z in range(*_range):
tot += z
return tot
with Pool(POOL_SIZE) as pool:
chunk_size = int(NUMBERS / POOL_SIZE)
chunks = ((i, i + chunk_size) for i in range(0, NUMBERS, chunk_size))
print(sum(pool.imap(func, chunks)))
In above chunks is a generator that produces the same ranges that were hardcoded in original version. It's given to imap which works the same as standard map except that it executes the function in the processes within the pool.
Less known fact about multiprocessing is that you can easily convert the code to use threads instead of processes by using undocumented multiprocessing.pool.ThreadPool. In order to convert above example to use threads just change import to:
from multiprocessing.pool import ThreadPool as Pool

Multiprocessing in python to speed up functions

I am confused with Python multiprocessing.
I am trying to speed up a function which process strings from a database but I must have misunderstood how multiprocessing works because the function takes longer when given to a pool of workers than with “normal processing”.
Here an example of what I am trying to achieve.
from time import clock, time
from multiprocessing import Pool, freeze_support
from random import choice
def foo(x):
TupWerteMany = []
for i in range(0,len(x)):
TupWerte = []
s = list(x[i][3])
NewValue = choice(s)+choice(s)+choice(s)+choice(s)
TupWerte.append(NewValue)
TupWerte = tuple(TupWerte)
TupWerteMany.append(TupWerte)
return TupWerteMany
if __name__ == '__main__':
start_time = time()
List = [(u'1', u'aa', u'Jacob', u'Emily'),
(u'2', u'bb', u'Ethan', u'Kayla')]
List1 = List*1000000
# METHOD 1 : NORMAL (takes 20 seconds)
x2 = foo(List1)
print x2[1:3]
# METHOD 2 : APPLY_ASYNC (takes 28 seconds)
# pool = Pool(4)
# Werte = pool.apply_async(foo, args=(List1,))
# x2 = Werte.get()
# print '--------'
# print x2[1:3]
# print '--------'
# METHOD 3: MAP (!! DOES NOT WORK !!)
# pool = Pool(4)
# Werte = pool.map(foo, args=(List1,))
# x2 = Werte.get()
# print '--------'
# print x2[1:3]
# print '--------'
print 'Time Elaspse: ', time() - start_time
My questions:
Why does apply_async takes longer than the “normal way” ?
What I am doing wrong with map?
Does it makes sense to speed up such tasks with multiprocessing at all?
Finally: after all I have read here, I am wondering if multiprocessing in python works on windows at all ?
So your first problem is that there is no actual parallelism happening in foo(x), you are passing the entire list to the function once.
1)
The idea of a process pool is to have many processes doing computations on separate bits of some data.
# METHOD 2 : APPLY_ASYNC
jobs = 4
size = len(List1)
pool = Pool(4)
results = []
# split the list into 4 equally sized chunks and submit those to the pool
heads = range(size/jobs, size, size/jobs) + [size]
tails = range(0,size,size/jobs)
for tail,head in zip(tails, heads):
werte = pool.apply_async(foo, args=(List1[tail:head],))
results.append(werte)
pool.close()
pool.join() # wait for the pool to be done
for result in results:
werte = result.get() # get the return value from the sub jobs
This will only give you an actual speedup if the time it takes to process each chunk is greater than the time it takes to launch the process, in the case of four processes and four jobs to be done, of course these dynamics change if you've got 4 processes and 100 jobs to be done. Remember that you are creating a completely new python interpreter four times, this isn't free.
2) The problem you have with map is that it applies foo to EVERY element in List1 in a separate process, this will take quite a while. So if you're pool has 4 processes map will pop an item of the list four times and send it to a process to be dealt with - wait for process to finish - pop some more stuff of the list - wait for the process to finish. This makes sense only if processing a single item takes a long time, like for instance if every item is a file name pointing to a one gigabyte text file. But as it stands map will just take a single string of the list and pass it to foo where as apply_async takes a slice of the list. Try the following code
def foo(thing):
print thing
map(foo, ['a','b','c','d'])
That's the built-in python map and will run a single process, but the idea is exactly the same for the multiprocess version.
Added as per J.F.Sebastian's comment: You can however use the chunksize argument to map to specify an approximate size of for each chunk.
pool.map(foo, List1, chunksize=size/jobs)
I don't know though if there is a problem with map on Windows as I don't have one available for testing.
3) yes, given that your problem is big enough to justify forking out new python interpreters
4) can't give you a definitive answer on that as it depends on the number of cores/processors etc. but in general it should be fine on Windows.
On question (2)
With the guidance of Dougal and Matti, I figured out what's went wrong.
The original foo function processes a list of lists, while map requires a function to process single elements.
The new function should be
def foo2 (x):
TupWerte = []
s = list(x[3])
NewValue = choice(s)+choice(s)+choice(s)+choice(s)
TupWerte.append(NewValue)
TupWerte = tuple(TupWerte)
return TupWerte
and the block to call it :
jobs = 4
size = len(List1)
pool = Pool()
#Werte = pool.map(foo2, List1, chunksize=size/jobs)
Werte = pool.map(foo2, List1)
pool.close()
print Werte[1:3]
Thanks to all of you who helped me understand this.
Results of all methods:
for List * 2 Mio records: normal 13.3 seconds , parallel with async: 7.5 seconds, parallel with with map with chuncksize : 7.3, without chunksize 5.2 seconds
Here's a generic multiprocessing template if you are interested.
import multiprocessing as mp
import time
def worker(x):
time.sleep(0.2)
print "x= %s, x squared = %s" % (x, x*x)
return x*x
def apply_async():
pool = mp.Pool()
for i in range(100):
pool.apply_async(worker, args = (i, ))
pool.close()
pool.join()
if __name__ == '__main__':
apply_async()
And the output looks like this:
x= 0, x squared = 0
x= 1, x squared = 1
x= 2, x squared = 4
x= 3, x squared = 9
x= 4, x squared = 16
x= 6, x squared = 36
x= 5, x squared = 25
x= 7, x squared = 49
x= 8, x squared = 64
x= 10, x squared = 100
x= 11, x squared = 121
x= 9, x squared = 81
x= 12, x squared = 144
As you can see, the numbers are not in order, as they are being executed asynchronously.

Categories

Resources