Optimizing a parallel implementation of a list comprehension

Optimizing a parallel implementation of a list comprehension - python

I have a dataframe, where each row contains a list of integers. I also have a reference-list that I use to check what integers in the dataframe appear in this list.
I have made two implementations of this, one single-threaded and one multi-threaded. The single-threaded implementation is quite fast (takes roughly 0.1s on my machine), whereas the multithreaded takes roughly 5s.
My question is: Is this due to my implementation being poor, or is this merely a case where the overhead due to multithreading is so large that it doesn't make sense to use multiple threads?
The example is below:
import time
from random import randint
import pandas as pd
import multiprocessing
from functools import partial
class A:
def __init__(self, N):
self.ls = [[randint(0, 99) for i in range(20)] for j in range(N)]
self.ls = pd.DataFrame({'col': self.ls})
self.lst_nums = [randint(0, 99) for i in range(999)]
#classmethod
def helper(cls, lst_nums, col):
return any([s in lst_nums for s in col])
def get_idx_method1(self):
method1 = self.ls['col'].apply(lambda nums: any(x in self.lst_nums for x in nums))
return method1
def get_idx_method2(self):
pool = multiprocessing.Pool(processes=1)
method2 = pool.map(partial(A.helper, self.lst_nums), self.ls['col'])
pool.close()
return method2
if __name__ == "__main__":
a = A(50000)
start = time.time()
m1 = a.get_idx_method1()
end = time.time()
print(end-start)
start = time.time()
m2 = a.get_idx_method2()
end = time.time()
print(end - start)

First of all, multiprocessing is useful when the cost of data communication between the main process and the others is less comparable to the time cost of the function.
Another thing is that you made an error in your code:
def helper(cls, lst_nums, col):
return any([s in lst_nums for s in col])
VS
any(x in self.lst_nums for x in nums)
You have that list [] in the helper method, which will make the any() method to wait for the entire array to be computed, while the second any() will just stop at the first True value.
In conclusion if you remove list brackets from the helper method and maybe increase the randint range for lst_nums initializer, you will notice an increase in speed when using multiple processes.
self.lst_nums = [randint(0, 10000) for i in range(999)]
and
def helper(cls, lst_nums, col):
return any(s in lst_nums for s in col)

Related

Can't use Python multiprocessing with large amount of calculations

I have to speed up my current code to do around 10^6 operations in a feasible time. Before I used multiprocessing in my the actual document I tried to do it in a mock case. Following is my attempt:
def chunkIt(seq, num):
avg = len(seq) / float(num)
out = []
last = 0.0
while last < len(seq):
out.append(seq[int(last):int(last + avg)])
last += avg
return out
def do_something(List):
# in real case this function takes about 0.5 seconds to finish for each
iteration
turn = []
for e in List:
turn.append((e[0]**2, e[1]**2,e[2]**2))
return turn
t1 = time.time()
List = []
#in the real case these 20's can go as high as 150
for i in range(1,20-2):
for k in range(i+1,20-1):
for j in range(k+1,20):
List.append((i,k,j))
t3 = time.time()
test = []
List = chunkIt(List,3)
if __name__ == '__main__':
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(do_something,List)
for result in results:
test.append(result)
test= np.array(test)
t2 = time.time()
T = t2-t1
T2 = t3-t1
However, when I increase the size of my "List" my computer tires to use all of my RAM and CPU and freezes. I even cut my "List" into 3 pieces so it will only use 3 of my cores. However, nothing changed. Also, when I tried to use it on a smaller data set I noticed the code ran much slower than when it ran on a single core.
I am still very new to multiprocessing in Python, am I doing something wrong. I would appreciate it if you could help me.

To reduce memory usage, I suggest you use instead the multiprocessing module and specifically the imap method method (or imap_unordered method). Unlike the map method of either multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor, the iterable argument is processed lazily. What this means is that if you use a generator function or generator expression for the iterable argument, you do not need to create the complete list of arguments in memory; as a processor in the pool become free and ready to execute more tasks, the generator will be called upon to generate the next argument for the imap call.
By default a chunksize value of 1 is used, which can be inefficient for a large iterable size. When using map and the default value of None for the chunksize argument, the pool will look at the length of the iterable first converting it to a list if necessary and then compute what it deems to be an efficient chunksize based on that length and the size of the pool. When using imap or imap_unordered, converting the iterable to a list would defeat the whole purpose of using that method. But if you know what that size would be (more or less) if the iterable were converted to a list, then there is no reason not to apply the same chunksize calculation the map method would have, and that is what is done below.
The following benchmarks perform the same processing first as a single process and then using multiprocessing using imap where each invocation of do_something on my desktop takes approximately .5 seconds. do_something now has been modified to just process a single i, k, j tuple as there is no longer any need to break up anything into smaller lists:
from multiprocessing import Pool, cpu_count
import time
def half_second():
HALF_SECOND_ITERATIONS = 10_000_000
sum = 0
for _ in range(HALF_SECOND_ITERATIONS):
sum += 1
return sum
def do_something(tpl):
# in real case this function takes about 0.5 seconds to finish for each iteration
half_second() # on my desktop
return tpl[0]**2, tpl[1]**2, tpl[2]**2
"""
def generate_tpls():
for i in range(1, 20-2):
for k in range(i+1, 20-1):
for j in range(k+1, 20):
yield i, k, j
"""
# Use smaller number of tuples so we finish in a reasonable amount of time:
def generate_tpls():
# 64 tuples:
for i in range(1, 5):
for k in range(1, 5):
for j in range(1, 5):
yield i, k, j
def benchmark1():
""" single processing """
t = time.time()
for tpl in generate_tpls():
result = do_something(tpl)
print('benchmark1 time:', time.time() - t)
def compute_chunksize(iterable_size, pool_size):
""" This is more-or-less the function used by the Pool.map method """
chunksize, remainder = divmod(iterable_size, 4 * pool_size)
if remainder:
chunksize += 1
return chunksize
def benchmark2():
""" multiprocssing """
t = time.time()
pool_size = cpu_count() # 8 logical cores (4 physical cores)
N_TUPLES = 64 # number of tuples that will be generated
pool = Pool(pool_size)
chunksize = compute_chunksize(N_TUPLES, pool_size)
for result in pool.imap(do_something, generate_tpls(), chunksize=chunksize):
pass
print('benchmark2 time:', time.time() - t)
if __name__ == '__main__':
benchmark1()
benchmark2()
Prints:
benchmark1 time: 32.261038303375244
benchmark2 time: 8.174998044967651

The nested For loops creating the array before the main definition appears to be the problem. Moving that part to underneath the main definition clears up any memory problems.
def chunkIt(seq, num):
avg = len(seq) / float(num)
out = []
last = 0.0
while last < len(seq):
out.append(seq[int(last):int(last + avg)])
last += avg
return out
def do_something(List):
# in real case this function takes about 0.5 seconds to finish for each
iteration
turn = []
for e in List:
turn.append((e[0]**2, e[1]**2,e[2]**2))
return turn
if __name__ == '__main__':
t1 = time.time()
List = []
#in the real case these 20's can go as high as 150
for i in range(1,20-2):
for k in range(i+1,20-1):
for j in range(k+1,20):
List.append((i,k,j))
t3 = time.time()
test = []
List = chunkIt(List,3)
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(do_something,List)
for result in results:
test.append(result)
test= np.array(test)
t2 = time.time()
T = t2-t1
T2 = t3-t1

Multiprocessing in Python: Parallelize a for loop to fill a Numpy array

I've been reading threads like this one but any of them seems to work for my case. I'm trying to parallelize the following toy example to fill a Numpy array inside a for loop using Multiprocessing in Python:
import numpy as np
from multiprocessing import Pool
import time
def func1(x, y=1):
return x**2 + y
def func2(n, parallel=False):
my_array = np.zeros((n))
# Parallelized version:
if parallel:
pool = Pool(processes=6)
for idx, val in enumerate(range(1, n+1)):
result = pool.apply_async(func1, [val])
my_array[idx] = result.get()
pool.close()
# Not parallelized version:
else:
for i in range(1, n+1):
my_array[i-1] = func1(i)
return my_array
def main():
start = time.time()
my_array = func2(60000)
end = time.time()
print(my_array)
print("Normal time: {}\n".format(end-start))
start_parallel = time.time()
my_array_parallelized = func2(60000, parallel=True)
end_parallel = time.time()
print(my_array_parallelized)
print("Time based on multiprocessing: {}".format(end_parallel-start_parallel))
if __name__ == '__main__':
main()
The lines in the code based on Multiprocessing seem to work and give you the right results. However, it takes far longer than the non parallelized version. Here is the output of both versions:
[2.00000e+00 5.00000e+00 1.00000e+01 ... 3.59976e+09 3.59988e+09
3.60000e+09]
Normal time: 0.01605963706970215
[2.00000e+00 5.00000e+00 1.00000e+01 ... 3.59976e+09 3.59988e+09
3.60000e+09]
Time based on multiprocessing: 2.8775112628936768
My intuition tells me that it should be a better way of capturing results from pool.apply_async(). What am I doing wrong? What is the most efficient way to accomplish this? Thx.

Creating processes is expensive. On my machine it take at leas several hundred of microsecond per process created. Moreover, the multiprocessing module copy the data to be computed between process and then gather the results from the process pool. This inter-process communication is very slow too. The problem is that your computation is trivial and can be done very quickly, likely much faster than all the introduced overhead. The multiprocessing module is only useful when you are dealing with quite small datasets and perform intensive computation (compared to the amount of computed data).
Hopefully, when it comes to numericals computations using Numpy, there is a simple and fast way to parallelize your application: the Numba JIT. Numba can parallelize a code if you explicitly use parallel structures (parallel=True and prange). It uses threads and not heavy processes that are working in shared memory. Numba can overcome the GIL if your code does not deal with native types and Numpy arrays instead of pure Python dynamic object (lists, big integers, classes, etc.). Here is an example:
import numpy as np
import numba as nb
import time
#nb.njit
def func1(x, y=1):
return x**2 + y
#nb.njit('float64[:](int64)', parallel=True)
def func2(n):
my_array = np.zeros(n)
for i in nb.prange(1, n+1):
my_array[i-1] = func1(i)
return my_array
def main():
start = time.time()
my_array = func2(60000)
end = time.time()
print(my_array)
print("Numba time: {}\n".format(end-start))
if __name__ == '__main__':
main()
Because Numba compiles the code at runtime, it is able to fully optimize the loop to a no-op resulting in a time close to 0 second in this case.

Here is the solution proposed by #thisisalsomypassword that improves my initial proposal. That is, "collecting the AsyncResult objects in a list within the loop and then calling AsyncResult.get() after all processes have started on each result object":
import numpy as np
from multiprocessing import Pool
import time
def func1(x, y=1):
time.sleep(0.1)
return x**2 + y
def func2(n, parallel=False):
my_array = np.zeros((n))
# Parallelized version:
if parallel:
pool = Pool(processes=6)
####### HERE COMES THE CHANGE #######
results = [pool.apply_async(func1, [val]) for val in range(1, n+1)]
for idx, val in enumerate(results):
my_array[idx] = val.get()
#######
pool.close()
# Not parallelized version:
else:
for i in range(1, n+1):
my_array[i-1] = func1(i)
return my_array
def main():
start = time.time()
my_array = func2(600)
end = time.time()
print(my_array)
print("Normal time: {}\n".format(end-start))
start_parallel = time.time()
my_array_parallelized = func2(600, parallel=True)
end_parallel = time.time()
print(my_array_parallelized)
print("Time based on multiprocessing: {}".format(end_parallel-start_parallel))
if __name__ == '__main__':
main()
Now it works. Time is reduced considerably with Multiprocessing:
Normal time: 60.107836008071
Time based on multiprocessing: 10.049324989318848
time.sleep(0.1) was added in func1 to cancel out the effect of being a super trivial task.

Using Multithreading or Multiprocessing to improve computational speed

I am iterating through very large file size [mesh]. Since the iteration is independent, I would like to split my mesh into smaller sizes and run them all at the same time in order to lower computation time. Below is a sample code. For example, if mesh is of length=50000, I would like to divide the mesh into 100 and run fun for each mesh/100 at the same time.
import numpy as np
def fnc(data, mesh):
d = []
for i, dummy_val in enumerate(mesh):
d.append(np.sqrt((data[:, 0]-mesh[i, 0])**2.0 + (data[:, 1]-mesh[i, 1])**2.0))
return d
interpolate = fnc(mydata, mymesh)
I would like to know to achieve this using multiprocessing or multithreading as I'm unable to reconcile it with the execution of my loop.

This will give you the general idea. I couldn't test this since I do not have your data. The default constructor for ProcessPoolExecutor will use the number of processors on your computer. But since that determines the level of multiprocessing you can have, it will probably be more efficient to set the N_CHUNKS parameter to the number of simultaneous processes you can support. That is, if you have a processing pool size of 6, then it is better to just divide your array into 6 large chunks and have 6 processes do the work rather than breaking it up into smaller pieces where processes will have to wait to run. So you should probably specify a specific max_workers number to the ProcessPoolExecutor not greater than the number of processors you have and set N_CHUNKS to the same value.
from concurrent.futures import ProcessPoolExecutor, as_completed
import numpy as np
def fnc(data, mesh):
d = []
for i, dummy_val in enumerate(mesh):
d.append(np.sqrt((data[:, 0]-mesh[i, 0])**2.0 + (data[:, 1]-mesh[i, 1])**2.0))
return d
def main(data, mesh):
#N_CHUNKS = 100
N_CHUNKS = 6 # assuming you have 6 processors; see max_workers parameter
n = len(mesh)
assert n != 0
if n <= N_CHUNKS:
N_CHUNKS = 1
chunk_size = n
last_chunk_size = n
else:
chunk_size = n // N_CHUNKS
last_chunk_size = n - chunk_size * (N_CHUNKS - 1)
with ProcessPoolExcutor(max_workers=N_CHUNKS) as executor: # assuming you have 6 processors
the_futures = {}
start = 0
for i in range(N_CHUNKS - 1):
future = executor.submit(fnc, data, mesh[start:start+chunk_size]) # pass slice
the_futures[future] = (start, start+chunk_size) # map future to request parameters
start += chunk_size
if last_chunk_size:
future = executor.submit(fnc, data, mesh[start:n]) # pass slice
the_futures[future] = (start, start+n)
for future in as_completed(the_futures):
(start, end) = the_futures[future] # the original range
d = future.result() # do something with the results
if __name__ == '__main__':
# the call to main must be done in a block governed by if __name__ == '__main__' or you will get into a recursive
# loop where each subprocess calls main again
main(data, mesh)

Implement merge_sort with multiprocessing solution

I tried to write a merge sort with multiprocessing solution
from heapq import merge
from multiprocessing import Process
def merge_sort1(m):
if len(m) < 2:
return m
middle = len(m) // 2
left = Process(target=merge_sort1, args=(m[:middle],))
left.start()
right = Process(target=merge_sort1, args=(m[middle:],))
right.start()
for p in (left, right):
p.join()
result = list(merge(left, right))
return result
Test it with arr
In [47]: arr = list(range(9))
In [48]: random.shuffle(arr)
It repost error:
In [49]: merge_sort1(arr)
TypeError: 'Process' object is not iterable
What's the problem with my code?

merge(left, right) tries to merge two processes, whereas you presumably want to merge the two lists that resulted from each process. Note that return value of the function passed to Process is lost; it is a different process, not just a different thread, and you can't very easily shuffle data back to parent, so Python doesn't do that, by default. You need to be explicit and code such a channel yourself. Fortunately, there are multiprocessing datatypes to help you; for example, multiprocessing.Pipe:
from heapq import merge
import random
import multiprocessing
def merge_sort1(m, send_end=None):
if len(m) < 2:
result = m
else:
middle = len(m) // 2
inputs = [m[:middle], m[middle:]]
pipes = [multiprocessing.Pipe(False) for _ in inputs]
processes = [multiprocessing.Process(target=merge_sort1, args=(input, send_end))
for input, (recv_end, send_end) in zip(inputs, pipes)]
for process in processes: process.start()
for process in processes: process.join()
results = [recv_end.recv() for recv_end, send_end in pipes]
result = list(merge(*results))
if send_end:
send_end.send(result)
else:
return result
arr = list(range(9))
random.shuffle(arr)
print(merge_sort1(arr))

python-measure function time

I am having a problem with measuring the time of a function.
My function is a "linear search":
def linear_search(obj, item,):
for i in range(0, len(obj)):
if obj[i] == item:
return i
return -1
And I made another function that measures the time 100 times and adds all the results to a list:
def measureTime(a):
nl=[]
import random
import time
for x in range(0,100): #calculating time
start = time.time()
a
end =time.time()
times=end-start
nl.append(times)
return nl
When I'm using measureTime(linear_search(list,random.choice(range(0,50)))), the function always returns [0.0].
What can cause this problem? Thanks.

you are actually passing the result of linear_search into function measureTime, you need to pass in the function and arguments instead for them to be execute inside measureTime function like #martijnn2008 answer
Or better wise you can consider using timeit module to to the job for you
from functools import partial
import timeit
def measureTime(n, f, *args):
# return average runtime for n number of times
# use a for loop with number=1 to get all individual n runtime
return timeit.timeit(partial(f, *args), number=n)
# running within the module
measureTime(100, linear_search, list, random.choice(range(0,50)))
# if running interactively outside the module, use below, lets say your module name mymodule
mymodule.measureTime(100, mymodule.linear_search, mymodule.list, mymodule.random.choice(range(0,50)))

Take a look at the following example, don't know exactly what you are trying to achieve so I guessed it ;)
import random
import time
def measureTime(method, n, *args):
start = time.time()
for _ in xrange(n):
method(*args)
end = time.time()
return (end - start) / n
def linear_search(lst, item):
for i, o in enumerate(lst):
if o == item:
return i
return -1
lst = [random.randint(0, 10**6) for _ in xrange(10**6)]
repetitions = 100
for _ in xrange(10):
item = random.randint(0, 10**6)
print 'average runtime =',
print measureTime(linear_search, repetitions, lst, item) * 1000, 'ms'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing a parallel implementation of a list comprehension - python

Related

Can't use Python multiprocessing with large amount of calculations

Multiprocessing in Python: Parallelize a for loop to fill a Numpy array

Using Multithreading or Multiprocessing to improve computational speed

Implement merge_sort with multiprocessing solution

python-measure function time

Categories

Resources