Slow speed while parallelizing operation on pandas dataframe

Slow speed while parallelizing operation on pandas dataframe - python

I have a dataframe which I perform some operation on and print out. To do this, I have to iterate through each row.
for count, row in final_df.iterrows():
x = row['param_a']
y = row['param_b']
# Perform operation
# Write to output file
I decided to parallelize this using the python multiprocessing module
def write_site_files(row):
x = row['param_a']
y = row['param_b']
# Perform operation
# Write to output file
pkg_num = 0
total_runs = final_df.shape[0] # Total number of rows in final_df
threads = []
import multiprocessing
while pkg_num < total_runs or len(threads):
if(len(threads) < num_proc and pkg_num < total_runs):
print pkg_num, total_runs
t = multiprocessing.Process(target=write_site_files,args=[final_df.iloc[pkg_num],pkg_num])
pkg_num = pkg_num + 1
t.start()
threads.append(t)
else:
for thread in threads:
if not thread.is_alive():
threads.remove(thread)
However, the latter (parallelized) method is way slower than the simple iteration based approach. Is there anything I am missing?
thanks!

This will be way less efficient that doing this in a single process unless the actual operation take a lot of time, like seconds per row.
Normally parallelization is the last tool in the box. After profiling, after local vectorization, after local optimization, then you parallelize.
You are spending time just doing the slicing, then spinning up new processes (which is generally a constant overhead), then pickling a single row (not clear how big it is from your example).
At the very least, you should chunk the rows, e.g. df.iloc[i:(i+1)*chunksize].
There hopefully will be some support for parallel apply in 0.14, see here: https://github.com/pydata/pandas/issues/5751

Related

random.random() generates same number in multiprocessing

I'm working on an optimization problem, and you can see a simplified version of my code posted below (the origin code is too complicated for asking such a question, and I hope my simplified code has simulated the original one as much as possible).
My purpose:
use the function foo in the function optimization, but foo can take very long time due to some hard situations. So I use multiprocessing to set a time limit for execution of the function (proc.join(iter_time), the method is from an anwser from this question; How to limit execution time of a function call?).
My problem:
In the while loop, every time the generated values for extra are the same.
The list lst's length is always 1, which means in every iteration in the while loop it starts from an empty list.
My guess: possible reason can be each time I create a process the random seed is counting from the beginning, and each time the process is terminated, there could be some garbage collection mechanism to clean the memory the processused, so the list is cleared.
My question
Anyone know the real reason of such problems?
if not using multiprocessing, is there anyway else that I can realize my purpose while generate different random numbers? btw I have tried func_timeout but it has other problems that I cannot handle...
random.seed(123)
lst = [] # a global list for logging data
def foo(epoch):
...
extra = random.random()
lst.append(epoch + extra)
...
def optimization(loop_time, iter_time):
start = time.time()
epoch = 0
while time.time() <= start + loop_time:
proc = multiprocessing.Process(target=foo, args=(epoch,))
proc.start()
proc.join(iter_time)
if proc.is_alive(): # if the process is not terminated within time limit
print("Time out!")
proc.terminate()
if __name__ == '__main__':
optimization(300, 2)

You need to use shared memory if you want to share variables across processes. This is because child processes do not share their memory space with the parent. Simplest way to do this here would be to use managed lists and delete the line where you set a number seed. This is what is causing same number to be generated because all child processes will take the same seed to generate the random numbers. To get different random numbers either don't set a seed, or pass a different seed to each process:
import time, random
from multiprocessing import Manager, Process
def foo(epoch, lst):
extra = random.random()
lst.append(epoch + extra)
def optimization(loop_time, iter_time, lst):
start = time.time()
epoch = 0
while time.time() <= start + loop_time:
proc = Process(target=foo, args=(epoch, lst))
proc.start()
proc.join(iter_time)
if proc.is_alive(): # if the process is not terminated within time limit
print("Time out!")
proc.terminate()
print(lst)
if __name__ == '__main__':
manager = Manager()
lst = manager.list()
optimization(10, 2, lst)
Output
[0.2035898948744943, 0.07617925389396074, 0.6416754412198231, 0.6712193790613651, 0.419777147554235, 0.732982735576982, 0.7137712131028766, 0.22875414425414997, 0.3181113880578589, 0.5613367673646847, 0.8699685474084119, 0.9005359611195111, 0.23695341111251134, 0.05994288664062197, 0.2306562314450149, 0.15575356275408125, 0.07435292814989103, 0.8542361251850187, 0.13139055891993145, 0.5015152768477814, 0.19864873743952582, 0.2313646288041601, 0.28992667535697736, 0.6265055915510219, 0.7265797043535446, 0.9202923318284002, 0.6321511834038631, 0.6728367262605407, 0.6586979597202935, 0.1309226720786667, 0.563889613032526, 0.389358766191921, 0.37260564565714316, 0.24684684162272597, 0.5982042933298861, 0.896663326233504, 0.7884030244369596, 0.6202229004466849, 0.4417549843477827, 0.37304274232635715, 0.5442716244427301, 0.9915536257041505, 0.46278512685707873, 0.4868394190894778, 0.2133187095154937]
Keep in mind that using managers will affect performance of your code. Alternate to this, you could also use multiprocessing.Array, which is faster than managers but is less flexible in what data it can store, or Queues as well.

How to multiprocess for loops in python where each calculation is independent?

I'm trying to learn something a little new in each mini-project I do. I've made a Game of Life( https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life ) program.
This involves a numpy array where each point in the array (a "cell") has an integer value. To evolve the state of the game, you have to compute for each cell the sum of all its neighbour values (8 neighbours).
The relevant class in my code is as follows, where evolve() takes in one of the xxx_method methods. It works fine for conv_method and loop_method, but I want to use multiprocessing (which I've identified should work, unlike multithreading?) on loop_method to see any performance increases. I feel it should work as each calculation is independent. I've tried a naive approach, but don't really understand the multiprocessing module well enough. Could I also use it within the evolve() method, as again I feel that each calculation within the double for loops are independent.
Any help appreciated, including general code comments.
Edit - I'm getting a RuntimeError, which I'm half-expecting as my understanding of multiprocessing isnt good enough. What needs to be done to the code to get it work?
class GoL:
""" Game Engine """
def __init__(self, size):
self.size = size
self.grid = Grid(size) # Grid is another class ive defined
def evolve(self, neigbour_sum_func):
new_grid = np.zeros_like(self.grid.cells) # start with everything dead, only need to test for keeping/turning alive
neighbour_sum_array = neigbour_sum_func()
for i in range(self.size):
for j in range(self.size):
cell_sum = neighbour_sum_array[i,j]
if self.grid.cells[i,j]: # already alive
if cell_sum == 2 or cell_sum == 3:
new_grid[i,j] = 1
else: # test for dead coming alive
if cell_sum == 3:
new_grid[i,j] = 1
self.grid.cells = new_grid
def conv_method(self):
""" Uses 2D convolution across the entire grid to work out the neighbour sum at each cell """
kernel = np.array([
[1,1,1],
[1,0,1],
[1,1,1]],
dtype=int)
neighbour_sum_grid = correlate2d(self.grid.cells, kernel, mode='same')
return neighbour_sum_grid
def loop_method(self, partition=None):
""" Also works out neighbour sum for each cell, using a more naive loop method """
if partition is None:
cells = self.grid.cells # no multithreading, just work on entire grid
else:
cells = partition # just work on a set section of the grid
neighbour_sum_grid = np.zeros_like(cells) # copy
for i, row in enumerate(cells):
for j, cell_val in enumerate(row):
neighbours = cells[i-1:i+2, j-1:j+2]
neighbour_sum = np.sum(neighbours) - cell_val
neighbour_sum_grid[i,j] = neighbour_sum
return neighbour_sum_grid
def multi_loop_method(self):
cores = cpu_count()
procs = []
slices = []
if cores == 2: # for my VM, need to impliment generalised method for more cores
half_grid_point = int(SQUARES / 2)
slices.append(self.grid.cells[0:half_grid_point])
slices.append(self.grid.cells[half_grid_point:])
else:
Exception
for sl in slices:
proc = Process(target=self.loop_method, args=(sl,))
proc.start()
procs.append(proc)
for proc in procs:
proc.join()

I want to use multiprocessing (which I've identified should work, unlike multithreading?)
Multithreading would not work because it would run on a single processor which is your current bottleneck. Multithreading is for things where you are awaiting for an API to answer. In that meantime you can do other calculations. But in Conway's Game of Life your program is constantly running.
Getting multiprocessing right is hard. If you have 4 processors you can define a quadrant for each of your processor. But you need to share the result between your processors. And with this you are getting a performance hit. They need to be synchronized/running on the same clock speed/have the same tick rate for updating and the result needs to be shared.
Multiprocessing starts being feasible when your grid is very big/there is a ton to calculate.
Since the question is very broad and complicated I cannot give you a better answer. There is a paper on getting parallel processing on Conway's Game of Life: http://www.shodor.org/media/content/petascale/materials/UPModules/GameOfLife/Life_Module_Document_pdf.pdf

Set the nr of executions per second using Python's multiprocessing

I wrote a script in Python 3.6 initially using a for loop which called an API, then putting all results into a pandas dataframe and writing them to a SQL database. (approximately 9,000 calls are made to that API every time the script runs).
Realising the calls inside the for loop were processed one-by-one, I decided to use the multiprocessing module to speed things up.
Therefore, I created a module level function called parallel_requests and now I call that instead of having the for loop:
list_of_lists = multiprocessing.Pool(processes=4).starmap(parallel_requests, zip(....))
Side note: I use starmap instead of map only because my parallel_requests function takes multiple arguments which I need to zip.
The good: this approach works and is much faster.
The bad: this approach works but is too fast. By using 4 processes (I tried that because I have 4 cores), parallel_requests is getting executed too fast. More than 15 calls per second are made to the API, and I'm getting blocked by the API itself.
In fact, it only works if I use 1 or 2 processes, otherwise it's too damn fast.
Essentially what I want is to keep using 4 processes, but also to limit the execution of my parallel_requests function to only 15 times per second overall.
Is there any parameter of multiprocessing.Pool that would help with this, or it's more complicated than that?

For this case I'd use a leaky bucket. You can have one process that fills a queue at the proscribed rate, with a maximum size that indicates how many requests you can "bank" if you don't make them at the maximum rate; the worker processes then just need to get from the queue before doing its work.
import time
def make_api_request(this, that, rate_queue):
rate_queue.get()
print("DEBUG: doing some work at {}".format(time.time()))
return this * that
def throttler(rate_queue, interval):
try:
while True:
if not rate_queue.full(): # avoid blocking
rate_queue.put(0)
time.sleep(interval)
except BrokenPipeError:
# main process is done
return
if __name__ == '__main__':
from multiprocessing import Pool, Manager, Process
from itertools import repeat
rq = Manager().Queue(maxsize=15) # conservative; no banking
pool = Pool(4)
Process(target=throttler, args=(rq, 1/15.)).start()
pool.starmap(make_api_request, zip(range(100), range(100, 200), repeat(rq)))

I'll look at the ideas posted here, but in the meantime I've just used a simple approach of opening and closing a Pool of 4 processes for every 15 requests and appending all the results in a list_of_lists.
Admittedly, not the best approach, since it takes time/resources to open/close a Pool, but it was the most handy solution for now.
# define a generator for use below
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
list_of_lists = []
for current_chunk in chunks(all_data, 15): # 15 is the API's limit of requests per second
pool = multiprocessing.Pool(processes=4)
res = pool.starmap(parallel_requests, zip(current_chunk, [to_symbol]*len(current_chunk), [query]*len(current_chunk), [start]*len(current_chunk), [stop]*len(current_chunk)) )
sleep(1) # Sleep for 1 second after every 15 API requests
list_of_lists.extend(res)
pool.close()
flatten_list = [item for sublist in list_of_lists for item in sublist] # use this to construct a `pandas` dataframe
PS: This solution is really not at all that fast due to the multiple opening/closing of pools. Thanks Nathan Vērzemnieks for suggesting to open just one pool, it's much faster, plus your processor won't look like it's running a stress test.

One way to do is to use Queue, which can share details about api-call timestamps with other processes.
Below is an example how this could work. It takes the oldest entry in queue, and if it is younger than one second, sleep functions is called for the duration of the difference.
from multiprocessing import Pool, Manager, queues
from random import randint
import time
MAX_CONNECTIONS = 10
PROCESS_COUNT = 4
def api_request(a, b):
time.sleep(randint(1, 9) * 0.03) # simulate request
return a, b, time.time()
def parallel_requests(a, b, the_queue):
try:
oldest = the_queue.get()
time_difference = time.time() - oldest
except queues.Empty:
time_difference = float("-inf")
if 0 < time_difference < 1:
time.sleep(1-time_difference)
else:
time_difference = 0
print("Current time: ", time.time(), "...after sleeping:", time_difference)
the_queue.put(time.time())
return api_request(a, b)
if __name__ == "__main__":
m = Manager()
q = m.Queue(maxsize=MAX_CONNECTIONS)
for _ in range(0, MAX_CONNECTIONS): # Fill the queue with zeroes
q.put(0)
p = Pool(PROCESS_COUNT)
# Create example data
data_length = 100
data1 = range(0, data_length) # Just some dummy-data
data2 = range(100, data_length+100) # Just some dummy-data
queue_iterable = [q] * (data_length+1) # required for starmap -function
list_of_lists = p.starmap(parallel_requests, zip(data1, data2, queue_iterable))
print(list_of_lists)

python parallel calculations in a while loop

I have been looking around for some time, but haven't had luck finding an example that could solve my problem. I have added an example from my code. As one can notice this is slow and the 2 functions could be done separately.
My aim is to print every second the latest parameter values. At the same time the slow processes can be calculated in the background. The latest value is shown and when any process is ready the value is updated.
Can anybody recommend a better way to do it? An example would be really helpful.
Thanks a lot.
import time
def ProcessA(parA):
# imitate slow process
time.sleep(5)
parA += 2
return parA
def ProcessB(parB):
# imitate slow process
time.sleep(10)
parB += 5
return parB
# start from here
i, parA, parB = 1, 0, 0
while True: # endless loop
print(i)
print(parA)
print(parB)
time.sleep(1)
i += 1
# update parameter A
parA = ProcessA(parA)
# update parameter B
parB = ProcessB(parB)

I imagine this should do it for you. This has the benefit of you being able to add extra parallel funcitons up to a total equal to the number of cores you have. Edits are welcome.
#import time module
import time
#import the appropriate multiprocessing functions
from multiprocessing import Pool
#define your functions
#whatever your slow function is
def slowFunction(x):
return someFunction(x)
#printingFunction
def printingFunction(new,current,timeDelay):
while new == current:
print current
time.sleep(timeDelay)
#set the initial value that will be printed.
#Depending on your function this may take some time.
CurrentValue = slowFunction(someTemporallyDynamicVairable)
#establish your pool
pool = Pool()
while True: #endless loop
#an asynchronous function, this will continue
# to run in the background while your printing operates.
NewValue = pool.apply_async(slowFunction(someTemporallyDynamicVairable))
pool.apply(printingFunction(NewValue,CurrentValue,1))
CurrentValue = NewValue
#close your pool
pool.close()

Using threads for a for-loop in Python

item_list = [("a", 10, 20), ("b", 25, 40), ("c", 40, 100), ("d", 45, 90),
("e", 35, 65), ("f", 50, 110)] #weight/value
results = [("", 0, 0)] #an empty string and a 2-tupel to compare with the new
#values
class Rucksack(object):
def __init__(self, B):
self.B = B #B=maximum weight
self.pack(item_list, 0, ("", 0, 0))
def pack(self, items, n, current):
n += 1 #n is incremented, to stop the recursion, if all
if n >= len(items) - 1:
if current[2] > results[0][2]:
#substitutes the result, if current is bigger and starts no
#new recursion
results[0] = current
else:
for i in items:
if current[1] + i[1] <= self.B and i[0] not in current[0]:
#first condition: current + the new value is not bigger
#than B; 2nd condition: the new value is not the same as
#current
i = (current[0] + " " + i[0], current[1] + i[1],
current[2] + i[2])
self.pack(items, n, i)
else:
#substitutes the result, if current is bigger and starts no
#new recursion
if current[2] > results[0][2]:
results[0] = current
rucksack1 = Rucksack(100)
This is a small algo for the knapsack-problem. I have to parallelize the code somehow, but I don't get the thread module so far. I think the only place to work with parallelisation is the for-loop, right? So, I tried this:
def run(self, items, i, n, current):
global num_threads, thread_started
lock.acquire()
num_threads += 1
thread_started = True
lock.release()
if current[1] + i[1] <= self.B and i[0] not in current[0]:
i = (current[0] + " " + i[0], current[1] + i[1], current[2] + i[2])
self.pack(items, n, i)
else:
if current[2] > results[0][2]:
results[0] = current
lock.acquire()
num_threads -= 1
lock.release()
but the results are strange. Nothing happens and if I make a keyboardinterrupt, the result is correct, but thats definetly not the sense of the implementation. Can you tell me what is wrong with the second code or where else I could use perallelisation soundly. Thanks.

First, since your code is CPU-bound, you will get very little benefit from using threads for parallelism, because of the GIL, as bereal explains. Fortunately, there are only a few differences between threads and processes—basically, all shared data must be passed or shared explicitly (see Sharing state between processes for details).
Second, if you want to data-parallelize your code, you have to lock all access to mutable shared objects. From a quick glance, while items and current look immutable, the results object is a shared global that you modify all over the place. If you can change your code to return a value up the chain, that's ideal. If not, if you can accumulate a bunch of separate return values and merge them after processing is finished, that's usually good too. If neither is feasible, you will need to guard all access to results with a lock. See Synchronization between processes for details.
Finally, you ask where to put the parallelism. The key is to find the right dividing line between independent tasks.
Ideally you want to find a large number of mid-sized jobs that you can queue up, and just have a pool of processes each picking up the next one. From a quick glance, the obvious places to do that are either at the recursive call to self.pack, or at each iteration of the for i in items: loop. If they actually are independent, just use concurrent.futures, as in the ProcessPollExecutor example. (If you're on Python 3.1 or earlier, you need the futures module, because it's not built into the stdlib.)
If there's no easy way to do this, often it's at least possible to create a small number (N or 2N, if you have N cores) of long-running jobs of about equal size, and just give each one its own multiprocessing.Process. For example:
n = 8
procs = [Process(target=rucksack.pack, args=(items[i//n:(i+1)//n],)) for i in range(n)]
One last note: If you finish your code and it looks like you've gotten away with implicitly sharing globals, what you've actually done is written code that usually-but-not-always works on some platforms, and never on others. See the Windows section of the multiprocessing docs to see what to avoid—and, if possible, test regularly on Windows, because it's the most restrictive platform.
You also ask a second question:
Can you tell me what is wrong with the second code.
It's not entirely clear what you were trying to do here, but there are a few obvious problems (beyond what's mentioned above).
You don't create a thread anywhere in the code you showed us. Just creating variables with "thread" in the name doesn't give you parallelism. And neither does adding locks—if you don't have any threads, all locks can do is slow you down for no reason.
From your description, it sounds like you were trying to use the thread module, instead of threading. There's a reason that the very top of the thread documentation tells you to not use it and use threading instead.
You have a lock protecting your thread count (which shouldn't be needed at all), but no lock protecting your results. You will get away with this in most cases in Python (because of the same GIL issue mentioned above—your threads are basically not going to run concurrently, and therefore they're not going to have races), but it's still a very bad idea (especially if you don't understand exactly what those "most cases" are).
However, it looks like your run function is based on the body of the for i in items: loop in pack. If that's a good place to parallelize, you're in luck, because creating a parallel task out of each iteration of a loop is exactly what futures and multiprocessing are best at. For example, this code:
results = []
for i in items:
result = dostuff(i)
results.append(result)
… can, of course, be written as:
results = map(dostuff, items)
And it can be trivially parallelized, without even having to understand what futures are about, as:
pool = concurrent.futures.ProcessPoolExecutor()
results = pool.map(dostuff, items)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.