I have an object oriented Python program where I am doing certain data operations in each object using multiprocessing. I am storing each object in a common manager dictionary. When I want to update an object, first, I am retrieving the object from the dictionary and after the update I am putting it back. My Class structure is like
from src.data_element import Data_element
from multiprocessing import freeze_support, Process, Manager
import pandas as pd
class Data_Obj(Data_element):
def __init__(self, dataset_name,name_wo_fields, fields):
Data_element.__init__(self, dataset_name, name_wo_fields, fields)
self.depends=['data_1','data_2']
def calc(self,obj_dict_manager):
data_1=obj_dict_manager['data_1']
data_2=obj_dict_manager['data_2']
self.df = pd.merge(
data_1.df,
data_2.df,
on='week',
suffixes=('', '_y')
)[['week','val']]
def calculate(obj_dict_manager,data):
data_obj=obj_dict_manager[data]
data_obj.calc(obj_dict_manager)
obj_dict_manager[data]=data_obj
if __name__ == '__main__':
freeze_support()
manager=Manager()
obj_dict_manager=manager.dict()
obj_dict_manager=create_empty_objects(obj_dict_manager)
joblist=[]
for data in obj_dict_manager.keys():
p=Process(target=calculate, args=(obj_dict_manager,data))
joblist.append(p)
p.start()
for job in joblist:
job.join()
During these operations, there is a significant time spend on
data_1=obj_dict_manager['data_1']
data_2=obj_dict_manager['data_2']
i.e., the 1 second spend during retrieving the objects from the manager dictionary and the rest of the calculation takes another 1 second.
Is there any way that I can reduce the time spent here? I will be doing thousands of such operations and performance is critical for me.
An Important Note
You're doing something that is potentially dangerous: as you're iterating over the keys in obj_dict_manager, you're launching processes that modify the very same dictionary. You should never be modifying something that while you're iterating over it, and doing the modifications asynchronously from a subprocess could introduce especially strange results.
Possible Causes of your Issue
1) I can't tell how many objects are actually stored in your shared dictionary (because we don't have the code for create_empty_objects()), but if it is a significant amount, your subprocesses may be competing for access to the shared dictionary. In particular, since you have both reading and writing to the dictionary, it's going to be locked by one process or another a lot of the time.
2) Since we can't see how many keys are in your shared dictionary, we also can't see how many processes are being launched. If you're creating more processes than cores on your system, you may be subjecting your CPU to a lot of context switching, which is going to slow everything down.
3) A combination of #1 & #2 - This could be especially problematic if the manager grants a lock to one process, then that process gets put to sleep because you have dozens of processes competing for CPU time on an 8-core machine, and now everyone has to wait until that process wakes up and releases the lock.
How to Fix It
1) If your issue is skewed towards #1, consider splitting up your dictionary instead of utilize a shared one, and pass a chunk of the dictionary to each subprocess, let them do whatever they need, have them return the resulting dictionary, then recombine all the returned dictionaries as the processes complete. Something like multiprocessing.map_async() may work better for you if you can divide the dictionary up.
2) In most cases, try to limit the number of processes you spawn to the number of cores you have on your system, some times even less if you have a lot of other stuff running at the same time on your system. An exception to this is if you're doing a lot of parallel processing AND you expect the subprocesses to get block a lot, like when doing IO in parallel.
Related
I am writing a program that analyses csv files in a directory, initially one file at a time. This could be several hundred files, but all of them are relatively small. My main runtime limitation was I/O, so I turned to multithreading using the threading library, which is a first for me.
I created a thread for each function call, following this guide, where each function call opens a csv in the desired directory. As a result, I have a list of threads for each file (i.e. hundreds of threads). However, my program still ran slowly, with the bulk of its time spent method 'acquire' of '_thread.lock' objects according to cProfile. I believe that this is because of the large number of threads resulting in lots of threads waiting for others to finish their tasks - is this correct?
How would you recommend I resolve this? My current idea is to split my list of files into equally sized chunks and to assign a thread to each chunk, rather than a thread to each file, and for each thread to iterate through the files in each chunk.
Python has something called the Global Interpreter Lock which seriously hurts your performance with that many threads, as each one is waiting to hold the "interpreter lock." I would recommend using Processes which if I remember are similar to Python thread objects in their use but do not suffer the same performance penalty of waiting for a lock. A thread and a process are different, but for your application, it sounds like it should not matter.
It is worth noting that the GIL can be released when performing I/O such as reading from a file, and therefore using threads might be fine - you just need to use fewer of them. In fact, with the number of threads/processes you are looking to create it might be a better idea to use a fixed pool of workers.
The problem:
When sending 1000 tasks to apply_async, they run in parallel on all 48 CPUs, but then sometimes fewer and fewer CPUs run, until only one CPU left is running, and only when the last one finishes its task, then all the CPUs continue running again each with a new task. It shouldn't need to wait for any "task batch" like this..
My (simplified) code:
from multiprocessing import Pool
pool = Pool(47)
tasks = [pool.apply_async(json2features, (j,)) for j in jsons]
feats = [t.get() for t in tasks]
jsons = [...] is a list of about 1000 JSONs already loaded to memory and parsed to objects.
json2features(json) does some CPU-heavy work on a json, and returns an array of numbers.
This function may take between 1 second and 15 minutes to run, and because of this I sort the jsons using a heuristic, s.t. hopefully the longest tasks are first in the list, and thus start first.
The json2features function also prints when a task is finished and how long it took. It all runs on an ubuntu server with 48 cores and like I said above, it starts out great, using all 47 cores. Then as the tasks get completed, fewer and fewer cores run, which at first sounds perfectly ok, where it not because after the last core is finished (when I see its print to stdout), all CPUs start running again on new tasks, meaning it wasn't really the end of the list. It may do the same thing again, and then again for the actual end of the list.
Sometimes it can be using just one core for 5 minutes, and when the task is finally done, it starts using all cores again, on new tasks. (So it's not stuck on some IPC overhead)
There are no repeated jsons, nor any dependencies between them (it's all static, fresh-from-disk data, no references etc..), nor any dependency between json2features calls (no global state or anything) except for them using the same terminal for their print.
I was suspicious that the problem was that a worker doesn't get released until get is called on its result, so I tried the following code:
from multiprocessing import Pool
pool = Pool(47)
tasks = [pool.apply_async(print, (i,)) for i in range(1000)]
# feats = [t.get() for t in tasks]
And it does print all 1000 numbers, even though get isn't called.
I have ran out of ideas right now what the problem might be.
Is this really the normal behavior of Pool?
Thanks a lot!
The multiprocessing.Pool relies on a single os.pipe to deliver the tasks to the workers.
Usually on Unix, the default pipe size range from 4 to 64 Kb in size. If the JSONs you are delivering are large in size, you might get the pipe clogged at any given point in time.
This means that, while one of the workers is busy reading the large JSON from the pipe, all the other workers will starve.
It is generally a bad practice to share large data via IPC as it leads to bad performance. This is even underlined in the multiprocessing programming guidelines.
Avoid shared state
As far as possible one should try to avoid shifting large amounts of data between processes.
Instead of reading the JSON files in the main process, just send the workers their file names and let them open and read the content. You will surely notice an improvement in performance because you are moving the JSON loading phase in the concurrent domain as well.
Note that the same is true also for the results. A single os.pipe is used to return the results to the main process as well. If one or more workers clog the results pipe then you will get all the processes waiting for the main one to drain it. Large results should be written to files as well. You can then leverage multithreading on the main process to quickly read back the results from the files.
def content_generator(applications, dict):
for app in applications:
yield(app, dict[app])
with open('abc.pickle', 'r') as f:
very_large_dict = pickle.load(f)
all_applications = set(very_large_dict.keys())
pool = multiprocessing.Pool()
for result in pool.imap_unordered(func_process_application, content_generator(all_applications, very_large_dict)):
do some aggregation on result
I have a really large dictionary whose keys are strings (application names), values are information concerning the application. Since applications are independent, I want to use multiprocessing to process them in parallel. Parallelization works when the dictionary is not that big but all the python processes were killed when the dictionary is too big. I used dmesg to check what went wrong and found they were killed since the machine ran out of memory. I did top when the pool processes are running and found that they all occupy the same amount of resident memory(RES), which is all 3.4G. This confuses me since it seems to have copied the whole dictionaries into the spawned processes. I thought I broke up the dictionary and passing only what is relevant to the spawned process by yielding only dict[app] instead of dict. Any thoughts on what I did wrong?
The comments are becoming impossible to follow, so I'm pasting in my important comment here:
On a Linux-y system, new processes are created by fork(), so get a copy of the entire parent-process address space at the time they're created. It's "copy on write", so is more of a "virtual" copy than a "real" copy, but still ... ;-) For a start, try creating your Pool before creating giant data structures. Then the child processes will inherit a much smaller address space.
Then some answers to questions:
so in python 2.7, there is no way to spawn a new process?
On Linux-y systems, no. The ability to use "spawn" on those was first added in Python 3.4. On Windows systems, "spawn" has always been the only choice (no fork() on Windows).
The big dictionary is passed in to a function as an argument and I
could only create the pool inside this function. How would I be able
to create the pool before the big dictionary
As simple as this: make these two lines the first two lines in your program:
import multiprocessing
pool = multiprocessing.Pool()
You can create the pool any time you like (just so long as it exists sometime before you actually use it), and worker processes will inherit the entire address space at the time the Pool constructor is invoked.
ANOTHER SUGGESTION
If you're not mutating the dict after it's created, try using this instead:
def content_generator(dict):
for app in dict:
yield app, dict[app]
That way you don't have to materialize a giant set of the keys either. Or, even better (if possible), skip all that and iterate directly over the items:
for result in pool.imap_unordered(func_process_application, very_large_dict.iteritems()):
Specification of the problem:
I am running some complex tasks in python and to speed it up, I decided to use pythons multiprocessing library. It worked pretty well, but after some time I started to wonder, how much time are the Locks, I use, consuming and how much are the processes blocking each other.
The structure of processes is following:
One process that updates shared list between processes. Code of data update is something like this:
lock.acquire()
list_rex[0] = pat.some_list
list_rex[1] = pat.some_dictionary
lock.release()
where list_rex and lock are defined by
list_rex = manager.list([[some_elements], {some_elements}])
lock = multi.Lock()
And then there are several processes, that once in a while updates their own memory space with these list.The code is as follows:
lock.acquire()
some_list = list_rex[0]
some_dict = list_rex[1]
lock.release()
some_list and some_dict are somehow related so I can not allow processes to have information in some_list from different update than is in some_dict.
And my question is, how fast are the methods acquire() and release()? In my case they can be called within seconds and sometimes milliseconds. And/Or is there some way how to avoid using locks in my case?
Thank you for your time.
EDIT: after considering your comment my question probably should be - how are proxy lists affecting my calculations? I use "some_list" and "some_dict" really a lot to be read from after the update.
I was about to write a really complicated answer involving reader-writer locks, atomic reference assignments not needing locks if you just used one object for both the list and dictionary together, and some other stuff about measuring performance before changing anything...
But then I took some time to look at what you are actually doing in your locks.... essentially nothing more than a couple reference assignments. Neither of which are blocking, or doing I/O, or anything "slow". It's just a couple of variable assignments. So the short answer is your acquire/release of the locks is likely negligible. Unless you entering the locks across dozens of threads a hundreds of times a second, the impact is negligible. But don't take my word for it - go measure it.
I'm more concerned what happens when some_list and some_dict are being referenced by the other processes outside of the lock while list_rex is being updated within the lock.
How can multiple calculations be launched in parallel, while stopping them all when the first one returns?
The application I have in mind is the following: there are multiple ways of calculating a certain value; each method takes a different amount of time depending on the function parameters; by launching calculations in parallel, the fastest calculation would automatically be "selected" each time, and the other calculations would be stopped.
Now, there are some "details" that make this question more difficult:
The parameters of the function to be calculated include functions (that are calculated from data points; they are not top-level module functions). In fact, the calculation is the convolution of two functions. I'm not sure how such function parameters could be passed to a subprocess (they are not pickeable).
I do not have access to all calculation codes: some calculations are done internally by Scipy (probably via Fortran or C code). I'm not sure whether threads offer something similar to the termination signals that can be sent to processes.
Is this something that Python can do relatively easily?
I would look at the multiprocessing module if you haven't already. It offers a way of offloading tasks to separate processes whilst providing you with a simple, threading like interface.
It provides the same kinds of primatives as you get in the threading module, for example, worker pools and queues for passing messages between your tasks, but it allows you to sidestep the issue of the GIL since your tasks actually run in separate processes.
The actual semantics of what you want are quite specific so I don't think there is a routine that fits the bill out-of-the-box, but you can surely knock one up.
Note: if you want to pass functions around, they cannot be bound functions since these are not pickleable, which is a requirement for sharing data between your tasks.
Because of the global interpreter lock you would be hard pressed to get any speedup this way. In reality even multithreaded programs in Python only run on one core. Thus, you would just be doing N processes at 1/N times the speed. Even if one finished in half the time of the others you would still lose time in the big picture.
Processes can be started and killed trivially.
You can do this.
import subprocess
watch = []
for s in ( "process1.py", "process2.py", "process3.py" ):
sp = subprocess.Popen( s )
watch.append( sp )
Now you're simply waiting for one of those to finish. When one finishes, kill the others.
import time
winner= None
while winner is None:
time.sleep(10)
for w in watch:
if w.poll() is not None:
winner= w
break
for w in watch:
if w.poll() is None: w.kill()
These are processes -- not threads. No GIL considerations. Make the operating system schedule them; that's what it does best.
Further, each process is simply a script that simply solves the problem using one of your alternative algorithms. They're completely independent and stand-alone. Simple to design, build and test.