Slower Execution with Python multiprocessing - python

I may be in a little over my head here, but I am working on a little bioinformatics project in python. I am trying to parallelism a program that analyzes a large dictionary of sets of strings (~2-3GB in RAM). I find that the multiprocessing version is faster when I have smaller dictionaries but is of little benefit and mostly slower with the large ones. My first theory was that running out of memory just slowed everything and the bottleneck was from swapping into virtual memory. However, I ran the program on a cluster with 4*48GB of RAM and the same slowdown occurred. My second theory is that access to certain data was being locked. If one thread is trying to access a reference currently being accessed in another thread, will that thread have to wait? I have tried creating copies of the dictionaries I want to manipulate, but that seems terribly inefficient. What else could be causing my problems?
My multiprocessing method is below:
def worker(seqDict, oQueue):
#do stuff with the given partial dictionary
oQueue.put(seqDict)
oQueue = multiprocessing.Queue()
chunksize = int(math.ceil(len(sdict)/4)) # 4 cores
inDict = {}
i=0
dicts = list()
for key in sdict.keys():
i+=1
if len(sdict[key]) > 0:
inDict[key] = sdict[key]
if i%chunksize==0 or i==len(sdict.keys()):
print(str(len(inDict.keys())) + ", size")
dicts.append(copy(inDict))
inDict.clear()
for pdict in dicts:
p =multiprocessing.Process(target = worker,args = (pdict, oQueue))
p.start()
finalDict = {}
for i in range(4):
finalDict.update(oQueue.get())
return finalDict

As I said in the comments, and as Kinch said in his answer, everything passed through to a subprocess has to be pickled and unpickled to duplicate it in the local context of the spawned process. If you use multiprocess.Manager.dict for sDict (thereby allowing processes to share the same data through a server that proxies the objects created on it) and spawning the processes with slice indices in to that shared sDict, that should cut down on the serialize/deserialize sequence involved in spawning the child processes. You still might hit bottlenecks with that though in the server communication step of working with the shared objects. If so, you'll have to look at simplifying your data so you can use true shared memory with multiprocess.Array or multiprocess.Value or look at multiprocess.sharedctypes to create custom datastructures to share between your processes.

Seems like the data from the "large dictionary of sets of strings" could be reformatted into a something that could be stored in a file or string, allowing you to use the mmap module to share it among all the processes. Each process might incur some startup overhead if it needs to convert the data back into some other more preferable form, but that could be minimized by passing each process something indicating what subset of the whole dataset in shared memory they should do their work on and only reconstitute the part required by that process.

Every data which is passed through the queue will be serialized and deserialized using pickle. I would guess this could be a bottleneck if you pass a lot of data round.
You could reduce the amount of data, make use of shared memory, write a multi-threading version in a c extension or try a multithreading version of this with a multithreading safe implemention of python (maybe jython or pypy; I don't know).
Oh and by the way: You are using multiprocessing and not multithreading.

Related

How can I share a large data structure among forked Python processes?

I'm creating a 58 GB graph in a Python program, which I then want to process through multiple (Linux) processes. For that I'm using a ProcessPoolExecutor with a fork context, passing to each process just an integer associated with the chunk associated with it.
with concurrent.futures.ProcessPoolExecutor(mp_context=mp.get_context('fork')) as executor:
futures = []
for start in range(0, len(vertices_list), BATCH_SIZE):
future = executor.submit(process_batch, start)
futures += [future]
The process_batch function calculates a metric of the graph's vertices, and returns a small list of results. Both vertices_list and graph are global variables. The graph variable is bound to a C++ Python extension module that implements the required data structure and calculation.
def process_batch(start):
results = []
for v in vertices_list[start:start + BATCH_SIZE]:
results.append((v, graph.some_metric(v)))
return results
This works fine with small samples, but when I run it on the 58 GB structure, the machine grinds to a halt, with memory getting swapped out to allow the forked processes to bring in more memory in ΒΌ MB chunks. (Using strace(1) I saw a few such mmap(2) calls.) I saw two comments (here and here) indicating that on Python fork(2) actually means copy on access, because just accessing an object will change its ref-count. This seems consistent with the behavior I'm observing: the forked processes probably attempting to create their own private copy of the data structure.
So my question is: does Python offer a way to share an arbitrary data structure among forked Python processes? I know about multiprocessing.shared_memory and the corresponding facilities, but these concern the built-in Array and Value types.

Python 3.8 Convert for loop to multiprocessing/multithreading

I am new to multiprocessing I would really appreciate it if someone can guide/help me here. I have the following for loop which gets some data from the two functions. The code looks like this
for a in accounts:
dl_users[a['Email']] = get_dl_users(a['Email'], adConn)
group_users[a['Email']] = get_group_users(a['Id'], adConn)
print(f"Users part of DL - {dl_users}")
print(f"Users part of groups - {group_users}")
adConn.unbind()
This works fine and gets all the results but recently I have noticed it takes a lot of time to get the list of users i.e. dl_users and group_users. It takes almost 14-15 mins to complete. I am looking for ways where I can speed up the function and would like to convert this for loop to multiprocessing. get_group_users and get_dl_users makes calls for LDAP, so I am not 100% sure if I should be converting this to multiprocessing or multithreading. Any suggestion would be of big help
As mentioned in the comments, multithreading is appropriate for I/O operations (reading/writing from/to files, sending http requests, communicating with databases), while multiprocessing is appropriate for CPU-bound tasks (such as transforming data, making calculations...). Depending on which kind of operation your functions perform, you want one or the other. If they do a mix, separate them internally and profile which of the two really needs optimisation, since both multiprocessing and -threading introduce overhead that might not be worth adding.
That said, the way to apply multiprocessing or multithreading is pretty simple in recent Python versions (including your 3.8).
Multiprocessing
from multiprocessing import Pool
# Pick the amount of processes that works best for you
processes = 4
with Pool(processes) as pool:
processed = pool.map(your_func, your_data)
Where your_func is a function to apply to each element of your_data, which is an iterable. If you need to provide some other parameters to the callable, you can use a lambda function:
processed = pool.map(lambda item: your_func(item, some_kwarg="some value"), your_data)
Multithreading
The API for multithreading is very similar:
from concurrent.futures import ThreadPoolExecutor
# Pick the amount of workers that works best for you.
# Most likely equal to the amount of threads of your machine.
workers = 4
with ThreadPoolExecutor(workers) as pool:
processed = pool.map(your_func, your_data)
If you want to avoid having to store your_data in memory if you need some attribute of the items instead of the items itself, you can use a generator:
processed = pool.map(your_func, (account["Email"] for account in accounts))

Python: Efficient way to move a huge list of data from one list to another

I'm using multiprocessing, and in a separate process (let's call it thread for clarity), I'm using PySerial to pull data from a device. In my program, I have a list that's shared with the main thread. I create the list using:
import multiprocessing as mp
#... stuff
self.mpManager = mp.Manager()
self.shared_return_list = self.mpManager.list()
This list, is filled inside the process, and then this is transferred to the local thread using:
if len(self.acqBuffer) < 50000:
try:
self.shared_result_lock.acquire()
self.acqBuffer.extend(self.shared_return_list)
del self.shared_return_list[:]
finally:
self.shared_result_lock.release()
where acqBuffer is the local list that takes the data for analysis and storage.
The problem is that if the device has lots of data queued, the transfer process will become really, really slow as lots of data is there, and the GUI freezes. Possible solutions are either to transfer the data in chunks and actively keep reviving the GUI, or find a smart way to transfer the data, which is what I'm asking about.
In C++, I would use some derivative of std::deque or std::list (which is not necessarily contiguous in memory) with a move constructor and use std::move to just push the pointer to the data, instead of re-copying the whole data to the main thread. Is such a thing possible in my case in Python? Are there smarter ways to do this?
Thank you.

CPU (all cores) become idle during python multiprocessing on windows

My system is windows 7. I wrote python program to do data analysis. I use multiprocessing library to achieve parallelism. When I open windows powershell, and type python MyScript.py. It starts to use all the cpu cores. But after a while, the CPU (all cores) became idle. But if I hit Enter in powershell window, all cores are back to full-load. To be clear, the program is fine, and has been tested. The problem here is that CPU-cores went idle by themselves.
This happened not only on my office computer, which runs Windows 7 Pro, but also on my home desktop, which runs Windows 7 Ultimate.
The parallel part of the program is very simple:
def myfunc(input):
##some operations based on a huge data and a small data##
operation1: read in a piece of HugeData #query based HDF5
operation2: some operation based on HugeData and SmallData
return output
# read in Small data
SmallData=pd.read_csv('data.csv')
if __name__ == '__main__':
pool = mp.Pool()
result=pool.map_async(myfunc, a_list_of_input)
out=result.get()
My function are mainly data manipulations using Pandas.
There is nothing wrong with the program, because I've successfully finished my program couple times. But I have to keep watching it, and hit Enter when cores become idle. The job takes couple hours, and I really don't keep watching it.
Is this a problem of windows system itself or my program?
By the way, can all the cores have access to the same variable stored in the memory? e.g. I have a data set mydata read into memory right before if __name__ == '__main__':. This data will be used in myfunc. All the cores should be able to access mydata in the same time, right?
Please help!
I was re-directed to this question as I was facing a similar problem while using Python's Multiprocessing library in Ubuntu. In my case, the processes do not start by hitting enter or such, however, they start after sometime abruptly. My code is an iterative heuristic that uses multiprocessing in each of its iterations. I have to rerun the code after completion of some iterations in order to get a steady runtime-performance. As the question was posted way long back, did you come across the actual reason behind it and solution to it?
I confess to not understanding the subtleties of map_async, but I'm not sure whether you can use it like that (I can't seem to get it to work at all)...
I usually use the following recipe (a list comprehension of the calls I want doing):
In [11]: procs = [multiprocessing.Process(target=f, args=()) for _ in xrange(4)]
....: for p in procs: p.start()
....: for p in procs: p.join()
....:
It's simple and waits until the jobs are finished before continuing.
This works fine with pandas objects provided you're not doing modifications... (I think) copies of the object are passed to each thread and if you perform mutations they not propogate and will be garbage collected.
You can use multiprocessing's version of a dict or list with the Manager class, this is useful for storing the result of each job (simply access the dict/list from within the function):
mgr = multiproccessing.Manager()
d = mgr.dict()
L = mgr.list()
and they will have shared access (as if you had written a Lock). It's hardly worth mentioning, that if you are appending to a list then the order will not just be the same as procs!
You may be able to do something similar to the Manager for pandas objects (writing a lock to objects in memory without copying), but I think this would be a non-trivial task...

Multiprocessing a dictionary in python

I have two dictionaries of data and I created a function that acts as a rules engine to analyze entries in each dictionaries and does things based on specific metrics I set(if it helps, each entry in the dictionary is a node in a graph and if rules match I create edges between them).
Here's the code I use(its a for loop that passes on parts of the dictionary to a rules function. I refactored my code to a tutorial I read):
jobs = []
def loadGraph(dayCurrent, day2Previous):
for dayCurrentCount in graph[dayCurrent]:
dayCurrentValue = graph[dayCurrent][dayCurrentCount]
for day1Count in graph[day2Previous]:
day1Value = graph[day2Previous][day1Count]
#rules(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous)
p = multiprocessing.Process(target=rules, args=(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous))
jobs.append(p)
p.start()
print ' in rules engine for day', dayCurrentCount, ' and we are about ', ((len(graph[dayCurrent])-dayCurrentCount)/float(len(graph[dayCurrent])))
The data I'm studying could be rather large(could, because its randomly generated). I think for each day there's about 50,000 entries. Because most of the time is spend on this stage, I was wondering if I could use the 8 cores I have available to help process this faster.
Because each dictionary entry is being compared to a dictionary entry from the day before, I thought the proceses could be split up by that but my above code is slower than using it normally. I think this is because its creating a new process for every entry its doing.
Is there a way to speed this up and use all my cpus? My problem is, I don't want to pass the entire dictionary because then one core will get suck processing it, I would rather have a the process split to each cpu or in a way that I maximum all free cpus for this.
I'm totally new to multiprocessing so I'm sure there's something easy I'm missing. Any advice/suggestions or reading material would be great!
What I've done in the past is to create a "worker class" that processes data entries. Then I'll spin up X number of threads that each run a copy of the worker class. Each item in the dataset gets pushed into a queue that the worker threads are watching. When there are no more items in the queue, the threads spin down.
Using this method, I was able to process 10,000+ data items using 5 threads in about 3 seconds. When the app was only single-threaded, this would take significantly longer.
Check out: http://docs.python.org/library/queue.html
I would recommend looking into MapReduce implementations in Python. Here's one: http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=mapreduce+python. Also, take a look at a python package called Celery: http://celeryproject.org/. With celery you can distribute your computation not only among cores on a single machine, but also to a server farm (cluster). You do pay for that flexibility with more involved setup/maintenance.

Categories

Resources