I am new to multiprocessing I would really appreciate it if someone can guide/help me here. I have the following for loop which gets some data from the two functions. The code looks like this
for a in accounts:
dl_users[a['Email']] = get_dl_users(a['Email'], adConn)
group_users[a['Email']] = get_group_users(a['Id'], adConn)
print(f"Users part of DL - {dl_users}")
print(f"Users part of groups - {group_users}")
adConn.unbind()
This works fine and gets all the results but recently I have noticed it takes a lot of time to get the list of users i.e. dl_users and group_users. It takes almost 14-15 mins to complete. I am looking for ways where I can speed up the function and would like to convert this for loop to multiprocessing. get_group_users and get_dl_users makes calls for LDAP, so I am not 100% sure if I should be converting this to multiprocessing or multithreading. Any suggestion would be of big help
As mentioned in the comments, multithreading is appropriate for I/O operations (reading/writing from/to files, sending http requests, communicating with databases), while multiprocessing is appropriate for CPU-bound tasks (such as transforming data, making calculations...). Depending on which kind of operation your functions perform, you want one or the other. If they do a mix, separate them internally and profile which of the two really needs optimisation, since both multiprocessing and -threading introduce overhead that might not be worth adding.
That said, the way to apply multiprocessing or multithreading is pretty simple in recent Python versions (including your 3.8).
Multiprocessing
from multiprocessing import Pool
# Pick the amount of processes that works best for you
processes = 4
with Pool(processes) as pool:
processed = pool.map(your_func, your_data)
Where your_func is a function to apply to each element of your_data, which is an iterable. If you need to provide some other parameters to the callable, you can use a lambda function:
processed = pool.map(lambda item: your_func(item, some_kwarg="some value"), your_data)
Multithreading
The API for multithreading is very similar:
from concurrent.futures import ThreadPoolExecutor
# Pick the amount of workers that works best for you.
# Most likely equal to the amount of threads of your machine.
workers = 4
with ThreadPoolExecutor(workers) as pool:
processed = pool.map(your_func, your_data)
If you want to avoid having to store your_data in memory if you need some attribute of the items instead of the items itself, you can use a generator:
processed = pool.map(your_func, (account["Email"] for account in accounts))
Related
I'm running a python code on Sagemaker Processing job, specifically SKLearnProcessor. The code run a for-loop for 200 times (each iteration is independent), each time takes 20 minutes.
for example: script.py
for i in list:
run_function(i)
I'm kicking off the job from a notebook:
sklearn_processor = SKLearnProcessor(
framework_version="1.0-1", role=role,
instance_type="ml.m5.4xlarge", instance_count=1,
sagemaker_session = Session()
)
out_path = 's3://' + os.path.join(bucket, prefix,'outpath')
sklearn_processor.run(
code="script.py",
outputs=[
ProcessingOutput(output_name="load_training_data",
source = f'/opt/ml/processing/output}',
destination = out_path),
],
arguments=["--some-args", "args"]
)
I want to parallel this code and make the Sagemaker processing job use it best capacity to run as many concurrent jobs as possible.
How can I do that
There are basically 3 paths you can take, depending on the context.
Parallelising function execution
This solution has nothing to do with SageMaker. It is applicable to any python script, regardless of the ecosystem, as long as you have the necessary resources to parallelise a task.
Based on the needs of your software, you have to work out whether to parallelise multi-thread or multi-process. This question may clarify some doubts in this regard: Multiprocessing vs. Threading Python
Here is a simple example on how to parallelise:
from multiprocessing import Pool
import os
POOL_SIZE = os.cpu_count()
your_list = [...]
def run_function(i):
# ...
return your_result
if __name__ == '__main__':
with Pool(POOL_SIZE) as pool:
print(pool.map(run_function, your_list))
Splitting input data into multiple instances
This solution is dependent on the quantity and size of the data. If they are completely independent of each other and have a considerable size, it may make sense to split the data over several instances. This way, execution will be faster and there may also be a reduction in costs based on the instances chosen over the initial larger instance.
It is clear in your case it is the instance_count parameter to set, as the documentation says:
instance_count (int or PipelineVariable) - The number of instances to
run the Processing job with. Defaults to 1.
This should be combined with the ProcessingInput split.
P.S.: This approach makes sense to use if the data can be retrieved before the script is executed. If the data is generated internally, the generation logic must be changed so that it is multi-instance.
Combined approach
One can undoubtedly combine the two previous approaches, i.e. create a script that parallelises the execution of a function on a list and have several parallel instances.
An example of use could be to process a number of csvs. If there are 100 csvs, we may decide to instantiate 5 instances so as to pass 20 files per instance. And in each instance decide to parallelise the reading and/or processing of the csvs and/or rows in the relevant functions.
To pursue such an approach, one must monitor well whether one is really bringing improvement to the system rather than wasting resources.
I have a function which takes in one input. I want to run the same function on 25 different inputs and finally combine the results of the 25 function calls. How do I do this in python? Will multithreading or multiprocessing be the better approach? I'm doing it on my work laptop and using the following approach:
import multiprocessing
pool = multiprocessing.Pool(processes=5)
output = pool.map(get_results, response_obj)
It's taking longer than even serially processing the objects. The functions is performing GET requests to an API.
It is generally said that using an asynchronous approach is best practice for get-requests. There's plenty of libraries & other resources on asynchronous programming in Python.
To get your feet wet you should take a look at asyncio.
https://docs.python.org/3/library/asyncio.html
I have created two versions of a program to add the numbers of an array, one version uses concurrent programming and the other is sequential. The problem that I have is that I cannot make the parallel program to return a faster processing time. I am currently using Windows 8 and Python 3.x. My code is:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
import random
import time
def fun(v):
s=0
for i in range(0,len(v)):
s=s+l[i]
return s
def sumSeq(v):
s=0
start=time.time()
for i in range(0,len(v)):
s=s+v[i]
start1=time.time()
print ("time seq ",start1-start," sum = ",s)
def main():
workers=4
vector = [random.randint(1,101) for _ in range(1000000)]
sumSeq(vector)
dim=(int)(len(vector)/(workers*10))
s=0
chunks=(vector[k:k+dim] for k in range(0,len(vector),(int)(len(vector)/(workers*10))))
start=time.time()
with ThreadPoolExecutor(max_workers=workers) as executor:
futures=[executor.submit(fun,chunk) for chunk in chunks]
start1=time.time()
for future in as_completed(futures):
s=s+future.result()
print ("concurrent time ",start1-start," sum = ",s)
The problem is that I get the following answer:
time sec 0.048101186752319336 sum = 50998349
concurrent time 0.059157371520996094 sum = 50998349
I cannot make the concurrent version to runs faster, I have change the chunks size and the number of max workers to None, but nothing seems to work. What am I doing wrong? I have read that the problem could be the creation of the processes, so how can I fix that in a simple way?
A long-standing weakness of Python is that it can't run pure-Python code simultaneously in multiple threads; the keyword to search for is "GIL" or "global interpreter lock".
Ways around this:
This only applies to CPU-heavy operations, like addition; I/O operations and the like can happily run in parallel. You can happily continue to run Python code in one thread while others are waiting for disk, network, database etc.
This only applies to pure-Python code; several computation-heavy extension modules will release the GIL and let code in other threads run. Things like matrix operations in numpy or image operations can thus run in threads alongside a CPU-heavy Python thread.
It applies to threads (ThreadPoolExecutor) specifically; the ProcessPoolExecutor will work the way you expect — but it's more isolated, so the program will spend more time marshalling and demarshalling the data and intermediate results.
I should also note that it looks like your example isn't really well suited to this test:
It's too small; with a total time of 0.05s, a lot of that is going to be the setup and tear-down of the parallel run. In order to test this, you need at least several seconds, ideally many seconds or a couple of minutes.
The sequential version visits the array in sequence; things like CPU cache are optimised for this sort of access. The parallel version will access the chunks at random, potentially causing cache evictions and the like.
I am trying to use pandas in a multi-thread environment. I have a few lists of pandas frames (long list, 5000 pandas frames, with dimensions of 300x2500 dimension) which I need to concatenate.
Since I have multiple lists, I want to run the concat for each list in an own thread (or use a threadpool, at least to get some parallel processing).
For some reason the processing in my multi-thread setup is identical to single threaded processing. I am wondering if I am doing something systematically wrong.
Here is my code snippet, I use ThreadPoolExecutor to implement parallelization:
def func_merge(the_list, key):
return (key, pd.concat(the_list))
def my_thread_starter():
buffer = {
'A': [df_1, ..., df_5000],
'B': [df_a1, ...., df_a5000]
}
with ThreadPoolExecutor(max_workers=2) as executor:
submitted=[]
for key, df_list in buffer.items():
submitted.append(executor.submit(func_merge, df_list, key = key))
for future in as_completed(submitted):
out = future.result()
// do with results
Is there a trick to use Pandas' concat in separate threads? I would at least expect my CPU utilization to spark when running more threads but it does seem to have any effect. Consequently, the time advantage is zero, too
Does anyone has an idea what the problem could be?
Because of the Global Interpreter Lock -GIL), I'm not sure your code is leveraging multi-threading.
Basically, ThreadPoolExecutor is useful when workload is not CPU bounded but IO bounded, like making many Web API call at the same time.
It may have change in python 3.8. But I don't know how to interpret the "tasks which release the GIL" the documentation.
ProcessPoolExecutor could help, but because it requires to serialize input and output of function, with huge data volume, it won't be faster.
I may be in a little over my head here, but I am working on a little bioinformatics project in python. I am trying to parallelism a program that analyzes a large dictionary of sets of strings (~2-3GB in RAM). I find that the multiprocessing version is faster when I have smaller dictionaries but is of little benefit and mostly slower with the large ones. My first theory was that running out of memory just slowed everything and the bottleneck was from swapping into virtual memory. However, I ran the program on a cluster with 4*48GB of RAM and the same slowdown occurred. My second theory is that access to certain data was being locked. If one thread is trying to access a reference currently being accessed in another thread, will that thread have to wait? I have tried creating copies of the dictionaries I want to manipulate, but that seems terribly inefficient. What else could be causing my problems?
My multiprocessing method is below:
def worker(seqDict, oQueue):
#do stuff with the given partial dictionary
oQueue.put(seqDict)
oQueue = multiprocessing.Queue()
chunksize = int(math.ceil(len(sdict)/4)) # 4 cores
inDict = {}
i=0
dicts = list()
for key in sdict.keys():
i+=1
if len(sdict[key]) > 0:
inDict[key] = sdict[key]
if i%chunksize==0 or i==len(sdict.keys()):
print(str(len(inDict.keys())) + ", size")
dicts.append(copy(inDict))
inDict.clear()
for pdict in dicts:
p =multiprocessing.Process(target = worker,args = (pdict, oQueue))
p.start()
finalDict = {}
for i in range(4):
finalDict.update(oQueue.get())
return finalDict
As I said in the comments, and as Kinch said in his answer, everything passed through to a subprocess has to be pickled and unpickled to duplicate it in the local context of the spawned process. If you use multiprocess.Manager.dict for sDict (thereby allowing processes to share the same data through a server that proxies the objects created on it) and spawning the processes with slice indices in to that shared sDict, that should cut down on the serialize/deserialize sequence involved in spawning the child processes. You still might hit bottlenecks with that though in the server communication step of working with the shared objects. If so, you'll have to look at simplifying your data so you can use true shared memory with multiprocess.Array or multiprocess.Value or look at multiprocess.sharedctypes to create custom datastructures to share between your processes.
Seems like the data from the "large dictionary of sets of strings" could be reformatted into a something that could be stored in a file or string, allowing you to use the mmap module to share it among all the processes. Each process might incur some startup overhead if it needs to convert the data back into some other more preferable form, but that could be minimized by passing each process something indicating what subset of the whole dataset in shared memory they should do their work on and only reconstitute the part required by that process.
Every data which is passed through the queue will be serialized and deserialized using pickle. I would guess this could be a bottleneck if you pass a lot of data round.
You could reduce the amount of data, make use of shared memory, write a multi-threading version in a c extension or try a multithreading version of this with a multithreading safe implemention of python (maybe jython or pypy; I don't know).
Oh and by the way: You are using multiprocessing and not multithreading.