multiprocessing a function execution with python - python

i have a pandas dataframe which consists of approximately 1M rows , it contains information entered by users. i wrote a function that validates if the number entered by the user is correct or not . what im trying to do, is to execute the function on multiple processors to overcome the issue of doing heavy computation on a single processor. what i did is i split my dataframe into multiple chunks where each chunk contains 50K rows and then used the python multiprocessor module to perform the processing on each chunk separately . the issue is that only the first process is starting and its still using one processor instead of distributing the load on all processors . here is the code i wrote :
pool = multiprocessing.Pool(processes=16)
r7 = pool.apply_async(validate.validate_phone_number, (has_phone_num_list[0],fields ,dictionary))
r8 = pool.apply_async(validate.validate_phone_number, (has_phone_num_list[1],fields ,dictionary))
print(r7.get())
print(r8.get())
pool.close()
pool.join()
i have attached a screenshot that shows how the CPU usage when executing the above code
any advice on how can i overcome this issue?

I suggest you try this:
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor() as executor:
params = [(pnl, fields, dictionary) for pnl in has_phone_num_list]
for result in executor.map(validate.validate_phone_number, params):
pass # process results here
By constructing the ProcessPoolExecutor with no parameters, most of your CPUs will be fully utilised. This is a very portable approach because there's no explicit assumption about the number of CPUs available. You could, of course, construct with max_workers=N where N is a low number to ensure that a minimal number of CPUs are used concurrently. You might do that if you're not too concerned about how long the overall process is going to take.

As suggested in this answer, you can use pandarallel for using Pandas' apply function in parallel. Unfortunately as I cannot try your code I am not able to find the problem. Did you try to use less processors (like 8 instead of 16)?
Note that in some cases the parallelization doesn't work.

Related

How to parallel for loop in Sagemaker Processing job

I'm running a python code on Sagemaker Processing job, specifically SKLearnProcessor. The code run a for-loop for 200 times (each iteration is independent), each time takes 20 minutes.
for example: script.py
for i in list:
run_function(i)
I'm kicking off the job from a notebook:
sklearn_processor = SKLearnProcessor(
framework_version="1.0-1", role=role,
instance_type="ml.m5.4xlarge", instance_count=1,
sagemaker_session = Session()
)
out_path = 's3://' + os.path.join(bucket, prefix,'outpath')
sklearn_processor.run(
code="script.py",
outputs=[
ProcessingOutput(output_name="load_training_data",
source = f'/opt/ml/processing/output}',
destination = out_path),
],
arguments=["--some-args", "args"]
)
I want to parallel this code and make the Sagemaker processing job use it best capacity to run as many concurrent jobs as possible.
How can I do that
There are basically 3 paths you can take, depending on the context.
Parallelising function execution
This solution has nothing to do with SageMaker. It is applicable to any python script, regardless of the ecosystem, as long as you have the necessary resources to parallelise a task.
Based on the needs of your software, you have to work out whether to parallelise multi-thread or multi-process. This question may clarify some doubts in this regard: Multiprocessing vs. Threading Python
Here is a simple example on how to parallelise:
from multiprocessing import Pool
import os
POOL_SIZE = os.cpu_count()
your_list = [...]
def run_function(i):
# ...
return your_result
if __name__ == '__main__':
with Pool(POOL_SIZE) as pool:
print(pool.map(run_function, your_list))
Splitting input data into multiple instances
This solution is dependent on the quantity and size of the data. If they are completely independent of each other and have a considerable size, it may make sense to split the data over several instances. This way, execution will be faster and there may also be a reduction in costs based on the instances chosen over the initial larger instance.
It is clear in your case it is the instance_count parameter to set, as the documentation says:
instance_count (int or PipelineVariable) - The number of instances to
run the Processing job with. Defaults to 1.
This should be combined with the ProcessingInput split.
P.S.: This approach makes sense to use if the data can be retrieved before the script is executed. If the data is generated internally, the generation logic must be changed so that it is multi-instance.
Combined approach
One can undoubtedly combine the two previous approaches, i.e. create a script that parallelises the execution of a function on a list and have several parallel instances.
An example of use could be to process a number of csvs. If there are 100 csvs, we may decide to instantiate 5 instances so as to pass 20 files per instance. And in each instance decide to parallelise the reading and/or processing of the csvs and/or rows in the relevant functions.
To pursue such an approach, one must monitor well whether one is really bringing improvement to the system rather than wasting resources.

Python 3.8 Convert for loop to multiprocessing/multithreading

I am new to multiprocessing I would really appreciate it if someone can guide/help me here. I have the following for loop which gets some data from the two functions. The code looks like this
for a in accounts:
dl_users[a['Email']] = get_dl_users(a['Email'], adConn)
group_users[a['Email']] = get_group_users(a['Id'], adConn)
print(f"Users part of DL - {dl_users}")
print(f"Users part of groups - {group_users}")
adConn.unbind()
This works fine and gets all the results but recently I have noticed it takes a lot of time to get the list of users i.e. dl_users and group_users. It takes almost 14-15 mins to complete. I am looking for ways where I can speed up the function and would like to convert this for loop to multiprocessing. get_group_users and get_dl_users makes calls for LDAP, so I am not 100% sure if I should be converting this to multiprocessing or multithreading. Any suggestion would be of big help
As mentioned in the comments, multithreading is appropriate for I/O operations (reading/writing from/to files, sending http requests, communicating with databases), while multiprocessing is appropriate for CPU-bound tasks (such as transforming data, making calculations...). Depending on which kind of operation your functions perform, you want one or the other. If they do a mix, separate them internally and profile which of the two really needs optimisation, since both multiprocessing and -threading introduce overhead that might not be worth adding.
That said, the way to apply multiprocessing or multithreading is pretty simple in recent Python versions (including your 3.8).
Multiprocessing
from multiprocessing import Pool
# Pick the amount of processes that works best for you
processes = 4
with Pool(processes) as pool:
processed = pool.map(your_func, your_data)
Where your_func is a function to apply to each element of your_data, which is an iterable. If you need to provide some other parameters to the callable, you can use a lambda function:
processed = pool.map(lambda item: your_func(item, some_kwarg="some value"), your_data)
Multithreading
The API for multithreading is very similar:
from concurrent.futures import ThreadPoolExecutor
# Pick the amount of workers that works best for you.
# Most likely equal to the amount of threads of your machine.
workers = 4
with ThreadPoolExecutor(workers) as pool:
processed = pool.map(your_func, your_data)
If you want to avoid having to store your_data in memory if you need some attribute of the items instead of the items itself, you can use a generator:
processed = pool.map(your_func, (account["Email"] for account in accounts))

Slow parallel code in Python compared to sequential version, why is it slower?

I have created two versions of a program to add the numbers of an array, one version uses concurrent programming and the other is sequential. The problem that I have is that I cannot make the parallel program to return a faster processing time. I am currently using Windows 8 and Python 3.x. My code is:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
import random
import time
def fun(v):
s=0
for i in range(0,len(v)):
s=s+l[i]
return s
def sumSeq(v):
s=0
start=time.time()
for i in range(0,len(v)):
s=s+v[i]
start1=time.time()
print ("time seq ",start1-start," sum = ",s)
def main():
workers=4
vector = [random.randint(1,101) for _ in range(1000000)]
sumSeq(vector)
dim=(int)(len(vector)/(workers*10))
s=0
chunks=(vector[k:k+dim] for k in range(0,len(vector),(int)(len(vector)/(workers*10))))
start=time.time()
with ThreadPoolExecutor(max_workers=workers) as executor:
futures=[executor.submit(fun,chunk) for chunk in chunks]
start1=time.time()
for future in as_completed(futures):
s=s+future.result()
print ("concurrent time ",start1-start," sum = ",s)
The problem is that I get the following answer:
time sec 0.048101186752319336 sum = 50998349
concurrent time 0.059157371520996094 sum = 50998349
I cannot make the concurrent version to runs faster, I have change the chunks size and the number of max workers to None, but nothing seems to work. What am I doing wrong? I have read that the problem could be the creation of the processes, so how can I fix that in a simple way?
A long-standing weakness of Python is that it can't run pure-Python code simultaneously in multiple threads; the keyword to search for is "GIL" or "global interpreter lock".
Ways around this:
This only applies to CPU-heavy operations, like addition; I/O operations and the like can happily run in parallel. You can happily continue to run Python code in one thread while others are waiting for disk, network, database etc.
This only applies to pure-Python code; several computation-heavy extension modules will release the GIL and let code in other threads run. Things like matrix operations in numpy or image operations can thus run in threads alongside a CPU-heavy Python thread.
It applies to threads (ThreadPoolExecutor) specifically; the ProcessPoolExecutor will work the way you expect — but it's more isolated, so the program will spend more time marshalling and demarshalling the data and intermediate results.
I should also note that it looks like your example isn't really well suited to this test:
It's too small; with a total time of 0.05s, a lot of that is going to be the setup and tear-down of the parallel run. In order to test this, you need at least several seconds, ideally many seconds or a couple of minutes.
The sequential version visits the array in sequence; things like CPU cache are optimised for this sort of access. The parallel version will access the chunks at random, potentially causing cache evictions and the like.

Pandas pd.concat() in separate threads shows no speed-up

I am trying to use pandas in a multi-thread environment. I have a few lists of pandas frames (long list, 5000 pandas frames, with dimensions of 300x2500 dimension) which I need to concatenate.
Since I have multiple lists, I want to run the concat for each list in an own thread (or use a threadpool, at least to get some parallel processing).
For some reason the processing in my multi-thread setup is identical to single threaded processing. I am wondering if I am doing something systematically wrong.
Here is my code snippet, I use ThreadPoolExecutor to implement parallelization:
def func_merge(the_list, key):
return (key, pd.concat(the_list))
def my_thread_starter():
buffer = {
'A': [df_1, ..., df_5000],
'B': [df_a1, ...., df_a5000]
}
with ThreadPoolExecutor(max_workers=2) as executor:
submitted=[]
for key, df_list in buffer.items():
submitted.append(executor.submit(func_merge, df_list, key = key))
for future in as_completed(submitted):
out = future.result()
// do with results
Is there a trick to use Pandas' concat in separate threads? I would at least expect my CPU utilization to spark when running more threads but it does seem to have any effect. Consequently, the time advantage is zero, too
Does anyone has an idea what the problem could be?
Because of the Global Interpreter Lock -GIL), I'm not sure your code is leveraging multi-threading.
Basically, ThreadPoolExecutor is useful when workload is not CPU bounded but IO bounded, like making many Web API call at the same time.
It may have change in python 3.8. But I don't know how to interpret the "tasks which release the GIL" the documentation.
ProcessPoolExecutor could help, but because it requires to serialize input and output of function, with huge data volume, it won't be faster.

Parallel Processing using subprocess in python2.4

I want to calculate a statistic over all pairwise combinations of the columns of a very large matrix. I have a python script, called jaccard.py that accepts a pair of columns and computes this statistic over the matrix.
On my work machine, each calculation takes about 10 seconds, and I have about 95000 of these calculations to complete. However, all these calculations are independent from one another and I am looking to use a cluster we have that uses the Torque queueing system and python2.4. What's the best way to parallelize this calculation so it's compatible with Torque?
I have made the calculations themselves compatible with python2.4, but I am at a loss how to parallelize these calculations using subprocess, or whether I can even do that because of the GIL.
The main idea I have is to keep a constant pool of subprocesses going; when one finishes, read the output and start a new one with the next pair of columns. I only need the output once the calculation is finished, then the process can be restarted on a new calculation.
My idea was to submit the job this way
qsub -l nodes=4:ppn=8 myjob.sh > outfile
myjob.sh would invoke a main python file that looks like the following:
import os, sys
from subprocess import Popen, PIPE
from select import select
def combinations(iterable, r):
#backport of itertools combinations
pass
col_pairs = combinations(range(598, 2))
processes = [Popen(['./jaccard.py'] + map(str, col_pairs.next()),
stdout=PIPE)
for _ in range(8)]
try:
while 1:
for p in processes:
# If process has completed the calculation, print it out
# **How do I do this part?**
# Delete the process and add a new one
p.stdout.close()
processes.remove(p)
process.append(Popen(['./jaccard.py'] + map(str, col_pairs.next()),
stdout=Pipe))
# When there are no more column pairs, end the job.
except StopIteration:
pass
Any advice on to how to best do this? I have never used Torque and am unfamiliar with subprocessing in this way. I tried using multiprocessing.Pool on my workstation and it worked flawlessly with Pool.map, but since the cluster uses python2.4, I'm not sure how to proceed.
EDIT: Actually, on second thought, I could just write multiple qsub scripts, each only working on a single chunk of the 95000 calculations. I could submit something like 16 different jobs, each doing 7125 calculations. It's essentially the same thing.
Actually, on second thought, I could just write multiple qsub scripts, each only working on a single chunk of the 95000 calculations. I could submit something like 16 different jobs, each doing 7125 calculations. It's essentially the same thing. This isn't a solution, but it's a suitable workaround given time and effort constraints.

Categories

Resources