I want to calculate a statistic over all pairwise combinations of the columns of a very large matrix. I have a python script, called jaccard.py that accepts a pair of columns and computes this statistic over the matrix.
On my work machine, each calculation takes about 10 seconds, and I have about 95000 of these calculations to complete. However, all these calculations are independent from one another and I am looking to use a cluster we have that uses the Torque queueing system and python2.4. What's the best way to parallelize this calculation so it's compatible with Torque?
I have made the calculations themselves compatible with python2.4, but I am at a loss how to parallelize these calculations using subprocess, or whether I can even do that because of the GIL.
The main idea I have is to keep a constant pool of subprocesses going; when one finishes, read the output and start a new one with the next pair of columns. I only need the output once the calculation is finished, then the process can be restarted on a new calculation.
My idea was to submit the job this way
qsub -l nodes=4:ppn=8 myjob.sh > outfile
myjob.sh would invoke a main python file that looks like the following:
import os, sys
from subprocess import Popen, PIPE
from select import select
def combinations(iterable, r):
#backport of itertools combinations
pass
col_pairs = combinations(range(598, 2))
processes = [Popen(['./jaccard.py'] + map(str, col_pairs.next()),
stdout=PIPE)
for _ in range(8)]
try:
while 1:
for p in processes:
# If process has completed the calculation, print it out
# **How do I do this part?**
# Delete the process and add a new one
p.stdout.close()
processes.remove(p)
process.append(Popen(['./jaccard.py'] + map(str, col_pairs.next()),
stdout=Pipe))
# When there are no more column pairs, end the job.
except StopIteration:
pass
Any advice on to how to best do this? I have never used Torque and am unfamiliar with subprocessing in this way. I tried using multiprocessing.Pool on my workstation and it worked flawlessly with Pool.map, but since the cluster uses python2.4, I'm not sure how to proceed.
EDIT: Actually, on second thought, I could just write multiple qsub scripts, each only working on a single chunk of the 95000 calculations. I could submit something like 16 different jobs, each doing 7125 calculations. It's essentially the same thing.
Actually, on second thought, I could just write multiple qsub scripts, each only working on a single chunk of the 95000 calculations. I could submit something like 16 different jobs, each doing 7125 calculations. It's essentially the same thing. This isn't a solution, but it's a suitable workaround given time and effort constraints.
Related
I'm running a python code on Sagemaker Processing job, specifically SKLearnProcessor. The code run a for-loop for 200 times (each iteration is independent), each time takes 20 minutes.
for example: script.py
for i in list:
run_function(i)
I'm kicking off the job from a notebook:
sklearn_processor = SKLearnProcessor(
framework_version="1.0-1", role=role,
instance_type="ml.m5.4xlarge", instance_count=1,
sagemaker_session = Session()
)
out_path = 's3://' + os.path.join(bucket, prefix,'outpath')
sklearn_processor.run(
code="script.py",
outputs=[
ProcessingOutput(output_name="load_training_data",
source = f'/opt/ml/processing/output}',
destination = out_path),
],
arguments=["--some-args", "args"]
)
I want to parallel this code and make the Sagemaker processing job use it best capacity to run as many concurrent jobs as possible.
How can I do that
There are basically 3 paths you can take, depending on the context.
Parallelising function execution
This solution has nothing to do with SageMaker. It is applicable to any python script, regardless of the ecosystem, as long as you have the necessary resources to parallelise a task.
Based on the needs of your software, you have to work out whether to parallelise multi-thread or multi-process. This question may clarify some doubts in this regard: Multiprocessing vs. Threading Python
Here is a simple example on how to parallelise:
from multiprocessing import Pool
import os
POOL_SIZE = os.cpu_count()
your_list = [...]
def run_function(i):
# ...
return your_result
if __name__ == '__main__':
with Pool(POOL_SIZE) as pool:
print(pool.map(run_function, your_list))
Splitting input data into multiple instances
This solution is dependent on the quantity and size of the data. If they are completely independent of each other and have a considerable size, it may make sense to split the data over several instances. This way, execution will be faster and there may also be a reduction in costs based on the instances chosen over the initial larger instance.
It is clear in your case it is the instance_count parameter to set, as the documentation says:
instance_count (int or PipelineVariable) - The number of instances to
run the Processing job with. Defaults to 1.
This should be combined with the ProcessingInput split.
P.S.: This approach makes sense to use if the data can be retrieved before the script is executed. If the data is generated internally, the generation logic must be changed so that it is multi-instance.
Combined approach
One can undoubtedly combine the two previous approaches, i.e. create a script that parallelises the execution of a function on a list and have several parallel instances.
An example of use could be to process a number of csvs. If there are 100 csvs, we may decide to instantiate 5 instances so as to pass 20 files per instance. And in each instance decide to parallelise the reading and/or processing of the csvs and/or rows in the relevant functions.
To pursue such an approach, one must monitor well whether one is really bringing improvement to the system rather than wasting resources.
i have a pandas dataframe which consists of approximately 1M rows , it contains information entered by users. i wrote a function that validates if the number entered by the user is correct or not . what im trying to do, is to execute the function on multiple processors to overcome the issue of doing heavy computation on a single processor. what i did is i split my dataframe into multiple chunks where each chunk contains 50K rows and then used the python multiprocessor module to perform the processing on each chunk separately . the issue is that only the first process is starting and its still using one processor instead of distributing the load on all processors . here is the code i wrote :
pool = multiprocessing.Pool(processes=16)
r7 = pool.apply_async(validate.validate_phone_number, (has_phone_num_list[0],fields ,dictionary))
r8 = pool.apply_async(validate.validate_phone_number, (has_phone_num_list[1],fields ,dictionary))
print(r7.get())
print(r8.get())
pool.close()
pool.join()
i have attached a screenshot that shows how the CPU usage when executing the above code
any advice on how can i overcome this issue?
I suggest you try this:
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor() as executor:
params = [(pnl, fields, dictionary) for pnl in has_phone_num_list]
for result in executor.map(validate.validate_phone_number, params):
pass # process results here
By constructing the ProcessPoolExecutor with no parameters, most of your CPUs will be fully utilised. This is a very portable approach because there's no explicit assumption about the number of CPUs available. You could, of course, construct with max_workers=N where N is a low number to ensure that a minimal number of CPUs are used concurrently. You might do that if you're not too concerned about how long the overall process is going to take.
As suggested in this answer, you can use pandarallel for using Pandas' apply function in parallel. Unfortunately as I cannot try your code I am not able to find the problem. Did you try to use less processors (like 8 instead of 16)?
Note that in some cases the parallelization doesn't work.
I have created two versions of a program to add the numbers of an array, one version uses concurrent programming and the other is sequential. The problem that I have is that I cannot make the parallel program to return a faster processing time. I am currently using Windows 8 and Python 3.x. My code is:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
import random
import time
def fun(v):
s=0
for i in range(0,len(v)):
s=s+l[i]
return s
def sumSeq(v):
s=0
start=time.time()
for i in range(0,len(v)):
s=s+v[i]
start1=time.time()
print ("time seq ",start1-start," sum = ",s)
def main():
workers=4
vector = [random.randint(1,101) for _ in range(1000000)]
sumSeq(vector)
dim=(int)(len(vector)/(workers*10))
s=0
chunks=(vector[k:k+dim] for k in range(0,len(vector),(int)(len(vector)/(workers*10))))
start=time.time()
with ThreadPoolExecutor(max_workers=workers) as executor:
futures=[executor.submit(fun,chunk) for chunk in chunks]
start1=time.time()
for future in as_completed(futures):
s=s+future.result()
print ("concurrent time ",start1-start," sum = ",s)
The problem is that I get the following answer:
time sec 0.048101186752319336 sum = 50998349
concurrent time 0.059157371520996094 sum = 50998349
I cannot make the concurrent version to runs faster, I have change the chunks size and the number of max workers to None, but nothing seems to work. What am I doing wrong? I have read that the problem could be the creation of the processes, so how can I fix that in a simple way?
A long-standing weakness of Python is that it can't run pure-Python code simultaneously in multiple threads; the keyword to search for is "GIL" or "global interpreter lock".
Ways around this:
This only applies to CPU-heavy operations, like addition; I/O operations and the like can happily run in parallel. You can happily continue to run Python code in one thread while others are waiting for disk, network, database etc.
This only applies to pure-Python code; several computation-heavy extension modules will release the GIL and let code in other threads run. Things like matrix operations in numpy or image operations can thus run in threads alongside a CPU-heavy Python thread.
It applies to threads (ThreadPoolExecutor) specifically; the ProcessPoolExecutor will work the way you expect — but it's more isolated, so the program will spend more time marshalling and demarshalling the data and intermediate results.
I should also note that it looks like your example isn't really well suited to this test:
It's too small; with a total time of 0.05s, a lot of that is going to be the setup and tear-down of the parallel run. In order to test this, you need at least several seconds, ideally many seconds or a couple of minutes.
The sequential version visits the array in sequence; things like CPU cache are optimised for this sort of access. The parallel version will access the chunks at random, potentially causing cache evictions and the like.
My python program runs every 5 minutes on the cloud. The program reads a certain list of files, and the files each contain a timestamp. If the timestamp matches within the current 5 minutes, the program does a particular action.
Here is an example:
I have a directory "D:/files" with n number of files. The loop will work like this,
for one_file in files:
time_in_file=one_file["time"]
if time_in_file==within_next_five_minutes:
do_a_particular_action
else:
move_to_the_next_file
Currently, I am using a small number of files (approximately 50), which is why it is working fine. In the coming future, the number of files is expected to be in the hundreds or thousands, and the process will take more than 5 minutes to complete. Is there any better way to optimize this other than iterating each file one by one?
I recommend creating 2 different processes one for loading the files and a second one to process each file. Depending on how the files look like, you can send them as they are on the queue, or in pieces. The queue has the responsibility to connect the 2 processes and send the data in between them.
Create 2 methods : load_files, process_file and create a process for each of them. First methods writes on the queue , second one is reading from the queue.
The queue object should be sent as argument to each of the methods, in the args.
import multiprocessing as mp
....
queue = mp.Queue()
loader= mp.Process(target=load_files, args=(queue, ....))
processor = mp.Process(target=process_file, args=(queue, ....))
....
loader.start()
processor.start()
....
loader.join()
processor.join()
Depending on how fast your loader vs. how fast the processor is, you can choose to wait on the second vs. on the first process (e.g. : by checking the queue size) :
https://docs.python.org/2/library/multiprocessing.html#multiprocessing.Queue.qsize
qsize() # qsize() method can help. but please be aware it does now work on every
# operating systems.
Using this approach, you can start multiple processes to process files, or to put on queue. But you should find an easy way to balance in between processes that have the same responsibility, if you have a higher load than what a single process can do. For even higher loads, there are definitely frameworks that you can use. Let me know if you need suggestions for such a framework.
Cheers!
My system is windows 7. I wrote python program to do data analysis. I use multiprocessing library to achieve parallelism. When I open windows powershell, and type python MyScript.py. It starts to use all the cpu cores. But after a while, the CPU (all cores) became idle. But if I hit Enter in powershell window, all cores are back to full-load. To be clear, the program is fine, and has been tested. The problem here is that CPU-cores went idle by themselves.
This happened not only on my office computer, which runs Windows 7 Pro, but also on my home desktop, which runs Windows 7 Ultimate.
The parallel part of the program is very simple:
def myfunc(input):
##some operations based on a huge data and a small data##
operation1: read in a piece of HugeData #query based HDF5
operation2: some operation based on HugeData and SmallData
return output
# read in Small data
SmallData=pd.read_csv('data.csv')
if __name__ == '__main__':
pool = mp.Pool()
result=pool.map_async(myfunc, a_list_of_input)
out=result.get()
My function are mainly data manipulations using Pandas.
There is nothing wrong with the program, because I've successfully finished my program couple times. But I have to keep watching it, and hit Enter when cores become idle. The job takes couple hours, and I really don't keep watching it.
Is this a problem of windows system itself or my program?
By the way, can all the cores have access to the same variable stored in the memory? e.g. I have a data set mydata read into memory right before if __name__ == '__main__':. This data will be used in myfunc. All the cores should be able to access mydata in the same time, right?
Please help!
I was re-directed to this question as I was facing a similar problem while using Python's Multiprocessing library in Ubuntu. In my case, the processes do not start by hitting enter or such, however, they start after sometime abruptly. My code is an iterative heuristic that uses multiprocessing in each of its iterations. I have to rerun the code after completion of some iterations in order to get a steady runtime-performance. As the question was posted way long back, did you come across the actual reason behind it and solution to it?
I confess to not understanding the subtleties of map_async, but I'm not sure whether you can use it like that (I can't seem to get it to work at all)...
I usually use the following recipe (a list comprehension of the calls I want doing):
In [11]: procs = [multiprocessing.Process(target=f, args=()) for _ in xrange(4)]
....: for p in procs: p.start()
....: for p in procs: p.join()
....:
It's simple and waits until the jobs are finished before continuing.
This works fine with pandas objects provided you're not doing modifications... (I think) copies of the object are passed to each thread and if you perform mutations they not propogate and will be garbage collected.
You can use multiprocessing's version of a dict or list with the Manager class, this is useful for storing the result of each job (simply access the dict/list from within the function):
mgr = multiproccessing.Manager()
d = mgr.dict()
L = mgr.list()
and they will have shared access (as if you had written a Lock). It's hardly worth mentioning, that if you are appending to a list then the order will not just be the same as procs!
You may be able to do something similar to the Manager for pandas objects (writing a lock to objects in memory without copying), but I think this would be a non-trivial task...