I am working in Python 3.4, performing a naive search against partitioned data in memory, and am attempting to fork processes to take advantage of all available processing power. I say naive, because I am certain there are other additional things that can be done to improve performance, but those potentials are out of scope for the question at hand.
The system I am testing on is a Windows 7 x64 environment.
What I would like to achieve is a relatively even, simultaneous distribution across cpu_count() - 1 cores (reading suggests that distributing against all cores rather than n-1 cores does not show any additional improvement due to baseline os system processes). So 75% pegged cpu Usage for a 4 core machine.
What I am seeing (using windows task manager 'performance tab' and the 'process tab') is that I never achieve greater than 25% system dedicated cpu utilization and that the process view shows computation occurring one core at a time, switching every few seconds between the forked processes.
I haven't instrumented the code for timing, but I am pretty sure that my subjective observations are correct in that I am not gaining the performance increase I expected (3x on an i5 3320m).
I haven't tested on Linux.
Based on the code presented:
- How can I achieve 75% CPU utilization?
#pseudo code
def search_method(search_term, partition):
<perform fuzzy search>
return results
partitions = [<list of lists>]
search_terms = [<list of search terms>]
#real code
import multiprocessing as mp
pool = mp.Pool(processes=mp.cpu_count() - 1)
for search_term in search_terms:
results = []
results = [pool.apply(search_method, args=(search_term, partitions[x])) for x in range(len(partitions))]
You're actually not doing anything concurrently here, because you're using pool.apply, which will block until the task you pass to it is complete. So, for every item in partitions, you're running search_method in some process inside of pool, waiting for it to complete, and then moving on to the next item. That perfectly coincides with what you're seeing in the Windows process manager. You want pool.apply_async instead:
for search_term in search_terms:
results = []
results = [pool.apply_async(search_method, args=(search_term, partitions[x])) for x in range(len(partitions))]
# Get the actual results from the AsyncResult objects returned.
results = [r.get() for r in results]
Or better yet, use pool.map (along with functools.partial to enable passing multiple arguments to our worker function):
from functools import partial
...
for search_term in search_terms:
func = partial(search_method, search_term)
results = pool.map(func, partitions)
Related
I'm running a python code on Sagemaker Processing job, specifically SKLearnProcessor. The code run a for-loop for 200 times (each iteration is independent), each time takes 20 minutes.
for example: script.py
for i in list:
run_function(i)
I'm kicking off the job from a notebook:
sklearn_processor = SKLearnProcessor(
framework_version="1.0-1", role=role,
instance_type="ml.m5.4xlarge", instance_count=1,
sagemaker_session = Session()
)
out_path = 's3://' + os.path.join(bucket, prefix,'outpath')
sklearn_processor.run(
code="script.py",
outputs=[
ProcessingOutput(output_name="load_training_data",
source = f'/opt/ml/processing/output}',
destination = out_path),
],
arguments=["--some-args", "args"]
)
I want to parallel this code and make the Sagemaker processing job use it best capacity to run as many concurrent jobs as possible.
How can I do that
There are basically 3 paths you can take, depending on the context.
Parallelising function execution
This solution has nothing to do with SageMaker. It is applicable to any python script, regardless of the ecosystem, as long as you have the necessary resources to parallelise a task.
Based on the needs of your software, you have to work out whether to parallelise multi-thread or multi-process. This question may clarify some doubts in this regard: Multiprocessing vs. Threading Python
Here is a simple example on how to parallelise:
from multiprocessing import Pool
import os
POOL_SIZE = os.cpu_count()
your_list = [...]
def run_function(i):
# ...
return your_result
if __name__ == '__main__':
with Pool(POOL_SIZE) as pool:
print(pool.map(run_function, your_list))
Splitting input data into multiple instances
This solution is dependent on the quantity and size of the data. If they are completely independent of each other and have a considerable size, it may make sense to split the data over several instances. This way, execution will be faster and there may also be a reduction in costs based on the instances chosen over the initial larger instance.
It is clear in your case it is the instance_count parameter to set, as the documentation says:
instance_count (int or PipelineVariable) - The number of instances to
run the Processing job with. Defaults to 1.
This should be combined with the ProcessingInput split.
P.S.: This approach makes sense to use if the data can be retrieved before the script is executed. If the data is generated internally, the generation logic must be changed so that it is multi-instance.
Combined approach
One can undoubtedly combine the two previous approaches, i.e. create a script that parallelises the execution of a function on a list and have several parallel instances.
An example of use could be to process a number of csvs. If there are 100 csvs, we may decide to instantiate 5 instances so as to pass 20 files per instance. And in each instance decide to parallelise the reading and/or processing of the csvs and/or rows in the relevant functions.
To pursue such an approach, one must monitor well whether one is really bringing improvement to the system rather than wasting resources.
i have a pandas dataframe which consists of approximately 1M rows , it contains information entered by users. i wrote a function that validates if the number entered by the user is correct or not . what im trying to do, is to execute the function on multiple processors to overcome the issue of doing heavy computation on a single processor. what i did is i split my dataframe into multiple chunks where each chunk contains 50K rows and then used the python multiprocessor module to perform the processing on each chunk separately . the issue is that only the first process is starting and its still using one processor instead of distributing the load on all processors . here is the code i wrote :
pool = multiprocessing.Pool(processes=16)
r7 = pool.apply_async(validate.validate_phone_number, (has_phone_num_list[0],fields ,dictionary))
r8 = pool.apply_async(validate.validate_phone_number, (has_phone_num_list[1],fields ,dictionary))
print(r7.get())
print(r8.get())
pool.close()
pool.join()
i have attached a screenshot that shows how the CPU usage when executing the above code
any advice on how can i overcome this issue?
I suggest you try this:
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor() as executor:
params = [(pnl, fields, dictionary) for pnl in has_phone_num_list]
for result in executor.map(validate.validate_phone_number, params):
pass # process results here
By constructing the ProcessPoolExecutor with no parameters, most of your CPUs will be fully utilised. This is a very portable approach because there's no explicit assumption about the number of CPUs available. You could, of course, construct with max_workers=N where N is a low number to ensure that a minimal number of CPUs are used concurrently. You might do that if you're not too concerned about how long the overall process is going to take.
As suggested in this answer, you can use pandarallel for using Pandas' apply function in parallel. Unfortunately as I cannot try your code I am not able to find the problem. Did you try to use less processors (like 8 instead of 16)?
Note that in some cases the parallelization doesn't work.
I have created two versions of a program to add the numbers of an array, one version uses concurrent programming and the other is sequential. The problem that I have is that I cannot make the parallel program to return a faster processing time. I am currently using Windows 8 and Python 3.x. My code is:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
import random
import time
def fun(v):
s=0
for i in range(0,len(v)):
s=s+l[i]
return s
def sumSeq(v):
s=0
start=time.time()
for i in range(0,len(v)):
s=s+v[i]
start1=time.time()
print ("time seq ",start1-start," sum = ",s)
def main():
workers=4
vector = [random.randint(1,101) for _ in range(1000000)]
sumSeq(vector)
dim=(int)(len(vector)/(workers*10))
s=0
chunks=(vector[k:k+dim] for k in range(0,len(vector),(int)(len(vector)/(workers*10))))
start=time.time()
with ThreadPoolExecutor(max_workers=workers) as executor:
futures=[executor.submit(fun,chunk) for chunk in chunks]
start1=time.time()
for future in as_completed(futures):
s=s+future.result()
print ("concurrent time ",start1-start," sum = ",s)
The problem is that I get the following answer:
time sec 0.048101186752319336 sum = 50998349
concurrent time 0.059157371520996094 sum = 50998349
I cannot make the concurrent version to runs faster, I have change the chunks size and the number of max workers to None, but nothing seems to work. What am I doing wrong? I have read that the problem could be the creation of the processes, so how can I fix that in a simple way?
A long-standing weakness of Python is that it can't run pure-Python code simultaneously in multiple threads; the keyword to search for is "GIL" or "global interpreter lock".
Ways around this:
This only applies to CPU-heavy operations, like addition; I/O operations and the like can happily run in parallel. You can happily continue to run Python code in one thread while others are waiting for disk, network, database etc.
This only applies to pure-Python code; several computation-heavy extension modules will release the GIL and let code in other threads run. Things like matrix operations in numpy or image operations can thus run in threads alongside a CPU-heavy Python thread.
It applies to threads (ThreadPoolExecutor) specifically; the ProcessPoolExecutor will work the way you expect — but it's more isolated, so the program will spend more time marshalling and demarshalling the data and intermediate results.
I should also note that it looks like your example isn't really well suited to this test:
It's too small; with a total time of 0.05s, a lot of that is going to be the setup and tear-down of the parallel run. In order to test this, you need at least several seconds, ideally many seconds or a couple of minutes.
The sequential version visits the array in sequence; things like CPU cache are optimised for this sort of access. The parallel version will access the chunks at random, potentially causing cache evictions and the like.
So I wanted to use the python driver to batch insert documents to Couchbase. However if the batch exceeds few thousand documents the python kernel gets restarted (I'm using notebook).
To reproduce this please try:
from couchbase.bucket import Bucket
cb = Bucket('couchbase://localhost/default')
keys_per_doc = 50
doc_count = 10000
docs = [dict(
[
('very_long_feature_{}'.format(i), float(i) if i % 2 == 0 else i)
for i in xrange(keys_per_doc)
] + [('id', id_)] ) for id_ in xrange(doc_count)]
def sframe_to_cb(sf, key_column, key_prefix):
pipe = cb.pipeline()
with pipe:
for r in sf:
cb.upsert(key_prefix + str(r[key_column]), r)
return 0
p = sframe_to_cb(docs, 'id', 'test_')
The fun thing is that all docs get inserted and I suppose the interpreter dies when gathering the results on the pipeline.exit method.
I don't get any error message and the notebook console just says that it has restarted the notebook.
I'm curious what is causing this behavior and if there is a way to fix it.
Obviously I can do mini-batches (up to 3000 docs in my case) but this makes it much slower if they are processed sequentially.
I cannot use multiprocessing because I run the inserts inside of celery.
I cannot use multiple celery tasks because the serialisation of batches is too expensive and could kill our redis instance.
So the questions:
What is causing the crash with large batches and is there a way to fix it?
Assuming that nothing can go wrong with upserts can I make the pipeline discard results.
Is there a different way to achieve high throughput from a single process?
Aditional info as requested in comments:
VMWarte fusion on Mac running a Ubuntu 14.04 LTS VM
The guest ubuntu has 4GB RAM, 12GB swap on SSD, 2 cores (4 threads)
The impression that doing mini batches is slower comes from watching the bucket statsistics (large batch peaks at 10K TPS smaller ones get c.a. 2K TPS)
There is a large speed up if I use multiprocessing and these batches are distributed across multiple CPUs (20-30K TPS) however I cannot do this in production because of celery limitations (I cannot use a ProcessPoolExecutor inside a celery task)
I cannot really tell when exactly does the crash happen (I'm not sure if this is relevant)
I am using a PBS-based cluster and running IPython parallel over a set of nodes, each with either 24 or 32 cores and memory ranging from 24G to 72G; this heterogeneity is due to our cluster having history to it. In addition, I have jobs that I am sending to the IPython cluster that have varying resource requirements (cores and memory). I am looking for a way to submit jobs to the ipython cluster that know about their resource requirements and those of the available engines. I imagine there is a way to deal with this situation gracefully using IPython functionality, but I have not found it. Any suggestions as to how to proceed?
In addition to graph dependencies, which you indicate that you already get, IPython tasks can have functional dependencies. These can be arbitrary functions, like tasks themselves. A functional dependency runs before the real task, and if it returns False or raises a special parallel.UnmetDependency exception, the task will not be run on that engine, and will be retried somewhere else.
So to use this, you need a function that checks whatever metric you need. For instance, let's say we only want to run a task on your nodes with a minimum amount of memory. Here is a function that checks the total memory on the system (in bytes):
def minimum_mem(limit):
import sys
if sys.platform == 'darwin': # or BSD in general?
from subprocess import check_output
mem = int(check_output(['sysctl', '-n', 'hw.memsize']))
else: # linux
with open("/proc/meminfo") as f:
for line in f:
if line.startswith("MemTotal"):
mem = 1024 * int(line.split()[1])
break
return mem >= limit
kB = 1024.
MB = 1024 * kB
GB = 1024 * MB
so minimum_mem(4 * GB) will return True iff you have at least 4GB of memory on your system. If you want to check available memory instead of total memory, you can use the MemFree and Inactive values in /proc/meminfo to determine what is not already in use.
Now you can submit tasks only to engines with sufficient RAM by applying the #parallel.depend decorator:
#parallel.depend(minimum_mem, 8 * GB)
def big_mem_task(n):
import os, socket
return "big", socket.gethostname(), os.getpid(), n
amr = view.map(big_mem_task, range(10))
Similarly, you can apply restrictions based on the number of CPUs (multiprocessing.cpu_count is a useful function there).
Here is a notebook that uses these to restrict assignment of some dumb tasks.
Typically, the model is to run one IPython engine per core (not per node), but if you have specific multicore tasks, then you may want to use a smaller number (e.g. N/2 or N/4). If your tasks are really big, then you may actually want to restrict it to one engine per node. If you are running more engines per node, then you will want to be a bit careful about running high resource tasks together. As I have written them, these checks do not take into account other tasks on the same node, so if a node as 16 GB of RAM, and you have two tasks that each need 10, you will need to be more careful about how you track available resources.