Why threading increase processing time?

Why threading increase processing time? - python

I was working on multitasking a basic 2-D DLA simulation. Diffusion Limited Aggregation (DLA) is when you have particles performing a random walk and aggregate when they touch the current aggregate.
In the simulation, I have 10.000 particles walking to a random direction at each step. I use a pool of worker and a queue to feed them. I feed them with a list of particles and the worker perform the method .updatePositionAndggregate() on each particle.
If I have one worker, I feed it with a list of 10.000 particles, if I have two workers, i feed them with a list of 5.000 particles each, if I have 3 workers, I feed them with a list of 3.333 particles each, etc and etc.
I show you some code for the worker now
class Worker(Thread):
"""
The worker class is here to process a list of particles and try to aggregate
them.
"""
def __init__(self, name, particles):
"""
Initialize the worker and its events.
"""
Thread.__init__(self, name = name)
self.daemon = True
self.particles = particles
self.start()
def run(self):
"""
The worker is started just after its creation and wait to be feed with a
list of particles in order to process them.
"""
while True:
particles = self.particles.get()
# print self.name + ': wake up with ' + str(len(self.particles)) + ' particles' + '\n'
# Processing the particles that has been feed.
for particle in particles:
particle.updatePositionAndAggregate()
self.particles.task_done()
# print self.name + ': is done' + '\n'
And in the main thread:
# Create the workers.
workerQueue = Queue(num_threads)
for i in range(0, num_threads):
Worker("worker_" + str(i), workerQueue)
# We run the simulation until all the particle has been created
while some_condition():
# Feed all the workers.
startWorker = datetime.datetime.now()
for i in range(0, num_threads):
j = i * len(particles) / num_threads
k = (i + 1) * len(particles) / num_threads
# Feeding the worker thread.
# print "main: feeding " + worker.name + ' ' + str(len(worker.particles)) + ' particles\n'
workerQueue.put(particles[j:k])
# Wait for all the workers
workerQueue.join()
workerDurations.append((datetime.datetime.now() - startWorker).total_seconds())
print sum(workerDurations) / len(workerDurations)
So, I print the average time in waiting the workers to terminate their tasks. I did some experiment with different thread number.
| num threads | average workers duration (s.) |
|-------------|-------------------------------|
| 1 | 0.147835636364 |
| 2 | 0.228585818182 |
| 3 | 0.258296454545 |
| 10 | 0.294294636364 |
I really wonder why adding workers increase the processing time, I thought that at least having 2 worker would decrease the processing time, but it dramatically increases from .14s. to 0.23s. Can you explain me why ?
EDIT:
So, explanation is Python threading implementation, is there a way so I can have real multitasking ?

This is happening because threads don't execute at the same time as Python can execute only one thread at a time due to GIL (Global Interpreter Lock).
When you spawn a new thread, everything freezes except this thread. When it stops the other one is executed. Spawning threads needs lots of time.
Friendly speaking, the code doesn't matter at all as any code using 100 threads is SLOWER than code using 10 threads in Python (if more threads means more efficiency and more speed, which is not always true).
Here's an exact quote from the Python docs:
CPython implementation detail:
In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
Wikipedia about GIL
StackOverflow about GIL

Threads in python (at least in 2.7) are NOT executed simultaneously because of GIL: https://wiki.python.org/moin/GlobalInterpreterLock - they run in single process and share CPU, therefore you can't use threads for speeding your computation up.
If you want to use parallel computation to speed up your calculation (at least in python2.7), use processes - package multiprocessing.

This is due to Python's global interpreter lock. Unfortunately, with the GIL in Python threads will block I/O and as such will never exceed usage of 1 CPU core. Have a look here to get you started on understanding the GIL: https://wiki.python.org/moin/GlobalInterpreterLock
Check your running processes (Task Manager in Windows, for example) and will notice that only one core is being utilized by your Python application.
I would suggest looking at multi-processing in Python, which is not hindered by the GIL: https://docs.python.org/2/library/multiprocessing.html

It takes time to actually create the other thread and start processing it. Since we don't have control of the scheduler, I'm willing to bet both of these threads get scheduled on the same core (since the work is so small), therefore you are adding the time it takes to create the thread and no parallel processing is done

Related

Multiprocessing and multithreading in Python

I have a python program which 1) Reads from a very large file from Disk(~95% time) and then 2) Process and Provide a relatively small output (~5% time). This Program is to be run on TeraBytes of files .
Now i am looking to Optimize this Program by utilizing Multi Processing and Multi Threading . The platform I am running is a Virtual Machine with 4 Processors on a virtual Machine .
I plan to have a scheduler Process which will execute 4 Processes (same as processors) and then Each Process should have some threads as most part is I/O . Each Thread will process 1 file & will report result to the Main Thread which in turn will report it back to scheduler Process via IPC . Scheduler can queue these and eventually write them to disk in ordered manner
So wondering How does one decide number of Processes and Threads to create for such scenario ? Is there a Mathematical way to figure out whats the best mix .
Thankyou

I think I would arrange it the inverse of what you are doing. That is, I would create a thread pool of a certain size that would be responsible for producing the results. The tasks that get submitted to this pool would be passed as an argument a processor pool that could be used by the worker thread for submitting the CPU-bound portions of work. In other words, the thread pool workers would primarily be doing all the disk-related operations and handing off to the processor pool any CPU-intensive work.
The size of the processor pool should just be the number of processors you have in your environment. It's difficult to give a precise size for the thread pool; it depends on how many concurrent disk operations it can handle before the law of diminishing returns come into play. It also depends on your memory: The larger the pool, the greater the memory resources that will be taken, especially if entire files have to be read into memory for processing. So, you may have to experiment with this value. The code below outlines these ideas. What you gain from the thread pool is overlapping of I/O operations greater than you would achieve if you just used a small processor pool:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from functools import partial
import os
def cpu_bound_function(arg1, arg2):
...
return some_result
def io_bound_function(process_pool_executor, file_name):
with open(file_name, 'r') as f:
# Do disk related operations:
. . . # code omitted
# Now we have to do a CPU-intensive operation:
future = process_pool_executor.submit(cpu_bound_function, arg1, arg2)
result = future.result() # get result
return result
file_list = [file_1, file_2, file_n]
N_FILES = len(file_list)
MAX_THREADS = 50 # depends on your configuration on how well the I/O can be overlapped
N_THREADS = min(N_FILES, MAX_THREADS) # no point in creating more threds than required
N_PROCESSES = os.cpu_count() # use the number of processors you have
with ThreadPoolExecutor(N_THREADS) as thread_pool_executor:
with ProcessPoolExecutor(N_PROCESSES) as process_pool_executor:
results = thread_pool_executor.map(partial(io_bound_function, process_pool_executor), file_list)
Important Note:
Another far simpler approach is to just have a single, processor pool whose size is greater than the number of CPU processors you have, for example, 25. The worker processes will do both I/O and CPU operations. Even though you have more processes than CPUs, many of the processes will be in a wait state waiting for I/O to complete allowing CPU-intensive work to run.
The downside to this approach is that the overhead in creating a N processes is far greater than the overhead in creating N threads + a small number of processes. However, as the running time of the tasks submitted to the pool becomes increasingly larger, then this increased overhead becomes decreasingly a smaller percentage of the total run time. So, if your tasks are not trivial, this could be a reasonably performant simplification.
Update: Benchmarks of Both Approaches
I did some benchmarks against the two approaches processing 24 files whose sizes were approximately 10,000KB (actually, these were just 3 different files processed 8 times each, so there might have been some caching done):
Method 1 (thread pool + processor pool)
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from functools import partial
import os
from math import sqrt
import timeit
def cpu_bound_function(b):
sum = 0.0
for x in b:
sum += sqrt(float(x))
return sum
def io_bound_function(process_pool_executor, file_name):
with open(file_name, 'rb') as f:
b = f.read()
future = process_pool_executor.submit(cpu_bound_function, b)
result = future.result() # get result
return result
def main():
file_list = ['/download/httpd-2.4.16-win32-VC14.zip'] * 8 + ['/download/curlmanager-1.0.6-x64.exe'] * 8 + ['/download/Element_v2.8.0_UserManual_RevA.pdf'] * 8
N_FILES = len(file_list)
MAX_THREADS = 50 # depends on your configuration on how well the I/O can be overlapped
N_THREADS = min(N_FILES, MAX_THREADS) # no point in creating more threds than required
N_PROCESSES = os.cpu_count() # use the number of processors you have
with ThreadPoolExecutor(N_THREADS) as thread_pool_executor:
with ProcessPoolExecutor(N_PROCESSES) as process_pool_executor:
results = list(thread_pool_executor.map(partial(io_bound_function, process_pool_executor), file_list))
print(results)
if __name__ == '__main__':
print(timeit.timeit(stmt='main()', number=1, globals=globals()))
Method 2 (processor pool only)
from concurrent.futures import ProcessPoolExecutor
from math import sqrt
import timeit
def cpu_bound_function(b):
sum = 0.0
for x in b:
sum += sqrt(float(x))
return sum
def io_bound_function(file_name):
with open(file_name, 'rb') as f:
b = f.read()
result = cpu_bound_function(b)
return result
def main():
file_list = ['/download/httpd-2.4.16-win32-VC14.zip'] * 8 + ['/download/curlmanager-1.0.6-x64.exe'] * 8 + ['/download/Element_v2.8.0_UserManual_RevA.pdf'] * 8
N_FILES = len(file_list)
MAX_PROCESSES = 50 # depends on your configuration on how well the I/O can be overlapped
N_PROCESSES = min(N_FILES, MAX_PROCESSES) # no point in creating more threds than required
with ProcessPoolExecutor(N_PROCESSES) as process_pool_executor:
results = list(process_pool_executor.map(io_bound_function, file_list))
print(results)
if __name__ == '__main__':
print(timeit.timeit(stmt='main()', number=1, globals=globals()))
Results:
(I have 8 cores)
Thread Pool + Processor Pool: 13.5 seconds
Processor Pool Alone: 13.3 seconds
Conclusion: I would try the simpler approach first of just using a processor pool for everything. Now the tricky bit is deciding what the maximum number of processes to create, which was part of your original question and had a simple answer when all it was doing was the CPU-intensive computations. If the number of files you are reading are not too many, then the point is moot; you can have one process per file. But if you have hundreds of files, you will not want to have hundreds of processes in your pool (there is also an upper limit to how many processes you can create and again there are those nasty memory constraints). There is just no way I can give you an exact number. If you do have a large number of files, start with a smallish pool size and keep incrementing until you get no further benefit (of course, you probably do not want to be processing more files than some maximum number for these tests or you will be running forever just deciding on a good pool size for the real run).

For parallel processing:
I saw this question, and quoting the accepted answer:
In practice, it can be difficult to find the optimal number of threads and even that number will likely vary each time you run the program. So, theoretically, the optimal number of threads will be the number of cores you have on your machine. If your cores are "hyper threaded" (as Intel calls it) it can run 2 threads on each core. Then, in that case, the optimal number of threads is double the number of cores on your machine.
For multiprocessing:
Someone asked a similar question here, and the accepted answer said this:
If all of your threads/processes are indeed CPU-bound, you should run as many processes as the CPU reports cores. Due to HyperThreading, each physical CPU cores may be able to present multiple virtual cores. Call multiprocessing.cpu_count to get the number of virtual cores.
If only p of 1 of your threads is CPU-bound, you can adjust that number by multiplying by p. For example, if half your processes are CPU-bound (p = 0.5) and you have two CPUs with 4 cores each and 2x HyperThreading, you should start 0.5 * 2 * 4 * 2 = 8 processes.
The key here is understand what machine are you using, from that, you can choose a nearly optimal number of threads/processes to split the execution of you code. And I said nearly optimal because it will vary a little bit every time you run your script, so it'll be difficult to predict this optimal number from a mathematical point of view.
For your specific situation, if your machine has 4 cores, I would recommend you to only create 4 threads max, and then split them:
1 to the main thread.
3 for file reading and process.

using multiple processes to speed up IO performance may not be a good idea, check this and the sample code below it to see wether it is helpful

One idea can be to have a thread only reading the file (If I understood well, there is only one file) and pushing the independent parts (for ex. rows) into queue with messages.
The messages can be processed by 4 threads. In this way, you can optimize the load between the processors.

On a strongly I/O-bound process (like what you are describing), you do not necessarily need multithreading nor multiprocessing: you could also use more advanced I/O primitives from your OS.
For example on Linux you can submit read requests to the kernel along with a suitably sized mutable buffer and be notified when the buffer is filled. This can be done using the AIO API, for which I've written a pure-python binding: python-libaio (libaio on pypi)), or with the more recent io_uring API for which there seems to be a CFFI python binding (liburing on pypy) (I have neither used io_uring nor this python binding).
This removes the complexity of parallel processing at your level, may reduce the number of OS/userland context switches (reducing the cpu time even further), and lets the OS know more about what you are trying to do, giving it the opportunity of scheduling the IO more efficiently (in a virtualised environment I would not be surprised if it reduced the number of data copies, although I have not tried it myself).
Of course, the downside is that your program will be more tightly bound to the OS you are executing it on, requiring more effort to get it to run on another one.

Running multiple python scripts from a single script, and communicating back and forth between them?

I have a script that I wrote that I am able to pass arguments to, and I want launch multiple simultaneous iterations (maybe 100+) with unique arguments. My plan was to write another python script which then launch these subscripts/processes, however to be effective, I need the that script to be able to monitor the subscripts for any errors.
Is there any straightforward way to do this, or a library that offers this functionality? I've been searching for a while and am not having good luck finding anything. Creating subprocesses and multiple threads seems straight forward enough but I can't really find any guides or tutorials on how to then communicate with those threads/subprocesses.

A better way to do this would be to make use of threads. If you made the script you want to call into a function in this larger script, you could have your main function call this script as many times as you want and have the threads report back with information as needed. You can read a little bit about how threads work here.

I suggest to use threading.Thread or multiprocessing.Process despite of requirements.
Simple way to communicate between Threads / Processes is to use Queue. Multiprocessing module provides some other ways to communicate between processes (Queue, Event, Manager, ...)
You can see some elementary communication in the example:
import threading
from Queue import Queue
import random
import time
class Worker(threading.Thread):
def __init__(self, name, queue_error):
threading.Thread.__init__(self)
self.name = name
self.queue_error = queue_error
def run(self):
time.sleep(random.randrange(1, 10))
# Do some processing ...
# Report errors
self.queue_error.put((self.name, 'Error state'))
class Launcher(object):
def __init__(self):
self.queue_error = Queue()
def main_loop(self):
# Start threads
for i in range(10):
w = Worker(i, self.queue_error)
w.start()
# Check for errors
while True:
while not self.queue_error.empty():
error_data = self.queue_error.get()
print 'Worker #%s reported error: %s' % (error_data[0], error_data[1])
time.sleep(0.1)
if __name__ == '__main__':
l = Launcher()
l.main_loop()

Like someone else said, you have to use multiple processes for true parallelism instead of threads because the GIL limitation prevents threads from running concurrently.
If you want to use the standard multiprocessing library (which is based on launching multiple processes), I suggest using a pool of workers. If I understood correctly, you want to launch 100+ parallel instances. Launching 100+ processes on one host will generate too much overhead. Instead, create a pool of P workers where P is for example the number of cores in your machine and submit the 100+ jobs to the pool. This is simple to do and there are many examples on the web. Also, when you submit jobs to the pool, you can provide a callback function to receive errors. This may be sufficient for your needs (there are examples here).
The Pool in multiprocessing however can't distribute work across multiple hosts (e.g. cluster of machines) last time I looked. So, if you need to do this, or if you need a more flexible communication scheme, like being able to send updates to the controlling process while the workers are running, my suggestion is to use charm4py (note that I am a charm4py developer so this is where I have experience).
With charm4py you could create N workers which are distributed among P processes by the runtime (works across multiple hosts), and the workers can communicate with the controller simply by doing remote method invocation. Here is a small example:
from charm4py import charm, Chare, Group, Array, ArrayMap, Reducer, threaded
import time
WORKER_ITERATIONS = 100
class Worker(Chare):
def __init__(self, controller):
self.controller = controller
#threaded
def work(self, x, done_future):
result = -1
try:
for i in range(WORKER_ITERATIONS):
if i % 20 == 0:
# send status update to controller
self.controller.progressUpdate(self.thisIndex, i, ret=True).get()
if i == 5 and self.thisIndex[0] % 2 == 0:
# trigger NameError on even-numbered workers
test[3] = 3
time.sleep(0.01)
result = x**2
except Exception as e:
# send error to controller
self.controller.collectError(self.thisIndex, e)
# send result to controller
self.contribute(result, Reducer.gather, done_future)
# This custom map is used to prevent workers from being created on process 0
# (where the controller is). Not strictly needed, but allows more timely
# controller output
class WorkerMap(ArrayMap):
def procNum(self, index):
return (index[0] % (charm.numPes() - 1)) + 1
class Controller(Chare):
def __init__(self, args):
self.startTime = time.time()
done_future = charm.createFuture()
# create 12 workers, which are distributed by charm4py among processes
workers = Array(Worker, 12, args=[self.thisProxy], map=Group(WorkerMap))
# start work
for i in range(12):
workers[i].work(i, done_future)
print('Results are', done_future.get()) # wait for result
exit()
def progressUpdate(self, worker_id, current_step):
print(round(time.time() - self.startTime, 3), ': Worker', worker_id,
'progress', current_step * 100 / WORKER_ITERATIONS, '%')
# the controller can return a value here and the worker would receive it
def collectError(self, worker_id, error):
print(round(time.time() - self.startTime, 3), ': Got error', error,
'from worker', worker_id)
charm.start(Controller)
In this example, the Controller will print status updates and errors as they happen. It
will print final results from all workers when they are all done. The result for workers
that have failed will be -1.
The number of processes P is given at launch. The runtime will distribute the N workers among the available processes. This happens when the workers are created and there is no dynamic load balancing in this particular example.
Also, note that in the charm4py model remote method invocation is asynchronous and returns a future which the caller can block on, but only the calling thread blocks (not the whole process).
Hope this helps.

How to perform batch computation in python by adding processes as soon as cores become free?

Bash has the function "wait -n" which can be used in a relatively trivial way to halt subsequent execution of child processes until a certain number of processor cores have been made available. E.g. I can do the following,
for IJOB in IJOBRANGE;
do
./func.x ${IJOB}
# checking the number of background processes
# and halting the execution accordingly
bground=( $(jobs -p) );
if (( ${#bground[#]} >= CORES )); then
wait -n
fi
done || exit 1
This snippet can batch execute an arbitrary C process "func.x" with varying arguments and always maintains a fixed number of parallel instances of the child processes, set to the value "CORES".
I was wondering if something similar could be done with a python script and
python child processes (or functions). Currently, I define a python function, set up an one dimensional parameter array and use the the Pool routine from the python multiprocessing module to parallel compute the function over the parameter array. The pool functions perform a set number (# of CPU CORES in the following example) of evaluation of my function and waits until all the instances of the spawned processes have concluded before moving to the next batch.
import multiprocessing as mp
def func(x):
# some computation with x
def main(j):
# setting the parameter array
xarray = range(j)
pool = mp.Pool()
pool.map(func,xarray)
I would like to know if it is possible to modify this snippet in order to always perform a fixed number of parallel computation of my subroutine, i.e. add another process as soon as one of the child processes have been finished. All the "func" processes here are supposed to be independent and the order of execution does not matter either. I am new to the python way and it would be really great to have some helpful perspectives.

Following our discussion in the comments, here's some test code adapted from yours that shows Pools don't wait for all parallel tasks to complete before assigning a new one to available workers:
import multiprocessing as mp
from time import sleep, time
def func(x):
"""sleeps for x seconds"""
name = mp.current_process().name
print("{} {}: sleep {}".format(time(), name, x))
sleep(x)
print("{} {}: done sleeping".format(time(), name))
def main():
# A pool of two processes, for the sake of simplicity
pool = mp.Pool(processes=2)
# Here's how that works out visually:
#
# 0s 1s 2s 3s
# P1 [sleep(1)][ sleep(2) ]
# P2 [ sleep(2) ][sleep(1)]
sleeps = [1, 2, 2, 1]
pool.map(func, sleeps)
if __name__ == "__main__":
main()
Running this code gives (timestamps simplified for clarity):
$ python3 mp.py
0s: ForkPoolWorker-1: sleep 1
0s: ForkPoolWorker-2: sleep 2
1s: ForkPoolWorker-1: done sleeping
1s: ForkPoolWorker-1: sleep 2
2s: ForkPoolWorker-2: done sleeping
2s: ForkPoolWorker-2: sleep 1
3s: ForkPoolWorker-1: done sleeping
3s: ForkPoolWorker-2: done sleeping
We can see that the first process doesn't wait for the second process to complete its first task before starting its second task.
So I guess that should answer the point you were raising, hope I've understood you clearly.

how to start multiple jobs in python and communicate with the main job

I am a novice user of python multithreading/multiprocessing, so please bear with me.
I would like to solve the following problem and I need some help/suggestions in this regard.
Let me describe in brief:
I would like to start a python script which does something in the
beginning sequentially.
After the sequential part is over, I would like to start some jobs
in parallel.
Assume that there are four parallel jobs I want to start.
I would like to also start these jobs on some other machines using "lsf" on the computing cluster.My initial script is also running on a ” lsf”
machine.
The four jobs which I started on four machines will perform two logical steps A and B---one after the other.
When a job started initially, they start with logical step A and finish it.
After every job (4jobs) has finished the Step A; they should notify the first job which started these. In other words, the main job which started is waiting for the confirmation from these four jobs.
Once the main job receives confirmation from these four jobs; it should notify all the four jobs to do the logical step B.
Logical step B will automatically terminate the jobs after finishing the task.
Main job is waiting for the all jobs to finish and later on it should continue with the sequential part.
An example scenario would be:
Python script running on an “lsf” machine in the cluster starts four "tcl shells" on four “lsf” machines.
In each tcl shell, a script is sourced to do the logical step A.
Once the step A is done, somehow they should inform the python script which is waiting for the acknowledgement.
Once the acknowledgement is received from all the four, python script inform them to do the logical step B.
Logical step B is also a script which is sourced in their tcl shell; this script will also close the tcl shell at the end.
Meanwhile, python script is waiting for all the four jobs to finish.
After all four jobs are finished; it should continue with the sequential part again and finish later on.
Here are my questions:
I am confused about---should I use multithreading/multiprocessing. Which one suits better?
In fact what is the difference between these two? I read about these but I wasn't able to conclude.
What is python GIL? I also read somewhere at any one point in time only one thread will execute.
I need some explanation here. It gives me an impression that I can't use threads.
Any suggestions on how could I solve my problem systematically and in a more pythonic way.
I am looking for some verbal step by step explanation and some pointers to read on each step.
Once the concepts are clear, I would like to code it myself.
Thanks in advance.

In addition to roganjosh's answer, I would include some signaling to start the step B after A has finished:
import multiprocessing as mp
import time
import random
import sys
def func_A(process_number, queue, proceed):
print "Process {} has started been created".format(process_number)
print "Process {} has ended step A".format(process_number)
sys.stdout.flush()
queue.put((process_number, "done"))
proceed.wait() #wait for the signal to do the second part
print "Process {} has ended step B".format(process_number)
sys.stdout.flush()
def multiproc_master():
queue = mp.Queue()
proceed = mp.Event()
processes = [mp.Process(target=func_A, args=(x, queue)) for x in range(4)]
for p in processes:
p.start()
#block = True waits until there is something available
results = [queue.get(block=True) for p in processes]
proceed.set() #set continue-flag
for p in processes: #wait for all to finish (also in windows)
p.join()
return results
if __name__ == '__main__':
split_jobs = multiproc_master()
print split_jobs

1) From the options you listed in your question, you should probably use multiprocessing in this case to leverage multiple CPU cores and compute things in parallel.
2) Going further from point 1: the Global Interpreter Lock (GIL) means that only one thread can actually execute code at any one time.
A simple example for multithreading that pops up often here is having a prompt for user input for, say, an answer to a maths problem. In the background, they want a timer to keep incrementing at one second intervals to register how long the person took to respond. Without multithreading, the program would block whilst waiting for user input and the counter would not increment. In this case, you could have the counter and the input prompt run on different threads so that they appear to be running at the same time. In reality, both threads are sharing the same CPU resource and are constantly passing an object backwards and forwards (the GIL) to grant them individual access to the CPU. This is hopeless if you want to properly process things in parallel. (Note: In reality, you'd just record the time before and after the prompt and calculate the difference rather than bothering with threads.)
3) I have made a really simple example using multiprocessing. In this case, I spawn 4 processes that compute the sum of squares for a randomly chosen range. These processes do not have a shared GIL and therefore execute independently unlike multithreading. In this example, you can see that all processes start and end at slightly different times, but we can aggregate the results of the processes into a single queue object. The parent process will wait for all 4 child processes to return their computations before moving on. You could then repeat the code for func_B (not included in the code).
import multiprocessing as mp
import time
import random
import sys
def func_A(process_number, queue):
start = time.time()
print "Process {} has started at {}".format(process_number, start)
sys.stdout.flush()
my_calc = sum([x**2 for x in xrange(random.randint(1000000, 3000000))])
end = time.time()
print "Process {} has ended at {}".format(process_number, end)
sys.stdout.flush()
queue.put((process_number, my_calc))
def multiproc_master():
queue = mp.Queue()
processes = [mp.Process(target=func_A, args=(x, queue)) for x in xrange(4)]
for p in processes:
p.start()
# Unhash the below if you run on Linux (Windows and Linux treat multiprocessing
# differently as Windows lacks os.fork())
#for p in processes:
# p.join()
results = [queue.get() for p in processes]
return results
if __name__ == '__main__':
split_jobs = multiproc_master()
print split_jobs

Python multiprocessing to binary gives bad scaling

I need to run multiple instances of a binary in parallel. For this I am using python multiprocessing module. The binary itself has a parallelization which can be set using OMP_NUM_THREADS environment variable. A minimalist example of my code is the following
import sys
import os
from numpy import *
import time
import xml.etree.ElementTree as ET
from multiprocessing import Process, Queue
def cal_dist(filename):
tic = time.time()
################################### COPY THE INPUP FILE ########################################
tree = ET.parse(inputfilename+'.feb')
tree.write(filename+'.feb',xml_declaration=True,encoding="ISO-8859-1")
##################################### SUBMIT THE JOB ###########################################
os.system('export OMP_NUM_THREADS=12')
os.system('$HOME/febiosource-2.0/bin/febio2.lnx64 -noconfig -i ' + filename + '.feb -silent')
toc = time.time()
print "Job %s completed in %5.2f minutes" %(filename,(toc-tic)/60.);
return
# INPUT PARAMETERS
inputfilename="main-step1"
tempfilename='temp';
nCPU=7;
for iter in range(0,1):
################################### PARALLEL PROCESSING STARTS ########################################
# CREATE ALL THE PROCESSES,
p=[];
maxj=nCPU;
for j in range(0,nCPU):
p.append(Process(target=cal_dist, args=(tempfilename+str(j),)))
# START THE PROCESSES,
for j in range(0,nCPU):
p[j].start();
time.sleep(0.2);
# JOIN THEM,
for j in range(0,nCPU):
p[j].join();
################################### PARALLEL PROCESSING ENDS ########################################
If I set OMP_NUM_THREADS=1, then increasing the nCPU gives a good scaling. That is,
for nCPU=1, job time=3.5 minutes
for nCPU=7, job time=4.2 minutes
However, if I set OMP_NUM_THREADS=12, then increasing the nCPU gives a very bad scaling. That is,
for nCPU=1, job time=3.4 minutes
for nCPU=5, job time=5.7 minutes
for nCPU=7, job time=7.5 minutes
Any ideas on how I can solve this issue? I really need to use high number of CPUs and OMP_NUM_THREADS for my actual problem (and I know that the architecture of computer is that each node has 12 processors and I run it on nCPU*12 number of processors.

It looks like you're overloading your CPUs. With nCPU set to 1 with OMP_NUM_THREADS=12, you're spawning one process that uses twelve threads, which means you're keeping all your CPUs fully saturated. When you set nCPU to 7 with OMP_NUM_THREADS=12, you're spawning seven processes that use twelve threads each, which means you've got 12 * 7 = 84 threads running in parallel, fighting over 12 CPUs. My guess is this is creating a high context-switching overhead for the OS, and that's slowing you down.
With only 12 CPUs to work with, you're going to get diminishing returns if you try to run more than 12 threads+processes in parallel. (Unless a bunch of the work being done is I/O-bound, which doesn't seem to be the case here.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.