I am trying to run a parallel loop on a simple example.
What am I doing wrong?
from joblib import Parallel, delayed
import multiprocessing
def processInput(i):
return i * i
if __name__ == '__main__':
# what are your inputs, and what operation do you want to
# perform on each input. For example...
inputs = range(1000000)
num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=4)(delayed(processInput)(i) for i in inputs)
print(results)
The problem with the code is that when executed under Windows environments in Python 3, it opens num_cores instances of python to execute the parallel jobs but only one is active. This should not be the case since the activity of the processor should be 100% instead of 14% (under i7 - 8 logic cores).
Why are the extra instances not doing anything?
Continuing on your request to provide a working multiprocessing code, I suggest that you use pool_map (if the delayed functionality is not important), I'll give you an example, if your'e working on python3 its worth to mention you can use starmap.
Also worth mentioning that you can use map_sync/starmap_async if the order of the returned results does not have to correspond to the order of inputs.
import multiprocessing as mp
def processInput(i):
return i * i
if __name__ == '__main__':
# what are your inputs, and what operation do you want to
# perform on each input. For example...
inputs = range(1000000)
# removing processes argument makes the code run on all available cores
pool = mp.Pool(processes=4)
results = pool.map(processInput, inputs)
print(results)
On Windows, the multiprocessing module uses the 'spawn' method to start up multiple python interpreter processes. This is relatively slow. Parallel tries to be smart about running the code. In particular, it tries to adjust batch sizes so a batch takes about half a second to execute. (See the batch_size argument at https://pythonhosted.org/joblib/parallel.html)
Your processInput() function runs so fast that Parallel determines that it is faster to run the jobs serially on one processor than to spin up multiple python interpreters and run the code in parallel.
If you want to force your example to run on multiple cores, try setting batch_size to 1000 or making processInput() more complicated so it takes longer to execute.
Edit: Working example on windows that shows multiple processes in use (I'm using windows 7):
from joblib import Parallel, delayed
from os import getpid
def modfib(n):
# print the process id to see that multiple processes are used, and
# re-used during the job.
if n%400 == 0:
print(getpid(), n)
# fibonacci sequence mod 1000000
a,b = 0,1
for i in range(n):
a,b = b,(a+b)%1000000
return b
if __name__ == "__main__":
Parallel(n_jobs=-1, verbose=5)(delayed(modfib)(j) for j in range(1000, 4000))
Related
I would like to use python.multiprocessing.Pool to limit the number of parallel processes , and then run processes such that two of them always run in sequence, one after another. There is no data exchange between the processes.
This is what I came up with
import multiprocessing
import os
def setup():
print(f"Hello, my PID is {os.getpid()}\n")
def compute():
print(f"Fellow, my PID is {os.getpid()}\n")
if __name__ == "__main__":
# creating a pool object
p = multiprocessing.Pool(processes=4)
for i in range(10): # 10 > processes !
# Setup needs to run first
result = p.apply_async(setup)
# and finish before
result.wait()
# compute is called.
p.apply_async(compute)
I first want all (maximally 4) processes to run setup in parallel. Then I want to make sure that a specific setup is completed, before calling its corresponding compute. Is this what it does? The output looks right, but with asynchronous output/execution that's not the right measure.
Alternatively,
if __name__ == "__main__":
# creating a pool object
p = multiprocessing.Pool(processes=4)
# Run all setups
# 10 > 4 available processes !
setup_results = [p.apply_async(setup) for i in range(10)]
# Run all computes
compute_results = [p.apply_async(compute) for i in range(10)]
The documentation states that
# launching multiple evaluations asynchronously *may* use more processes
multiple_results = [pool.apply_async(os.getpid, ()) for i in range(4)]
print [res.get(timeout=1) for res in multiple_results]
do I have to worry about overscubscribing the machine? How can I make sure that maximally 4 CPUs are used at all times?
Shouldn't I have something like "wait" after the loop-call to setup?
I expect the alternative (loop setup, then loop compute) is faster, because there's no wait, is that true?
First of all, and this is more of an aside, in your original code where you have:
result = p.apply_async(setup)
# and finish before
result.wait()
This can be simplified to:
return_value = p.apply(setup) # return_value will be None
That is, just use the blocking method apply, which returns the return value from your worker function, setup.
Now for your second alternative:
You are calling non-blocking method apply_async 4 times to perform setup in parallel. But you are not waiting for these 4 tasks to complete before doing likewise with function compute. So even if you have a pool size of 4, when the first setup task completes, the pool process that had been executing this task is now free to start working on the compute tasks that have been submitted. So in general you will end up with a compute task executing while 3 setup tasks are also still executing. This is not what you want. Instead:
if __name__ == "__main__":
# Use maximum pool size but no greater than number of tasks being submitted:
#pool_size = min(10, multiprocessing.cpu_count())
pool_size = 4
# creating a pool object
p = multiprocessing.Pool(processes=pool_size)
# Run all setups
setup_results = [p.apply_async(setup) for i in range(10)]
# Wait for the above tasks to complete:
for setup_result in setup_results:
setup_result.get() # Get None return value
# Run all computes
compute_results = [p.apply_async(compute) for i in range(10)]
for compute_result in compute_results:
compute_result.get() # Get None return value
Or use blocking methods. For example:
if __name__ == "__main__":
# Use maximum pool size but no greater than number of tasks being submitted:
#pool_size = min(10, multiprocessing.cpu_count())
pool_size = 4
# creating a pool object
p = multiprocessing.Pool(processes=pool_size)
# Run all setups
setup_results = p.map(setup, range(10))
# Run all computes
compute_results = p.map(compute, range(10))
I have left the pool size to be 4 in case you need to artificially restrict the number of parallel tasks.
But note that in both examples I have commented-out code that shows how to use all available CPU cores for our pool under the assumption that the worker functions are mostly CPU processing and so there is no point in creating a pool size that is greater, which could be advantageous if there were a lot of I/O or network waiting involved with these function. Of course, if the processing done by these functions was mostly I/O, then you would be probably better off using a multithreading pool whose size is the number of concurrent tasks being submitted with a large upper bound. Also, there is no point in creating a pool size that is greater than the number of processes that will be executing in parallel, which is 10 in this case, regardless of how many cores were available.
I have edited the code , currently it is working fine . But thinks it is not executing parallely or dynamically . Can anyone please check on to it
Code :
def folderStatistic(t):
j, dir_name = t
row = []
for content in dir_name.split(","):
row.append(content)
print(row)
def get_directories():
import csv
with open('CONFIG.csv', 'r') as file:
reader = csv.reader(file,delimiter = '\t')
return [col for row in reader for col in row]
def folderstatsMain():
freeze_support()
start = time.time()
pool = Pool()
worker = partial(folderStatistic)
pool.map(worker, enumerate(get_directories()))
def datatobechecked():
try:
folderstatsMain()
except Exception as e:
# pass
print(e)
if __name__ == '__main__':
datatobechecked()
Config.CSV
C:\USERS, .CSV
C:\WINDOWS , .PDF
etc.
There may be around 200 folder paths in config.csv
welcome to StackOverflow and Python programming world!
Moving on to the question.
Inside the get_directories() function you open the file in with context, get the reader object and close the file immediately after the moment you leave the context so when the time comes to use the reader object the file is already closed.
I don't want to discourage you, but if you are very new to programming do not dive into parallel programing yet. Difficulty in handling multiple threads simultaneously grows exponentially with every thread you add (pools greatly simplify this process though). Processes are even worse as they don't share memory and can't communicate with each other easily.
My advice is, try to write it as a single-thread program first. If you have it working and still need to parallelize it, isolate a single function with input file path as a parameter that does all the work and then use thread/process pool on that function.
EDIT:
From what I can understand from your code, you get directory names from the CSV file and then for each "cell" in the file you run parallel folderStatistics. This part seems correct. The problem may lay in dir_name.split(","), notice that you pass individual "cells" to the folderStatistics not rows. What makes you think it's not running paralelly?.
There is a certain amount of overhead in creating a multiprocessing pool because creating processes is, unlike creating threads, a fairly costly operation. Then those submitted tasks, represented by each element of the iterable being passed to the map method, are gathered up in "chunks" and written to a multiprocessing queue of tasks that are read by the pool processes. This data has to move from one address space to another and that has a cost associated with it. Finally when your worker function, folderStatistic, returns its result (which is None in this case), that data has to be moved from one process's address space back to the main process's address space and that too has a cost associated with it.
All of those added costs become worthwhile when your worker function is sufficiently CPU-intensive such that these additional costs is small compared to the savings gained by having the tasks run in parallel. But your worker function's CPU requirements are so small as to reap any benefit from multiprocessing.
Here is a demo comparing single-processing time vs. multiprocessing times for invoking a worker function, fn, twice where the first time it only performs its internal loop 10 times (low CPU requirements) while the second time it performs its internal loop 1,000,000 times (higher CPU requirements). You can see that in the first case the multiprocessing version runs considerable slower (you can't even measure the time for the single processing run). But when we make fn more CPU-intensive, then multiprocessing achieves gains over the single-processing case.
from multiprocessing import Pool
from functools import partial
import time
def fn(iterations, x):
the_sum = x
for _ in range(iterations):
the_sum += x
return the_sum
# required for Windows:
if __name__ == '__main__':
for n_iterations in (10, 1_000_000):
# single processing time:
t1 = time.time()
for x in range(1, 20):
fn(n_iterations, x)
t2 = time.time()
# multiprocessing time:
worker = partial(fn, n_iterations)
t3 = time.time()
with Pool() as p:
results = p.map(worker, range(1, 20))
t4 = time.time()
print(f'#iterations = {n_iterations}, single processing time = {t2 - t1}, multiprocessing time = {t4 - t3}')
Prints:
#iterations = 10, single processing time = 0.0, multiprocessing time = 0.35399389266967773
#iterations = 1000000, single processing time = 1.182999849319458, multiprocessing time = 0.5530076026916504
But even with a pool size of 8, the running time is not reduced by a factor of 8 (it's more like a factor of 2) due to the fixed multiprocessing overhead. When I change the number of iterations for the second case to be 100,000,000 (even more CPU-intensive), we get ...
#iterations = 100000000, single processing time = 109.3077495098114, multiprocessing time = 27.202054023742676
... which is a reduction in running time by a factor of 4 (I have many other processes running in my computer, so there is competition for the CPU).
I have a large python script (an economic model with rows > 1500) which I want to excecute in parallel on several cpu cores. All the examples for multiprocessing I found so far were about simple functions, but not whole scripts. Could you please give me a hint how to achieve this?
Thanks!
Clarification: the model generates as an output a dataset for a multitude of variables. Each result is randomly different from the other model runs. Therefore I have to run the model often enough till some deviation measure is achieved (let's say 50 times). Model input is allways the same, but not the output.
Edit, got it:
import os
from multiprocessing import Pool
n_cores = 4
n_iterations = 5
def run_process(process):
os.system('python myscript.py')
if __name__ == '__main__':
p = Pool(n_cores)
p.map(run_process, range(n_iterations))
If you want to use a pool of workers, I usually do the following.
import multiprocessing as mp
def MyFunctionInParallel(foo, bar, queue):
res = foo + bar
queue.put({res: res})
return
if __name__ == '__main__':
data = []
info = {}
num =
ManQueue = mp.Manager().Queue()
with mp.Pool(processes=numProcs) as pool:
pool.starmap(MyFunctionInParallel, [(data[v], info, ManQueue)
for v in range(num)])
resultdict = {}
for i in range(num):
resultdict.update(ManQueue.get())
To be clearer, your script becomes the body of MyFunctionInParallel. This means that you need to slightly change your script so that the variables which depend on your input (i.e. each of your models) can be passed as arguments to MyFunctionInParallel. Then, depending on what you want to do with the results you get for each run, you can either use a Queue as sketched above or for example, write your results in a file. If you use a Queue, it means that you want to be able to retrieve your data at the end of the parallel execution (i.e. in the same script execution), and I would advise to use dictionaries as a way to store your results in the Queue, as they are very flexible on the data they can contain. On the other hand, writing up your results in a file is I guess better if you wish to share them with other users/applications. You have to be careful with concurrent writing from all the workers, so as to produce a meaningful output, but writing one file per model can also be OK.
For the main part of the code, num would be the number of models you will be running, data and info some parameters which are specific (or not) to each model and numProcs the number of processes that you wish to launch. For the call to starmap, it will basically map the arguments in the list comprehension to each call of MyFunctionInParallel, allowing each execution to have different input arguments.
In a python script, I have a large dataset that I would like to apply multiple functions to. The functions are responsible for creating certain outputs that get saved to the hard drive.
A few things of note:
the functions are independent
none of the functions return anything
the functions will take variable amounts of time
some of the functions may fail, and that is fine
Can I multiprocess this in any way that each function and the dataset are sent separately to a core and run there? This way I do not need the first function to finish before the second one can kick off? There is no need for them to be sequentially dependent.
Thanks!
Since your functions are independent and only read data, as long as it is not an issue if your data is modified during the execution of a function, then they are also thread safe.
Use a thread pool (click) . You would have to create a task per function you want to run.
Note: In order for it to run on more than one core you must use Python Multiprocessing. Else all the threads will run on a single core. This happens because Python has a Global Interpreter Lock (GIL). For more information Python threads all executing on a single core
Alternatively, you could use DASK , which augments the data in order to run some multi threading. While adding some overhead, it might be quicker for your needs.
I was in a similar situation as yours, and used Processes with the following function:
import multiprocessing as mp
def launch_proc(nproc, lst_functions, lst_args, lst_kwargs):
n = len(lst_functions)
r = 1 if n % nproc > 0 else 0
for b in range(n//nproc + r):
bucket = []
for p in range(nproc):
i = b*nproc + p
if i == n:
break
proc = mp.Process(target=lst_functions[i], args=lst_args[i], kwargs=lst_kwargs[i])
bucket.append(proc)
for proc in bucket:
proc.start()
for proc in bucket:
proc.join()
This has a major drawback: all Processes in a bucket have to finish before a new bucket can start. I tried to use a JoinableQueue to avoid this, but could not make it work.
Example:
def f(i):
print(i)
nproc = 2
n = 11
lst_f = [f] * n
lst_args = [[i] for i in range(n)]
lst_kwargs = [{}] * n
launch_proc(nproc, lst_f, lst_args, lst_kwargs)
Hope it can help.
I'm beginner to python and machine learning. I'm trying to reproduce the code for countvectorizer() using multi-threading. I'm working with yelp dataset to do sentiment analysis using LogisticRegression. This is what I've written so far:
Code snippet:
from multiprocessing.dummy import Pool as ThreadPool
from threading import Thread, current_thread
from functools import partial
data = df['text']
rev = df['stars']
y = []
def product_helper(args):
return featureExtraction(*args)
def featureExtraction(p,t):
temp = [0] * len(bag_of_words)
for word in p.split():
if word in bag_of_words:
temp[bag_of_words.index(word)] += 1
return temp
# function to be mapped over
def calculateParallel(threads):
pool = ThreadPool(threads)
job_args = [(item_a, rev[i]) for i, item_a in enumerate(data)]
l = pool.map(product_helper,job_args)
pool.close()
pool.join()
return l
temp_X = calculateParallel(12)
Here this is just part of code.
Explanation:
df['text'] has all the reviews and df['stars'] has the ratings (1 through 5). I'm trying to find the word count vector temp_X using multi-threading. bag_of_words is a list of some frequent words of choice.
Question:
Without multi-threading , I was able to compute the temp_X in around 24 minutes and the above code took 33 mins for a dataset of size 100k reviews. My machine has 128GB of DRAM and 12 cores (6 physical cores with hyperthreading i.e., threads per core=2).
What am I doing wrong here?
Your whole code seems CPU Bound rather than IO Bound.You are just using threads which are under GIL so effectively running just one thread plus overheads.It runs only on one core.To run on multiple cores use
Use
import multiprocessing
pool = multiprocessing.Pool()
l = pool.map_async(product_helper,job_args)
from multiprocessing.dummy import Pool as ThreadPool is just a wrapper over thread module.It utilises just one core and not more than that.
Python and threads dont really work together very well. There is a known issue called the GIL (global interperter lock). Basically there is a lock in the interperter that makes all threads not run in parallel (even if you have multiple cpu cores). Python will simply give each thread a few milliseconds of cpu time one after another (and the reason it became slower is the overhead from context switching between those threads).
Here is a really good document explaining how it works: http://www.dabeaz.com/python/UnderstandingGIL.pdf
To fix your problem i suggest you try multi processing:
https://pymotw.com/2/multiprocessing/basics.html
Note: multiprocessing is not 100% equivilent to multithreading. Multiprocessing will run at parallel but the diffrent processes wont share memory so if you change a variable in one of them it will not be changed in the other process.