I am fairly new to Python, and my experience is specific to its use in Powerflow modelling through the API provided in Siemens PSS/e. I have a script that I have been using for several years that runs some simulation on a large data set.
In order to get to finish quickly, I usually split the inputs up into multiple parts, then run multiple instances of the script in IDLE. Ive recently added a GUI for the inputs, and have refined the code to be more object oriented, creating a class that the GUI passes the inputs to but then works as the original script did.
My question is how do I go about splitting the process from within the program itself rather than making multiple copies? I have read a bit about the mutliprocess module but I am not sure how to apply it to my situation. Essentially I want to be able to instantiate N number of the same object, each running in parallel.
The class I have now (called Bot) is passed a set of arguments and creates a csv output while it runs until it finishes. I have a separate block of code that puts the pieces together at the end but for now I just need to understand the best approach to kicking multiple Bot objects off once I hit run in my GUI. Ther are inputs in the GUI to specify the number of N segments to be used.
I apologize ahead of time if my question is rather vague. Thanks for any information at all as Im sort of stuck and dont know where to look for better answers.
Create a list of configurations:
configurations = [...]
Create a function which takes the relevant configuration, and makes use of your Bot:
def function(configuration):
bot = Bot(configuration)
bot.create_csv()
Create a Pool of workers with however many CPUs you want to use:
from multiprocessing import Pool
pool = Pool(3)
Call the function multiple times which each configuration in your list of configurations.
pool.map(function, configurations)
For example:
from multiprocessing import Pool
import os
class Bot:
def __init__(self, inputs):
self.inputs = inputs
def create_csv(self):
pid = os.getpid()
print('TODO: create csv in process {} using {}'
.format(pid, self.inputs))
def use_bot(inputs):
bot = Bot(inputs)
bot.create_csv()
def main():
configurations = [
['input1_1.txt', 'input1_2.txt'],
['input2_1.txt', 'input2_2.txt'],
['input3_1.txt', 'input3_2.txt']]
pool = Pool(2)
pool.map(use_bot, configurations)
if __name__ == '__main__':
main()
Output:
TODO: create csv in process 10964 using ['input2_1.txt', 'input2_2.txt']
TODO: create csv in process 8616 using ['input1_1.txt', 'input1_2.txt']
TODO: create csv in process 8616 using ['input3_1.txt', 'input3_2.txt']
If you'd like to make life a little less complicated, you can use multiprocess instead of multiprocessing, as there is better support for classes and also for working in the interpreter. You can see below, we can now work directly with a method on a class instance, which is not possible with multiprocessing.
>>> from multiprocess import Pool
>>> import os
>>>
>>> class Bot(object):
... def __init__(self, x):
... self.x = x
... def doit(self, y):
... pid = os.getpid()
... return (pid, self.x + y)
...
>>> p = Pool()
>>> b = Bot(5)
>>> results = p.imap(b.doit, range(4))
>>> print dict(results)
{46552: 7, 46553: 8, 46550: 5, 46551: 6}
>>> p.close()
>>> p.join()
Above, I'm using imap, to get an iterator on the results, which I'll just dump into a dict. Note that you should close your pools after you are done, to clean up. On Windows, you may also want to look at freeze_support, for cases where the code otherwise fails to run.
>>> import multiprocess as mp
>>> mp.freeze_support
Related
I have a large python script (an economic model with rows > 1500) which I want to excecute in parallel on several cpu cores. All the examples for multiprocessing I found so far were about simple functions, but not whole scripts. Could you please give me a hint how to achieve this?
Thanks!
Clarification: the model generates as an output a dataset for a multitude of variables. Each result is randomly different from the other model runs. Therefore I have to run the model often enough till some deviation measure is achieved (let's say 50 times). Model input is allways the same, but not the output.
Edit, got it:
import os
from multiprocessing import Pool
n_cores = 4
n_iterations = 5
def run_process(process):
os.system('python myscript.py')
if __name__ == '__main__':
p = Pool(n_cores)
p.map(run_process, range(n_iterations))
If you want to use a pool of workers, I usually do the following.
import multiprocessing as mp
def MyFunctionInParallel(foo, bar, queue):
res = foo + bar
queue.put({res: res})
return
if __name__ == '__main__':
data = []
info = {}
num =
ManQueue = mp.Manager().Queue()
with mp.Pool(processes=numProcs) as pool:
pool.starmap(MyFunctionInParallel, [(data[v], info, ManQueue)
for v in range(num)])
resultdict = {}
for i in range(num):
resultdict.update(ManQueue.get())
To be clearer, your script becomes the body of MyFunctionInParallel. This means that you need to slightly change your script so that the variables which depend on your input (i.e. each of your models) can be passed as arguments to MyFunctionInParallel. Then, depending on what you want to do with the results you get for each run, you can either use a Queue as sketched above or for example, write your results in a file. If you use a Queue, it means that you want to be able to retrieve your data at the end of the parallel execution (i.e. in the same script execution), and I would advise to use dictionaries as a way to store your results in the Queue, as they are very flexible on the data they can contain. On the other hand, writing up your results in a file is I guess better if you wish to share them with other users/applications. You have to be careful with concurrent writing from all the workers, so as to produce a meaningful output, but writing one file per model can also be OK.
For the main part of the code, num would be the number of models you will be running, data and info some parameters which are specific (or not) to each model and numProcs the number of processes that you wish to launch. For the call to starmap, it will basically map the arguments in the list comprehension to each call of MyFunctionInParallel, allowing each execution to have different input arguments.
I'm trying to reduce the processing time of reading a database with roughly 100,000 entries, but I need them to be formatted a specific way, in an attempt to do this, I tried to use python's multiprocessing.map function which works perfectly except that I can't seem to get any form of queue reference to work across them.
I've been using information from Filling a queue and managing multiprocessing in python to guide me for using queues across multiple processes, and Using a global variable with a thread to guide me for using global variables across threads. I've gotten the software to work, but when I check the list/queue/dict/map length after running the process, it always returns zero
I've written a simple example to show what I mean:
You have to run the script as a file, the map's initialize function does not work from the interpreter.
from multiprocessing import Pool
from collections import deque
global_q = deque()
def my_init(q):
global global_q
global_q = q
q.append("Hello world")
def map_fn(i):
global global_q
global_q.append(i)
if __name__ == "__main__":
with Pool(3, my_init, (global_q,)) as pool:
pool.map(map_fn, range(3))
for p in range(len(global_q)):
print(global_q.pop())
Theoretically, when I pass the queue object reference from the main thread to the worker threads using the pool function, and then initialize that thread's global variables using with the given function, then when I insert elements into the queue from the map function later, that object reference should still be pointing to the original queue object reference (long story short, everything should end up in the same queue, because they all point to the same location in memory).
So, I expect:
Hello World
Hello World
Hello World
1
2
3
of course, the 1, 2, 3's are in arbitrary order, but what you'll see on the output is ''.
How come when I pass object references to the pool function, nothing happens?
Here's an example of how to share something between processes by extending the multiprocessing.managers.BaseManager class to support deques.
There's a Customized managers section in the documentation about creating them.
import collections
from multiprocessing import Pool
from multiprocessing.managers import BaseManager
class DequeManager(BaseManager):
pass
class DequeProxy(object):
def __init__(self, *args):
self.deque = collections.deque(*args)
def __len__(self):
return self.deque.__len__()
def appendleft(self, x):
self.deque.appendleft(x)
def append(self, x):
self.deque.append(x)
def pop(self):
return self.deque.pop()
def popleft(self):
return self.deque.popleft()
# Currently only exposes a subset of deque's methods.
DequeManager.register('DequeProxy', DequeProxy,
exposed=['__len__', 'append', 'appendleft',
'pop', 'popleft'])
process_shared_deque = None # Global only within each process.
def my_init(q):
""" Initialize module-level global. """
global process_shared_deque
process_shared_deque = q
q.append("Hello world")
def map_fn(i):
process_shared_deque.append(i) # deque's don't have a "put()" method.
if __name__ == "__main__":
manager = DequeManager()
manager.start()
shared_deque = manager.DequeProxy()
with Pool(3, my_init, (shared_deque,)) as pool:
pool.map(map_fn, range(3))
for p in range(len(shared_deque)): # Show left-to-right contents.
print(shared_deque.popleft())
Output:
Hello world
0
1
2
Hello world
Hello world
You cant use global variable for multiprocesing.
Pass to the function multiprocessing queue.
from multiprocessing import Queue
queue= Queue()
def worker(q):
q.put(something)
Also you are propably experiencing that the code is allright, but as the pool create separate processes, even the errors are separeted and therefore you dont see the code not only isnt working, but that it throws error.
The reason why your output is '', is because nothing was appended to your q/global_q. And if it was appended, then only some variable, that may be called global_q, but its totally different one than your global_q in your main thread
Try to print('Hello world') inside the function you want to multiprocess and you will see by yourself, that nothing is actually printed at all. That processes is simply outside of your main thread and the only way to access that process is by multiprocessing Queues. You access the Queue by queue.put('something') and something = queue.get()
Try to understand this code and you will do well:
import multiprocessing as mp
shared_queue = mp.Queue() # This will be shared among all procesess, but you need to pass the queue as an argument in the process. You CANNOT use it as global variable. Understand that the functions kind of run in total different processes and nothing can really access them... Except multiprocessing.Queue - that can be shared across all processes.
def channel(que,channel_num):
que.put(channel_num)
if __name__ == '__main__':
processes = [mp.Process(target=channel, args=(shared_queue, channel_num)) for channel_num in range(8)]
for p in processes:
p.start()
for p in processes: # wait for all results to close the pool
p.join()
for i in range(8): # Get data from Queue. (you can get data out of it at any time actually)
print(shared_queue.get())
I am relatively new to python and definitely new to multiprocessing. I'm following this question/answer for the structure of my multiprocessing, but in def func_A, I'm calling a module that passes a class object as one of the arguments. In the module, I change an object attribute that I would like the main program to see and update the user with the object attribute value. The child processes run for very long times, so I need the main program to provide updates as they run.
My suspicion is that I'm not understanding namespace/object scoping or something similar, but from what I've read, passing an object (an instance of a class?) to a module as an argument passes a reference to the object and not a copy. I would have thought this meant that changing the attributes of the object in the child process/module would have changed the attributes in the main program object (since they're the same object). Or am I confusing things?
The code for my main program:
# MainProgram.py
import multiprocessing as mp
import time
from time import sleep
import sys
from datetime import datetime
import myModule
MYOBJECTNAMES = ['name1','name2']
class myClass:
def __init__(self, name):
self.name = name
self.value = 0
myObjects = []
for n in MYOBJECTNAMES:
myObjects.append(myClass(n))
def func_A(process_number, queue):
start = datetime.now()
print("Process {} (object: {}) started at {}".format(process_number, myObjects[process_number].name, start))
myModule.Eval(myObjects[process_number])
sys.stdout.flush()
def multiproc_master():
queue = mp.Queue()
proceed = mp.Event()
processes = [mp.Process(target=func_A, args=(x, queue)) for x in range(len(myObjects))]
for p in processes:
p.start()
for i in range(100):
for o in myObjects:
print("In main: Value of {} is {}".format(o.name, o.value))
sleep(10)
for p in processes:
p.join()
if __name__ == '__main__':
split_jobs = multiproc_master()
print(split_jobs)
The code for my module program:
# myModule.py
from time import sleep
def Eval(myObject):
for i in range(100):
myObject.value += 1
print("In module: Value of {} is {}".format(myObject.name, myObject.value))
sleep(5)
That question/answer you linked to probably was probably a poor choice to use as a template, as it's doing many things that your code doesn't require (much less use).
I think your biggest misconception about how multiprocessing works is thinking that all the code is running in the same address-space. The main task runs in its own, and there are separate ones for each subtask. The way your code is written, each of them will end up with its own separate myObjects list. That's why the main task doesn't see any of the changes made by any of the other tasks.
While there are ways share objects using the multiprocessing module, doing so often introduces significant overhead because keeping it or them all in-sync between all the processes requires lots of things happening "under the covers" to make seem like they're shared (which is what is really going on since they can't actually be because of having separate address-spaces). This overhead frequently completely cancels out any speed gained by parallel-processing.
As stated in the documentation: "when doing concurrent programming it is usually best to avoid using shared state as far as possible".
as someone who explores the new and mighty world of Python, I am running into an understanding problem for my coding and it would be great if someone could help me on this one.
To make my problem simple I have made an example.
Lets say, I have two functions, running via multiprocessing simultaneously. One is a permanent data listener and one prints the value of it out. In addition I have one object which owns the data, data is set via set/get. So the challenge is how both function can access the data without putting it to global. I guess my lack of understanding is somewhere in how to transfer the object between functions.
NOTE : both functions do not need to be in sync and the while is just for endless loop. It just how to bring the data over.
This gives a code like (I know it is not working, just to get the idea) :
import multiprocessing
#simply a data object
class data(object):
def __init__(self):
self.__value = 1
def set_value(self, value):
self.__value = value
def get_value(self):
return self.__value
# Data listener
def f1(count):
zae = 0
while True:
zae += 1
count.set_value = zae
def f2(count):
while True:
print (count.get_value)
#MainPart
if __name__ == '__main__':
print('start')
count = data()
jobs = []
p1 = multiprocessing.Process(target =f1(count))
p2 = multiprocessing.Process(target =f2(count))
jobs.append(p1)
jobs.append(p2)
p1.start()
p2.start()
print ('end')
Please enlight me,
regards
AdrianMonk
This looks like a neat case for using memory-mapped files.
When a process memory-maps a file (say F) and another process comes along and maps the same file (i.e maps to F.fileno() too), then exactly the same block of memory is mapped into the second process's address space. This allows the two processes to exchange information extremely rapidly by writing into the shared memory. .
Of course you have to manage the proper access (read,write,etc) in your mappings, and then it is just a matter of properly polling/writing the proper locations in the file to satisfy the logic of your application
(see http://docs.python.org/2/library/mmap.html).
The communication channels Pipe or Queue from multiprocessing are designed to solve exactly this kind of problem
I have found information on multiprocessing and multithreading in python but I don't understand the basic concepts and all the examples that I found are more difficult than what I'm trying to do.
I have X independent programs that I need to run. I want to launch the first Y programs (where Y is the number of cores of my computer and X>>Y). As soon as one of the independent programs is done, I want the next program to run in the next available core. I thought that this would be straightforward, but I keep getting stuck on it. Any help in solving this problem would be much appreciated.
Edit: Thanks a lot for your answers. I also found another solution using the joblib module that I wanted to share. Suppose that you have a script called 'program.py' that you want to run with different combination of the input parameters (a0,b0,c0) and you want to use all your cores. This is a solution.
import os
from joblib import Parallel, delayed
a0 = arange(0.1,1.1,0.1)
b0 = arange(-1.5,-0.4,0.1)
c0 = arange(1.,5.,0.1)
params = []
for i in range(len(a0)):
for j in range(len(b0)):
for k in range(len(c0)):
params.append((a0[i],b0[j],c0[k]))
def func(parameters):
s = 'python program.py %g %g %g' % parameters[0],parameters[1],parameters[2])
command = os.system(s)
return command
output = Parallel(n_jobs=-1,verbose=1000)(delayed(func)(i) for i in params)
You want to use multiprocessing.Pool, which represents a "pool" of workers (default one per core, though you can specify another number) that do your jobs. You then submit jobs to the pool, and the workers handle them as they become available. The easiest function to use is Pool.map, which runs a given function for each of the arguments in the passed sequence, and returns the result for each argument. If you don't need return values, you could also use apply_async in a loop.
def do_work(arg):
pass # do whatever you actually want to do
def run_battery(args):
# args should be like [arg1, arg2, ...]
pool = multiprocessing.Pool()
ret_vals = pool.map(do_work, arg_tuples)
pool.close()
pool.join()
return ret_vals
If you're trying to call external programs and not just Python functions, use subprocess. For example, this will call cmd_name with the list of arguments passed, raise an exception if the return code isn't 0, and return the output:
def do_work(subproc_args):
return subprocess.check_output(['cmd_name'] + list(subproc_args))
Hi i'm using the object QThread from pyqt
From what i understood, your thread when he is running can only use his own variable and proc, he cannot change your main object variables
So before you run it be sur to define all the qthread variables you will need
like this for example:
class worker(QThread)
def define(self, phase):
print 'define'
self.phase=phase
self.start()#will run your thread
def continueJob(self):
self.start()
def run(self):
self.launchProgramme(self.phase)
self.phase+=1
def launchProgramme(self):
print self.phase
i'm not well aware of how work the basic python thread but in pyqt your thread launch a signal
to your main object like this:
class mainObject(QtGui.QMainWindow)
def __init__(self):
super(mcMayaClient).__init__()
self.numberProgramme=4
self.thread = Worker()
#create
self.connect(self.thread , QtCore.SIGNAL("finished()"), self.threadStoped)
self.connect(self.thread , QtCore.SIGNAL("terminated()"), self.threadStopped)
connected like this, when the thread.run stop, it will launch your threadStopped proc in your main object where u can get the value of your thread Variables
def threadStopped(self):
value=self.worker.phase
if value<self.numberProgramme:
self.worker.continueJob()
after that you just have to lauch another thread or not depending of the value you get
This is for pyqt threading of course, in python basic thread, the way to execute the def threadStopped could be different.