Python Multiprocessing.Pool in a class - python

I am trying to change my code to a more object oriented format. In doing so I am lost on how to 'visualize' what is happening with multiprocessing and how to solve it. On the one hand, the class should track changes to local variables across functions, but on the other I believe multiprocessing creates a copy of the code which the original instance would not have access to. I need to figure out a way to manipulate classes, within a class, using multiprocessing, and have the parent class retain all manipulated values in the nested classes.
A simple version (OLD CODE):
function runMultProc():
...
dictReports = {}
listReports = ['reportName1.txt', 'reportName2.txt']
tasks = []
pool = multiprocessing.Pool()
for report in listReports:
if report not in dictReports:
dictReports[today][report] = {}
tasks.append(pool.apply_async(worker, args=([report, dictReports[today][report]])))
else:
continue
for task in tasks:
report, currentReportDict = task.get()
dictReports[report] = currentFileDict
function worker(report, currentReportDict):
<Manipulate_reports_dict>
return report, currentReportDict
NEW CODE:
class Transfer():
def __init__(self):
self.masterReportDictionary[<todays_date>] = [reportObj1, reportObj2]
def processReports(self):
self.pool = multiprocessing.Pool()
self.pool.map(processWorker, self.masterReportDictionary[<todays_date>])
self.pool.close()
self.pool.join()
def processWorker(self, report):
# **process and manipulate report, currently no return**
report.name = 'foo'
report.path = '/path/to/report'
class Report():
def init(self):
self.name = ''
self.path = ''
self.errors = {}
self.runTime = ''
self.timeProcessed = ''
self.hashes = {}
self.attempts = 0
I don't think this code does what I need it to do, which is to have it process the list of reports in parallel AND, as processWorker manipulates each report class object, store those results. As I am fairly new to this I was hoping someone could help.
The big difference between the two is that the first one build a dictionary and returned it. The second model shouldn't really be returning anything, I just need for the classes to finish being processed and they should have relevant information within them.
Thanks!

Related

Shadow variable in optapy is not updated as expected

This clarification makes a lot of sense to me, but if I try and apply the reasoning to the code below (which is based on the employee scheduling example available in optapy) I would have expected set_timeslot_list to be called when set_task is called but it does not look like it is.
The optimisation runs OK and finds a suitable set of tasks to assign to the list of time slots that I have available, but each task.timeslot_list remains empty, and looks like the set_timeslot_list method is never called.
I believe I am missing something..Can you please help me understand what is wrong with how I modified the example in the code below or with I am interpreting how shadow vars work?
I can provide longer snippets, or the #planning_solution class if this is not sufficient.
#planning_entity(pinning_filter=timeslot_pinning_filter)
class Timeslot:
def __init__(self, start: datetime.datetime = None, end: datetime.datetime = None,
location: str = None, required_skill: str = None, task: object = None):
self.start = start
self.end = end
self.location = location
self.required_skill = required_skill
self.task = task
#planning_id
def get_id(self):
return self.id
# The type of the planning variable is Task, but we cannot use it because task refers to Timeslot below.
#planning_variable(object, value_range_provider_refs=['task_range'], nullable=False)
def get_task(self):
return self.task
def set_task(self, task):
self.task = task
#planning_entity
class Task:
def __init__(self, name: str = None, duration: int = None, skill_set: list = None):
self.name = name
self.duration = duration
self.skill_set = skill_set
self.timeslot_list = [] #The shadow property, which is a list, can never be None. If no genuine variable references that shadow entity, then it is an empty list
#inverse_relation_shadow_variable(Timeslot, source_variable_name = "task")
def get_timeslot_list(self):
return self.timeslot_list
def set_timeslot_list(self, ts):
self.timeslot_list = ts
Inverse Relation Shadow variables work differently than other variables: in particular, they directly modify the list returned by get_timeslot_list, so set_timeslot_list is never called. Your code look correct, which leaves me to believe you are checking the original planning entities and not the solution planning entities. In OptaPy (and OptaPlanner), the working solution/planning solution is cloned whenever we find a new best solution. As a result, the original problem (and the original planning entities) are never touched. So if your code look similar to this:
solver = optapy.solver_factory_create(...).buildSolver()
timeslot_list = [...]
task_1 = Task(...)
task_2 = Task(...)
task_list = [task_1, task_2]
problem = EmployeeSchedulingProblem(timeslot_list, task_list, ...)
solution = solver.solve(problem)
# this is incorrect; it prints the timeslot_list of the original problem
print(task_1.timeslot_list)
It should be changed to this instead:
solver = optapy.solver_factory_create(...).buildSolver()
timeslot_list = [...]
task_1 = Task(...)
task_2 = Task(...)
task_list = [task_1, task_2]
problem = EmployeeSchedulingProblem(timeslot_list, task_list, ...)
solution = solver.solve(problem)
# this is correct; it prints the timeslot_list of the solution
print(solution.task_list[0].timeslot_list)

Threading, multiprocessing shared memory and asyncio

I am having trouble implementing the following scheme :
class A:
def __init__(self):
self.content = []
self.current_len = 0
def __len__(self):
return self.current_len
def update(self, new_content):
self.content.append(new_content)
self.current_len += 1
class B:
def __init__(self, id):
self.id = id
And I also have these 2 functions that will be called later in the main :
async def do_stuff(first_var, second_var):
""" this function is ideally called from the main in another
process. Also, first_var and second_var are not modified so it
would be nice if they could be given by reference without
having to copy them """
### used for first call
yield None
while len(first_var) < CERTAIN_NUMBER:
time.sleep(10)
while True:
## do stuff
if condition_met:
yield new_second_var ## which is a new instance of B
## continue doing stuff
def do_other_stuff(first_var, second_var):
while True:
queue = multiprocessing.JoinableQueue()
results = multiprocessing.Queue()
### do stuff
first_var.update(results)
The main looks like this at the moment :
first_var = A()
second_var = B()
while True:
async for new_second_var in do_stuff(first_var, second_var):
if new_second_var:
## stop the do_other_stuff that is currently running
## to re-launch it with the updated new_var
do_other_stuff(first_var, new_second_var)
else: ## used for the first call
do_other_stuff(first_var, second_var)
Here are my questions :
Is there a better solution to make this scheme work?
How can I implement the "stopping" part since there is a while True loop that fills first_var by reference?
Will the instance of A (first_var) be passed by reference to do_stuff if first_var doesn't get modified inside it?
Is it even possible to have an asynchronous generator in another process?
Is it even possible at all?
This is using Python 3.6 for the async generators.
I hope this is somewhat clear! Thanks a lot!

Python Multiprocessing with shared data source and multiple class instances

My program needs to spawn multiple instances of a class, each processing data that is coming from a streaming data source.
For example:
parameters = [1, 2, 3]
class FakeStreamingApi:
def __init__(self):
pass
def data(self):
return 42
pass
class DoStuff:
def __init__(self, parameter):
self.parameter = parameter
def run(self):
data = streaming_api.data()
output = self.parameter ** 2 + data # Some CPU intensive task
print output
streaming_api = FakeStreamingApi()
# Here's how this would work with no multiprocessing
instance_1 = DoStuff(parameters[0])
instance_1.run()
Once the instances are running they don't need to interact with each other, they just have to get the data as it comes in. (and print error messages, etc)
I am totally at a loss how to make this work with multiprocessing, since I first have to create a new instance of the class DoStuff, and then have it run.
This is definitely not the way to do it:
# Let's try multiprocessing
import multiprocessing
for parameter in parameters:
processes = [ multiprocessing.Process(target = DoStuff, args = (parameter)) ]
# Hmm, this doesn't work...
We could try defining a function to spawn classes, but that seems ugly:
import multiprocessing
def spawn_classes(parameter):
instance = DoStuff(parameter)
instance.run()
for parameter in parameters:
processes = [ multiprocessing.Process(target = spawn_classes, args = (parameter,)) ]
# Can't tell if it works -- no output on screen?
Plus, I don't want to have 3 different copies of the API interface class running, I want that data to be shared between all the processes... and as far as I can tell, multiprocessing creates copies of everything for each new process.
Ideas?
Edit:
I think I may have got it... is there anything wrong with this?
import multiprocessing
parameters = [1, 2, 3]
class FakeStreamingApi:
def __init__(self):
pass
def data(self):
return 42
pass
class Worker(multiprocessing.Process):
def __init__(self, parameter):
super(Worker, self).__init__()
self.parameter = parameter
def run(self):
data = streaming_api.data()
output = self.parameter ** 2 + data # Some CPU intensive task
print output
streaming_api = FakeStreamingApi()
if __name__ == '__main__':
jobs = []
for parameter in parameters:
p = Worker(parameter)
jobs.append(p)
p.start()
for j in jobs:
j.join()
I came to the conclusion that it would be necessary to use multiprocessing.Queues to solve this. The data source (the streaming API) needs to pass copies of the data to all the different processes, so they can consume it.
There's another way to solve this using the multiprocessing.Manager to create a shared dict, but I didn't explore it further, as it looks fairly inefficient and cannot propagate changes to inner values (e.g if you have a dict of lists, changes to the inner lists will not propagate).

Pyqt and general python, can this be considered a correct approach for coding?

I have a dialog window containing check-boxes, when each of them is checked a particular class needs to be instantiated and a run a a task on a separated thread (one for each check box). I have 14 check-boxes to check the .isChecked() property and is comprehensible checking the returned Boolean for each of them is not efficient and requires a lot more coding.
Hence I decided to get all the children items corresponding to check-box element, get just those that are checked, appending their names to list and loop through them matching their name to d dictionary which key is the name of the check box and the value is the corresponding class to instantiate.
EXAMPLE:
# class dictionary
self.summary_runnables = {'dupStreetCheckBox': [DupStreetDesc(),0],
'notStreetEsuCheckBox': [StreetsNoEsuDesc(),1],
'notType3CheckBox': [Type3Desc(False),2],
'incFootPathCheckBox': [Type3Desc(True),2],
'dupEsuRefCheckBox': [DupEsuRef(True),3],
'notEsuStreetCheckBox': [NoLinkEsuStreets(),4],
'invCrossRefCheckBox': [InvalidCrossReferences()],
'startEndCheckBox': [CheckStartEnd(tol=10),8],
'tinyEsuCheckBox': [CheckTinyEsus("esu",1)],
'notMaintReinsCheckBox': [CheckMaintReins()],
'asdStartEndCheckBox': [CheckAsdCoords()],
'notMaintPolysCheckBox': [MaintNoPoly(),16],
'notPolysMaintCheckBox': [PolyNoMaint()],
'tinyPolysCheckBox': [CheckTinyEsus("rd_poly",1)]}
# looping through list
self.long_task = QThreadPool(None).globalInstance()
self.long_task.setMaxThreadCount(1)
start_report = StartReport(val_file_path)
end_report = EndReport()
# start_report.setAutoDelete(False)
# end_report.setAutoDelete(False)
end_report.signals.result.connect(self.log_progress)
end_report.signals.finished.connect(self.show_finished)
# end_report.setAutoDelete(False)
start_report.signals.result.connect(self.log_progress)
self.long_task.start(start_report)
# print str(self.check_boxes_names)
for check_box_name in self.check_boxes_names:
run_class = self.summary_runnables[check_box_name]
if run_class[0].__class__.__name__ is 'CheckStartEnd':
run_class[0].tolerance = tolerance
runnable = run_class[0]()
runnable.signals.result.connect(self.log_progress)
self.long_task.start(runnable)
self.long_task.start(end_report)
example of a runnable (even if some of them use different global functions)
I can't post the global functions that write content to file as they are too many and not all 14 tasks execute the same type function. arguments of these functions are int keys to other dictionaries that contain the report static content and the SQL queries to return report main dynamic contents.
class StartReport(QRunnable):
def __init__(self, file_path):
super(StartReport,self).__init__()
# open the db connection in thread
db.open()
self.signals = GeneralSignals()
# self.simple_signal = SimpleSignal()
# print self.signals.result
self.file_path = file_path
self.task = "Starting Report"
self.progress = 1
self.org_name = org_name
self.user = user
self.report_title = "Validation Report"
print "instantiation of start report "
def run(self):
self.signals.result.emit(self.task, self.progress)
if self.file_path is None:
print "I started and found file none "
return
else:
global report_file
# create the file and prints the header
report_file = open(self.file_path, 'wb')
report_file.write(str(self.report_title) + ' for {0} \n'.format(self.org_name))
report_file.write('Created on : {0} at {1} By : {2} \n'.format(datetime.today().strftime("%d/%m/%Y"),
datetime.now().strftime("%H:%M"),
str(self.user)))
report_file.write(
"------------------------------------------------------------------------------------------ \n \n \n \n")
report_file.flush()
os.fsync(report_file.fileno())
class EndReport(QRunnable):
def __init__(self):
super(EndReport,self).__init__()
self.signals = GeneralSignals()
self.task = "Finishing report"
self.progress = 100
def run(self):
self.signals.result.emit(self.task, self.progress)
if report_file is not None:
# write footer and close file
report_file.write("\n \n \n")
report_file.write("---------- End of Report -----------")
report_file.flush()
os.fsync(report_file.fileno())
report_file.close()
self.signals.finished.emit()
# TODO: checking whether opening a db connection in thread might affect the db on the GUI
# if db.isOpen():
# db.close()
else:
return
class DupStreetDesc(QRunnable):
"""
duplicate street description report section creation
:return: void if the report is to text
list[string] if the report is to screen
"""
def __init__(self):
super(DupStreetDesc,self).__init__()
self.signals = GeneralSignals()
self.task = "Checking duplicate street descriptions..."
self.progress = 16.6
def run(self):
self.signals.result.emit(self.task,self.progress)
if report_file is None:
print "report file is none "
# items_list = write_content(0, 0, 0, 0)
# for item in items_list:
# self.signals.list.emit(item)
else:
write_content(0, 0, 0, 0)
Now, I used this approach before and it has always worked fine without using multiprocessing. In this case it works good to some extent, I can run the tasks the first time but if I try to run for the second time I get the following Python Error :
self.long_task.start(run_class[0])
RuntimeError: wrapped C/C++ object of type DupStreetDesc has been deleted
I tried to use run_class[0].setAutoDelete(False) before running them in the loop but pyQt crashes with a minidump error (I am running the code in QGIS) and I the programs exists with few chances to understand what has happened.
On the other hand, if I run my classes separately, checking with an if else statement each check-box, then it works fine, I can run the tasks again and the C++ classes are not deleted, but it isn't a nice coding approach, at least from my very little experience.
Is there anyone else out there who can advise a different approach in order to make this run smoothly without using too many lines of code? Or knows whether there is a more efficient pattern to handle this problem, which I think must be quite common?
It seems that you should create a new instance of each runnable, and allow Qt to automatically delete it. So your dictionary entries could look like this:
'dupStreetCheckBox': [lambda: DupStreetDesc(), 0],
and then you can do:
for check_box_name in self.check_boxes_names:
run_class = self.summary_runnables[check_box_name]
runnable = run_class[0]()
runnable.signals.result.connect(self.log_progress)
self.long_task.start(runnable)
I don't know why setAutoDelete does not work (assuming you are calling it before starting the threadpool). I suppose there might be a bug, but it's impossible to be sure without having a fully-working example to test.

Working with parallel python and classes

I was shocked to learn how little tutorials and guides there is to be found on the internet regarding parallel python (PP) and handling classes. I've ran into a problem where I want to initiate a couple of instances of the same class and after that retreive some variables (for instances reading 5 datafiles in parallel, and then retreive their data). Here's a simple piece of code to illustrate my problem:
import pp
class TestClass:
def __init__(self, i):
self.i = i
def doSomething(self):
print "\nI'm being executed!, i = "+str(self.i)
self.j = 2*self.i
print "self.j is supposed to be "+str(self.j)
return self.i
class parallelClass:
def __init__(self):
job_server = pp.Server()
job_list = []
self.instances = [] # for storage of the class objects
for i in xrange(3):
TC = TestClass(i) # initiate a new instance of the TestClass
self.instances.append(TC) # store the instance
job_list.append(job_server.submit(TC.doSomething, (), ())) # add some jobs to the job_list
results = [job() for job in job_list] # execute order 66...
print "\nIf all went well there's a nice bunch of objects in here:"
print self.instances
print "\nAccessing an object's i works ok, but accessing j does not"
print "i = "+str(self.instances[2].i)
print "j = "+str(self.instances[2].j)
if __name__ == '__main__' :
parallelClass() # initiate the program
I've added comments for your convenience. What am I doing wrong here?
You should use callbacks
A callbacks is a function that you pass to the submit call. That function will be called with the result of the job as argument (have a look at the API for more arcane usage).
In your case
Set up a callback:
class TestClass:
def doSomething(self):
j = 2 * self.i
return j # It's REQUIRED that you return j here.
def set_j(self, j):
self.j = j
Add the callback to the job submit call
class parallellClass:
def __init__(self):
#your code...
job_list.append(job_server.submit(TC.doSomething, callback=TC.set_j))
And you're done.
I made some improvements to the code to avoid using self.j in the doSomething call, and only use a local jvariable.
As mentioned in the comments, in pp, you only communicate the result of your job. That's why you have to return this variable, it will be passed to the callback.

Categories

Resources