Threading, multiprocessing shared memory and asyncio - python

I am having trouble implementing the following scheme :
class A:
def __init__(self):
self.content = []
self.current_len = 0
def __len__(self):
return self.current_len
def update(self, new_content):
self.content.append(new_content)
self.current_len += 1
class B:
def __init__(self, id):
self.id = id
And I also have these 2 functions that will be called later in the main :
async def do_stuff(first_var, second_var):
""" this function is ideally called from the main in another
process. Also, first_var and second_var are not modified so it
would be nice if they could be given by reference without
having to copy them """
### used for first call
yield None
while len(first_var) < CERTAIN_NUMBER:
time.sleep(10)
while True:
## do stuff
if condition_met:
yield new_second_var ## which is a new instance of B
## continue doing stuff
def do_other_stuff(first_var, second_var):
while True:
queue = multiprocessing.JoinableQueue()
results = multiprocessing.Queue()
### do stuff
first_var.update(results)
The main looks like this at the moment :
first_var = A()
second_var = B()
while True:
async for new_second_var in do_stuff(first_var, second_var):
if new_second_var:
## stop the do_other_stuff that is currently running
## to re-launch it with the updated new_var
do_other_stuff(first_var, new_second_var)
else: ## used for the first call
do_other_stuff(first_var, second_var)
Here are my questions :
Is there a better solution to make this scheme work?
How can I implement the "stopping" part since there is a while True loop that fills first_var by reference?
Will the instance of A (first_var) be passed by reference to do_stuff if first_var doesn't get modified inside it?
Is it even possible to have an asynchronous generator in another process?
Is it even possible at all?
This is using Python 3.6 for the async generators.
I hope this is somewhat clear! Thanks a lot!

Related

Incremented dict does not keep the value

I have the following situation:
class Test:
cities_visited: dict
#staticmethod
def prepare_city_dict(persons):
Test.cities_visited = {}
for i in range(len(persons)):
name = persons[i].surname
Test.cities_visited[name] = Test.create_visit()
#staticmethod
def create_visit():
counter: dict = {"City1": 0, "City2": 0, "City3": 0}
return counter
#staticmethod
def increment_visit(surname: str, key):
counter_visit = Test.cities_visited[surname]
current_value = counter_visit[key]
print(current_value)
counter_visit[key] = current_value + 1
Test.cities_visited[surname] = counter_visit
At start-up I am calling Test.prepare_city_dict, and then I create a thread and do a lock and call other stuff, at some point I try to increment 2 cities:
Test.increment_visit("Dummy", "City1")
Test.increment_visit("Dummy", "City2")
If I am trying to log how many times a city was visited, only the 'City1' is correctly implemented.
I am coming from a different language (which is pretty obvious I think :D), running my code in a docker container on the Windows OS, everything is incremented properly.
Running the same configuration (container) under Linux OS, only the first 'City1' is properly incremented.
I taught it was a race condition, but unfortunately I cannot reproduce it and I cannot figure out what is going on.
+++ UPDATE:
class TestClass:
def main():
Test.prepare_city_dict(persons)
lock = threading.Lock()
thread = threading.Thread(target=TestClass.process_message,
args=(lock, persons,))
thread.start()
def process_message(lock, persons):
lock.acquire()
Test.increment_visit("Dummy", "City1")
..... -> lots of calculations
Test.increment_visit("Dummy", "City2")
lock.release()

Passing variables between classes in different threads?

Assume I have two classes that use threads
class foo(threading.Thread):
def __init__(self):
threading.Thread.__init__(self,name="foo=>bar")
self.var1 = {}
def run(self):
while True
value, name = getvalue() // name is an string
self.var1[name] = value
bar(self)
class bar(threading.Thread):
def __init__(self,fooInstance):
threading.Thread.__init__(self,name="bar")
def run(self):
while True
arg = myfunction() // somefunction (not shown for simplicity)
val = myOtherfunction(fooInstance.var1[arg]) //other function
print(val)
f = foo()
f.start()
The variable var1 in foo will change over time and bar needs to be aware of these changes. It makes sense to me, but I wonder if there is something fundamental here that could fail eventually. is this correct in python?
The actual sharing part is the same question as "how do I share a value with another object?" without threads, and all the same solutions will work.
For example. you're already passing the foo instance into the bar initializer, so just get it from there:
class bar(threading.Thread):
def __init__(self,fooInstance):
threading.Thread.__init__(self,name="bar")
self.var1 = fooInstance.var1
But is this thread-safe?
Well, yes, but only because you never actually start the background thread. But I assume in your real code, you're going to have two threads running at the same time, both accessing that var1 value. In which case it's not thread-safe without some kind of synchronization. For example:
class foo(threading.Thread):
def __init__(self):
threading.Thread.__init__(self,name="foo=>bar")
self.var1 = {}
self.var1lock = threading.Lock()
class bar(threading.Thread):
def __init__(self,fooInstance):
threading.Thread.__init__(self,name="bar")
self.var1 = fooInstance.var1
self.var1lock = fooInstance.var1lock
And now, instead of this:
self.var1[name] = value
… you do this:
with self.var1lock:
self.var1[name] = value
And likewise, instead of this:
val = myOtherfunction(fooInstance.var1[arg]) //other function
… you do this:
with self.var1lock:
var1arg = var1[arg]
val = myOtherfunction(var1arg)
Or… as it turns out, in CPython, updating a value for a single key in a dict (only a builtin dict, not a subclass or custom mapping class!) has always been atomic, and probably always will be. If you want to rely on that fact, you can. But I'd only do that if the lock turned out to be a significant performance issue. And I'd comment every use of it to make it clear, too.
If you'd rather pass values instead of share them, the usual answer is queue.Queue or one of its relatives.
But this requires a redesign of your program. For example, maybe you want to pass each new/changed key-value pair over the queue. That would go something like this:
class foo(threading.Thread):
def __init__(self):
threading.Thread.__init__(self,name="foo=>bar")
self.var1 = {}
self.q = queue.Queue()
def run(self):
b = bar(self)
b.start()
while True:
value, name = getvalue() // name is an string
self.var1[name] = value
self.q.put((name, value))
class bar(threading.Thread):
def __init__(self,fooInstance):
threading.Thread.__init__(self,name="bar")
self.var1 = copy.deepcopy(fooInstance.var1)
self.q = fooInstance.q
def _checkq(self):
while True:
try:
key, val = self.q.get_nowait()
except queue.Empty:
break
else:
self.var1[key] = val
def run(self):
while True:
self._checkq()
arg = myfunction() // somefunction (not shown for simplicity)
val = myOtherfunction(fooInstance.var1[arg]) //other function
print(val)

Python Multiprocessing.Pool in a class

I am trying to change my code to a more object oriented format. In doing so I am lost on how to 'visualize' what is happening with multiprocessing and how to solve it. On the one hand, the class should track changes to local variables across functions, but on the other I believe multiprocessing creates a copy of the code which the original instance would not have access to. I need to figure out a way to manipulate classes, within a class, using multiprocessing, and have the parent class retain all manipulated values in the nested classes.
A simple version (OLD CODE):
function runMultProc():
...
dictReports = {}
listReports = ['reportName1.txt', 'reportName2.txt']
tasks = []
pool = multiprocessing.Pool()
for report in listReports:
if report not in dictReports:
dictReports[today][report] = {}
tasks.append(pool.apply_async(worker, args=([report, dictReports[today][report]])))
else:
continue
for task in tasks:
report, currentReportDict = task.get()
dictReports[report] = currentFileDict
function worker(report, currentReportDict):
<Manipulate_reports_dict>
return report, currentReportDict
NEW CODE:
class Transfer():
def __init__(self):
self.masterReportDictionary[<todays_date>] = [reportObj1, reportObj2]
def processReports(self):
self.pool = multiprocessing.Pool()
self.pool.map(processWorker, self.masterReportDictionary[<todays_date>])
self.pool.close()
self.pool.join()
def processWorker(self, report):
# **process and manipulate report, currently no return**
report.name = 'foo'
report.path = '/path/to/report'
class Report():
def init(self):
self.name = ''
self.path = ''
self.errors = {}
self.runTime = ''
self.timeProcessed = ''
self.hashes = {}
self.attempts = 0
I don't think this code does what I need it to do, which is to have it process the list of reports in parallel AND, as processWorker manipulates each report class object, store those results. As I am fairly new to this I was hoping someone could help.
The big difference between the two is that the first one build a dictionary and returned it. The second model shouldn't really be returning anything, I just need for the classes to finish being processed and they should have relevant information within them.
Thanks!

Python Running Multiple Locks across Multiple Threads

So the situation is that I have multiple methods, which might be threaded simaltenously, but all need their own lock
against being re-threaded until they have run. They are established by initialising a class with some dataprocessing options:
class InfrequentDataDaemon(object): pass
class FrequentDataDaemon(object): pass
def addMethod(name):
def wrapper(f):
setattr(processor, f.__name__, staticmethod(f))
return f
return wrapper
class DataProcessors(object):
lock = threading.Lock()
def __init__(self, options):
self.common_settings = options['common_settings']
self.data_processing_configurations = options['data_processing_configurations'] #Configs for each processing method
self.data_processing_types = options['data_processing_types']
self.Data_Processsing_Functions ={}
#I __init__ each processing method as a seperate function so that it can be locked
for type in options['data_processing_types']:
def bindFunction1(name):
def func1(self, data=None, lock=None):
config = self.data_processing_configurations[data['type']] #I get the right config for the datatype
with lock:
FetchDataBaseStuff(data['type'])
#I don't want this to be run more than once at a time per DataProcessing Type
# But it's fine if multiple DoSomethings run at once, as long as each DataType is different!
DoSomething(data, config)
WriteToDataBase(data['type'])
func1.__name__ = "Processing_for_{}".format(type)
self.Data_Processing_Functions[func1.__name__] = func1 #Add this function to the Dictinary object
bindFunction1(type)
#Then I add some methods to a daemon that are going to check if our Dataprocessors need to be called
def fast_process_types(data):
if not example_condition is True: return
if not data['type'] in self.data_processing_types: return #Check that we are doing something with this type of data
threading.Thread(target=self.Data_Processing_Functions["Processing_for_{}".format(data['type'])], args=(self,data, lock)).start()
def slow_process_types(data):
if not some_other_condition is True: return
if not data['type'] in self.data_processing_types: return #Check that we are doing something with this type of data
threading.Thread(target=self.Data_Processing_Functions["Processing_for_{}".format(data['type'])], args=(self,data, lock)).start()
addMethod(InfrequentDataDaemon)(slow_process_types)
addMethod(FrequentDataDaemon)(fast_process_types)
The idea is to lock each method in
DataProcessors.Data_Processing_Functions - so that each method is only accessed by one thread at a time (and the rest of the threads for the same method are queued). How does Locking need to be set up to achieve this effect?
I'm not sure I completely follow what you're trying to do here, but could you just create a separate threading.Lock object for each type?
class DataProcessors(object):
def __init__(self, options):
self.common_settings = options['common_settings']
self.data_processing_configurations = options['data_processing_configurations'] #Configs for each processing method
self.data_processing_types = options['data_processing_types']
self.Data_Processsing_Functions ={}
self.locks = {}
#I __init__ each processing method as a seperate function so that it can be locked
for type in options['data_processing_types']:
self.locks[type] = threading.Lock()
def bindFunction1(name):
def func1(self, data=None):
config = self.data_processing_configurations[data['type']] #I get the right config for the datatype
with self.locks[data['type']]:
FetchDataBaseStuff(data['type'])
DoSomething(data, config)
WriteToDataBase(data['type'])
func1.__name__ = "Processing_for_{}".format(type)
self.Data_Processing_Functions[func1.__name__] = func1 #Add this function to the Dictinary object
bindFunction1(type)
#Then I add some methods to a daemon that are going to check if our Dataprocessors need to be called
def fast_process_types(data):
if not example_condition is True: return
if not data['type'] in self.data_processing_types: return #Check that we are doing something with this type of data
threading.Thread(target=self.Data_Processing_Functions["Processing_for_{}".format(data['type'])], args=(self,data)).start()
def slow_process_types(data):
if not some_other_condition is True: return
if not data['type'] in self.data_processing_types: return #Check that we are doing something with this type of data
threading.Thread(target=self.Data_Processing_Functions["Processing_for_{}".format(data['type'])], args=(self,data)).start()
addMethod(InfrequentDataDaemon)(slow_process_types)
addMethod(FrequentDataDaemon)(fast_process_types)

Working with parallel python and classes

I was shocked to learn how little tutorials and guides there is to be found on the internet regarding parallel python (PP) and handling classes. I've ran into a problem where I want to initiate a couple of instances of the same class and after that retreive some variables (for instances reading 5 datafiles in parallel, and then retreive their data). Here's a simple piece of code to illustrate my problem:
import pp
class TestClass:
def __init__(self, i):
self.i = i
def doSomething(self):
print "\nI'm being executed!, i = "+str(self.i)
self.j = 2*self.i
print "self.j is supposed to be "+str(self.j)
return self.i
class parallelClass:
def __init__(self):
job_server = pp.Server()
job_list = []
self.instances = [] # for storage of the class objects
for i in xrange(3):
TC = TestClass(i) # initiate a new instance of the TestClass
self.instances.append(TC) # store the instance
job_list.append(job_server.submit(TC.doSomething, (), ())) # add some jobs to the job_list
results = [job() for job in job_list] # execute order 66...
print "\nIf all went well there's a nice bunch of objects in here:"
print self.instances
print "\nAccessing an object's i works ok, but accessing j does not"
print "i = "+str(self.instances[2].i)
print "j = "+str(self.instances[2].j)
if __name__ == '__main__' :
parallelClass() # initiate the program
I've added comments for your convenience. What am I doing wrong here?
You should use callbacks
A callbacks is a function that you pass to the submit call. That function will be called with the result of the job as argument (have a look at the API for more arcane usage).
In your case
Set up a callback:
class TestClass:
def doSomething(self):
j = 2 * self.i
return j # It's REQUIRED that you return j here.
def set_j(self, j):
self.j = j
Add the callback to the job submit call
class parallellClass:
def __init__(self):
#your code...
job_list.append(job_server.submit(TC.doSomething, callback=TC.set_j))
And you're done.
I made some improvements to the code to avoid using self.j in the doSomething call, and only use a local jvariable.
As mentioned in the comments, in pp, you only communicate the result of your job. That's why you have to return this variable, it will be passed to the callback.

Categories

Resources