Data inconsistency in multithreaded Python code - python

First time SO user here. I have a problem with my "thread-safe" Python singleton. The class stores data that is only read by the worker threads. The main thread updates some of the data. I was under the impression that a singleton will ensure that my worker threads will have access to the same data. However, in reality, some of the threads still process "old" data (pre data change in the main thread).
I used the singleton implementation from https://refactoring.guru:
from threading import Lock
class SingletonMeta(type):
_instances = {}
_lck = Lock()
def __call__(cls, *args, **kwargs):
with cls._lck:
if cls not in cls._instances:
instance = super().__call__(*args, **kwargs)
cls._instances[cls] = instance
return cls._instances[cls]
class Storage(metaclass=SingletonMeta):
def __init__(self, shared=None):
self._shared = shared
#property
def shared(self):
return self._shared
def refresh(self, update):
if update:
self._shared['managed'] = update
logger.debug("[+] Shared data refreshed.")
The refresh() method is only called from the main thread. The worker threads read the dictionary via the shared property.
Is this the correct approach? I am afraid not, since the data is not consistent among the worker threads. Can anybody help me understand what I am doing wrong and why the data is not updated for all threads?
Thank you
Update
After more investigation and reading, it turns out, the Singleton is not the problem. My problem (and the part I didn't mention because I did't think of it at first) is rooted in the different gunicorn worker processes. So the thread(s) in the worker process that applies the data update has the correct data and the others don't. I will have to think about how I can synchronize the data between workers.

Related

How to periodically call instance method from a separate process

I'm trying to write a class to help with buffering some data that takes a while to read in, and which needs to be periodically updated. The python version is 3.7.
There are 3 criteria I would like the class to satisfy:
Manual update: An instance of the class should have an 'update' function, which reads in new data.
Automatic update: An instance's update method should be periodically run, so the buffered data never gets too old. As reading takes a while, I'd like to do this without blocking the main process.
Self contained: Users should be able to inherit from the class and overwrite the method for refreshing data, i.e. the automatic updating should work out of the box.
I've tried having instances create their own subprocess for running the updates. This causes problems because simply passing the instance to another process seems to create a copy, so the desired instance is not updated automatically.
Below is an example of the approach I'm trying. Can anyone help getting the automatic update to work?
import multiprocessing as mp
import random
import time
def refresh_helper(buffer, lock):
"""Periodically calls refresh method in a buffer instance."""
while True:
with lock.acquire():
buffer._refresh_data()
time.sleep(10)
class Buffer:
def __init__(self):
# Set up a helper process to periodically update data
self.lock = mp.Lock()
self.proc = mp.Process(target=refresh_helper, args=(self, self.lock), daemon=True)
self.proc.start()
# Do an initial update
self.data = None
self.update()
def _refresh_data(self):
"""Pretends to read in some data. This would take a while for real data"""
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9]
data = [random.choice(numbers) for _ in range(3)]
self.data = data
def update(self):
with self.lock.acquire():
self._refresh_data()
def get_data(self):
return self.data
#
if __name__ == '__main__':
buffer = Buffer()
data_first = buffer.get_data()
time.sleep(11)
data_second = buffer.get_data() # should be different from first
Here is an approach that makes use a of a multiprocessing queue. It's similar to what you had implemented, but your implementation was trying to assign to self within Buffer._refresh_data in both processes. Because self refers to a different Buffer object in each process, they did not affect each other.
To send data from one process to another you need to use shared memory, pipes, or some other such mechanism. Python's multiprocessing library provides multiprocess.Queue, which simplifies this for us.
To send data from the refresh helper to the main process we need only use queue.put in the helper process, and queue.get in the main process. The data being sent must be serializable using Python's pickle module to be sent between the processes through a multiprocess.Queue.
Using a multiprocess.Queue also saves us from having to use locks ourselves, since the queue handles that internally.
To handle the helper process starting and stopping cleanly for the example, I have added __enter__ and __exit__ methods to make Buffer into a context manager. They can be removed if you would rather manually stop the helper process.
I have also changed your _refresh_data method into _get_new_data, which returns new data half the time, and has no new data to give the other half of the time (i.e. it returns None). This was done to make it more similar to what I imagine a real application for this class would be.
It is important that only static/class methods or external functions are called from the other process, as otherwise they may operate on a self attribute that refers to a completely different instance. The exception is if the attribute is meant to be sent across the process barrier, like with self.queue. That is why the update method can use self.queue to send data to the main process despite self being a different Buffer instance in the other process.
The method get_next_data will return the oldest item found in the queue. If there is nothing in the queue, it will wait until something is added to the queue. You can change this behaviour by giving the call to self.queue.get a timeout (which will cause an exception to be raised if it times out), or using self.queue.get_nowait (which will return None immediately if the queue is empty).
from __future__ import annotations
import multiprocessing as mp
import random
import time
class Buffer:
def __init__(self):
self.queue = mp.Queue()
self.proc = mp.Process(target=self._refresh_helper, args=(self,))
self.update()
def __enter__(self):
self.proc.start()
return self
def __exit__(self, ex_type, ex_val, ex_tb):
self.proc.kill()
self.proc.join()
#staticmethod
def _refresh_helper(buffer: "Buffer", period: float = 1.0) -> None:
"""Periodically calls refresh method in a buffer instance."""
while True:
buffer.update()
time.sleep(period)
#staticmethod
def _get_new_data() -> list[int] | None:
"""Pretends to read in some data. This would take a while for real data"""
if random.randint(0, 1):
return random.choices(range(10), k=3)
return None
def update(self) -> None:
new_data = self._get_new_data()
if new_data is not None:
self.queue.put(new_data)
def get_next_data(self):
return self.queue.get()
if __name__ == '__main__':
with Buffer() as buffer:
for _ in range(5):
print(buffer.get_next_data())
Running this code will, as an example, start the helper process, then print out the first 5 pieces of data it gets from the buffer. The first one will be from the update that is performed when the buffer is initialized. The others will all be provided by the helper process running update.
Let's review your criteria:
Manual update: An instance of the class should have an 'update' function, which reads in new data.
The Buffer.update method can be used for this.
Automatic update: An instance's update method should be periodically run, so the buffered data never gets too old. As reading takes a while, I'd like to do this without blocking the main process.
This is done by a helper process which adds data to a queue for later processing. If you would rather throw away old data, and only process the newest data, then the queue can be swapped out for a multiprocess.Array, or whatever other multiprocessing compatible shared memory wrapper you prefer.
Self contained: Users should be able to inherit from the class and overwrite the method for refreshing data, i.e. the automatic updating should work out of the box.
This works by overwriting the _get_new_data method. So long as it's a static or class method which returns the data, automatic updating should work with it without any changes.
All processes exist in different areas of memory from one another, each of which is meant to be fully separate from all others. As you pointed out, the additional process creates a copy of the instance on which it operates, meaning the updated version exists in a separate memory space from the instance you're running get_data() on. Because of this there is no easy way to perform this operation on this specific instance from a different process.
Given that you want the updating of the data to not block the checking of the data, you may not use threading, as only 1 thread may operate at a time in any given process. Instead, you need to use an object which exists in a memory space shared between all processes. To do this, you can use a multiprocessing.Value object or a multiprocessing.Array, both of which store ctypes objects. Both of these objects existed in 3.7 (appropriate documentation attached.)
If this approach does not work, consider examining these similar threads:
Sharing a complex object between processes?
multiprocessing: sharing a large read-only object between processes?
Good luck with your project!

Limit number of cores in nondaemon pool Python

I have a script where I running some processes with pool.apply_async and running them as nondaemon to avoid issues of "zombie" processes overwhelming memory wise. It's been working well so far, except that now I have scaled to a larger dataset in memory so by using all my cores I am blowing up memory wise. I want to limit the number of cores used in those cases, but can't get it to work
Normally I would integrate something like the following
pool = Pool(self.nb_cores)
to limit the number of cores. However I can't seem to find out where to integrate it into a nondeamon process.
import multiprocessing
import multiprocessing.pool
class NoDaemonProcess(multiprocessing.Process):
"""
Extends the multiprocessing Process class to disable
the daemonic property. Polling the daemonic property
will always return False and cannot be set.
"""
#property
def daemon(self):
"""
Always return False
"""
return False
#daemon.setter
def daemon(self, value):
"""
Pass over the property setter
:param bool value: Ignored setting
"""
pass
class NoDaemonContext(type(multiprocessing.get_context())):
"""
With the new multiprocessing module, everything is based
on contexts after the overhaul. This extends the base
context so that we set all Processes to NoDaemonProcesses
"""
Process = NoDaemonProcess
class NoDaemonPool(multiprocessing.pool.Pool):
"""
This extends the normal multiprocessing Pool class so that
all spawned child processes are non-daemonic, allowing them
to spawn their own children processes.
"""
def __init__(self, *args, **kwargs):
kwargs['context'] = NoDaemonContext()
super(NoDaemonPool, self).__init__(*args, **kwargs)
I know I need to integrate a number of cores limit somewhere ... just can't seem to find the precise function I need in my context.
Your custom NoDaemonPool class is derived from multiprocessing.pool.Pool therefore will be able to accept processes (the number of worker processes to use) as keyword argument:
pool = NoDaemonPool(processes=nb_cores)

subclassing Celery Task for a `ClassTask` mixin

Prefacing my question with the fact that I'm new to Celery and this (1) may have been answered somewhere else (if so, I couldn't find the answer) or (2) there may be a better way to accomplish my objective than what I'm directly asking.
Also, I am aware of celery.contrib.methods, but task_method does not quite accomplish what I am looking for.
My Objective
I would like to create a class mixin that turns a whole class into a Celery task. For example, a mixin represented by something like the code below (which right now does not run):
from celery import Task
class ClassTaskMixin(Task):
#classmethod
def enqueue(cls, *args, **kwargs):
cls.delay(*args, **kwargs)
def run(self, *args, **kwargs):
Obj = type(self.name, (), {})
Obj(*args, **kwargs).run_now()
def run_now(self):
raise NotImplementedError()
Unlike when using task_method, I do not want to fully instantiate the class before the task is queued and .delay() is called. Rather, I want to simply hand-off the class name along with any relevant initialization parameters to the async process. The async process would then fully instantiate the class using the class name and the given initialization paremeters, and then call some method (say .run_now(), for example) on the instantiated object.
Example Use Case
Constructing and sending email asynchronously would be an example use for the mixin I need.
class WelcomeEmail(EmailBase, ClassTaskMixin):
def __init__(self, recipient_address, template_name, template_context):
self.recipient_address = recipient_address
self.template_name = template_name
self.template_context = template_context
def send(self):
self.render_templates()
self.construct_mime()
self.archive_to_db()
self.send_smtp_email()
def run_now(self):
self.send()
The above code would send an email in an async Celery process by calling WelcomeEmail.enqueue(recipient_address, template_name, template_context). Sending the email synchronously in-process would be accomplished by calling WelcomeEmail(recipient_address, template_name, template_context).send().
Questions
Is there any reason that what I'm trying to do is very, very wrong within the Celery framework?
Is there a better way to structure the mixin to make it more Celery-onic than what I've proposed (better attribute names, different method structure, etc.)?
What am I missing to make the mixin functional in a use case as I've described?
Apparently this issue isn't hugely interesting to a lot of people, but... I've accomplished what I set out to do.
See pull request https://github.com/celery/celery/pull/1897 for details.

How to share object tree with process fork?

I don't have much experience with multithreading, and I'm trying to get something like the below working:
from multiprocessing import Process
class Node:
def __init__(self):
self.children = {}
class Test(Process):
def __init__(self, tree):
super().__init__()
self.tree = tree
def run(self):
# infinite loop which does stuff to the tree
self.tree.children[1] = Node()
self.tree.children[2] = Node()
x = Node()
t = Test(x)
t.start()
print(x.children) # random access to tree
I realize this shouldn't (and doesn't) work for a variety of very sensible reasons, but I'm not sure how to get it to work. Referring to the documentation, it seems that I need to do something with Managers and Proxies, but I honestly have no idea where to start, or whether that is actually what I'm looking for. Could someone provide an example of the above that works?
multiprocessing has limited support for implicitly shared objects, which can even share lists and dicts.
In general, multiprocessing is shared-nothing (after the initial fork) and relies on explicit communication between the processes. This adds overhead (how much really depends on the kind of interaction between the processes), but neatly avoids a lot of the pitfalls of multithreaded programming. The high-level building blocks of multiprocessing favor master/slave models (esp. the Pool class), with masters handing out work items, and slaves operating on them, returning results.
Keeping state in sync across several processes may, depending how often they change, incur a prohibitive overhead.
TL;DR: It can be done, but probably shouldn't.
import time, multiprocessing
class Test(multiprocessing.Process):
def __init__(self, manager):
super().__init__()
self.quit = manager.Event()
self.dict = manager.dict()
def stop(self):
self.quit.set()
self.join()
def run(self):
self.dict['item'] = 0
while not self.quit.is_set():
time.sleep(1)
self.dict['item'] += 1
m = multiprocessing.Manager()
t = Test(m)
t.start()
for x in range(10):
time.sleep(1.2)
print(t.dict)
t.stop()
The multiprocessing examples show how to create proxies for more complicated objects, which should allow you to implement the tree structure in your question.
It seems to me that what you want is actual multithreading, rather than multiprocessing. With threads rather than processes, you can do precisely that, since threads run in the same process, sharing all memory and therefore data with each other.

Passing values between two threads

I have two threads and both the threads do a set of calculations and obtain results. The problem is at one point both the thread's calculations require the results obtained in the other. I thought of inheritance but could only pass values from one thread to another. How can I pass values between two threads without using a global variable?
I want to do something like this.
class first(threading.Thread):
def __init__(self, flag, second):
##rest of the class first##
class second(threading.Thread):
def __init__(self, flag, first):
##rest of the class second##
def main():
flag=threading.Condition()
First=first(flag,Second)
First.start()
Second=second(flag,First)
Second.start()
I get an error when I do the above.
You can use the Queue module: Give each of your threads a Queue.Queue object. Then each thread can do its calculations, put the result in the other thread's queue and then listen on its own queue until the result of the other thread arrives.
Make sure to post the result first and then wait for the other thread's result, otherwise your threads will end up deadlocked.

Categories

Resources