Python synchronise between threads and processes - python

A bit of background:
I am writing a function in Django to get the next invoice number, which needs to be sequential (not gaps), so the function looks like this:
def get_next_invoice_number():
"""
Returns the max(invoice_number) + 1 from the payment records
Does NOT pre-allocate number
"""
# TODO ensure this is thread safe
max_num = Payment.objects.aggregate(Max('invoice_number'))['invoice_number__max']
if max_num is not None:
return max_num + 1
return PaymentConfig.min_invoice_number
Now the problem is, this function only returns the max()+1, in my production environment I have multiple Django processes, so if this function is called twice for 2 different payments (before the first record saved), they will get the same invoice number.
To mitigate this problem I can override the save() function to call the get_next_invoice_number() to minimise the time gap between these function calls, but there is still a very tiny chance for problem to happen.
So I want to implement a lock in the approve method, something like
from multiprocessing import Lock
lock = Lock()
class Payment(models.Model):
def approve(self):
lock.acquire()
try:
self.invoice_number = get_next_invoice_number()
self.save()
except:
pass
finally:
lock.release()
So my questions are:
Does this look okay?
The lock is for multiprocess, how about threads?
UPDATE:
As my colleague pointed out, this is not going to work when it's deployed to multiple servers, the locks will be meaningless.
Looks like DB transaction locking is the way to go.

The easiest way to do this, by far, is with your database's existing tools for creating sequences. In fact, if you don't mind the value starting from 1 you can just use Django's AutoField.
If your business requirements are such that you need to choose a starting number, you'll have to see how to do this in the database. Here are some questions that might help.
Trying to ensure this with locks or transactions will be harder to do and slower to perform.

Related

Signaling a long running Huey task to stop

I'm using Huey to perform some processing operations on objects I have in my Django application.
#db_task()
def process_album(album: Album) -> None:
images = album.objects.find_non_processed_images()
for image in images:
if album.should_pause():
return
process_image(album, image)
This is a simplified example of the situation I wish to solve. I have an Album model that stores a various number of Images that are required to be processed. The processing operation is defined in a different function that s wrapped with #task decorator so it'll be able to run concurrently (when the number of workers > 1).
The question is how to implement album.should_pause() in the right way. Current implementation looks like this:
def should_pause(self):
self.refresh_from_db()
return self.processing_state != AlbumProcessingState.RUNNING
Therefore, on each iteration, the database is queried to update the model, to make sure that state field didn't change to something other than AlbumProcessingState.RUNNING, which would indicate that the album processing tasks should break.
Although it works, it feels wrong since I have to update the model from the database on each iteration, but these feelings might be false. What do you think?

Turn Off Celery Tasks

I'm trying to find a way to be able to turn celery tasks on/off from the django admin. This is mostly to disable tasks that call external services when those services are down or have a scheduled maintenance period.
For my periodic tasks, this is easy, especially with django-celery. But for tasks that are called on demand I'm having some trouble. Currently I'm exploring just storing an on/off status for various tasks in a TaskControl model and then just checking that status at the beginning of task execution, returning None if the status is False. This makes me feel dirty due to all the extra db lookups every time a task kicks off. I could use a cache backend that isn't the db, but it seems a little overkill to add caching just for these handful of key/value pairs.
in models.py
# this is a singleton model. singleton code bits omitted for brevity.
class TaskControl(models.Model):
some_status = models.BooleanField(default=True)
# more statuses
in tasks.py
#celery.task(ignore_result=True)
def some_task():
task_control = TaskControl.objects.get(pk=1)
if not task_control.some_status:
return None
# otherwise execute task as normal
What is a better way to do this?
Option 1. Try your simple approach. See if it affect performance. If not, lose the “dirty” feeling.
Option 2. Cache in process memory with a singleton. Add freshness information to your TaskControl model:
class TaskControl(models.Model):
some_status = models.BooleanField(default=True)
# more statuses
expires = models.DateTimeField()
check_interval = models.IntegerField(default=5 * 60)
def is_stale(self):
return (
(datetime.utcnow() >= self.expires) or
((datetime.utcnow() - self.retrieved).total_seconds >= self.check_interval))
Then in a task_ctl.py:
_control = None
def is_enabled():
global _control
if (_control is None) or _control.is_stale():
_control = TaskControl.objects.get(pk=1)
# There's probably a better way to set `retrieved`,
# maybe with a signal or a `Model` override,
# but this should work.
_control.retrieved = datetime.utcnow()
return _control.some_status
Option 3. Like option 2, but instead of time-based expiration, use Celery’s remote control to force all workers to reload the TaskControl (you’ll have to write your own control command, and I don’t know if all the internals you will need are public APIs).
Option 4, only applicable if all your Celery workers run on a single machine. Store the on/off flag as a file on that machine’s file system. Query its existence with os.path.exists (that should be a single stat() or something, cheap enough). If the workers and the admin panel are on different machines, use a special Celery task to create/remove the file.
Option 5. Like option 4, but on the admin/web machine: if the file exists, just don’t queue the task in the first place.

Should I use MySQL Lock Table or there exists better solution?

I'm working on a multiprocessed application, and each process sometimes executes the following code:
db_cursor.execute("SELECT MAX(id) FROM prqueue;")
for record in db_cursor.fetchall():
if record[0]:
db_cursor.execute("DELETE FROM prqueue WHERE id='%s'" % record[0]);
db_connector.commit()
And I'm facing the following problem: there may be a situation, when two processes take the same maximum value, and both try to delete this value. Such situation is not acceptable in the context of my application, each value must be taken (and deleted) only by one process.
How can I achieve this? Is table locking while taking the maximum and deleting absolutely necessary, or there is another, nice way to do that?
Thank you.
Consider simulating record locks with GET_LOCK();
Choose a name specific to the op you want locking. e.g. 'prqueue_max_del'.
Call SELECT GET_LOCK('prqueue_max_del',30) to lock the name 'prqueue_max_del'.. it will return 1 and set the lock if the name becomes available, or return 0 if the lock is not available after 30 seconds (the second parameter is the timeout).
Use SELECT RELEASE_LOCK('prqueue_max_del') when you are finished.
You will have to use the same names in each transaction and calling GET_LOCK() again in a transaction will release the previously set lock.
Beware; As only the abstract name is locked, all other processes not using this method and abstract name will be able to savage your table independently.
GET_LOCK() docs

SQLAlchemy - Multithreaded Persistent Object Creation, how to merge back into single session to avoid state conflict?

I have tens (potentially hundreds) of thousands of persistent objects that I want to generate in a multithreaded fashion due the processing required.
While the creation of the objects happens in separate threads (using Flask-SQLAlchemy extension btw with scoped sessions) the call to write the generated objects to the DB happens in 1 place after the generation has completed.
The problem, I believe, is that the objects being created are part of several existing relationships-- thereby triggering the automatic addition to the identity map despite being created in separate, concurrent, threads with no explicit session in any of the threads.
I was hoping to contain the generated objects in a single list, and then write the whole list (using a single session object) to the database. This results in an error like this:
AssertionError: A conflicting state is already present in the identity map for key (<class 'app.ModelObject'>, (1L,))
Hence why I believe the identity map has already been populated, because it's when I try to add and commit using the global session outside of the concurrent code, the assertion error is triggered.
The final detail is that whatever session object(s), (scoped or otherwise, as I don't fully understand how automatic addition to the identity map works in the case of multithreading) I cannot find a way / don't know how to get a reference to them so that even if I wanted to deal with a separate session per process I could.
Any advice is greatly appreciated. The only reason I am not posting code (yet) is because it's difficult to abstract a working example immediately out of my app. I will post if somebody really needs to see it though.
Each session is thread-local; in other words there is a separate session for each thread. If you decide to pass some instances to another thread, they will become "detached" from the session. Use db.session.add_all(objects) in the receiving thread to put them all back.
For some reason, it looks like you're creating objects with the same identity (primary key columns) in different threads, then trying to send them both to the database. One option is to fix why this is happening, so that identities will be guaranteed unique. You may also try merging; merged_object = db.session.merge(other_object, load=False).
Edit: zzzeek's comment clued me in on something else that may be going on:
With Flask-SQLAlchemy, the session is tied to the app context. Since that is thread local, spawning a new thread will invalidate the context; there will be no database session in the threads. All the instances are detached there, and cannot properly track relationships. One solution is to pass app to each thread and perform everything within a with app.app_context(): block. Inside the block, first use db.session.add to populate the local session with the passed instances. You should still merge in the master task afterwards to ensure consistency.
I just want to clarify the problem and the solution with some pseudo-code in case somebody has this problem / wants to do this in the future.
class ObjA(object):
obj_c = relationship('ObjC', backref='obj_c')
class ObjB(object):
obj_c = relationship('ObjC', backref='obj_c')
class ObjC(object):
obj_a_id = Column(Integer, ForeignKey('obj_a.id'))
obj_b_id = Column(Integer, ForeignKey('obj_b.id'))
def __init__(self, obj_a, obj_b):
self.obj_a = obj_a
self.obj_b = obj_b
def make_a_bunch_of_c(obj_a, list_of_b=None):
return [ObjC(obj_a, obj_b) for obj_b in list_of_b]
def parallel_generate():
list_of_a = session.query(ObjA).all() # assume there are 1000 of these
list_of_b = session.query(ObjB).all() # and 30 of these
fxn = functools.partial(make_a_bunch_of_c, list_of_b=list_of_b)
pool = multiprocessing.Pool(10)
all_the_things = pool.map(fxn, list_of_a)
return all_the_things
Now let's stop here a second. The original problem was that attempting to ADD the list of ObjC's caused the error message in the original question:
session.add_all(all_the_things)
AssertionError: A conflicting state is already present in the identity map for key [...]
Note: The error occurs during the adding phase, the commit attempt never even happens because the assertion occurs pre-commit. As far as I could tell.
Solution:
all_the_things = parallel_generate()
for thing in all_the_things:
session.merge(thing)
session.commit()
The details of session utilization when dealing with automatically added objects (via the relationship cascading) is still beyond me and I cannot explain why the conflict originally occurred. All I know is that using the merge function will cause SQLAlchemy to sort all of the child objects that were created across 10 different processes into a single session in the master process.
I would be curious in the why, if anyone happens across this.

Python: deferToThread XMLRPC Server - Twisted - Cherrypy?

This question is related to others I have asked on here, mainly regarding sorting huge sets of data in memory.
Basically this is what I want / have:
Twisted XMLRPC server running. This server keeps several (32) instances of Foo class in memory. Each Foo class contains a list bar (which will contain several million records). There is a service that retrieves data from a database, and passes it to the XMLRPC server. The data is basically a dictionary, with keys corresponding to each Foo instance, and values are a list of dictionaries, like so:
data = {'foo1':[{'k1':'v1', 'k2':'v2'}, {'k1':'v1', 'k2':'v2'}], 'foo2':...}
Each Foo instance is then passed the value corresponding to it's key, and the Foo.bar dictionaries are updated and sorted.
class XMLRPCController(xmlrpc.XMLRPC):
def __init__(self):
...
self.foos = {'foo1':Foo(), 'foo2':Foo(), 'foo3':Foo()}
...
def update(self, data):
for k, v in data:
threads.deferToThread(self.foos[k].processData, v)
def getData(self, fookey):
# return first 10 records of specified Foo.bar
return self.foos[fookey].bar[0:10]
class Foo():
def __init__(self):
bar = []
def processData(self, new_bar_data):
for record in new_bar_data:
# do processing, and add record, then sort
# BUNCH OF PROCESSING CODE
self.bar.sort(reverse=True)
The problem is that when the update function is called in the XMLRPCController with a lot of records (say 100K +) it stops responding to my getData calls until all 32 Foo instances have completed the process_data method. I thought deferToThread would work, but I think I am misunderstanding where the problem is.
Any suggestions... I am open to using something else, like Cherrypy if it supports this required behavior.
EDIT
#Troy: This is how the reactor is set up
reactor.listenTCP(port_no, server.Site(XMLRPCController)
reactor.run()
As far as GIL, would it be a viable option to change
sys.setcheckinterval()
value to something smaller, so the lock on the data is released so it can be read?
The easiest way to get the app to be responsive is to break up the CPU-intensive processing in smaller chunks, while letting the twisted reactor run in between. For example by calling reactor.callLater(0, process_next_chunk) to advance to next chunk. Effectively implementing cooperative multitasking by yourself.
Another way would be to use separate processes to do the work, then you will benefit from multiple cores. Take a look at Ampoule: https://launchpad.net/ampoule It provides an API similar to deferToThread.
I don't know how long your processData method runs nor how you're setting up your twisted reactor. By default, the twisted reactor has a thread pool of between 0 and 10 threads. You may be trying to defer as many as 32 long-running calculations to as many as 10 threads. This is sub-optimal.
You also need to ask what role the GIL is playing in updating all these collections.
Edit:
Before you make any serious changes to your program (like calling sys.setcheckinterval()) you should probably run it using the profiler or the python trace module. These should tell you what methods are using all your time. Without the right information, you can't make the right changes.

Categories

Resources