Django and thread safety

Django and thread safety - python

If I read the django documentation, only the documents about the template tags mention the potential danger of thread safety.
However, I'm curious what kind of things I have to do / avoid to write thread-safe code in Django...
One example is, I have the following function to config the loggers used in django.
_LOGGER_CONFIGURED = False
def config_logger():
global _LOGGER_CONFIGURED
if _LOGGER_CONFIGURED: return
_LOGGER_CONFIGURED = True
rootlogger = logging.getLogger('')
stderr_handler = StreamHandler(sys.stderr)
rootlogger.addHandler(stderr_handler)
and at the end of my root urlconf, i have the following function call:
config_logger()
The question is:
Is this code threadsafe?
What kind of variables are shared between the django threads?

There's not really a whole lot you can do about django templates and their issues with threading, besides not using them, or at least not using the tags that are sensitive to threading issues. There aren't many django template tags that do have issues, only the stateful ones do, such as cycle.
In the example you have given, you are not doing anything about thread safety, and you shouldn't be anyway: the logging module is already perfectly thread safe, so long as you use it in the normal way, which is to call logging.getLogger in the modules that need it, and LOGGING or LOGGING_CONFIG is set appropriately in your settings.py. No need to be clever with this.
other things you might be concerned about are database integrity in the face of concurrent updates. Don't be, if you are using PostgreSQL or MySQL/INNOdb databases, then you are completely protected from concurrency shennanegans.

Related

Is it possible to create a Tornado application from instances of Request/Web Handlers instead of class definitions?

The apparent requirement to provide class definitions instead of instances causes very difficult problems. I have two different classes and one of them needs a reference to the other
app = tornado.web.Application([
(r"/fusion.*", FusionListener),
(r"/admin.*", AdminListener),
])
. The AdminListener needs a reference to the FusionListener since there are internal items needing to be managed. Sending messages is an unacceptable additional complexity here. The current mechanism does not seem to afford that possibility.
What kind of pattern can get around this shortcoming in Tornado?

For my use-case there are both persistent and in-memory state. We have spark and postgres repositories for the former. For the latter I had already designed and written the application to have instance-level data structures. But I have gathered that instance attributes on Tornado launched RequestHandler / WebHandler subclasses are not persistent.
The latter wants to live in a class managing the state: but I am compelled to significantly redraw the boundaries due to this design ot Tornado. Instead it will be necessary to push everything to global variables. Few would argue this were a preferred design. I will be dumping tornado as soon as I can get the time.
Not sure what will be the alternative: I already reverted from cherrypy due to significant limitations of its own: here are a couple of my questions on it
404 for path served by Cherrypy
How to specify the listening server instances using cherrypy tree.mount?
I got through those with some scars but still in one piece. There were additional issues that knocked me out: url's were not being served, and there was no clear end to the mole whacking. It also generally does not get a lot of attention and has confusing outdated or incomplete documentation. There is plenty of docs - that's why I got started on it: but the holes make for a series of rabbit-chasing episodes.
Flask and django have their own issues. It seems finding a functionally adequate but not super heavy weight web server in python is an illusory target. Not certain yet which framework has the least gotchas.

Posting this as answer in order to benefit from proper code formatting.
The paradigm I used for keeping track of existing instances of a RequestHandler is very simple:
class MyHandler(RequestHandler):
_instances = set()
def get(self):
if needs_to_be_added(self.request): # some conditions can be used here
if len(MyHandler._instances) > THRESHOLD: # careful with memory usage
return self.finish("some_error")
MyHandler._instances.add(self)
...
def post(self):
if needs_to_be_removed(self.request):
MyHandler._instances.discard(self)
...
Of course you might need to change when to add / discard elements.
Depending on how you want to refer to existing instances in the future (by some key for example) you could use a dict for keeping track of them.
I don't think you can use something weak references (as in classes from the weakref module) because those will only track live instances which won't work with the way request handlers instances are created and destroyed.

Large(ish) django application architecture

How does one properly structure a larger django website such as to retain testability and maintainability?
In the best django spirit (I hope) we started out by not caring too much about decoupling between different parts of our website. We did separate it into different apps, but those depend rather directly upon each other, through common use of model classes and direct method calls.
This is getting quite entangled. For example, one of our actions/services looks like this:
def do_apply_for_flat(user, flat, bid_amount):
assert can_apply(user, flat)
application = Application.objects.create(
user=user, flat=flat, amount=bid_amount,
status=Application.STATUS_ACTIVE)
events.logger.application_added(application)
mails.send_applicant_application_added(application)
mails.send_lessor_application_received(application)
return application
The function does not only perform the actual business process, no, it also handles event logging and sending mails to the involved users. I don't think there's something inherently wrong with this approach. Yet, it's getting more and more difficult to properly reason about the code and even test the application, as it's getting harder to separate parts intellectually and programmatically.
So, my question is, how do the big boys structure their applications such that:
Different parts of the application can be tested in isolation
Testing stays fast by only enabling parts that you really need for a specific test
Code coupling is reduced
My take on the problem would be to introduce a centralized signal hub (just a bunch of django signals in a single python file) which the single django apps may publish or subscribe to. The above example function would publish an application_added event, which the mails and events apps would listen to. Then, for efficient testing, I would disconnect the parts I don't need. This also increases decoupling considerably, as services don't need to know about sending mails at all.
But, I'm unsure, and thus very interested in what's the accepted practice for these kind of problems.

For testing, you should mock your dependencies. The logging and mailing component, for example, should be mocked during unit testing of the views. I would usually use python-mock, this allows your views to be tested independently of the logging and mailing component, and vice versa. Just assert that your views are calling the right service calls and mock the return value/side effect of the service call.
You should also avoid touching the database when doing tests. Instead try to use as much in memory objects as possible, instead of Application.objects.create(), defer the save() to the caller, so that you can test the services without having to actually have the Application in the database. Alternatively, patch out the save() method, so it won't actually save, but that's much more tedious.

Transfer some parts of your app to different microservices. This will make some parts of your app focused on doing one or two things right (e.g. event logging, emails). Code coupling is also reduced and different parts of the site can be tested in isolation as well.
The microservice architecture style involves developing a single application as a collection of smaller services that communicates usually via an API.
You might need to use a smaller framework like Flask.
Resources:
For more information on microservices click here:
http://martinfowler.com/articles/microservices.html
http://aurelavramescu.blogspot.com/2014/06/user-microservice-python-way.html

First, try to brake down your big task into smaller classes. Connect them with usual method calls or Django signals.
If you feel that the sub-tasks are independent enough, you can implement them as several Django applications in the same project. See the Django tutorial, which describes relation between applications and projects.

How To Run Arbitrary Code After Django is "Fully Loaded"

I need to perform some fairly simple tasks after my Django environment has been "fully loaded".
More specifically I need to do things like Signal.disconnect() some Django Signals that are setup by my third party library by default and connect my own Signals and I need to do some "monkey patching" to add convenience functions to some Django models from another library.
I've been doing this stuff in my Django app's __init__.py file, which seems to work fine for the monkey patching, but doesn't work for my Signal disconnecting. The problem appears to be one of timing--for whatever reason the Third Party Library always seems to call its Signal.connect() after I try to Signal.disconnect() it.
So two questions:
Do I have any guarantee based on the order of my INSTALLED_APPS the order of when my app's __init__.py module is loaded?
Is there a proper place to put logic that needs to run after Django apps have been fully loaded into memory?

In Django 1.7 Apps can implement the ready() method: https://docs.djangoproject.com/en/dev/ref/applications/#django.apps.AppConfig.ready

My question is a more poorly phrased duplicate of this question: Where To Put Django Startup Code. The answer comes from that question:
Write middleware that does this in init and afterwards raise django.core.exceptions.MiddlewareNotUsed from the init, django will remove it for all requests...
See the Django documentation on writing your own middleware.

I had to do the following monkey patching. I use django 1.5 from github branch. I don't know if that's the proper way to do it, but it works for me.
I couldn't use middleware, because i also wanted the manage.py scripts to be affected.
anyway, here's this rather simple patch:
import django
from django.db.models.loading import AppCache
django_apps_loaded = django.dispatch.Signal()
def populate_with_signal(cls):
ret = cls._populate_orig()
if cls.app_cache_ready():
if not hasattr(cls, '__signal_sent'):
cls.__signal_sent = True
django_apps_loaded.send(sender=None)
return ret
if not hasattr(AppCache, '_populate_orig'):
AppCache._populate_orig = AppCache._populate
AppCache._populate = populate_with_signal
and then you could use this signal like any other:
def django_apps_loaded_receiver(sender, *args, **kwargs):
# put your code here.
django_apps_loaded.connect(django_apps_loaded_receiver)

As far as I know there's no such thing as "fully loaded". Plenty of Django functions include import something right in the function. Those imports will only happen if you actually invoke that function. The only way to do what you want would be to explicitly import the things you want to patch (which you should be able to do anywhere) and then patch them. Thereafter any other imports will re-use them.

What is so bad with threadlocals

Everybody in Django world seems to hate threadlocals(http://code.djangoproject.com/ticket/4280, http://code.djangoproject.com/wiki/CookBookThreadlocalsAndUser). I read Armin's essay on this(http://lucumr.pocoo.org/2006/7/10/why-i-cant-stand-threadlocal-and-others), but most of it hinges on threadlocals is bad because it is inelegant.
I have a scenario where theadlocals will make things significantly easier. (I have a app where people will have subdomains, so all the models need to have access to the current subdomain, and passing them from requests is not worth it, if the only problem with threadlocals is that they are inelegant, or make for brittle code.)
Also a lot of Java frameworks seem to be using threadlocals a lot, so how is their case different from Python/Django 's?

I avoid this sort of usage of threadlocals, because it introduces an implicit non-local coupling. I frequently use models in all kinds of non-HTTP-oriented ways (local management commands, data import/export, etc). If I access some threadlocals data in models.py, now I have to find some way to ensure that it is always populated whenever I use my models, and this could get quite ugly.
In my opinion, more explicit code is cleaner and more maintainable. If a model method requires a subdomain in order to operate, that fact should be made obvious by having the method accept that subdomain as a parameter.
If I absolutely could find no way around storing request data in threadlocals, I would at least implement wrapper methods in a separate module that access threadlocals and call the model methods with the needed data. This way the models.py remains self-contained and models can be used without the threadlocals coupling.

I don't think there is anything wrong with threadlocals - yes, it is a global variable, but besides that it's a normal tool. We use it just for this purpose (storing subdomain model in the context global to the current request from middleware) and it works perfectly.
So I say, use the right tool for the job, in this case threadlocals make your app much more elegant than passing subdomain model around in all the model methods (not mentioning the fact that it is even not always possible - when you are overriding django manager methods to limit queries by subdomain, you have no way to pass anything extra to get_query_set, for example - so threadlocals is the natural and only answer).

Also a lot of Java frameworks seem to be using threadlocals a lot, so how is their case different from Python/Django 's?
CPython's interpreter has a Global Interpreter Lock (GIL) which means that only one Python thread can be executed by the interpreter at any given time. It isn't clear to me that a Python interpreter implementation would necessarily need to use more than one operating system thread to achieve this, although in practice CPython does.
Java's main locking mechanism is via objects' monitor locks. This is a decentralized approach that allows the use of multiple concurrent threads on multi-core and or multi-processor CPUs, but also produces much more complicated synchronization issues for the programmer to deal with.
These synchronization issues only arise with "shared-mutable state". If the state isn't mutable, or as in the case of a ThreadLocal it isn't shared, then that is one less complicated problem for the Java programmer to solve.
A CPython programmer still has to deal with the possibility of race conditions, but some of the more esoteric Java problems (such as publication) are presumably solved by the interpreter.
A CPython programmer also has the option to code performance critical code in Python-callable C or C++ code where the GIL restriction does not apply. Technically a Java programmer has a similar option via JNI, but this is rightly or wrongly considered less acceptable in Java than in Python.

You want to use threadlocals when you're working with multiple threads and want to localize some objects to a specific thread, eg. having one database connection for each thread.
In your case, you want to use it more as a global context (if I understand you correctly), which is probably a bad idea. It will make your app a bit slower, more coupled and harder to test.
Why is passing it from request not worth it? Why don't you store it in session or user profile?
There difference with Java is that web development there is much more stateful than in Python/PERL/PHP/Ruby world so people are used to all kind of contexts and stuff like that. I don't think that is an advantage, but it does seem like it at the beginning.

I have found using ThreadLocal is an excellent way to implement Dependency Injection in a HTTP request/response environment (i.e. any webapp). You just set up a servlet filter to 'inject' the object you need into the thread on receiving the request and 'uninject' it on returning the response.
It's a smart man's DI without all the XML ugliness, without the MB of Spring Jars (not to mention its learning curve) and without all the cryptic repetitive #annotation nonsense and because it doesn't individually inject many object instances with the dependencies it's probably a heck of a lot faster and uses less memory.
It worked so well we opened sourced our exPOJO Filter that can inject a Hibernate session or a JDO PersistenceManager using ThreadLocal:
http://www.expojo.com

Static methods and thread safety

In python with all this idea of "Everything is an object" where is thread-safety?
I am developing django website with wsgi. Also it would work in linux, and as I know they use effective process management, so we could not think about thread-safety alot. I am not doubt in how module loads, and there functions are static or not? Every information would be helpfull.

Functions in a module are equivalent to static methods in a class. The issue of thread safety arises when multiple threads may be modifying shared data, or even one thread may be modifying such data while others are reading it; it's best avoided by making data be owned by ONE module (accessed via Queue.Queue from others), but when that's not feasible you have to resort to locking and other, more complex, synchronization primitives.
This applies whether the access to shared data happens in module functions, static methods, or instance methods -- and the shared data is such whether it's instance variables, class ones, or global ones (scoping and thread safety are essentially disjoint, except that function-local data is, to a pont, intrinsically thread-safe -- no other thread will ever see the data inside a function instance, until and unless the function deliberately "shares" it through shared containers).
If you use the multiprocessing module in Python's standard library, instead of the threading module, you may in fact not have to care about "thread safety" -- essentially because NO data is shared among processes... well, unless you go out of your way to change that, e.g. via mmapped files;-).

See the python documentation to better understand the general thread safety implications of Python.
Django itself seems to be thread safe as of 1.0.3, but your code may not and you will have to verify that...
My advice would be to simply don't care about that and serve your application with multiple processes instead of multiple threads (for example by using apache 'prefork' instead of 'worker' MPM).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django and thread safety - python

Related

Is it possible to create a Tornado application from instances of Request/Web Handlers instead of class definitions?

Large(ish) django application architecture

How To Run Arbitrary Code After Django is "Fully Loaded"

What is so bad with threadlocals

Static methods and thread safety

Categories

Resources