Large(ish) django application architecture

Large(ish) django application architecture - python

How does one properly structure a larger django website such as to retain testability and maintainability?
In the best django spirit (I hope) we started out by not caring too much about decoupling between different parts of our website. We did separate it into different apps, but those depend rather directly upon each other, through common use of model classes and direct method calls.
This is getting quite entangled. For example, one of our actions/services looks like this:
def do_apply_for_flat(user, flat, bid_amount):
assert can_apply(user, flat)
application = Application.objects.create(
user=user, flat=flat, amount=bid_amount,
status=Application.STATUS_ACTIVE)
events.logger.application_added(application)
mails.send_applicant_application_added(application)
mails.send_lessor_application_received(application)
return application
The function does not only perform the actual business process, no, it also handles event logging and sending mails to the involved users. I don't think there's something inherently wrong with this approach. Yet, it's getting more and more difficult to properly reason about the code and even test the application, as it's getting harder to separate parts intellectually and programmatically.
So, my question is, how do the big boys structure their applications such that:
Different parts of the application can be tested in isolation
Testing stays fast by only enabling parts that you really need for a specific test
Code coupling is reduced
My take on the problem would be to introduce a centralized signal hub (just a bunch of django signals in a single python file) which the single django apps may publish or subscribe to. The above example function would publish an application_added event, which the mails and events apps would listen to. Then, for efficient testing, I would disconnect the parts I don't need. This also increases decoupling considerably, as services don't need to know about sending mails at all.
But, I'm unsure, and thus very interested in what's the accepted practice for these kind of problems.

For testing, you should mock your dependencies. The logging and mailing component, for example, should be mocked during unit testing of the views. I would usually use python-mock, this allows your views to be tested independently of the logging and mailing component, and vice versa. Just assert that your views are calling the right service calls and mock the return value/side effect of the service call.
You should also avoid touching the database when doing tests. Instead try to use as much in memory objects as possible, instead of Application.objects.create(), defer the save() to the caller, so that you can test the services without having to actually have the Application in the database. Alternatively, patch out the save() method, so it won't actually save, but that's much more tedious.

Transfer some parts of your app to different microservices. This will make some parts of your app focused on doing one or two things right (e.g. event logging, emails). Code coupling is also reduced and different parts of the site can be tested in isolation as well.
The microservice architecture style involves developing a single application as a collection of smaller services that communicates usually via an API.
You might need to use a smaller framework like Flask.
Resources:
For more information on microservices click here:
http://martinfowler.com/articles/microservices.html
http://aurelavramescu.blogspot.com/2014/06/user-microservice-python-way.html

First, try to brake down your big task into smaller classes. Connect them with usual method calls or Django signals.
If you feel that the sub-tasks are independent enough, you can implement them as several Django applications in the same project. See the Django tutorial, which describes relation between applications and projects.

Related

I'm using Jaeger tracing python functions. Do i have to create spans in every function manually?

I've found an example: https://medium.com/velotio-perspectives/a-comprehensive-tutorial-to-implementing-opentracing-with-jaeger-a01752e1a8ce
I have a pretty large codebase and I really don't want to modify every function by adding a line like ' with tracer.start_span('booking') as span:'. Is there any way to do it?
Thanks in advance.

Jaeger is a distributed tracer, inspired by Google's Dapper paper, and so it is mainly used for tracing communication between different processes in a microservices / distributed system architecture, not so much for portions of code inside an application.
The way Jaeger is introduced into most applications is to integrate it into the part of the application that is receiving requests from the network. For example, if your Python application is receiving HTTP requests using Django or Flask, or other types of requests (e.g. gRPC) using some other framework, there will probably be a project somewhere on the internet that lets you hook Jaeger into your framework with a couple of lines of code. For the most popular frameworks, the Jaeger docs point to opentracing-contrib as a good source for these "client libraries".
While making extra tracing calls inside an application is not unheard of or discouraged with d.tracers, it's not something that tends to happen a lot, because d.tracers are typically used in microservices environments where the interactions between components are more important than what's happening inside the components.
If you do want to create tracing records inside an application, then it would be very unusual to do tracing of every single function. Instead, tracing inside an application would typically be done at the boundary of components in a modular monolith, i.e. when one component calls into another component.
Lastly, if what you really want is performance analysis of your single Python application at the level of each function, and you don't care about it's interaction with other applications in your system (maybe you only have the one?), then Jaeger is probably not the right tool. In that case, you would probably want to look for an Application Performance Monitoring or APM tool that works with Python and suits your needs.

Is it possible to create a Tornado application from instances of Request/Web Handlers instead of class definitions?

The apparent requirement to provide class definitions instead of instances causes very difficult problems. I have two different classes and one of them needs a reference to the other
app = tornado.web.Application([
(r"/fusion.*", FusionListener),
(r"/admin.*", AdminListener),
])
. The AdminListener needs a reference to the FusionListener since there are internal items needing to be managed. Sending messages is an unacceptable additional complexity here. The current mechanism does not seem to afford that possibility.
What kind of pattern can get around this shortcoming in Tornado?

For my use-case there are both persistent and in-memory state. We have spark and postgres repositories for the former. For the latter I had already designed and written the application to have instance-level data structures. But I have gathered that instance attributes on Tornado launched RequestHandler / WebHandler subclasses are not persistent.
The latter wants to live in a class managing the state: but I am compelled to significantly redraw the boundaries due to this design ot Tornado. Instead it will be necessary to push everything to global variables. Few would argue this were a preferred design. I will be dumping tornado as soon as I can get the time.
Not sure what will be the alternative: I already reverted from cherrypy due to significant limitations of its own: here are a couple of my questions on it
404 for path served by Cherrypy
How to specify the listening server instances using cherrypy tree.mount?
I got through those with some scars but still in one piece. There were additional issues that knocked me out: url's were not being served, and there was no clear end to the mole whacking. It also generally does not get a lot of attention and has confusing outdated or incomplete documentation. There is plenty of docs - that's why I got started on it: but the holes make for a series of rabbit-chasing episodes.
Flask and django have their own issues. It seems finding a functionally adequate but not super heavy weight web server in python is an illusory target. Not certain yet which framework has the least gotchas.

Posting this as answer in order to benefit from proper code formatting.
The paradigm I used for keeping track of existing instances of a RequestHandler is very simple:
class MyHandler(RequestHandler):
_instances = set()
def get(self):
if needs_to_be_added(self.request): # some conditions can be used here
if len(MyHandler._instances) > THRESHOLD: # careful with memory usage
return self.finish("some_error")
MyHandler._instances.add(self)
...
def post(self):
if needs_to_be_removed(self.request):
MyHandler._instances.discard(self)
...
Of course you might need to change when to add / discard elements.
Depending on how you want to refer to existing instances in the future (by some key for example) you could use a dict for keeping track of them.
I don't think you can use something weak references (as in classes from the weakref module) because those will only track live instances which won't work with the way request handlers instances are created and destroyed.

Multiple sites on Django

I am trying to run a large amount of sites which share about 90% of their code. They are simply designed to query an API and return the results. They will have a common userbase / database but will be configured slightly different and will have different CSS (perhaps even different templating).
My inital idea was to run them as separate applications with a common library but I have read about the sites framework which would allow them to run from a single instance of django which may help to reduce memory usage.
https://docs.djangoproject.com/en/dev/ref/contrib/sites/
My question is, is the site framework the right approach to a problem like this, and does it have real benefits over running separate applications. Initially I thought it was, but not I think otherwise. I have heard the following:
Your SITE_ID is set in settings.py, so in order to have multiple
sites, you need multiple settings.py configurations, which means
multiple distinct processes/instances. You can of course share the
code base between them, but each site will need a dedicated worker /
WSGIDaemon to serve the site.
This effeceitly removes any benefit of running multiple sites under one hood, if each site needs a UWSGI instance running.
Alternative ideas of systems:
https://github.com/iivvoo/django_layers
https://github.com/shestera/django-multisite
http://www.huyng.com/posts/franchising-running-multiple-sites-from-one-django-codebase/
I don't know what route to be taking with this.

IMHO it comes down to what degree of change is possible, what the impact is, and how likely is it to happen. For example:
They will have a common userbase / database
Are you saying the same people use all the sites? If so then the risk profile will be less severe than if it was different people (say different organizations). Basically (through good appropriate architecture) you want to be de-coupling things so that when one thing changes it doesn't have a massive impact on everything else.
If you run off the same instance then it's easy to update every site at once (say you need to perform a maintenance patch on the base system), but on the other hand that can bite you (one group of users is happy to have the change but others aren't - either because of the functional change or the downtime needed to apply the patch (for example).
Running the same code-base but in different instances is a larger maintenance overhead but removes a lot of risk associated with managing change; the conversation then becomes one of how to most efficiently maintain many instances of the same thing, rather than mapping risk associated with each time you make a change.

Actually, you can run those 2 (or more) sites under the same WSGI instance.
Depending on your version of Django and the features you need there are some drawbacks (like using threadlocals) but all in all those 2 solutions work pretty well.
Django 1.8+: https://bitbucket.org/levit_scs/airavata
Django <:1.7: https://bitbucket.org/uysrc/django-dynamicsites/overview (but it will probably require some fiddling depending on your version of Django)
What those 2 applications add compared to Django sites framework is the ability to easily serve sites on the same instance according to the domain name.

Python Backend Design Patterns

I am now working on a big backend system for a real-time and history tracking web service.
I am highly experienced in Python and intend to use it with sqlalchemy (MySQL) to develop the backend.
I don't have any major experience developing robust and sustainable backend systems and I was wondering if you guys could point me out to some documentation / books about backend design patterns? I basically need to feed data to a database by querying different services (over HTML / SOAP / JSON) at realtime, and to keep history of that data.
Thanks!

Can you define "backend" more precisely? Normally, in web dev, I follow a MVC'ish structure where my "front-end", html/css/js and code dealing with displaying either, is loosly coupled with my "backend" model (business objects and data persistence; i.e. database).
I like Django's Model/View/Template approach:
http://docs.djangoproject.com/en/dev/faq/general/#django-appears-to-be-a-mvc-framework-but-you-call-the-controller-the-view-and-the-view-the-template-how-come-you-don-t-use-the-standard-names
But, you haven't really defined what you mean by "backend" so its hard to give advice on design patterns. You said you are experienced in Python, have you ever developed a database driven web application before?
update
Based on your comment, I won't be able to help much as I don't have much experience doing "backends" like that. However, seeing as how you are pulling in resources from the web, your latency/throughput is going to be pretty high. So, in order to increase overall effectiveness, you are going to want to have something that can run multiple threads or processes with pretty high concurrency. I suggest you check out the answers on this thread (and search for similar ones):
Concurrent downloads - Python
Specifically, I found the example for the recursive web server and the example following it to probably be a very good start on your solution:
http://eventlet.net/doc/examples.html#recursive-web-crawler
As for taking that idea and then turning it into a robust/continuous process, that's going to depend a lot on your platform and how well you do error handling. Basically:
run it in a loop and make sure you handle any error that can possibly be thrown
have some kind process monitoring your worker process to kill/restart it if it hangs or dies
make sure you have a monitoring solution to notify you if it stops working (nagios, etc.)
One of the best ways to keep things "robust" is to make them as simple (not simplistic) as possible. If all you are doing is pulling in info from the web, parsing it in some way, and then storing that info in a DB, then try to keep the process that simple. Don't add unnecessarily complexity in an effort to make it more robust. If you end up with a 200 line script that does what you want, great!

Use Apache, Django and Piston.
Use REST as the protocol.
Write as little code as possible.
Django models, forms, and admin interface.
Piston wrapppers for your resources.

How would one make Python objects persistent in a web-app?

I'm writing a reasonably complex web application. The Python backend runs an algorithm whose state depends on data stored in several interrelated database tables which does not change often, plus user specific data which does change often. The algorithm's per-user state undergoes many small changes as a user works with the application. This algorithm is used often during each user's work to make certain important decisions.
For performance reasons, re-initializing the state on every request from the (semi-normalized) database data quickly becomes non-feasible. It would be highly preferable, for example, to cache the state's Python object in some way so that it can simply be used and/or updated whenever necessary. However, since this is a web application, there several processes serving requests, so using a global variable is out of the question.
I've tried serializing the relevant object (via pickle) and saving the serialized data to the DB, and am now experimenting with caching the serialized data via memcached. However, this still has the significant overhead of serializing and deserializing the object often.
I've looked at shared memory solutions but the only relevant thing I've found is POSH. However POSH doesn't seem to be widely used and I don't feel easy integrating such an experimental component into my application.
I need some advice! This is my first shot at developing a web application, so I'm hoping this is a common enough issue that there are well-known solutions to such problems. At this point solutions which assume the Python back-end is running on a single server would be sufficient, but extra points for solutions which scale to multiple servers as well :)
Notes:
I have this application working, currently live and with active users. I started out without doing any premature optimization, and then optimized as needed. I've done the measuring and testing to make sure the above mentioned issue is the actual bottleneck. I'm sure pretty sure I could squeeze more performance out of the current setup, but I wanted to ask if there's a better way.
The setup itself is still a work in progress; assume that the system's architecture can be whatever suites your solution.

Be cautious of premature optimization.
Addition: The "Python backend runs an algorithm whose state..." is the session in the web framework. That's it. Let the Django framework maintain session state in cache. Period.
"The algorithm's per-user state undergoes many small changes as a user works with the application." Most web frameworks offer a cached session object. Often it is very high performance. See Django's session documentation for this.
Advice. [Revised]
It appears you have something that works. Leverage to learn your framework, learn the tools, and learn what knobs you can turn without breaking a sweat. Specifically, using session state.
Second, fiddle with caching, session management, and things that are easy to adjust, and see if you have enough speed. Find out whether MySQL socket or named pipe is faster by trying them out. These are the no-programming optimizations.
Third, measure performance to find your actual bottleneck. Be prepared to provide (and defend) the measurements as fine-grained enough to be useful and stable enough to providing meaningful comparison of alternatives.
For example, show the performance difference between persistent sessions and cached sessions.

I think that the multiprocessing framework has what might be applicable here - namely the shared ctypes module.
Multiprocessing is fairly new to Python, so it might have some oddities. I am not quite sure whether the solution works with processes not spawned via multiprocessing.

I think you can give ZODB a shot.
"A major feature of ZODB is transparency. You do not need to write any code to explicitly read or write your objects to or from a database. You just put your persistent objects into a container that works just like a Python dictionary. Everything inside this dictionary is saved in the database. This dictionary is said to be the "root" of the database. It's like a magic bag; any Python object that you put inside it becomes persistent."
Initailly it was a integral part of Zope, but lately a standalone package is also available.
It has the following limitation:
"Actually there are a few restrictions on what you can store in the ZODB. You can store any objects that can be "pickled" into a standard, cross-platform serial format. Objects like lists, dictionaries, and numbers can be pickled. Objects like files, sockets, and Python code objects, cannot be stored in the database because they cannot be pickled."
I have read it but haven't given it a shot myself though.
Other possible thing could be a in-memory sqlite db, that may speed up the process a bit - being an in-memory db, but still you would have to do the serialization stuff and all.
Note: In memory db is expensive on resources.
Here is a link: http://www.zope.org/Documentation/Articles/ZODB1

First of all your approach is not a common web development practice. Even multi threading is being used, web applications are designed to be able to run multi-processing environments, for both scalability and easier deployment .
If you need to just initialize a large object, and do not need to change later, you can do it easily by using a global variable that is initialized while your WSGI application is being created, or the module contains the object is being loaded etc, multi processing will do fine for you.
If you need to change the object and access it from every thread, you need to be sure your object is thread safe, use locks to ensure that. And use a single server context, a process. Any multi threading python server will serve you well, also FCGI is a good choice for this kind of design.
But, if multiple threads are accessing and changing your object the locks may have a really bad effect on your performance gain, which is likely to make all the benefits go away.

This is Durus, a persistent object system for applications written in the Python
programming language. Durus offers an easy way to use and maintain a consistent
collection of object instances used by one or more processes. Access and change of a
persistent instances is managed through a cached Connection instance which includes
commit() and abort() methods so that changes are transactional.
http://www.mems-exchange.org/software/durus/
I've used it before in some research code, where I wanted to persist the results of certain computations. I eventually switched to pytables as it met my needs better.

Another option is to review the requirement for state, it sounds like if the serialisation is the bottle neck then the object is very large. Do you really need an object that large?
I know in the Stackoverflow podcast 27 the reddit guys discuss what they use for state, so that maybe useful to listen to.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.