How to keep a shared functional object in memory in Django?

How to keep a shared functional object in memory in Django? - python

I have an object that wraps some Active Directory functions which are used quite frequently in my codebase. I have a convenience function to create it, but each time it is creating an SSL connection which is slow and inefficient. The way I can improve this in some places is to pass it to functions in a loop but this is not always convenient.
The class is state-free so it's thread-safe and could be shared within each Django instance. It should maintain its AD connection for at least a few minutes, and ideally not longer than an hour. There are also other non-AD objects I would like to do the same with.
I have used the various cache types, including in-memory, is it appropriate to use these for functional objects? I thought they were only meant for (serializable) data.
Alternatively: is there a Django suitable pattern for service locators or connection pooling like you often seen in Java apps?
Thanks,
Joel

I have found a solution that appears to work well, which is simply a Python feature that is similar to a static variable in Java.
def get_ad_service():
if "ad_service" not in get_ad_service.__dict__:
logger.debug("Creating AD service")
get_ad_service.ad_service = CompanyADService(settings.LDAP_SERVER_URL,
settings.LDAP_USER_DN,
settings.LDAP_PASSWORD)
logger.debug("Created AD service")
else:
logger.debug("Returning existing AD service")
return get_ad_service.ad_service
My code already calls this function to get an instance of the AD service so I don't have to do anything further to make it persistent.
I found this and similar solutions here: What is the Python equivalent of static variables inside a function?
Happy to hear alternatives :)

Related

Is it possible to create a Tornado application from instances of Request/Web Handlers instead of class definitions?

The apparent requirement to provide class definitions instead of instances causes very difficult problems. I have two different classes and one of them needs a reference to the other
app = tornado.web.Application([
(r"/fusion.*", FusionListener),
(r"/admin.*", AdminListener),
])
. The AdminListener needs a reference to the FusionListener since there are internal items needing to be managed. Sending messages is an unacceptable additional complexity here. The current mechanism does not seem to afford that possibility.
What kind of pattern can get around this shortcoming in Tornado?

For my use-case there are both persistent and in-memory state. We have spark and postgres repositories for the former. For the latter I had already designed and written the application to have instance-level data structures. But I have gathered that instance attributes on Tornado launched RequestHandler / WebHandler subclasses are not persistent.
The latter wants to live in a class managing the state: but I am compelled to significantly redraw the boundaries due to this design ot Tornado. Instead it will be necessary to push everything to global variables. Few would argue this were a preferred design. I will be dumping tornado as soon as I can get the time.
Not sure what will be the alternative: I already reverted from cherrypy due to significant limitations of its own: here are a couple of my questions on it
404 for path served by Cherrypy
How to specify the listening server instances using cherrypy tree.mount?
I got through those with some scars but still in one piece. There were additional issues that knocked me out: url's were not being served, and there was no clear end to the mole whacking. It also generally does not get a lot of attention and has confusing outdated or incomplete documentation. There is plenty of docs - that's why I got started on it: but the holes make for a series of rabbit-chasing episodes.
Flask and django have their own issues. It seems finding a functionally adequate but not super heavy weight web server in python is an illusory target. Not certain yet which framework has the least gotchas.

Posting this as answer in order to benefit from proper code formatting.
The paradigm I used for keeping track of existing instances of a RequestHandler is very simple:
class MyHandler(RequestHandler):
_instances = set()
def get(self):
if needs_to_be_added(self.request): # some conditions can be used here
if len(MyHandler._instances) > THRESHOLD: # careful with memory usage
return self.finish("some_error")
MyHandler._instances.add(self)
...
def post(self):
if needs_to_be_removed(self.request):
MyHandler._instances.discard(self)
...
Of course you might need to change when to add / discard elements.
Depending on how you want to refer to existing instances in the future (by some key for example) you could use a dict for keeping track of them.
I don't think you can use something weak references (as in classes from the weakref module) because those will only track live instances which won't work with the way request handlers instances are created and destroyed.

Static Objects in Python

I have a complex set of classes that i call through a rest API. When I make calls through the rest interface, the class objects are created an they are persisted or they die at the end of the call. I want these objects to stay in memory so I dont have to create and kill the object every time I make the call. Any suggestions of how I can do this. A friend suggested using something like a static class , but I dont seem to find how I can achieve this is python
Any help will be appreciated.

Without knowing more about how your server is implemented I suggest that you could use some kind of cache, e.g. memcached. You can use python-memcached to interface to it, or a framework specific one such as those for django, bottle, flask et. al.

Python object hierarchy, and REST resources?

I am currently running into an "architectural" problem in my Python app (using Twisted) that uses a REST api, and I am looking for feedback.
Warning ! long post ahead!
Lets assume the following Object hiearchy:
class Device(object):
def __init__():
self._driver=Driver()
self._status=Status()
self._tasks=TaskManager()
def __getattr__(self, attr_name):
if hasattr(self._tasks, attr_name):
return getattr(self._tasks, attr_name)
else:
raise AttributeError(attr_name)
class Driver(object):
def __init__(self):
self._status=DriverStatus()
def connect(self):
"""some code here"""
def disconnect(self):
"""some code here"""
class DriverStatus(object):
def __init__(self):
self._isConnected=False
self._isPluggedIn=False
I also have a rather deep object hiearchy (the above elements are only a sub part of it) So, right now this gives me following resources, in the rest api (i know, rest isn't about url hierarchy, but media types, but this is for simplicity's sake):
/rest/environments
/rest/environments/{id}
/rest/environments/{id}/devices/
/rest/environments/{id}/devices/{deviceId}
/rest/environments/{id}/devices/{deviceId}/driver
/rest/environments/{id}/devices/{deviceId}/driver/driverstatus
I switched a few months back from a "dirty" soap type Api to REST, but I am becoming unsure about how to handle what seems like added complexity:
Proliferation of REST resources/media types : for example instead of having just a Device resource I now have all these resources:
Device
DeviceStatus
Driver
DriverStatus
While these all make sense from a Resfull point of view, is it normal to have a lot of sub resources that each map to a separate python class ?
Mapping a method rich application core to a Rest-Full api : in Rest resources should be nouns, not verbs : are there good rules /tips to inteligently define a set of resources from a set of methods ? (The most comprehensive example I found so far seems to be this article)
Api logic influencing application structure: should an application's API logic at least partially guide some of its internal logic, or is it good practice to apply separation of concerns ? Ie , should I have an intermediate layer of "resource" objects that have the job of communicating with the application core , but that do not map one to one to the core's classes ?
How would one correctly handle the following in a rest-full way : I need to be able to display a list of available driver types (ie class names, not Driver instance) in the client : would this mean creating yet another resource like "DriverTypes" ?
These are rather long winded questions, so thanks for your patience, and any pointers, feedback and criticism is more than welcome !
To S.Lott:
By "too fragmented resources" what i meant was, lots of different sub resources that basically still apply to the same server side entity
For The "connection" : So that would be a modified version of the "DriverStatus" resource then ? I consider the connection to be always existing, hence the use of "PUT" , but would that be bad thing considering "PUT" should be idempotent ?
You are right about "stopping coding and rethinking", that is the reason I asked all these questions and putting things down, on paper to get a better overview.
-The thing is, right now the basic "real world objects" as you put them make sense to me as rest resources /collections of resources, and they are correctly manipulated via POST, GET, UPDATE, DELETE , but I am having a hard time getting my head around the Rest approach for things that I do not instinctively view as "Resources".

Rule 1. REST is about objects. Not methods.
The REST "resources" have become too fragmented
False. Always false. REST resources are independent. They can't be "too" fragmented.
instead of having just a Device resource I now have all these resources:
Device DeviceStatus Driver DriverStatus
While these all make sense
from a [RESTful] point of view, is it normal to have a lot of sub
resources that each map to a separate python class ?
Actually, they don't make sense. Hence your question.
Device is a thing. /rest/environments/{id}/devices/{deviceId}
It has status. You should consider providing the status and the device information together as a single composite document that describes a device.
Just because your relational database is normalized does not mean your RESTful objects need to be precisely as normalized as your database. While it's simpler (and many frameworks make it very, very simple to do this) it may not be meaningful.
consider the connection to be always existing, hence the use of "PUT"
, but would that be bad thing considering "PUT" should be idempotent ?
Connections do not always exist. They may come and go.
While a relational database may have a many-to-many association table which you can UPDATE, that's a peculiar special case that doesn't really make much sense outside the world of DBA's.
The connection between two RESTful things is rarely a separate thing. It's an attribute of each of the RESTful things.
It's perfectly unclear what this "connection" thing is. You talk vaguely about it, but provide no details.
Lacking any usable facts, I'll guess that you're connecting devices to drivers and there's some kind of [Device]<-[Driver Status]->[Driver] relationship. The connection from device to driver can be a separate RESTful resource.
It can just as easily be an attribute of Device or Driver that does not actually have a separate, visible, RESTful resource.
[Again. Some frameworks like Django-Piston make it trivial to simple expose the underlying classes. This may not always be appropriate, however.]
are there good rules /tips to inteligently define a set of resources from a set of methods ?
Yes. Don't do it. Resources aren't methods. Pretty much that's that.
If you have a lot of methods -- outside CRUD -- then you may have a data model issue. You may have too few classes of things expressed in your relational model and too many stateful updates of things.
Stateful objects are not inherently evil, but they need to be examined critically. In some cases, a PUT to change status of an object perhaps should have been a POST to add to the history of an object. The "current" state is the last thing POSTed.
Also.
You don't have to trivialize each resource as a class of things. You can have resources which are collections. You can POST a fairly complex document to a composite (properly a Facade) "resource". That complex document can imply several CRUD operations in the database.
You're wandering away from simple RESTful. Your question remains intentionally murky. "method rich application core" doesn't mean much. Without concrete examples, it's impossible to imagine.
Api logic influencing application structure
If these are somehow different, you're probably creating needless, no-value complexity.
is it good practice to apply separation of concerns ?
Always. Why ask?
a lot of this seems to come from my confusion about how to map a rather method rich api to a Rest-Full one , where resources should be nouns, not verbs : so when is it wise to consider an element a rest "resource"?
A resource is defined by your problem domain. It's usually something tangible. The methods (as in "method-rich API" are usually irrelevant. They're CRUD (Create, Retrieve, Update, Delete) operations. If you have something that's not essentially CRUD, you have to STOP coding. STOP writing code, and rethink the API to make it CRUD-like.
CRUD - Create-Retrieve-Update-Delete maps to REST's POST-GET-PUT-DELETE. If you can't recast your problem domain into those terms, stop coding. Stop coding until you get to CRUD rules.
i need to be able to display a list of available driver types (ie class names, not Driver instance) in the client : would this mean creating yet another resource like "DriverTypes" ?
Correct. They're already part of your problem domain. You already have this class defined. You're just making it available through REST.
Here's the point. The problem domain has real-world objects. You have class definitions. They're tangible things. REST transfers the state of those tangible things.
Your software may have intangible things like "associations" or "links" or "connections" other junk that's part of the software solution. This junk doesn't matter very much. It's implementation detail. Not real-world things.
An "association" is always visible from both of the two real-world RESTful resources. One resource may have an foreign-key like reference that allows the client to do a RESTful fetch of another, related object. Or a resource may have a collection of other, related objects, and a single GET retrieves an object and a collection of related objects.
Either way, the real-world RESTful resources are what's available. The relationship is merely implied. Even if it's a physical many-to-many database table -- that doesn't mean it must be exposed. [Again. Some frameworks make it trivially easy to expose everything. This isn't always good.]

You can represent the path portion /rest with a Site object, but environments in the path must be a Resource. From there you have to handle the hierarchy yourself in the render_* methods of environments. The request object you get will have a postpath attribute that gives you the remainder of the path (i.e. after /rest/environments). You'll have to parse out the id, detect whether or not devices is given in the path, and if so pass the remainder of the path (and the request) down to your devices collection. Unfortunately, Twisted will not handle this decision for you.

What is so bad with threadlocals

Everybody in Django world seems to hate threadlocals(http://code.djangoproject.com/ticket/4280, http://code.djangoproject.com/wiki/CookBookThreadlocalsAndUser). I read Armin's essay on this(http://lucumr.pocoo.org/2006/7/10/why-i-cant-stand-threadlocal-and-others), but most of it hinges on threadlocals is bad because it is inelegant.
I have a scenario where theadlocals will make things significantly easier. (I have a app where people will have subdomains, so all the models need to have access to the current subdomain, and passing them from requests is not worth it, if the only problem with threadlocals is that they are inelegant, or make for brittle code.)
Also a lot of Java frameworks seem to be using threadlocals a lot, so how is their case different from Python/Django 's?

I avoid this sort of usage of threadlocals, because it introduces an implicit non-local coupling. I frequently use models in all kinds of non-HTTP-oriented ways (local management commands, data import/export, etc). If I access some threadlocals data in models.py, now I have to find some way to ensure that it is always populated whenever I use my models, and this could get quite ugly.
In my opinion, more explicit code is cleaner and more maintainable. If a model method requires a subdomain in order to operate, that fact should be made obvious by having the method accept that subdomain as a parameter.
If I absolutely could find no way around storing request data in threadlocals, I would at least implement wrapper methods in a separate module that access threadlocals and call the model methods with the needed data. This way the models.py remains self-contained and models can be used without the threadlocals coupling.

I don't think there is anything wrong with threadlocals - yes, it is a global variable, but besides that it's a normal tool. We use it just for this purpose (storing subdomain model in the context global to the current request from middleware) and it works perfectly.
So I say, use the right tool for the job, in this case threadlocals make your app much more elegant than passing subdomain model around in all the model methods (not mentioning the fact that it is even not always possible - when you are overriding django manager methods to limit queries by subdomain, you have no way to pass anything extra to get_query_set, for example - so threadlocals is the natural and only answer).

Also a lot of Java frameworks seem to be using threadlocals a lot, so how is their case different from Python/Django 's?
CPython's interpreter has a Global Interpreter Lock (GIL) which means that only one Python thread can be executed by the interpreter at any given time. It isn't clear to me that a Python interpreter implementation would necessarily need to use more than one operating system thread to achieve this, although in practice CPython does.
Java's main locking mechanism is via objects' monitor locks. This is a decentralized approach that allows the use of multiple concurrent threads on multi-core and or multi-processor CPUs, but also produces much more complicated synchronization issues for the programmer to deal with.
These synchronization issues only arise with "shared-mutable state". If the state isn't mutable, or as in the case of a ThreadLocal it isn't shared, then that is one less complicated problem for the Java programmer to solve.
A CPython programmer still has to deal with the possibility of race conditions, but some of the more esoteric Java problems (such as publication) are presumably solved by the interpreter.
A CPython programmer also has the option to code performance critical code in Python-callable C or C++ code where the GIL restriction does not apply. Technically a Java programmer has a similar option via JNI, but this is rightly or wrongly considered less acceptable in Java than in Python.

You want to use threadlocals when you're working with multiple threads and want to localize some objects to a specific thread, eg. having one database connection for each thread.
In your case, you want to use it more as a global context (if I understand you correctly), which is probably a bad idea. It will make your app a bit slower, more coupled and harder to test.
Why is passing it from request not worth it? Why don't you store it in session or user profile?
There difference with Java is that web development there is much more stateful than in Python/PERL/PHP/Ruby world so people are used to all kind of contexts and stuff like that. I don't think that is an advantage, but it does seem like it at the beginning.

I have found using ThreadLocal is an excellent way to implement Dependency Injection in a HTTP request/response environment (i.e. any webapp). You just set up a servlet filter to 'inject' the object you need into the thread on receiving the request and 'uninject' it on returning the response.
It's a smart man's DI without all the XML ugliness, without the MB of Spring Jars (not to mention its learning curve) and without all the cryptic repetitive #annotation nonsense and because it doesn't individually inject many object instances with the dependencies it's probably a heck of a lot faster and uses less memory.
It worked so well we opened sourced our exPOJO Filter that can inject a Hibernate session or a JDO PersistenceManager using ThreadLocal:
http://www.expojo.com

How would one make Python objects persistent in a web-app?

I'm writing a reasonably complex web application. The Python backend runs an algorithm whose state depends on data stored in several interrelated database tables which does not change often, plus user specific data which does change often. The algorithm's per-user state undergoes many small changes as a user works with the application. This algorithm is used often during each user's work to make certain important decisions.
For performance reasons, re-initializing the state on every request from the (semi-normalized) database data quickly becomes non-feasible. It would be highly preferable, for example, to cache the state's Python object in some way so that it can simply be used and/or updated whenever necessary. However, since this is a web application, there several processes serving requests, so using a global variable is out of the question.
I've tried serializing the relevant object (via pickle) and saving the serialized data to the DB, and am now experimenting with caching the serialized data via memcached. However, this still has the significant overhead of serializing and deserializing the object often.
I've looked at shared memory solutions but the only relevant thing I've found is POSH. However POSH doesn't seem to be widely used and I don't feel easy integrating such an experimental component into my application.
I need some advice! This is my first shot at developing a web application, so I'm hoping this is a common enough issue that there are well-known solutions to such problems. At this point solutions which assume the Python back-end is running on a single server would be sufficient, but extra points for solutions which scale to multiple servers as well :)
Notes:
I have this application working, currently live and with active users. I started out without doing any premature optimization, and then optimized as needed. I've done the measuring and testing to make sure the above mentioned issue is the actual bottleneck. I'm sure pretty sure I could squeeze more performance out of the current setup, but I wanted to ask if there's a better way.
The setup itself is still a work in progress; assume that the system's architecture can be whatever suites your solution.

Be cautious of premature optimization.
Addition: The "Python backend runs an algorithm whose state..." is the session in the web framework. That's it. Let the Django framework maintain session state in cache. Period.
"The algorithm's per-user state undergoes many small changes as a user works with the application." Most web frameworks offer a cached session object. Often it is very high performance. See Django's session documentation for this.
Advice. [Revised]
It appears you have something that works. Leverage to learn your framework, learn the tools, and learn what knobs you can turn without breaking a sweat. Specifically, using session state.
Second, fiddle with caching, session management, and things that are easy to adjust, and see if you have enough speed. Find out whether MySQL socket or named pipe is faster by trying them out. These are the no-programming optimizations.
Third, measure performance to find your actual bottleneck. Be prepared to provide (and defend) the measurements as fine-grained enough to be useful and stable enough to providing meaningful comparison of alternatives.
For example, show the performance difference between persistent sessions and cached sessions.

I think that the multiprocessing framework has what might be applicable here - namely the shared ctypes module.
Multiprocessing is fairly new to Python, so it might have some oddities. I am not quite sure whether the solution works with processes not spawned via multiprocessing.

I think you can give ZODB a shot.
"A major feature of ZODB is transparency. You do not need to write any code to explicitly read or write your objects to or from a database. You just put your persistent objects into a container that works just like a Python dictionary. Everything inside this dictionary is saved in the database. This dictionary is said to be the "root" of the database. It's like a magic bag; any Python object that you put inside it becomes persistent."
Initailly it was a integral part of Zope, but lately a standalone package is also available.
It has the following limitation:
"Actually there are a few restrictions on what you can store in the ZODB. You can store any objects that can be "pickled" into a standard, cross-platform serial format. Objects like lists, dictionaries, and numbers can be pickled. Objects like files, sockets, and Python code objects, cannot be stored in the database because they cannot be pickled."
I have read it but haven't given it a shot myself though.
Other possible thing could be a in-memory sqlite db, that may speed up the process a bit - being an in-memory db, but still you would have to do the serialization stuff and all.
Note: In memory db is expensive on resources.
Here is a link: http://www.zope.org/Documentation/Articles/ZODB1

First of all your approach is not a common web development practice. Even multi threading is being used, web applications are designed to be able to run multi-processing environments, for both scalability and easier deployment .
If you need to just initialize a large object, and do not need to change later, you can do it easily by using a global variable that is initialized while your WSGI application is being created, or the module contains the object is being loaded etc, multi processing will do fine for you.
If you need to change the object and access it from every thread, you need to be sure your object is thread safe, use locks to ensure that. And use a single server context, a process. Any multi threading python server will serve you well, also FCGI is a good choice for this kind of design.
But, if multiple threads are accessing and changing your object the locks may have a really bad effect on your performance gain, which is likely to make all the benefits go away.

This is Durus, a persistent object system for applications written in the Python
programming language. Durus offers an easy way to use and maintain a consistent
collection of object instances used by one or more processes. Access and change of a
persistent instances is managed through a cached Connection instance which includes
commit() and abort() methods so that changes are transactional.
http://www.mems-exchange.org/software/durus/
I've used it before in some research code, where I wanted to persist the results of certain computations. I eventually switched to pytables as it met my needs better.

Another option is to review the requirement for state, it sounds like if the serialisation is the bottle neck then the object is very large. Do you really need an object that large?
I know in the Stackoverflow podcast 27 the reddit guys discuss what they use for state, so that maybe useful to listen to.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.