Python object hierarchy, and REST resources? - python

I am currently running into an "architectural" problem in my Python app (using Twisted) that uses a REST api, and I am looking for feedback.
Warning ! long post ahead!
Lets assume the following Object hiearchy:
class Device(object):
def __init__():
self._driver=Driver()
self._status=Status()
self._tasks=TaskManager()
def __getattr__(self, attr_name):
if hasattr(self._tasks, attr_name):
return getattr(self._tasks, attr_name)
else:
raise AttributeError(attr_name)
class Driver(object):
def __init__(self):
self._status=DriverStatus()
def connect(self):
"""some code here"""
def disconnect(self):
"""some code here"""
class DriverStatus(object):
def __init__(self):
self._isConnected=False
self._isPluggedIn=False
I also have a rather deep object hiearchy (the above elements are only a sub part of it) So, right now this gives me following resources, in the rest api (i know, rest isn't about url hierarchy, but media types, but this is for simplicity's sake):
/rest/environments
/rest/environments/{id}
/rest/environments/{id}/devices/
/rest/environments/{id}/devices/{deviceId}
/rest/environments/{id}/devices/{deviceId}/driver
/rest/environments/{id}/devices/{deviceId}/driver/driverstatus
I switched a few months back from a "dirty" soap type Api to REST, but I am becoming unsure about how to handle what seems like added complexity:
Proliferation of REST resources/media types : for example instead of having just a Device resource I now have all these resources:
Device
DeviceStatus
Driver
DriverStatus
While these all make sense from a Resfull point of view, is it normal to have a lot of sub resources that each map to a separate python class ?
Mapping a method rich application core to a Rest-Full api : in Rest resources should be nouns, not verbs : are there good rules /tips to inteligently define a set of resources from a set of methods ? (The most comprehensive example I found so far seems to be this article)
Api logic influencing application structure: should an application's API logic at least partially guide some of its internal logic, or is it good practice to apply separation of concerns ? Ie , should I have an intermediate layer of "resource" objects that have the job of communicating with the application core , but that do not map one to one to the core's classes ?
How would one correctly handle the following in a rest-full way : I need to be able to display a list of available driver types (ie class names, not Driver instance) in the client : would this mean creating yet another resource like "DriverTypes" ?
These are rather long winded questions, so thanks for your patience, and any pointers, feedback and criticism is more than welcome !
To S.Lott:
By "too fragmented resources" what i meant was, lots of different sub resources that basically still apply to the same server side entity
For The "connection" : So that would be a modified version of the "DriverStatus" resource then ? I consider the connection to be always existing, hence the use of "PUT" , but would that be bad thing considering "PUT" should be idempotent ?
You are right about "stopping coding and rethinking", that is the reason I asked all these questions and putting things down, on paper to get a better overview.
-The thing is, right now the basic "real world objects" as you put them make sense to me as rest resources /collections of resources, and they are correctly manipulated via POST, GET, UPDATE, DELETE , but I am having a hard time getting my head around the Rest approach for things that I do not instinctively view as "Resources".

Rule 1. REST is about objects. Not methods.
The REST "resources" have become too fragmented
False. Always false. REST resources are independent. They can't be "too" fragmented.
instead of having just a Device resource I now have all these resources:
Device DeviceStatus Driver DriverStatus
While these all make sense
from a [RESTful] point of view, is it normal to have a lot of sub
resources that each map to a separate python class ?
Actually, they don't make sense. Hence your question.
Device is a thing. /rest/environments/{id}/devices/{deviceId}
It has status. You should consider providing the status and the device information together as a single composite document that describes a device.
Just because your relational database is normalized does not mean your RESTful objects need to be precisely as normalized as your database. While it's simpler (and many frameworks make it very, very simple to do this) it may not be meaningful.
consider the connection to be always existing, hence the use of "PUT"
, but would that be bad thing considering "PUT" should be idempotent ?
Connections do not always exist. They may come and go.
While a relational database may have a many-to-many association table which you can UPDATE, that's a peculiar special case that doesn't really make much sense outside the world of DBA's.
The connection between two RESTful things is rarely a separate thing. It's an attribute of each of the RESTful things.
It's perfectly unclear what this "connection" thing is. You talk vaguely about it, but provide no details.
Lacking any usable facts, I'll guess that you're connecting devices to drivers and there's some kind of [Device]<-[Driver Status]->[Driver] relationship. The connection from device to driver can be a separate RESTful resource.
It can just as easily be an attribute of Device or Driver that does not actually have a separate, visible, RESTful resource.
[Again. Some frameworks like Django-Piston make it trivial to simple expose the underlying classes. This may not always be appropriate, however.]
are there good rules /tips to inteligently define a set of resources from a set of methods ?
Yes. Don't do it. Resources aren't methods. Pretty much that's that.
If you have a lot of methods -- outside CRUD -- then you may have a data model issue. You may have too few classes of things expressed in your relational model and too many stateful updates of things.
Stateful objects are not inherently evil, but they need to be examined critically. In some cases, a PUT to change status of an object perhaps should have been a POST to add to the history of an object. The "current" state is the last thing POSTed.
Also.
You don't have to trivialize each resource as a class of things. You can have resources which are collections. You can POST a fairly complex document to a composite (properly a Facade) "resource". That complex document can imply several CRUD operations in the database.
You're wandering away from simple RESTful. Your question remains intentionally murky. "method rich application core" doesn't mean much. Without concrete examples, it's impossible to imagine.
Api logic influencing application structure
If these are somehow different, you're probably creating needless, no-value complexity.
is it good practice to apply separation of concerns ?
Always. Why ask?
a lot of this seems to come from my confusion about how to map a rather method rich api to a Rest-Full one , where resources should be nouns, not verbs : so when is it wise to consider an element a rest "resource"?
A resource is defined by your problem domain. It's usually something tangible. The methods (as in "method-rich API" are usually irrelevant. They're CRUD (Create, Retrieve, Update, Delete) operations. If you have something that's not essentially CRUD, you have to STOP coding. STOP writing code, and rethink the API to make it CRUD-like.
CRUD - Create-Retrieve-Update-Delete maps to REST's POST-GET-PUT-DELETE. If you can't recast your problem domain into those terms, stop coding. Stop coding until you get to CRUD rules.
i need to be able to display a list of available driver types (ie class names, not Driver instance) in the client : would this mean creating yet another resource like "DriverTypes" ?
Correct. They're already part of your problem domain. You already have this class defined. You're just making it available through REST.
Here's the point. The problem domain has real-world objects. You have class definitions. They're tangible things. REST transfers the state of those tangible things.
Your software may have intangible things like "associations" or "links" or "connections" other junk that's part of the software solution. This junk doesn't matter very much. It's implementation detail. Not real-world things.
An "association" is always visible from both of the two real-world RESTful resources. One resource may have an foreign-key like reference that allows the client to do a RESTful fetch of another, related object. Or a resource may have a collection of other, related objects, and a single GET retrieves an object and a collection of related objects.
Either way, the real-world RESTful resources are what's available. The relationship is merely implied. Even if it's a physical many-to-many database table -- that doesn't mean it must be exposed. [Again. Some frameworks make it trivially easy to expose everything. This isn't always good.]

You can represent the path portion /rest with a Site object, but environments in the path must be a Resource. From there you have to handle the hierarchy yourself in the render_* methods of environments. The request object you get will have a postpath attribute that gives you the remainder of the path (i.e. after /rest/environments). You'll have to parse out the id, detect whether or not devices is given in the path, and if so pass the remainder of the path (and the request) down to your devices collection. Unfortunately, Twisted will not handle this decision for you.

Related

Is it possible to create a Tornado application from instances of Request/Web Handlers instead of class definitions?

The apparent requirement to provide class definitions instead of instances causes very difficult problems. I have two different classes and one of them needs a reference to the other
app = tornado.web.Application([
(r"/fusion.*", FusionListener),
(r"/admin.*", AdminListener),
])
. The AdminListener needs a reference to the FusionListener since there are internal items needing to be managed. Sending messages is an unacceptable additional complexity here. The current mechanism does not seem to afford that possibility.
What kind of pattern can get around this shortcoming in Tornado?
For my use-case there are both persistent and in-memory state. We have spark and postgres repositories for the former. For the latter I had already designed and written the application to have instance-level data structures. But I have gathered that instance attributes on Tornado launched RequestHandler / WebHandler subclasses are not persistent.
The latter wants to live in a class managing the state: but I am compelled to significantly redraw the boundaries due to this design ot Tornado. Instead it will be necessary to push everything to global variables. Few would argue this were a preferred design. I will be dumping tornado as soon as I can get the time.
Not sure what will be the alternative: I already reverted from cherrypy due to significant limitations of its own: here are a couple of my questions on it
404 for path served by Cherrypy
How to specify the listening server instances using cherrypy tree.mount?
I got through those with some scars but still in one piece. There were additional issues that knocked me out: url's were not being served, and there was no clear end to the mole whacking. It also generally does not get a lot of attention and has confusing outdated or incomplete documentation. There is plenty of docs - that's why I got started on it: but the holes make for a series of rabbit-chasing episodes.
Flask and django have their own issues. It seems finding a functionally adequate but not super heavy weight web server in python is an illusory target. Not certain yet which framework has the least gotchas.
Posting this as answer in order to benefit from proper code formatting.
The paradigm I used for keeping track of existing instances of a RequestHandler is very simple:
class MyHandler(RequestHandler):
_instances = set()
def get(self):
if needs_to_be_added(self.request): # some conditions can be used here
if len(MyHandler._instances) > THRESHOLD: # careful with memory usage
return self.finish("some_error")
MyHandler._instances.add(self)
...
def post(self):
if needs_to_be_removed(self.request):
MyHandler._instances.discard(self)
...
Of course you might need to change when to add / discard elements.
Depending on how you want to refer to existing instances in the future (by some key for example) you could use a dict for keeping track of them.
I don't think you can use something weak references (as in classes from the weakref module) because those will only track live instances which won't work with the way request handlers instances are created and destroyed.

Large(ish) django application architecture

How does one properly structure a larger django website such as to retain testability and maintainability?
In the best django spirit (I hope) we started out by not caring too much about decoupling between different parts of our website. We did separate it into different apps, but those depend rather directly upon each other, through common use of model classes and direct method calls.
This is getting quite entangled. For example, one of our actions/services looks like this:
def do_apply_for_flat(user, flat, bid_amount):
assert can_apply(user, flat)
application = Application.objects.create(
user=user, flat=flat, amount=bid_amount,
status=Application.STATUS_ACTIVE)
events.logger.application_added(application)
mails.send_applicant_application_added(application)
mails.send_lessor_application_received(application)
return application
The function does not only perform the actual business process, no, it also handles event logging and sending mails to the involved users. I don't think there's something inherently wrong with this approach. Yet, it's getting more and more difficult to properly reason about the code and even test the application, as it's getting harder to separate parts intellectually and programmatically.
So, my question is, how do the big boys structure their applications such that:
Different parts of the application can be tested in isolation
Testing stays fast by only enabling parts that you really need for a specific test
Code coupling is reduced
My take on the problem would be to introduce a centralized signal hub (just a bunch of django signals in a single python file) which the single django apps may publish or subscribe to. The above example function would publish an application_added event, which the mails and events apps would listen to. Then, for efficient testing, I would disconnect the parts I don't need. This also increases decoupling considerably, as services don't need to know about sending mails at all.
But, I'm unsure, and thus very interested in what's the accepted practice for these kind of problems.
For testing, you should mock your dependencies. The logging and mailing component, for example, should be mocked during unit testing of the views. I would usually use python-mock, this allows your views to be tested independently of the logging and mailing component, and vice versa. Just assert that your views are calling the right service calls and mock the return value/side effect of the service call.
You should also avoid touching the database when doing tests. Instead try to use as much in memory objects as possible, instead of Application.objects.create(), defer the save() to the caller, so that you can test the services without having to actually have the Application in the database. Alternatively, patch out the save() method, so it won't actually save, but that's much more tedious.
Transfer some parts of your app to different microservices. This will make some parts of your app focused on doing one or two things right (e.g. event logging, emails). Code coupling is also reduced and different parts of the site can be tested in isolation as well.
The microservice architecture style involves developing a single application as a collection of smaller services that communicates usually via an API.
You might need to use a smaller framework like Flask.
Resources:
For more information on microservices click here:
http://martinfowler.com/articles/microservices.html
http://aurelavramescu.blogspot.com/2014/06/user-microservice-python-way.html
First, try to brake down your big task into smaller classes. Connect them with usual method calls or Django signals.
If you feel that the sub-tasks are independent enough, you can implement them as several Django applications in the same project. See the Django tutorial, which describes relation between applications and projects.

I need to run untrusted server-side code in a web app - what are my options?

# Context -- skip if you want to get right to the point
I've been building a rather complex web application in Python (Bottle/gevent/MongoDB). It is a RSVP system which allows several independent front-end instances with registration forms als well as back-end access with granular user permissions (those users are our clients). I now need to implement a flexible map-reduce engine to collect statistics on the registration data. A one-size-fits-all solution is impossible since the data gathered varies from instance to instance. I also want to keep this open for our more technically inclined clients.
# End of context
So I need to execute arbitrary strings of code (some kind of ad-hoc plugin - language doesn't matter) entered through a web interface. I've already learned that it's virtually impossible to properly sandbox Python, so that's no option.
As of now I've looked into Lua and found Lupa, Lunatic Python and Lupy, but all three of them allow access to parts of the Python runtime.
There's also PyExecJS and its various runtimes (V8, Node, SpiderMonkey), but I have no idea whether it poses any security risks.
Questions:
1. Does anyone know of another (more fitting) option?
2. To those familiar with any of the Lua bindings: Is it possible to make them completely safe without too much hassle?
3. To those familiar with PyExecJS: How secure is it? Also, what kind of performance should I expect for, say, calling a short mapping function 1000 times and then iterating over a 1000-item list?
Here are a few ways you can run untrusted code:
a docker container that runs the code, I would suggest checking codecube.io out, it does exactly what you want and you can learn more about the process here
using the libsandbox libraries but at the present time the documentation is pretty bad
PyPy’s sandboxing
Sneklang is strict subset of Python, that is safely evaluated in your provided scope.
It is limited by scope size, and by number of node evaluation steps and protects from infinite loops, stack overflows, and excessive memory usage.
There is an online sandbox as well: https://sneklang.functup.com
I've made this project specifically because I had the same requirements.

What is so bad with threadlocals

Everybody in Django world seems to hate threadlocals(http://code.djangoproject.com/ticket/4280, http://code.djangoproject.com/wiki/CookBookThreadlocalsAndUser). I read Armin's essay on this(http://lucumr.pocoo.org/2006/7/10/why-i-cant-stand-threadlocal-and-others), but most of it hinges on threadlocals is bad because it is inelegant.
I have a scenario where theadlocals will make things significantly easier. (I have a app where people will have subdomains, so all the models need to have access to the current subdomain, and passing them from requests is not worth it, if the only problem with threadlocals is that they are inelegant, or make for brittle code.)
Also a lot of Java frameworks seem to be using threadlocals a lot, so how is their case different from Python/Django 's?
I avoid this sort of usage of threadlocals, because it introduces an implicit non-local coupling. I frequently use models in all kinds of non-HTTP-oriented ways (local management commands, data import/export, etc). If I access some threadlocals data in models.py, now I have to find some way to ensure that it is always populated whenever I use my models, and this could get quite ugly.
In my opinion, more explicit code is cleaner and more maintainable. If a model method requires a subdomain in order to operate, that fact should be made obvious by having the method accept that subdomain as a parameter.
If I absolutely could find no way around storing request data in threadlocals, I would at least implement wrapper methods in a separate module that access threadlocals and call the model methods with the needed data. This way the models.py remains self-contained and models can be used without the threadlocals coupling.
I don't think there is anything wrong with threadlocals - yes, it is a global variable, but besides that it's a normal tool. We use it just for this purpose (storing subdomain model in the context global to the current request from middleware) and it works perfectly.
So I say, use the right tool for the job, in this case threadlocals make your app much more elegant than passing subdomain model around in all the model methods (not mentioning the fact that it is even not always possible - when you are overriding django manager methods to limit queries by subdomain, you have no way to pass anything extra to get_query_set, for example - so threadlocals is the natural and only answer).
Also a lot of Java frameworks seem to be using threadlocals a lot, so how is their case different from Python/Django 's?
CPython's interpreter has a Global Interpreter Lock (GIL) which means that only one Python thread can be executed by the interpreter at any given time. It isn't clear to me that a Python interpreter implementation would necessarily need to use more than one operating system thread to achieve this, although in practice CPython does.
Java's main locking mechanism is via objects' monitor locks. This is a decentralized approach that allows the use of multiple concurrent threads on multi-core and or multi-processor CPUs, but also produces much more complicated synchronization issues for the programmer to deal with.
These synchronization issues only arise with "shared-mutable state". If the state isn't mutable, or as in the case of a ThreadLocal it isn't shared, then that is one less complicated problem for the Java programmer to solve.
A CPython programmer still has to deal with the possibility of race conditions, but some of the more esoteric Java problems (such as publication) are presumably solved by the interpreter.
A CPython programmer also has the option to code performance critical code in Python-callable C or C++ code where the GIL restriction does not apply. Technically a Java programmer has a similar option via JNI, but this is rightly or wrongly considered less acceptable in Java than in Python.
You want to use threadlocals when you're working with multiple threads and want to localize some objects to a specific thread, eg. having one database connection for each thread.
In your case, you want to use it more as a global context (if I understand you correctly), which is probably a bad idea. It will make your app a bit slower, more coupled and harder to test.
Why is passing it from request not worth it? Why don't you store it in session or user profile?
There difference with Java is that web development there is much more stateful than in Python/PERL/PHP/Ruby world so people are used to all kind of contexts and stuff like that. I don't think that is an advantage, but it does seem like it at the beginning.
I have found using ThreadLocal is an excellent way to implement Dependency Injection in a HTTP request/response environment (i.e. any webapp). You just set up a servlet filter to 'inject' the object you need into the thread on receiving the request and 'uninject' it on returning the response.
It's a smart man's DI without all the XML ugliness, without the MB of Spring Jars (not to mention its learning curve) and without all the cryptic repetitive #annotation nonsense and because it doesn't individually inject many object instances with the dependencies it's probably a heck of a lot faster and uses less memory.
It worked so well we opened sourced our exPOJO Filter that can inject a Hibernate session or a JDO PersistenceManager using ThreadLocal:
http://www.expojo.com

How would one make Python objects persistent in a web-app?

I'm writing a reasonably complex web application. The Python backend runs an algorithm whose state depends on data stored in several interrelated database tables which does not change often, plus user specific data which does change often. The algorithm's per-user state undergoes many small changes as a user works with the application. This algorithm is used often during each user's work to make certain important decisions.
For performance reasons, re-initializing the state on every request from the (semi-normalized) database data quickly becomes non-feasible. It would be highly preferable, for example, to cache the state's Python object in some way so that it can simply be used and/or updated whenever necessary. However, since this is a web application, there several processes serving requests, so using a global variable is out of the question.
I've tried serializing the relevant object (via pickle) and saving the serialized data to the DB, and am now experimenting with caching the serialized data via memcached. However, this still has the significant overhead of serializing and deserializing the object often.
I've looked at shared memory solutions but the only relevant thing I've found is POSH. However POSH doesn't seem to be widely used and I don't feel easy integrating such an experimental component into my application.
I need some advice! This is my first shot at developing a web application, so I'm hoping this is a common enough issue that there are well-known solutions to such problems. At this point solutions which assume the Python back-end is running on a single server would be sufficient, but extra points for solutions which scale to multiple servers as well :)
Notes:
I have this application working, currently live and with active users. I started out without doing any premature optimization, and then optimized as needed. I've done the measuring and testing to make sure the above mentioned issue is the actual bottleneck. I'm sure pretty sure I could squeeze more performance out of the current setup, but I wanted to ask if there's a better way.
The setup itself is still a work in progress; assume that the system's architecture can be whatever suites your solution.
Be cautious of premature optimization.
Addition: The "Python backend runs an algorithm whose state..." is the session in the web framework. That's it. Let the Django framework maintain session state in cache. Period.
"The algorithm's per-user state undergoes many small changes as a user works with the application." Most web frameworks offer a cached session object. Often it is very high performance. See Django's session documentation for this.
Advice. [Revised]
It appears you have something that works. Leverage to learn your framework, learn the tools, and learn what knobs you can turn without breaking a sweat. Specifically, using session state.
Second, fiddle with caching, session management, and things that are easy to adjust, and see if you have enough speed. Find out whether MySQL socket or named pipe is faster by trying them out. These are the no-programming optimizations.
Third, measure performance to find your actual bottleneck. Be prepared to provide (and defend) the measurements as fine-grained enough to be useful and stable enough to providing meaningful comparison of alternatives.
For example, show the performance difference between persistent sessions and cached sessions.
I think that the multiprocessing framework has what might be applicable here - namely the shared ctypes module.
Multiprocessing is fairly new to Python, so it might have some oddities. I am not quite sure whether the solution works with processes not spawned via multiprocessing.
I think you can give ZODB a shot.
"A major feature of ZODB is transparency. You do not need to write any code to explicitly read or write your objects to or from a database. You just put your persistent objects into a container that works just like a Python dictionary. Everything inside this dictionary is saved in the database. This dictionary is said to be the "root" of the database. It's like a magic bag; any Python object that you put inside it becomes persistent."
Initailly it was a integral part of Zope, but lately a standalone package is also available.
It has the following limitation:
"Actually there are a few restrictions on what you can store in the ZODB. You can store any objects that can be "pickled" into a standard, cross-platform serial format. Objects like lists, dictionaries, and numbers can be pickled. Objects like files, sockets, and Python code objects, cannot be stored in the database because they cannot be pickled."
I have read it but haven't given it a shot myself though.
Other possible thing could be a in-memory sqlite db, that may speed up the process a bit - being an in-memory db, but still you would have to do the serialization stuff and all.
Note: In memory db is expensive on resources.
Here is a link: http://www.zope.org/Documentation/Articles/ZODB1
First of all your approach is not a common web development practice. Even multi threading is being used, web applications are designed to be able to run multi-processing environments, for both scalability and easier deployment .
If you need to just initialize a large object, and do not need to change later, you can do it easily by using a global variable that is initialized while your WSGI application is being created, or the module contains the object is being loaded etc, multi processing will do fine for you.
If you need to change the object and access it from every thread, you need to be sure your object is thread safe, use locks to ensure that. And use a single server context, a process. Any multi threading python server will serve you well, also FCGI is a good choice for this kind of design.
But, if multiple threads are accessing and changing your object the locks may have a really bad effect on your performance gain, which is likely to make all the benefits go away.
This is Durus, a persistent object system for applications written in the Python
programming language. Durus offers an easy way to use and maintain a consistent
collection of object instances used by one or more processes. Access and change of a
persistent instances is managed through a cached Connection instance which includes
commit() and abort() methods so that changes are transactional.
http://www.mems-exchange.org/software/durus/
I've used it before in some research code, where I wanted to persist the results of certain computations. I eventually switched to pytables as it met my needs better.
Another option is to review the requirement for state, it sounds like if the serialisation is the bottle neck then the object is very large. Do you really need an object that large?
I know in the Stackoverflow podcast 27 the reddit guys discuss what they use for state, so that maybe useful to listen to.

Categories

Resources