What is so bad with threadlocals - python

Everybody in Django world seems to hate threadlocals(http://code.djangoproject.com/ticket/4280, http://code.djangoproject.com/wiki/CookBookThreadlocalsAndUser). I read Armin's essay on this(http://lucumr.pocoo.org/2006/7/10/why-i-cant-stand-threadlocal-and-others), but most of it hinges on threadlocals is bad because it is inelegant.
I have a scenario where theadlocals will make things significantly easier. (I have a app where people will have subdomains, so all the models need to have access to the current subdomain, and passing them from requests is not worth it, if the only problem with threadlocals is that they are inelegant, or make for brittle code.)
Also a lot of Java frameworks seem to be using threadlocals a lot, so how is their case different from Python/Django 's?

I avoid this sort of usage of threadlocals, because it introduces an implicit non-local coupling. I frequently use models in all kinds of non-HTTP-oriented ways (local management commands, data import/export, etc). If I access some threadlocals data in models.py, now I have to find some way to ensure that it is always populated whenever I use my models, and this could get quite ugly.
In my opinion, more explicit code is cleaner and more maintainable. If a model method requires a subdomain in order to operate, that fact should be made obvious by having the method accept that subdomain as a parameter.
If I absolutely could find no way around storing request data in threadlocals, I would at least implement wrapper methods in a separate module that access threadlocals and call the model methods with the needed data. This way the models.py remains self-contained and models can be used without the threadlocals coupling.

I don't think there is anything wrong with threadlocals - yes, it is a global variable, but besides that it's a normal tool. We use it just for this purpose (storing subdomain model in the context global to the current request from middleware) and it works perfectly.
So I say, use the right tool for the job, in this case threadlocals make your app much more elegant than passing subdomain model around in all the model methods (not mentioning the fact that it is even not always possible - when you are overriding django manager methods to limit queries by subdomain, you have no way to pass anything extra to get_query_set, for example - so threadlocals is the natural and only answer).

Also a lot of Java frameworks seem to be using threadlocals a lot, so how is their case different from Python/Django 's?
CPython's interpreter has a Global Interpreter Lock (GIL) which means that only one Python thread can be executed by the interpreter at any given time. It isn't clear to me that a Python interpreter implementation would necessarily need to use more than one operating system thread to achieve this, although in practice CPython does.
Java's main locking mechanism is via objects' monitor locks. This is a decentralized approach that allows the use of multiple concurrent threads on multi-core and or multi-processor CPUs, but also produces much more complicated synchronization issues for the programmer to deal with.
These synchronization issues only arise with "shared-mutable state". If the state isn't mutable, or as in the case of a ThreadLocal it isn't shared, then that is one less complicated problem for the Java programmer to solve.
A CPython programmer still has to deal with the possibility of race conditions, but some of the more esoteric Java problems (such as publication) are presumably solved by the interpreter.
A CPython programmer also has the option to code performance critical code in Python-callable C or C++ code where the GIL restriction does not apply. Technically a Java programmer has a similar option via JNI, but this is rightly or wrongly considered less acceptable in Java than in Python.

You want to use threadlocals when you're working with multiple threads and want to localize some objects to a specific thread, eg. having one database connection for each thread.
In your case, you want to use it more as a global context (if I understand you correctly), which is probably a bad idea. It will make your app a bit slower, more coupled and harder to test.
Why is passing it from request not worth it? Why don't you store it in session or user profile?
There difference with Java is that web development there is much more stateful than in Python/PERL/PHP/Ruby world so people are used to all kind of contexts and stuff like that. I don't think that is an advantage, but it does seem like it at the beginning.

I have found using ThreadLocal is an excellent way to implement Dependency Injection in a HTTP request/response environment (i.e. any webapp). You just set up a servlet filter to 'inject' the object you need into the thread on receiving the request and 'uninject' it on returning the response.
It's a smart man's DI without all the XML ugliness, without the MB of Spring Jars (not to mention its learning curve) and without all the cryptic repetitive #annotation nonsense and because it doesn't individually inject many object instances with the dependencies it's probably a heck of a lot faster and uses less memory.
It worked so well we opened sourced our exPOJO Filter that can inject a Hibernate session or a JDO PersistenceManager using ThreadLocal:
http://www.expojo.com

Related

Should I take steps to ensure a Django app can scale before writing it?

So, I'm looking at writing an app with python2 django(-rest-framework), postgres and angular.
I'm aware there are lots of things that can be done
multi-server setup behind load balancer
DB replication/sharding?
caching (in various ways)
swapping DRF serialiser for serpy
running on python3
running on pypy
my question is - Which of these (or other things) should really be done right at the start of the project?
Write with scalability in mind.
scaliability is not limited just to production servers/environments but also to development environments.
Always write with scalability in mind.
At development
Scalability at development let you develop the product seemlessly.
Structure your repository
Use git branching models like GitFlow so that developers can work on parallel, or a single developer can switch working on different features. Use a bug tracker.
Design your apps.
Before actually writing a single line of code, write down what apps you are going to write. Design apps as to minimize Relations (ManyToMany, ForeignKey, etc..), imports. Django offers pulggable app architecture feel free to use it wisely.
Write your tests first.
This ensures that you can migrate(production environments), upgrade and downgrade with less pain and hairloss. Trust me writing tests feels so boring, but it is worth to.
Abstract models, managers
Use Abstract Models and Managers, it can eliminate bolier plate model code and help you maintain the code.
Name variables, classes and methods descriptive.
Name the variables, classes and methods descriptive, as you would be able to know what it represents without looking documentation.
Document code.
Feel free to document classes and methods, so you or other peers who look into the code get an idea of what it is indented for than stacktracing to see what a method is doing.
Use debug toolbar
Use django debug toolbar on developement, as you test your API, use prefectch_related() and select_related() to minimize/eliminate duplicated queries.
Modularize code.
Modularize the code. Python and django in general encourage use of modules. Modules are easy to manage. Use classes, more inheritance and abstract base classes to reuse code.
Use Continuous Integration
use continuous integration test your repo and make sure new pushes don't break the system.
At production.
scalability at production let you serve the product to infinite users seemlessly.
For multi server setup
Stick with rest design principles.
Eliminate sessions.
Use distributed cache like Redis.
Swapping DRF serializer for serpy
Start with serpy if you need more speed and if you are comfortable with. It is better to stick with Serpy than rewriting DRF serializer as writing both looks smiliar, but make sure you are not wasting time by optimizing for the lost 1 or 2ms.
Running on python3
Depends on the libraries that you plan to use.
Running on pypy
pypy is faster than the standard implementations. It depends on the library compatibility to use pypy. A list of compatibile package and status of compatibility.
now the question,
Which of these (or other things) should really be done right at the start of the project?
Ans: Developemet (1,2,3,4,5,6,7,8) Production(1,2)
I don't think you need to start worrying about the setup right away. I would discourage premature optimizations. Rather, run the app in production, profile it. See what affects the performance when you hit scale - you would know what's the bottleneck.
The first and main things you have to get right are a clean and correct db schema and clear, readable and correctly factored (DRY... unless it's accidental duplication) and decoupled code. If you know to design a relational DB schema and learn to use Python and Django properly you shouldn't have much problems so far, and if you get both these things right it will (well it should) be easy to scale - by adding cache where needed (Redis, Memcache, or an intermediary NoSQL document database storing "pre-processed" versions of your often accessed data), adding servers, load-balancing etc, depending on your application's needs. Django is built to scale easily, and unless you do stupid things it does scale easily.

I need to run untrusted server-side code in a web app - what are my options?

# Context -- skip if you want to get right to the point
I've been building a rather complex web application in Python (Bottle/gevent/MongoDB). It is a RSVP system which allows several independent front-end instances with registration forms als well as back-end access with granular user permissions (those users are our clients). I now need to implement a flexible map-reduce engine to collect statistics on the registration data. A one-size-fits-all solution is impossible since the data gathered varies from instance to instance. I also want to keep this open for our more technically inclined clients.
# End of context
So I need to execute arbitrary strings of code (some kind of ad-hoc plugin - language doesn't matter) entered through a web interface. I've already learned that it's virtually impossible to properly sandbox Python, so that's no option.
As of now I've looked into Lua and found Lupa, Lunatic Python and Lupy, but all three of them allow access to parts of the Python runtime.
There's also PyExecJS and its various runtimes (V8, Node, SpiderMonkey), but I have no idea whether it poses any security risks.
Questions:
1. Does anyone know of another (more fitting) option?
2. To those familiar with any of the Lua bindings: Is it possible to make them completely safe without too much hassle?
3. To those familiar with PyExecJS: How secure is it? Also, what kind of performance should I expect for, say, calling a short mapping function 1000 times and then iterating over a 1000-item list?
Here are a few ways you can run untrusted code:
a docker container that runs the code, I would suggest checking codecube.io out, it does exactly what you want and you can learn more about the process here
using the libsandbox libraries but at the present time the documentation is pretty bad
PyPy’s sandboxing
Sneklang is strict subset of Python, that is safely evaluated in your provided scope.
It is limited by scope size, and by number of node evaluation steps and protects from infinite loops, stack overflows, and excessive memory usage.
There is an online sandbox as well: https://sneklang.functup.com
I've made this project specifically because I had the same requirements.

Bad practice to have ORMs with NoSQL stores?

I use Redis (redis-py) inside my Python platform. Recently it was suggested that I switch to an ORM.
E.g.: python-stdnet, rom or redisco
Is use of ORMs considered bad practice in the NoSQL world?
Ultimately the question boils down to at what layer do you want to write code.
Do you want to write code that manipulates data structures in a remote database, or do you want to write higher-level code that uses the abstractions built on top of those data structures? You can think of it as a similar question about relational databases as do you want to write SQL, or do you want to write higher-level code?
Personally, despite using rom myself for a variety of tasks (I am the author), I also directly manipulate Redis in the same projects where it makes sense.
Comments pointing out that the R in ORM is for relational are technically correct. That doesn't mean there aren't valid uses and reasons for libraries that abstract redis away.
There are some great libraries that make interfacing with a redis feel much nicer and more idiomatic to the language you are using. For ruby libraries like ohm or redis-native_hash (disclosure: I wrote that one) do just that. For python there are tools like redisco and surely others. These make persisting objects to redis very simple and make working with redis feel much more ruby-ish or python-ish.
Here are a few more benefits from using even the most basic abstraction, like a very thin wrapper you might write and keep in your application:
Switching redis clients will be easier. Maybe you'll never do this, but if you did, changing your calls to redis in one place (your wrapper) is much simpler than changing them everywhere you use redis.
Implementing things you might need for scaling, like sharding or connection pooling, is likely going to be easier if your calls are made through some abstraction.
Replacing redis with some other key/value store or data structure server would be simpler if an abstraction is in place.
I'm not advocating using an object mapping library or building your own abstraction, just pointing out there are valid reasons why you would. Its up to you to evaluate your needs and pick what works best for you. There is nothing wrong with calling redis directly either.

Pythonic thread-safe object

After reading a lot on this subject and discussing on IRC, the response seems to be: stay away from threads. Sorry for repeating this question, my intention is to go deeper in the subject by not accepting the "threading is evil" answer, with the hope to find a common solution.
EDIT: Just Say No to the combined evils of locking, deadlocks, lock granularity, livelocks, nondeterminism and race conditions. --Guido van Rossum
I'm developing a Python web application, and I'd like to create a global object for each user which is only accessible by the current user. (for example the requested URI)
The suggested way is to pass the object around, which IMO makes the application harder to maintain, and not beautiful code if I need the same value in different places (some might be 3rd party plugins).
I see that many popular frameworks (Django, CherryPy, Flask) use Python threading locks to solve the issue.
If all these frameworks go against the Pythonic way and feel the need to create a globally accessible object, it means that the community needs this sort of thing. And me too.
Is the "best" way to pass objects around?
Is the only alternative solution to use the "evil" threading locks?
Would it be more Pythonic to store this information in a database or memcached?
Thanks in advance!
If you don't want to lock, then either don't use globals, or use thread-local storage (in a webapp, you can be fairly sure that a request won't cross thread boundary). If global state can be avoided, it should be avoided. This makes multi-threading way easier to implement and debug.
I also disagree that passing objects around makes the application harder to maintain — it's usually the other way around — global state hides dependencies in addition to requiring careful synchronisation.
Well, there also lock-free approaches, like STM or whatnot, but it's probably an overkill for a web application.

How would one make Python objects persistent in a web-app?

I'm writing a reasonably complex web application. The Python backend runs an algorithm whose state depends on data stored in several interrelated database tables which does not change often, plus user specific data which does change often. The algorithm's per-user state undergoes many small changes as a user works with the application. This algorithm is used often during each user's work to make certain important decisions.
For performance reasons, re-initializing the state on every request from the (semi-normalized) database data quickly becomes non-feasible. It would be highly preferable, for example, to cache the state's Python object in some way so that it can simply be used and/or updated whenever necessary. However, since this is a web application, there several processes serving requests, so using a global variable is out of the question.
I've tried serializing the relevant object (via pickle) and saving the serialized data to the DB, and am now experimenting with caching the serialized data via memcached. However, this still has the significant overhead of serializing and deserializing the object often.
I've looked at shared memory solutions but the only relevant thing I've found is POSH. However POSH doesn't seem to be widely used and I don't feel easy integrating such an experimental component into my application.
I need some advice! This is my first shot at developing a web application, so I'm hoping this is a common enough issue that there are well-known solutions to such problems. At this point solutions which assume the Python back-end is running on a single server would be sufficient, but extra points for solutions which scale to multiple servers as well :)
Notes:
I have this application working, currently live and with active users. I started out without doing any premature optimization, and then optimized as needed. I've done the measuring and testing to make sure the above mentioned issue is the actual bottleneck. I'm sure pretty sure I could squeeze more performance out of the current setup, but I wanted to ask if there's a better way.
The setup itself is still a work in progress; assume that the system's architecture can be whatever suites your solution.
Be cautious of premature optimization.
Addition: The "Python backend runs an algorithm whose state..." is the session in the web framework. That's it. Let the Django framework maintain session state in cache. Period.
"The algorithm's per-user state undergoes many small changes as a user works with the application." Most web frameworks offer a cached session object. Often it is very high performance. See Django's session documentation for this.
Advice. [Revised]
It appears you have something that works. Leverage to learn your framework, learn the tools, and learn what knobs you can turn without breaking a sweat. Specifically, using session state.
Second, fiddle with caching, session management, and things that are easy to adjust, and see if you have enough speed. Find out whether MySQL socket or named pipe is faster by trying them out. These are the no-programming optimizations.
Third, measure performance to find your actual bottleneck. Be prepared to provide (and defend) the measurements as fine-grained enough to be useful and stable enough to providing meaningful comparison of alternatives.
For example, show the performance difference between persistent sessions and cached sessions.
I think that the multiprocessing framework has what might be applicable here - namely the shared ctypes module.
Multiprocessing is fairly new to Python, so it might have some oddities. I am not quite sure whether the solution works with processes not spawned via multiprocessing.
I think you can give ZODB a shot.
"A major feature of ZODB is transparency. You do not need to write any code to explicitly read or write your objects to or from a database. You just put your persistent objects into a container that works just like a Python dictionary. Everything inside this dictionary is saved in the database. This dictionary is said to be the "root" of the database. It's like a magic bag; any Python object that you put inside it becomes persistent."
Initailly it was a integral part of Zope, but lately a standalone package is also available.
It has the following limitation:
"Actually there are a few restrictions on what you can store in the ZODB. You can store any objects that can be "pickled" into a standard, cross-platform serial format. Objects like lists, dictionaries, and numbers can be pickled. Objects like files, sockets, and Python code objects, cannot be stored in the database because they cannot be pickled."
I have read it but haven't given it a shot myself though.
Other possible thing could be a in-memory sqlite db, that may speed up the process a bit - being an in-memory db, but still you would have to do the serialization stuff and all.
Note: In memory db is expensive on resources.
Here is a link: http://www.zope.org/Documentation/Articles/ZODB1
First of all your approach is not a common web development practice. Even multi threading is being used, web applications are designed to be able to run multi-processing environments, for both scalability and easier deployment .
If you need to just initialize a large object, and do not need to change later, you can do it easily by using a global variable that is initialized while your WSGI application is being created, or the module contains the object is being loaded etc, multi processing will do fine for you.
If you need to change the object and access it from every thread, you need to be sure your object is thread safe, use locks to ensure that. And use a single server context, a process. Any multi threading python server will serve you well, also FCGI is a good choice for this kind of design.
But, if multiple threads are accessing and changing your object the locks may have a really bad effect on your performance gain, which is likely to make all the benefits go away.
This is Durus, a persistent object system for applications written in the Python
programming language. Durus offers an easy way to use and maintain a consistent
collection of object instances used by one or more processes. Access and change of a
persistent instances is managed through a cached Connection instance which includes
commit() and abort() methods so that changes are transactional.
http://www.mems-exchange.org/software/durus/
I've used it before in some research code, where I wanted to persist the results of certain computations. I eventually switched to pytables as it met my needs better.
Another option is to review the requirement for state, it sounds like if the serialisation is the bottle neck then the object is very large. Do you really need an object that large?
I know in the Stackoverflow podcast 27 the reddit guys discuss what they use for state, so that maybe useful to listen to.

Categories

Resources