I am developing web app on flask, python, sqlalchemy and postgresql.
My question is here regarding concurrency handling in this app.
How I wrote the app :
I take the example of adding user in database. I post the form and a view is called. I process all the form data and then call add_user(*arg) which uses sqlalchemy code to insert user in database and returns on successful execution and I return the response from the view.
What I assumed:
Ok now I assumed that my web server (which I have not decided yet) will either spawn a thread or a process if two users are trying to signup at the same time and will handle all the concurreny requirements.
Do i need to write threaded code here? By threaded code I mean that before writing I acquire a lock and after write release it.
I am pretty new to web development and multithreading/multiprocessing programing and would like some guidance on how write web app which can handle concurrency well.
Writing concurrency handling from start is right or this thought should come when a large number of concurrent users are using the webapp. Even If it should be done later I would like some pointers about it.
Basically I have no idea about concurrency part of webapp development. If you can point to resources from where I can learn more about it would be really helpful.
Flask will execute each request in a separate thread or even in separate processes. The number of threads and processes to spawn is determined by the WSGI server (for example, Apache with mod_wsgi).
If you use SQLAlchemy ScopedSessions, the session is perfectly thread-safe. You must not share ORM-controlled objects across threads (but in the large majority of cases, you won't let your objects live longer than a request anyway so this is usually not a concern).
In other words, as long as you don't intend to share state between requests other than through the database or cookies, you don't need to worry about concurrency issues. You don't need to create a lock for writing to the database.
If you create your own long-lived objects within your application, which you most likely don't need to do, and if those objects communicate or share state with request handling code, then you must take appropriate precautions to avoid synchronization issues (race conditions, deadlocks, use of libraries that are not thread-safe, etc.)
Related
We have an existing python application (let's call it control app) that does operation data logging as well as smaller controlling tasks on a machine. We want to extend this application with a web interface, which is based on flask (let's call it web app). Both parts, the control app as well as the web app, are already present, however, the setup feels somehow fishy. In the process of rethinking the setup, I'm somehow undecided on how to structure those two parts.
At the moment, the control app gathers machine data and stores it in a postgres database. Based on several machine states, additional operations are performed that provide new input for the PLCs that control the machine.
The web app currently polls the database to react to machine states to e.g. update visualisation data, change some (state representing) images and such things.
The web app polling the database is the part that somehow smells. So my idea was to unify both apps into one to have the web app tightly coupled to the control app to be able to react on machine state changes instead of polling the database for those state changes.
Based on that idea, I'm wondering how to add a flask app to an existing python app. When I'm not mistaken, the flask app consumes the application's main thread, which would break to already existing logic. Thus I would need to have one of the two parts running on another thread. Thinking about this problem, I'm further wondering whether this merging is a good idea at all.
So, the questions are: Is it a good idea to merge both applications? If yes, how to merge them without breaking one of them? If not, how else should I try to get rid of the database polling (how to synchronize and also move some data from the web app to the control app)?
It's not a good idea to merge them per se -- problems in one part will affect the other, and this sort of tight coupling is a bad idea both because you can't run the two parts of the program on separate machines and because if one crashes, so does the other one. It's better to have them communicating over some sort of protocol.
If I were designing this, I would probably do the same thing as you did, except that instead of using an SQL database for this, I would use something like Redis which stores its data in memory. Redis allows you to subscribe to events rather than poll for updates, and polling for updates is cheaper because it's in memory.
Edit for clarify my question:
I want to attach a python service on uwsgi using this feature (I can't understand the examples) and I also want to be able to communicate results between them. Below I present some context and also present my first thought on the communication matter, expecting maybe some advice or another approach to take.
I have an already developed python application that uses multiprocessing.Pool to run on demand tasks. The main reason for using the pool of workers is that I need to share several objects between them.
On top of that, I want to have a flask application that triggers tasks from its endpoints.
I've read several questions here on SO looking for possible drawbacks of using flask with python's multiprocessing module. I'm still a bit confused but this answer summarizes well both the downsides of starting a multiprocessing.Pool directly from flask and what my options are.
This answer shows an uWSGI feature to manage daemon/services. I want to follow this approach so I can use my already developed python application as a service of the flask app.
One of my main problems is that I look at the examples and do not know what I need to do next. In other words, how would I start the python app from there?
Another problem is about the communication between the flask app and the daemon process/service. My first thought is to use flask-socketIO to communicate, but then, if my server stops I need to deal with the connection... Is this a good way to communicate between server and service? What are other possible solutions?
Note:
I'm well aware of Celery, and I pretend to use it in a near future. In fact, I have an already developed node.js app, on which users perform actions that should trigger specific tasks from the (also) already developed python application. The thing is, I need a production-ready version as soon as possible, and instead of modifying the python application, that uses multiprocessing, I thought it would be faster to create a simple flask server to communicate with node.js through HTTP. This way I would only need to implement a flask app that instantiates the python app.
Edit:
Why do I need to share objects?
Simply because the creation of the objects in questions takes too long. Actually, the creation takes an acceptable amount of time if done once, but, since I'm expecting (maybe) hundreds to thousands of requests simultaneously having to load every object again would be something I want to avoid.
One of the objects is a scikit classifier model, persisted on a pickle file, which takes 3 seconds to load. Each user can create several "job spots" each one will take over 2k documents to be classified, each document will be uploaded on an unknown point in time, so I need to have this model loaded in memory (loading it again for every task is not acceptable).
This is one example of a single task.
Edit 2:
I've asked some questions related to this project before:
Bidirectional python-node communication
Python multiprocessing within node.js - Prints on sub process not working
Adding a shared object to a manager.Namespace
As stated, but to clarify: I think the best solution would be to use Celery, but in order to quickly have a production ready solution, I trying to use this uWSGI attach daemon solution
I can see the temptation to hang on to multiprocessing.Pool. I'm using it in production as part of a pipeline. But Celery (which I'm also using in production) is much better suited to what you're trying to do, which is distribute work across cores to a resource that's expensive to set up. Have N cores? Start N celery workers, which of which can load (or maybe lazy-load) the expensive model as a global. A request comes in to the app, launch a task (e.g., task = predict.delay(args), wait for it to complete (e.g., result = task.get()) and return a response. You're trading a little bit of time learning celery for saving having to write a bunch of coordination code.
I am having a hard time trying to figure out the big picture of the handling of multiple requests by the uwsgi server with django or pyramid application.
My understanding at the moment is this:
When multiple http requests are sent to uwsgi server concurrently, the server creates a separate processes or threads (copies of itself) for every request (or assigns to them the request) and every process/thread loads the webapplication's code (say django or pyramid) into computers memory and executes it and returns the response. In between every copy of the code can access the session, cache or database. There is a separate database server usually and it can also handle concurrent requests to the database.
So here some questions I am fighting with.
Is my above understanding correct or not?
Are the copies of code interact with each other somehow or are they wholly separated from each other?
What about the session or cache? Are they shared between them or are they local to each copy?
How are they created: by the webserver or by copies of python code?
How are responses returned to the requesters: by each process concurrently or are they put to some kind of queue and sent synchroniously?
I have googled these questions and have found very interesting answers on StackOverflow but anyway can't get the whole picture and the whole process remains a mystery for me. It would be fantastic if someone can explain the whole picture in terms of django or pyramid with uwsgi or whatever webserver.
Sorry for asking kind of dumb questions, but they really torment me every night and I am looking forward to your help:)
There's no magic in pyramid or django that gets you past process boundaries. The answers depend entirely on the particular server you've selected and the settings you've selected. For example, uwsgi has the ability to run multiple threads and multiple processes. If uwsig spins up multiple processes then they will each have their own copies of data which are not shared unless you took the time to create some IPC (this is why you should keep state in a third party like a database instead of in-memory objects which are not shared across processes). Each process initializes a WSGI object (let's call it app) which the server calls via body_iter = app(environ, start_response). This app object is shared across all of the threads in the process and is invoked concurrently, thus it needs to be threadsafe (usually the structures the app uses are either threadlocal or readonly to deal with this, for example a connection pool to the database).
In general the answers to your questions are that things happen concurrently, and objects may or may not be shared based on your server model but in general you should take anything that you want to be shared and store it somewhere that can handle concurrency properly (a database).
The power and weakness of webservers is that they are in principle stateless. This enables them to be massively parallel. So indeed for each page request a different thread may be spawned. Wether or not this indeed happens depends on the configuration settings of you webserver. There's also a cost to spawning many threads, so if possible threads are reused from a thread pool.
Almost all serious webservers have page cache. So if the same page is requested multiple times, it can be retrieved from cache. In addition, browsers do their own caching. A webserver has to be clever about what to cache and what not. Static pages aren't a big problem, although they may be replaced, in which case it is quite confusing to still get the old page served because of the cache.
One way to defeat the cache is by adding (dummy) parameters to the page request.
The statelessness of the web was initialy welcomed as a necessity to achieve scalability, where webpages of busy sites even could be served concurrently from different servers at nearby or remote locations.
However the trend is to have stateful apps. State can be maintained on the browser, easing the burden on the server. If it's maintained on the server it requires the server to know 'who's talking'. One way to do this is saving and recognizing cookies (small identifiable bits of data) on the client.
For databases the story is a bit different. As soon as anything gets stored that relates to a particular user, the application is in principle stateful. While there's no conceptual difference between retaining state on disk and in RAM memory, traditionally statefulness was left to the database, which in turned used thread pools and load balancing to do its job efficiently.
With the advent of very large internet shops like amazon and google, mandatory disk access to achieve statefulness created a performance problem. The answer were in-memory databases. While they may be accessed traditionally using e.g. SQL, they offer much more flexibility in the way data is stored conceptually.
A type of database that enjoys growing popularity is persistent object store. With this database, while the distinction still can be made formally, the boundary between webserver and database is blurred. Both have their data in RAM (but can swap to disk if needed), both work with objects rather than flat records as in SQL tables. These objects can be interconnected in complex ways.
In short there's an explosion of innovative storage / thread pooling / caching/ persistence / redundance / synchronisation technology, driving what has become popularly know as 'the cloud'.
I'm trying to build a Twisted/Django mashup that will let me control various client connections managed by a Twisted server via Django's admin interface. Meaning, I want to be able to login to Django's admin and see what protocols are currently in use, any details specific to each connection (e.g. if the server is connected to freenode via IRC, it should list all the channels currently connected to), and allow me to disconnect or connect new clients by modifying or creating database records.
What would be the best way to do this? There are lots of posts out there about combining Django with Twisted, but I haven't found any prior art for doing quite what I've outlined. All the Twisted examples I've seen use hardcoded connection parameters, which makes it difficult for me to imagine how I would dynamically running reactor.connectTCP(...) or loseConnection(...) when signalled by a record in the database.
My strategy is to create a custom ClientFactory that solely polls the Django/managed database every N seconds for any commands, and to modify/create/delete connections as appropriate, reflecting the new status in the database when complete.
Does this seem feasible? Is there a better approach? Does anyone know of any existing projects that implement similar functionality?
Polling the database is lame, but unfortunately, databases rarely have good tools (and certainly there are no database-portable tools) for monitoring changes. So your approach might be okay.
However, if your app is in Django and you're not supporting random changes to the database from other (non-Django) clients, and your WSGI container is Twisted, then you can do this very simply by doing callFromThread(connectTCP, ...).
I've been working on yet another way of combing django and twisted. Fell free to give it a try: https://github.com/kowalski/featdjango.
The way it works, is slightly different that the others. It starts a twisted application and http site. The requests done to django are processed inside a special thread pool. What makes it special, is that that these threads can wait on Deferred, which makes it easy to combine synchronous django application code with asynchronous twisted code.
The reason I came up with structure like this, is that my application needs to perform a lot of http requests from inside the django views. Instead of performing them one by one I can delegate all of them at once to "the main application thread" which runs twisted and wait for them. The similarity to your problem is, that I also have an asynchronous component, which is a singleton and I access it from django views.
So this is, for example, this is how you would initiate the twisted component and later to get the reference from the view.
import threading
from django.conf import settings
_initiate_lock = threading.Lock()
def get_component():
global _initiate_lock
if not hasattr(settings, 'YOUR_CLIENT')
_initiate_lock.acquire()
try:
# other thread might have did our job while we
# were waiting for the lock
if not hasattr(settings, 'YOUR_CLIENT'):
client = YourComponent(**whatever)
threading.current_thread().wait_for_deferred(
client.initiate)
settings.YOUR_CLIENT = client
finally:
_initiate_lock.release()
return settings.YOUR_CLIENT
The code above, initiates my client and calls the initiate method on it. This method is asynchronous and returns a Deferred. I do all the necessary setup in there. The django thread will wait for it to finish before processing to next line.
This is how I do it, because I only access it from the request handler. You probably would want to initiate your component at startup, to call ListenTCP|SSL. Than your django request handlers could get the data about the connections just accessing some public methods on the your client. These methods could even return Deferred, in which case you should use .wait_for_defer() to call them.
I have a django app which is used for managing registrations to a survey.
There are fixed number of slots and I want to "reserve" slots for users when they sign up.
In one of my views, I get the next available slot and reserve it (or redirect the user if there are no slots available.)
I want to protect against the case where two user's signing up at the same time get registered for the same slot because the the method "get_next_available_slot" returned the same slot for both users.
For this I am trying to understand the use of processes and threads with Django's views.
1) Is each request handled in a separate thread and will using python threading module's LOCK() take care of exclusive access?
2) I am running apache on RHEL with modwsgi. How do I configure Apache/modwsgi to ensure a more easy and simple solution to handle the above situation?
I am not expecting a huge load on the web application at all. So I would like a simpler solution instead of a high performance one.
You should not make assumptions about thread/process setup of django application, because it depends on web server you're using and how django is integrated to it. Therefore, interprocess communication methods should not rely on these details to be reliable. One good solution is using built-in cache library for locks and shared data.
Here's a good example of cache lock ensuring only once instance of celery task is run at a time. You can apply it to regular requests as well.
You shouldn't be worrying about this kind of stuff.
These slots are stored in a database right? The database should handle all the locking mechanisms for you, just make sure you run everything under a transaction and you will be fine.