I've been writing a Python web app (in Flask) for a while now, and I don't believe I fully grasp how database access should work across multiple request/response cycles. Prior to Python my web programming experience was in PHP (several years worth) and I'm afraid that my PHP experience is misleading some of my Python work.
In PHP, each new request creates a brand new DB connection, because nothing is shared across requests. The more requests you have, the more connections you need to support. However, in a Python web app, where there is shared state across requests, DB connections can persist.
So I need to manage those connections, and ensure that I close them. Also, I need to have some kind of connection pool, because if I have just one connection shared across all requests, then requests could block waiting on DB access, if I don't have enough connections available.
Is this a correct understanding? Or have I identified the differences well? In a Python web app, do I need to have a DB connection pool that shares its connections across many requests? And the number of connections in the pool will depend on my application's request load?
I'm using Psycopg2.
Have you looked in to SQLAlchemy at all? It takes care of a lot of the dirty details - it maintains a pool of connections, and reuses/closes them as necessary.
to avoid connection cost, seems that best way is to use pooling.
either directly with psycopg2.pool from docs or via higher level abstraction layer as SQL Alchemy as underlined on comments here.
Related
If I don't need transactions, can I reuse the same database connection for multiple requests?
Flask documentation says:
Because database connections encapsulate a transaction, we also need to make sure that only one request at the time uses the connection.
Here's how I understand the meaning of the above sentence:
Python DB-API connection can only handle one transaction at a time; to start a new transaction, one must first commit or roll back the previous one. So if each of our requests needs its own transaction, then of course each request needs its own database connection.
Please let me know if I got it wrong.
But let's say I set autocommit mode, and handle each request in a single SQL statement. Or, alternatively, let's say I only read - not write - to the database. In either case, it seems I can just reuse the same database connection for all my requests to save the overhead of multiple connections. But I'm not sure if there's any downside to this approach.
Edit: I can see one issue with what I'm proposing: each request might be handled by a different process. Since connections should probably not be reused across processes, let me clarify my question: I mean creating one connection per process, and using it for all requests that happen to be handled by this process.
On the other hand, the whole point of (green or native) threads is usually to serve one request per thread, so my proposed approach implies sharing connection across threads. It seems one connection can be used concurrently in multiple native threads, but not in multiple green threads.
So let's say for concreteness my environment is flask + gunicorn with multiple multi-threaded sync workers.
Based on #Craig Ringer comment on a different question, I think I know the answer.
The only possible advantage of connection sharing is performance (other factors - like transaction encapsulation and simplicity - favor a separate connection per request). And since a connection can't be shared across processes or green threads, it only has a chance with native threads. But psycopg2 (and presumably other drivers) doesn't allow concurrent access from the same connection. So unless each request spends very little time talking to the database, there is likely a performance hit, not benefit, from connection sharing.
I'm working on a distributed system where one process is controlling a hardware piece and I want it to be running as a service. My app is Django + Twisted based, so Twisted maintains the main loop and I access the database (SQLite) through Django, the entry point being a Django Management Command.
On the other hand, for user interface, I am writing a web application on the same Django project on the same database (also using Crossbar as websockets and WAMP server). This is a second Django process accessing the same database.
I'm looking for some validation here. Is anything fundamentally wrong to this approach? I'm particularly scared of issues with database (two different processes accessing it via Django ORM).
Consider that Django, like all WSGI-based web servers, almost always has multiple processes accessing the database. Because a single WSGI process can handle only one connection at a time, it's normal for servers to run multiple processes in parallel when they get any significant amount of traffic.
That doesn't mean there's no cause for concern. You have the database as if data might change between any two calls to it. Familiarize yourself with how Django uses transactions (default is autocommit mode, not atomic requests), and …
and oh, you said sqlite. Yeah, sqlite is probably not the best database to use when you need to write to it from multiple processes. I can imagine that might work for a single-user interface to a piece of hardware, but if you run in to any problems when adding the webapp, you'll want to trade up to a database server like postgresql.
No there is nothing inherently wrong with that approach. We currently use a similar approach for a lot of our work.
I am using Python to stream large amounts of Twitter data into a MySQL database. I anticipate my job running over a period of several weeks. I have code that interacts with the twitter API and gives me an iterator that yields lists, each list corresponding to a database row. What I need is a means of maintaining a persistent database connection for several weeks. Right now I find myself having to restart my script repeatedly when my connection is lost, sometimes as a result of MySQL being restarted.
Does it make the most sense to use the mysqldb library, catch exceptions and reconnect when necessary? Or is there an already made solution as part of sqlalchemy or another package? Any ideas appreciated!
I think the right answer is to try and handle the connection errors; it sounds like you'd only be pulling in a much a larger library just for this feature, while trying and catching is probably how it's done, whatever level of the stack it's at. If necessary, you could multithread these things since they're probably IO-bound (i.e. suitable for Python GIL threading as opposed to multiprocessing) and decouple the production and the consumption with a queue, too, which would maybe take some of the load off of the database connection.
I am developing web app on flask, python, sqlalchemy and postgresql.
My question is here regarding concurrency handling in this app.
How I wrote the app :
I take the example of adding user in database. I post the form and a view is called. I process all the form data and then call add_user(*arg) which uses sqlalchemy code to insert user in database and returns on successful execution and I return the response from the view.
What I assumed:
Ok now I assumed that my web server (which I have not decided yet) will either spawn a thread or a process if two users are trying to signup at the same time and will handle all the concurreny requirements.
Do i need to write threaded code here? By threaded code I mean that before writing I acquire a lock and after write release it.
I am pretty new to web development and multithreading/multiprocessing programing and would like some guidance on how write web app which can handle concurrency well.
Writing concurrency handling from start is right or this thought should come when a large number of concurrent users are using the webapp. Even If it should be done later I would like some pointers about it.
Basically I have no idea about concurrency part of webapp development. If you can point to resources from where I can learn more about it would be really helpful.
Flask will execute each request in a separate thread or even in separate processes. The number of threads and processes to spawn is determined by the WSGI server (for example, Apache with mod_wsgi).
If you use SQLAlchemy ScopedSessions, the session is perfectly thread-safe. You must not share ORM-controlled objects across threads (but in the large majority of cases, you won't let your objects live longer than a request anyway so this is usually not a concern).
In other words, as long as you don't intend to share state between requests other than through the database or cookies, you don't need to worry about concurrency issues. You don't need to create a lock for writing to the database.
If you create your own long-lived objects within your application, which you most likely don't need to do, and if those objects communicate or share state with request handling code, then you must take appropriate precautions to avoid synchronization issues (race conditions, deadlocks, use of libraries that are not thread-safe, etc.)
At my organization, PostgreSQL databases are created with a 20-connection limit as a matter of policy. This tends to interact poorly when multiple applications are in play that use connection pools, since many of those open up their full suite of connections and hold them idle.
As soon as there are more than a couple of applications in contact with the DB, we run out of connections, as you'd expect.
Pooling behaviour is a new thing here; until now we've managed pooled connections by serializing access to them through a web-based DB gateway (?!) or by not pooling anything at all. As a consequence, I'm having to explain (literally, 5 trouble tickets from one person over the course of the project) over and over again how the pooling works.
What I want is one of the following:
A solid, inarguable rationale for increasing the number of available connections to the database in order to play nice with pools.
If so, what's a safe limit? Is there any reason to keep the limit to 20?
A reason why I'm wrong and we should cut the size of the pools down or eliminate them altogether.
For what it's worth, here are the components in play. If it's relevant how one of these is configured, please weigh in:
DB: PostgreSQL 8.2. No, we won't be upgrading it as part of this.
Web server: Python 2.7, Pylons 1.0, SQLAlchemy 0.6.5, psycopg2
This is complicated by the fact that some aspects of the system access data using SQLAlchemy ORM using a manually configured engine, while others access data using a different engine factory (Still sqlalchemy) written by one of my associates that wraps the connection in an object that matches an old PHP API.
Task runner: Python 2.7, celery 2.1.4, SQLAlchemy 0.6.5, psycopg2
I think it's reasonable to require one connection per concurrent activity, and it's reasonable to assume that concurrent HTTP requests are concurrently executed.
Now, the number of concurrent HTTP requests you want to process should scale with a) the load on your server, and b) the number of CPUs you have available. If all goes well, each request will consume CPU time somewhere (in the web server, in the application server, or in the database server), meaning that you couldn't process more requests concurrently than you have CPUs. In practice, it's not that all goes well: some requests will wait for IO at some point, and not consume any CPU. So it's ok to process some more requests concurrently than you have CPUs.
Still, assuming that you have, say, 4 CPUs, allowing 20 concurrent requests is already quite some load. I'd rather throttle HTTP requests than increasing the number of requests that can be processed concurrently. If you find that a single request needs more than one connection, you have a flaw in your application.
So my recommendation is to cope with the limit, and make sure that there are not too many idle connections (compared to the number of requests that you are actually processing concurrently).