If I don't need transactions, can I reuse the same database connection for multiple requests?
Flask documentation says:
Because database connections encapsulate a transaction, we also need to make sure that only one request at the time uses the connection.
Here's how I understand the meaning of the above sentence:
Python DB-API connection can only handle one transaction at a time; to start a new transaction, one must first commit or roll back the previous one. So if each of our requests needs its own transaction, then of course each request needs its own database connection.
Please let me know if I got it wrong.
But let's say I set autocommit mode, and handle each request in a single SQL statement. Or, alternatively, let's say I only read - not write - to the database. In either case, it seems I can just reuse the same database connection for all my requests to save the overhead of multiple connections. But I'm not sure if there's any downside to this approach.
Edit: I can see one issue with what I'm proposing: each request might be handled by a different process. Since connections should probably not be reused across processes, let me clarify my question: I mean creating one connection per process, and using it for all requests that happen to be handled by this process.
On the other hand, the whole point of (green or native) threads is usually to serve one request per thread, so my proposed approach implies sharing connection across threads. It seems one connection can be used concurrently in multiple native threads, but not in multiple green threads.
So let's say for concreteness my environment is flask + gunicorn with multiple multi-threaded sync workers.
Based on #Craig Ringer comment on a different question, I think I know the answer.
The only possible advantage of connection sharing is performance (other factors - like transaction encapsulation and simplicity - favor a separate connection per request). And since a connection can't be shared across processes or green threads, it only has a chance with native threads. But psycopg2 (and presumably other drivers) doesn't allow concurrent access from the same connection. So unless each request spends very little time talking to the database, there is likely a performance hit, not benefit, from connection sharing.
Related
I am having a hard time trying to figure out the big picture of the handling of multiple requests by the uwsgi server with django or pyramid application.
My understanding at the moment is this:
When multiple http requests are sent to uwsgi server concurrently, the server creates a separate processes or threads (copies of itself) for every request (or assigns to them the request) and every process/thread loads the webapplication's code (say django or pyramid) into computers memory and executes it and returns the response. In between every copy of the code can access the session, cache or database. There is a separate database server usually and it can also handle concurrent requests to the database.
So here some questions I am fighting with.
Is my above understanding correct or not?
Are the copies of code interact with each other somehow or are they wholly separated from each other?
What about the session or cache? Are they shared between them or are they local to each copy?
How are they created: by the webserver or by copies of python code?
How are responses returned to the requesters: by each process concurrently or are they put to some kind of queue and sent synchroniously?
I have googled these questions and have found very interesting answers on StackOverflow but anyway can't get the whole picture and the whole process remains a mystery for me. It would be fantastic if someone can explain the whole picture in terms of django or pyramid with uwsgi or whatever webserver.
Sorry for asking kind of dumb questions, but they really torment me every night and I am looking forward to your help:)
There's no magic in pyramid or django that gets you past process boundaries. The answers depend entirely on the particular server you've selected and the settings you've selected. For example, uwsgi has the ability to run multiple threads and multiple processes. If uwsig spins up multiple processes then they will each have their own copies of data which are not shared unless you took the time to create some IPC (this is why you should keep state in a third party like a database instead of in-memory objects which are not shared across processes). Each process initializes a WSGI object (let's call it app) which the server calls via body_iter = app(environ, start_response). This app object is shared across all of the threads in the process and is invoked concurrently, thus it needs to be threadsafe (usually the structures the app uses are either threadlocal or readonly to deal with this, for example a connection pool to the database).
In general the answers to your questions are that things happen concurrently, and objects may or may not be shared based on your server model but in general you should take anything that you want to be shared and store it somewhere that can handle concurrency properly (a database).
The power and weakness of webservers is that they are in principle stateless. This enables them to be massively parallel. So indeed for each page request a different thread may be spawned. Wether or not this indeed happens depends on the configuration settings of you webserver. There's also a cost to spawning many threads, so if possible threads are reused from a thread pool.
Almost all serious webservers have page cache. So if the same page is requested multiple times, it can be retrieved from cache. In addition, browsers do their own caching. A webserver has to be clever about what to cache and what not. Static pages aren't a big problem, although they may be replaced, in which case it is quite confusing to still get the old page served because of the cache.
One way to defeat the cache is by adding (dummy) parameters to the page request.
The statelessness of the web was initialy welcomed as a necessity to achieve scalability, where webpages of busy sites even could be served concurrently from different servers at nearby or remote locations.
However the trend is to have stateful apps. State can be maintained on the browser, easing the burden on the server. If it's maintained on the server it requires the server to know 'who's talking'. One way to do this is saving and recognizing cookies (small identifiable bits of data) on the client.
For databases the story is a bit different. As soon as anything gets stored that relates to a particular user, the application is in principle stateful. While there's no conceptual difference between retaining state on disk and in RAM memory, traditionally statefulness was left to the database, which in turned used thread pools and load balancing to do its job efficiently.
With the advent of very large internet shops like amazon and google, mandatory disk access to achieve statefulness created a performance problem. The answer were in-memory databases. While they may be accessed traditionally using e.g. SQL, they offer much more flexibility in the way data is stored conceptually.
A type of database that enjoys growing popularity is persistent object store. With this database, while the distinction still can be made formally, the boundary between webserver and database is blurred. Both have their data in RAM (but can swap to disk if needed), both work with objects rather than flat records as in SQL tables. These objects can be interconnected in complex ways.
In short there's an explosion of innovative storage / thread pooling / caching/ persistence / redundance / synchronisation technology, driving what has become popularly know as 'the cloud'.
I am working on an online judge.I am using python 2.7 and Mysql ( as I am working on back end-part)
My Method:
I create a main thread which pulls out submissions from database( 10 at a time) and puts them in a queue.Then I have multiple threads that take submissions from queue, evaluate it and write the result back to database.
Now I have some doubts(I know they are doubts from different topics but approach to some of them also is highly appreciated).
Currently when I start the threads I give them their own db connections, Which they use.Is this a good practice to give one connection per thread. Does sharing of connections between threads create problems.How do I go about this.
My main thread uses a single connection as its only work is to pull submissions from db and put then in queue(also update their status in db to Assessing Submission). But sometimes I get the error: Lost connection to Mysql server while querying. I keep getting it even when I stop the program and start it again.What do I do about it? Also should I implement a Pool of connections for only the main thread?
Also does a db connection stay alive for ever? What to do when its session memory etc gets exhausted how to handle that?
Use a connection pool. Sharing the database connection is not always bad but you have to be careful about it. You can try SQLAlchemy to manage a lot of this for you: http://docs.sqlalchemy.org/en/rel_0_8/orm/session.html#unitofwork-contextual
The server might be out of connections, your connection might have been killed because it uses too many resources.. etc. A connection pool could help you solve this.
It all depends, it could stay alive indefinitely theoretically, but usually you have a timeout somewhere.
If you give the same connection to every thread then the threads will not be able to query the database and race condition will occur. So you need to provide separate connection to every thread and indeed it is a good idea. Use a Connection Pool for the purpose it will help you get different connections.
Connection Pool will surely help.
Release the connection once your work is over. There is a limit to connection which is termed as connection time out. So you need to use some third party library to handle that, c3p0 is a good library which can help you in this.
Please refer the below link to configure it:
Best configuration of c3p0
I noticed that sqlite3 isnĀ“t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?
First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?
I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.
sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/
At my organization, PostgreSQL databases are created with a 20-connection limit as a matter of policy. This tends to interact poorly when multiple applications are in play that use connection pools, since many of those open up their full suite of connections and hold them idle.
As soon as there are more than a couple of applications in contact with the DB, we run out of connections, as you'd expect.
Pooling behaviour is a new thing here; until now we've managed pooled connections by serializing access to them through a web-based DB gateway (?!) or by not pooling anything at all. As a consequence, I'm having to explain (literally, 5 trouble tickets from one person over the course of the project) over and over again how the pooling works.
What I want is one of the following:
A solid, inarguable rationale for increasing the number of available connections to the database in order to play nice with pools.
If so, what's a safe limit? Is there any reason to keep the limit to 20?
A reason why I'm wrong and we should cut the size of the pools down or eliminate them altogether.
For what it's worth, here are the components in play. If it's relevant how one of these is configured, please weigh in:
DB: PostgreSQL 8.2. No, we won't be upgrading it as part of this.
Web server: Python 2.7, Pylons 1.0, SQLAlchemy 0.6.5, psycopg2
This is complicated by the fact that some aspects of the system access data using SQLAlchemy ORM using a manually configured engine, while others access data using a different engine factory (Still sqlalchemy) written by one of my associates that wraps the connection in an object that matches an old PHP API.
Task runner: Python 2.7, celery 2.1.4, SQLAlchemy 0.6.5, psycopg2
I think it's reasonable to require one connection per concurrent activity, and it's reasonable to assume that concurrent HTTP requests are concurrently executed.
Now, the number of concurrent HTTP requests you want to process should scale with a) the load on your server, and b) the number of CPUs you have available. If all goes well, each request will consume CPU time somewhere (in the web server, in the application server, or in the database server), meaning that you couldn't process more requests concurrently than you have CPUs. In practice, it's not that all goes well: some requests will wait for IO at some point, and not consume any CPU. So it's ok to process some more requests concurrently than you have CPUs.
Still, assuming that you have, say, 4 CPUs, allowing 20 concurrent requests is already quite some load. I'd rather throttle HTTP requests than increasing the number of requests that can be processed concurrently. If you find that a single request needs more than one connection, you have a flaw in your application.
So my recommendation is to cope with the limit, and make sure that there are not too many idle connections (compared to the number of requests that you are actually processing concurrently).
I recall hearing that the connection process in mysql was designed to be very fast compared to other RDBMSes, and that therefore using a library that provides connection pooling (SQLAlchemy) won't actually help you that much if you enable the connection pool.
Does anyone have any experience with this?
I'm leery of enabling it because of the possibility that if some code does something stateful to a db connection and (perhaps mistakenly) doesn't clean up after itself, that state which would normally get cleaned up upon closing the connection will instead get propagated to subsequent code that gets a recycled connection.
There's no need to worry about residual state on a connection when using SQLA's connection pool, unless your application is changing connectionwide options like transaction isolation levels (which generally is not the case). SQLA's connection pool issues a connection.rollback() on the connection when its checked back in, so that any transactional state or locks are cleared.
It is possible that MySQL's connection time is pretty fast, especially if you're connecting over unix sockets on the same machine. If you do use a connection pool, you also want to ensure that connections are recycled after some period of time as MySQL's client library will shut down connections that are idle for more than 8 hours automatically (in SQLAlchemy this is the pool_recycle option).
You can quickly do some benching of connection pool vs. non with a SQLA application by changing the pool implementation from the default of QueuePool to NullPool, which is a pool implementation that doesn't actually pool anything - it connects and disconnects for real when the proxied connection is acquired and later closed.
Even if the connection part of MySQL itself is pretty slick, presumably there's still a network connection involved (whether that's loopback or physical). If you're making a lot of requests, that could get significantly expensive. It will depend (as is so often the case) on exactly what your application does, of course - if you're doing a lot of work per connection, then that will dominate and you won't gain a lot.
When in doubt, benchmark - but I would by-and-large trust that a connection pooling library (at least, a reputable one) should work properly and reset things appropriately.
Short answer: you need to benchmark it.
Long answer: it depends. MySQL is fast for connection setup, so avoiding that cost is not a good reason to go for connection pooling. Where you win there is if the queries run are few and fast because then you will see a win with pooling.
The other worry is how the application treats the SQL thread. If it does no SQL transactions, and makes no assumptions about the state of the thread, then pooling won't be a problem. OTOH, code that relies on the closing of the thread to discard temporary tables or to rollback transactions will have a lot of problems with pooling.
The connection pool speeds things up in that fact that you do not have create a java.sql.Connection object every time you do a database query. I use the Tomcat connection pool to a mysql database for web applications that do a lot of queries, during high user load there is noticeable speed improvement.
I made a simple RESTful service with Django and tested it with and without connection pooling. In my case, the difference was quite noticeable.
In a LAN, without it, response time was between 1 and 5 seconds. With it, less than 20 ms.
Results may vary, but the configuration I'm using for the MySQL & Apache servers is pretty standard low-end.
If you're serving UI pages over the internet the extra time may not be noticeable to the user, but in my case it was unacceptable, so I opted for using the pool. Hope this helps you.