I plan to use python at my job to connect directly to our main system production database. However the IT department is reluctant since apparently there is no easy way to control how much I can query the database. Hence they are worried I could affect performance for the rest of the users.
Is there a way to limit frequency of queries from python connection to the database? Or some other method that I can "sell" to my IT department so they will let me connect directly to production DB via python?
Many thanks
Database resource manager gives quite few options for this, depending how the production usage is compared to what you will be adding. This does not depend on the type of client.
https://blogs.oracle.com/db/oracle-resource-manager-and-dbmsresourcemanager
Often a plan is created where an order of usage limiting is specified. Regular production will get most resources, your project a class lower. If production is running, your session[s] get what is left over by production.
Also very nice is a cost estimation that allows to cancel a query deemed too expensive.
A bit of thought must be given to slow long running transaction that held blocking locks…. It does need a bit of experimentation to get this right.
Related
I've got a simple webservice which uses SQLAlchemy to connect to a database using the pattern
engine = create_engine(database_uri)
connection = engine.connect()
In each endpoint of the service, I then use the same connection, in the following fashion:
for result in connection.execute(query):
<do something fancy>
Since Sessions are not thread-safe, I'm afraid that connections aren't either.
Can I safely keep doing this? If not, what's the easiest way to fix it?
Minor note -- I don't know if the service will ever run multithreaded, but I'd rather be sure that I don't get into trouble when it does.
Short answer: you should be fine.
There is a difference between a connection and a Session. The short description is that connections represent just that… a connection to a database. Information you pass into it will come out pretty plain. It won't keep track of your transactions unless you tell it to, and it won't care about what order you send it data. So if it matters that you create your Widget object before you create your Sprocket object, then you better call that in a thread-safe context. Same generally goes for if you want to keep track of a database transaction.
Session, on the other hand, keeps track of data and transactions for you. If you check out the source code, you'll notice quite a bit of back and forth over database transactions and without a way to know that you have everything you want in a transaction, you could very well end up committing in one thread while you expect to be able to add another object (or several) in another.
In case you don't know what a transaction is this is Wikipedia, but the short version is that transactions help make sure your data stays stable. If you have 15 inserts and updates, and insert 15 fails, you might not want to make the other 14. A transaction would let you cancel the entire operation in bulk.
First, the server setup:
nginx frontend to the world
gunicorn running a Flask app with gevent workers
Postgres database, connection pooled in the app, running from Amazon RDS, connected with psycopg2 patched to work with gevent
The problem I'm encountering is inexplicably slow queries that are sometimes running on the order of 100ms or so (ideal), but which often spike to 10s or more. While time is a parameter in the query, the difference between the fast and slow query happens much more frequently than a change in the result set. This doesn't seem to be tied to any meaningful spike in CPU usage, memory usage, read/write I/O, request frequency, etc. It seems to be arbitrary.
I've tried:
Optimizing the query - definitely valid, but it runs quite well locally, as well as any time I've tried it directly on the server through psql.
Running on a larger/better RDS instance - I'm currently working on an m3.medium instance with PIOPS and not coming close to that read rate, so I don't think that's the issue.
Tweaking the number of gunicorn workers - I thought this could be an issue, if the psycopg2 driver is having to context switch excessively, but this had no effect.
More - I've been working for a decent amount of time at this, so these were just a couple of the things I've tried.
Does anyone have ideas about how to debug this problem?
This is what shared tenancy gets you, unpredictable results.
What is the size of the data set the queries run on? Although Craig says it sounds like busrty checkpoint activity, that doesn't make sense because this is RDS. It sounds more like cache fallout, e.g; your relations are falling out of cache.
You say you are running piops but m3.medium is not an EBS optimized instance.
You need at least:
High instance level. Make sure your memory is more than the active data set.
EBS optimized instances, see here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
Lots of memory.
PIOPS
By the time you have all of that you will realize you will save a ton of money pushing PostgreSQL (or any database) to bare metal and leaving AWS to what it is good at, Memory and CPU (not IO).
You could try this from within psql to get more details on query timing
EXPLAIN sql_statement
Also turn on more database logging. mysql has slow query analysis, maybe PostgreSQL has an equivalent.
Recently our testing system is experiencing the error:
Error connecting to database: (1040, 'Too many connections')
And due to the memory limitation, we cannot increase the current max-connections too much. Adding more memory to the server is a short term solution.
The current DB is MySQL and we're using python to do the database programming. Basically, python code would control the test to run on specific node and store the result into the DB. So, open DB, store data, close DB. Every day we had millions of tests to run and store. But I wonder whether there's a way or methodology or philosophy that I could follow from the perspective of programming.
Also, what is the definition of one client connection? In terms of machines number or cores number in total?
Thanks.
I've got a sqlite3 database and I want to write in it from multiple threads. I've got multiple ideas but I'm not sure which I should implement.
create multiple connection, detect and waif if the DB is locked
use one connection and try to make use of Serialized connections (which don't seem to be implemented in python)
have a background process with a single connection, which collects the queries from all threads and then executes them on their behalft
forget about SQlite and use something like Postgresql
What are the advances of these different approaches and which is most likely to be fruitful? Are there any other possibilities?
Try to use https://pypi.python.org/pypi/sqlitedict
A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
But take into account "Concurrent requests are still serialized internally, so this "multithreaded support" doesn't give you any performance benefits. It is a work-around for sqlite limitations in Python."
PostgreSQL, MySQL, etc. give you the better performance for several connections in one time
I used method 1 before. It is the easiest in coding. Since that project has a small website, each query take only several milliseconds. All the users requests can be processed promptly.
I also used method 3 before. Because when the query take longer time, it is better to queue the queries since frequent "detect and wait" makes no sense here. And would require a classic consumer-producer model. It would require more time to code.
But if the query is really heavy and frequent. I suggest look to other db like MS SQL/MySQL.
I am working on a large Django (v1.5.1) application that includes multiple application servers, MySQL servers etc. Before rolling out NewRelic onto all of the servers I want to have an idea of what kind of overhead I will incur per transaction.
If possible I'd like to even distinguish between the application tracking and the server monitoring that would be ideal.
Does anyone know of generally accepted numbers for this? Perhaps a site that is doing this sort of investigation or steps so that we can do the investigation on our own.
For the Python agent and monitoring of a Django web application, the overhead per request is driven by how many functions are executed within a specific request that are instrumented. This is because full profiling is not being done. Instead only specific functions of interest are instrumented. It is therefore only the overhead of having a wrapper being executed for that one function call, not nested calls, unless those nested functions were in turn ones which were being instrumented.
Specific functions which are instrumented in Django are the middleware and view handler function, plus template rendering and the function within the template renderer which deals with each template block. Distinct from Django itself, you have instrumentation on the low level database client module functions for executing a query, plus memcache and web externals etc.
What this means is that if for a specific web request execution only passed through 100 instrumented functions, then it is only the execution of those which incur an extra overhead. If instead your view handler performed a large number of distinct database queries, or you have a very complicated template being rendered, the number of instrumented functions could be a lot more and as such the overhead for that web request will be more. That said, if your view handler is doing more work, then it already would generally have a longer response time than a less complex one.
In other words, the per request overhead is not fixed and depends on how much work is being done, or more specifically how many instrumented functions are invoked. It is not therefore possible to quantify things and give you a fixed per request figure for the overhead.
That all said, there will be some overhead and the general target range being aimed at is around 5%.
What generally happens though is that the insight which is gained from having the performance metrics means that for most customers there are usually some quite easy improvements that can be found almost immediately. Having made such changes, response times can quite quickly be brought down to be below what they were before you started monitoring, so you end up being ahead of where you were to start with when you had no monitoring. With further digging and tuning, improvements can be even much more dramatic. Pay attention to certain aspect of the performance metrics being provided and you can also better tune your WSGI server and perhaps better utilise it and reduce the number of hosts required and so reduce your hosting costs.