Inconsistently slow queries in production (RDS)

Inconsistently slow queries in production (RDS) - python

First, the server setup:
nginx frontend to the world
gunicorn running a Flask app with gevent workers
Postgres database, connection pooled in the app, running from Amazon RDS, connected with psycopg2 patched to work with gevent
The problem I'm encountering is inexplicably slow queries that are sometimes running on the order of 100ms or so (ideal), but which often spike to 10s or more. While time is a parameter in the query, the difference between the fast and slow query happens much more frequently than a change in the result set. This doesn't seem to be tied to any meaningful spike in CPU usage, memory usage, read/write I/O, request frequency, etc. It seems to be arbitrary.
I've tried:
Optimizing the query - definitely valid, but it runs quite well locally, as well as any time I've tried it directly on the server through psql.
Running on a larger/better RDS instance - I'm currently working on an m3.medium instance with PIOPS and not coming close to that read rate, so I don't think that's the issue.
Tweaking the number of gunicorn workers - I thought this could be an issue, if the psycopg2 driver is having to context switch excessively, but this had no effect.
More - I've been working for a decent amount of time at this, so these were just a couple of the things I've tried.
Does anyone have ideas about how to debug this problem?

This is what shared tenancy gets you, unpredictable results.
What is the size of the data set the queries run on? Although Craig says it sounds like busrty checkpoint activity, that doesn't make sense because this is RDS. It sounds more like cache fallout, e.g; your relations are falling out of cache.
You say you are running piops but m3.medium is not an EBS optimized instance.
You need at least:
High instance level. Make sure your memory is more than the active data set.
EBS optimized instances, see here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
Lots of memory.
PIOPS
By the time you have all of that you will realize you will save a ton of money pushing PostgreSQL (or any database) to bare metal and leaving AWS to what it is good at, Memory and CPU (not IO).

You could try this from within psql to get more details on query timing
EXPLAIN sql_statement
Also turn on more database logging. mysql has slow query analysis, maybe PostgreSQL has an equivalent.

Related

cx_Oracle Query Performance

I plan to use python at my job to connect directly to our main system production database. However the IT department is reluctant since apparently there is no easy way to control how much I can query the database. Hence they are worried I could affect performance for the rest of the users.
Is there a way to limit frequency of queries from python connection to the database? Or some other method that I can "sell" to my IT department so they will let me connect directly to production DB via python?
Many thanks

Database resource manager gives quite few options for this, depending how the production usage is compared to what you will be adding. This does not depend on the type of client.
https://blogs.oracle.com/db/oracle-resource-manager-and-dbmsresourcemanager
Often a plan is created where an order of usage limiting is specified. Regular production will get most resources, your project a class lower. If production is running, your session[s] get what is left over by production.
Also very nice is a cost estimation that allows to cancel a query deemed too expensive.
A bit of thought must be given to slow long running transaction that held blocking locks…. It does need a bit of experimentation to get this right.

How to calculate max requests per second of a Django app?

I am about the deploy a Django app, and then it struck me that I couldn't find a way to anticipate how many requests per second my application can handle.
Is there a way of calculating how many requests per second can a Django application handle, without resorting to things like doing a test deployment and use an external tool such as locust?
I know there are several factors involved (such as number of database queries, etc.), but perhaps there is a convenient way of calculating, even estimating, how many visitors can a single Django app instance handle.
EDIT: Removed the mention to Gunicorn, since it only adds confusion to what I truly wanted to know.

Is there a way of calculating how many requests per second can a
Django application handle, without resorting to things like doing a
test deployment and use an external tool such as locust?
No and Yes. As mackarone pointed out, I don't think there's anyway you avoid measuring it. Consider the case where you did a local benchmark on your local dev server talking to a local DB instance, in order to generate a baseline for estimation. The issue with this is that the hardware, network (distance between services) all make a huge difference. So any numbers you generated locally would be relatively worthless for capacity planning.
In my experiences, local testing is great for relative changes. Consider the case where you wanted to see the performance impact of sql query planninng. Establishing a local baseline, making the change, than observing the effect locally is useful to gauge relative speedup.
How to generate these numbers?
I would recommend deploying the app to the hardware, and network you plan on testing on. This deploy should use your production configuration and component topology (ie if you're going to run gunicorn, make sure gunicorn is running instead of NGINX, or if you're going to have a proxy in front of gunicorn, make sure that is setup. I would run a single instance of your application using your production config.
Once this is running, I would launch a load test against the single instance using any of the popular load testing tools:
Apache Benchmark
Siege
Vegeta
K6
etc
You can launch these load tests from your single machine and ramp up traffic until response times are no longer acceptable in order to get a feel for the # of concurrent connections, and throughput your application can accommodate.
Now you have some idea of what a single instance of your service is able to handle. Up until your db (or other shared resources) are saturated these numbers can be used to project how many instances of your service are necessary to handle some amount of traffic!

According to the Gunicorn documentation
How Many Workers?
DO NOT scale the number of workers to the number of clients you expect to have. Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second.
Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.
Obviously, your particular hardware and application are going to affect the optimal number of workers. Our recommendation is to start with the above guess and tune using TTIN and TTOU signals while the application is under load.
Always remember, there is such a thing as too many workers. After a point your worker processes will start thrashing system resources decreasing the throughput of the entire system.
The best thing is tune it using some load testing tool as locust as you mentioned.
Emphasis mine

You have to install (loadtest) first, it is a npm package,
I was learning redis and at that time I found this, you can use it, it worked for me,
For More check this tutorial: https://realpython.com/caching-in-django-with-redis/#start-by-measuring-performance
npm install -g loadtest
loadtest -n 100 -k http://localhost:8000/myUrl/

Crate AMI Performance Lower With Flask-RESTful Endpoint on EC2

Launched a Simple Crate AMI EC2 Instance and opened up the ports for Crate on 4200 and 5000 for Flask.
When I run the EC2 instance with Crate AMI, the speeds are slower but still fast enough (~1-2 Second), but when I call the same with the Flask Endpoint which calls the Crate DB (on the same instance) by passing a query to it, it takes close to 10 seconds.
I tested the endpoint on a localhost and there was no change to the speed execution as such. Hence, I've ruled out the code being the problem.
My questions:
Why are the queries being run through the Flask-Restful endpoint on EC2 so slow?
Does it make a difference in speed performance to make an EC2 AMI from scratch and install CrateDB into it, than an out-of-the-box Crate AMI?

That can be one of several things, mostly however I suspect a 'hardware' issue:
Are the hardware specs the same? more cores, more memory, SSD vs spinning disks?
Is the environment variable CRATE_HEAP_SIZE set to half the available RAM? (/etc/sysconfig/crate)
Is the CREATE TABLE statement the same? A different number of cores result in a different number of shards if not specified. Oversharding/undersharding will degrade performance noticeably.
I am assuming the table size and queries are the same ;) otherwise seemingly minor changes can make a difference in performance. Partitioned tables optimize if the partition column is in the WHERE clause, as well as queries hitting the primary key(s) directly are way faster. Similarly, aggregations/comparisons on Strings are slower than on numeric types etc.
Cheers, Claus

PostgreSQL ETL process on Heroku

I've been given the task of writing an ETL (Extract, Transform, Load) process between a PostgreSQL 9.1 database hosted on Heroku (we can call it the Master) to another, application-purposed copy of the data that will be in another Heroku (Cedar Stack) hosted PostgreSQL database. Our primary development stack is Python 2.7.2, Django 1.3.3 and PostgreSQL 9.1. As many of you may know, the file system in Heroku is limited in what you can do, and I'm not sure if I completely understand what the rules are for the Ephemeral Filesystem.
So, I'm trying to figure out what my options are here. The obvious one is that I can just write a Django management command and have two separate database connections (and a destination and source set of models) and pump the data over that way and handle the ETL in the process. While effect, my initial tests shows this is a very slow approach. Obviously, a faster approach would be to use PostreSQL COPY functionality. But, normally if I was doing this I would be able to write it out to a file and then use psql to pull it in. Any one done anything like this between two dedicated PostgreSQL databases on Heroku? Any advice or tips will be appreciated.

One solution may be to do the whole ETL process in Postgres land. That is, use the dblink extension to pull data from the source database into the target database. This may or may not be sufficient, but it's worth investigating.
You are free to use the filesystem on a heroku dyno, but I don't think this is a bullet proof solution. The way it works is that you can write to the filesystem just fine, but as soon as that process exits, away goes the data within it. The size of that filesystem is not guaranteed at all, but it is quite large, unless you need multiple hundreds of GBs worth of storage.
Finally, you can speed up some of the process by turning some session level postgres knobs. Instead of listing them here, just read it up on the excellent postgres docs.
EDIT: We now support the Postgres FDW, a better alternative to dblink: http://www.postgresql.org/docs/current/static/postgres-fdw.html

Postgres 8.4.4 + psycopg2 + python 2.6.5 + Win7 instability

You can see the combination of software components I'm using in the title of the question.
I have a simple 10-table database running on a Postgres server (Win 7 Pro). I have client apps (python using psycopg to connect to Postgres) who connect to the database at random intervals to conduct relatively light transactions. There's only one client app at a time doing any kind of heavy transaction, and those are typically < 500ms. The rest of them spend more time connecting than actually waiting for the database to execute the transaction. The point is that the database is under light load, but the load is evenly split between reads and writes.
My client apps run as servers/services themselves. I've found that it is pretty common for me to be able to (1) take the Postgres server completely down, and (2) ruin the database by killing the client app with a keyboard interrupt.
By (1), I mean that the Postgres process on the server aborts and the service needs to be restarted.
By (2), I mean that the database crashes again whenever a client tries to access the database after it has restarted and (presumably) finished "recovery mode" operations. I need to delete the old database/schema from the database server, then rebuild it each time to return it to a stable state. (After recovery mode, I have tried various combinations of Vacuums to see whether that improves stability; the vacuums run, but the server will still go down quickly when clients try to access the database again.)
I don't recall seeing the same effect when I kill the client app using a "taskkill" - only when using a keyboard interrupt to take the python process down. It doesn't happen all the time, but frequently enough that it's a major concern (25%?).
Really surprised that anything on a client would actually be able to take down an "enterprise class" database. Can anyone share tips on how to improve robustness, and hopefully help me to understand why this is happening in the first place? Thanks, M

If you're having problems with postgresql acting up like this, you should read this page:
http://wiki.postgresql.org/wiki/Guide_to_reporting_problems
For an example of a real bug, and how to ask a question that gets action and answers, read this thread.
http://archives.postgresql.org/pgsql-general/2010-12/msg01030.php

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.