Poor MySQL performance on EC2 Micro instance

Poor MySQL performance on EC2 Micro instance - python

I have one small webapp, which uses Pyhon/Flask and a MySQL db for storage of data. I have a studentsdatabase, which has around 3 thousand rows. When trying to load that page, the loading takes very much time, sometimes even a minute or so. Its around 20 seconds, which is really slow and I am wondering what is causing this. This is the state of the server before any request is made, and this happens when I try to load that site.
As I said, this is not too much records, and I am puzzled by why this is so ineffective. I am using Ubuntu 12.04, with Ver 14.14 Distrib 5.5.32, for debian-linux-gnu (x86_64) using readline 6.2 mysql version. Other queries run fine, for example listing students whose name starts with some letter takes around 2-3 seconds, which is acceptable. That shows the portion of the table, so I am guessing something is not optimized right.
My.cnf file is located here. I tried some stuff, added some lines at the bottom, but without too much success.
The actual queries are done by sqlalchemy, and this is the specific code used to load this:
score = db.session.query(Scores.id).order_by(Scores.date.desc()).correlate(Students).filter(Students.email == Scores.email).limit(1)
students = db.session.query(Students, score.as_scalar()).filter_by(archive=0).order_by(Students.exam_date)
return render_template("students.html", students=students.all())
This appears to be the sql generated:
SELECT student.id AS student_id, student.first_name AS student_first_name, student.middle_name AS student_middle_name, student.last_name AS student_last_name, student.email AS student_email, student.password AS student_password, student.address1 AS student_address1, student.address2 AS student_address2, student.city AS student_city, student.state AS student_state, student.zip AS student_zip, student.country AS student_country, student.phone AS student_phone, student.cell_phone AS student_cell_phone, student.active AS student_active, student.archive AS student_archive, student.imported AS student_imported, student.security_pin AS student_security_pin, (SELECT scores.id \nFROM scores \nWHERE student.email = scores.email ORDER BY scores.date DESC \n LIMIT 1) AS anon_1 \nFROM student \nWHERE student.archive = 0"
Thanks in advance for your time and help!

#datasage is right - the micro instance can only do so much. You might try starting a second micro-instance for your mysql database. Running both apache and mysql on a single micro instance will be slow.
From my experience, when using AWS's RDS service (mysql)- you can get reasonable performance on the micro-instance for testing. Depending on how long the instance has been on, sometimes you can get crawlers pinging your site, so it can help to IP restrict it to your computer in the security policy.
It doesn't look like your database structure is that complex - you might add an index on your email fields, but I suspect unless your dataset is over 5000 rows it won't make much difference. If you're using the sqlalchemy ORM, this would look like:
class Scores(base):
__tablename__ = 'center_master'
id = Column(Integer(), primary_key=True)
email = Column(String(255), index=True)

Micro instances are pretty slow performance wise. They are designed with burstable CPU profiles and will be heavily restricted when the burstable time is exceeded.
That said, your problem here is likely with your database design. Any time you want to join two tables, you want to have indexes on columns of the right and left side of the join. In this case you are using the email field.
Using strings to join on isn't quite as optimal as using an integer id. Also using the Explain keyword will running the query directly in mysql will show you an execution plan and can help you quickly identify where you may be missing indexes or have other problems.

Related

Having issues doing fast enough db inserts inside a Flask endpoint

I have an HTTP POST endpoint in Flask which needs to insert whatever data comes in into a database. This endpoint can receive up to hundreds of requests per second. Doing an insert every time a new request comes takes too much time. I have thought that doing a bulk insert every 1000 request with all the previous 1000 request data should work like some sort of caching mechanism. I have tried saving 1000 incoming data objects into some collection and then doing a bulk insert once the array is 'full'.
Currently my code looks like this:
#app.route('/user', methods=['POST'])
def add_user():
firstname = request.json['firstname']
lastname = request.json['lastname']
email = request.json['email']
usr = User(firstname, lastname, email)
global bulk
bulk.append(usr)
if len(bulk) > 1000:
bulk = []
db.session.bulk_save_objects(bulk)
db.session.commit()
return user_schema.jsonify(usr)
The problem I'm having with this is that the database becomes 'locked', and I really don't know if this is a good solution but just poorly implemented, or a stupid idea.
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked

Your error message indicates that you are using an sqlite DB with SQLAlchemy. You may want to try changing the setting of the sqlite "synchronous" flag to turn syncing OFF. This can speed INSERT queries up dramatically, but it does come with the increased risk of data loss. See https://sqlite.org/pragma.html#pragma_synchronous for more details.
With synchronous OFF (0), SQLite continues without syncing as soon as
it has handed data off to the operating system. If the application
running SQLite crashes, the data will be safe, but the database might
become corrupted if the operating system crashes or the computer loses
power before that data has been written to the disk surface. On the
other hand, commits can be orders of magnitude faster with synchronous
OFF
If your application and use case can tolerate the increased risks, then disabling syncing may negate the need for bulk inserts.
See "How to set SQLite PRAGMA statements with SQLAlchemy": How to set SQLite PRAGMA statements with SQLAlchemy

Once I moved the code on AWS and used the Aurora instance as the database, the problems went away, so I suppose it's safe to conclude that the issue were solely related to my sqlite3 instance.
The final solution gave me satisfactory results and I ended up changing only this line:
db.session.bulk_save_objects(bulk)
to this:
db.session.save_all(bulk)
I can now safely do up to 400 or more (haven't tested for more) calls on that specific endpoints, all ending with valid inserts, per second.

Not an expert on this, but seems like database has reached its concurrency limits. You can try using Pony for better concurrency and transaction management
https://docs.ponyorm.org/transactions.html
By default Pony uses the optimistic concurrency control concept for increasing performance. With this concept, Pony doesn’t acquire locks on database rows. Instead it verifies that no other transaction has modified the data it has read or is trying to modify.

PostgreSQL query slow, what's the issue?

I'm trying to store some measurement data into my postgresql db using Python Django.
So far all good, i've made a docker container with django, and another one with the postgresql server.
However, i am getting close to 2M rows in my measurement table, and queries start to get really slow, while i'm not really sure why, i'm not doing very intense queries.
This query
SELECT ••• FROM "measurement" WHERE "measurement"."device_id" = 26 ORDER BY "measurement"."measure_timestamp" DESC LIMIT 20
for example takes between 3 and 5 seconds to run, depending on which device i query.
I would expect this to run a lot faster, since i'm not doing anything fancy.
The measurement table
id INTEGER
measure_timestamp TIMESTAMP WITH TIMEZONE
sensor_height INTEGER
device_id INTEGER
with indices on id and measure_timestamp.
The server doesn't look too busy, even though it's only 512M memory, i have plenty left during queries.
I configured the postgresql server with shared_buffers=256MB and work_mem=128MB.
The total database is just under 100MB, so it should easily fit.
If i run the query in PgAdmin, i'm seeing a lot of Block I/O, so i suspect it has to read from disk, which is obviously slow.
Could anyone give me a few pointers in the right direction how to find the issue?
EDIT:
Added output of explain analyze on a query. I now added index on the device_id, which helped a lot, but i would expect even quicker query times.
https://pastebin.com/H30JSuWa

Do you have indexes on measure_timestamp and device_id? If the queries always take that form, you might also like multi-column indexes.

Please look at the distribution key of your table. It is possible that the data is sparsely populated hence it affects the performance. Selecting a proper distribution key is very important when you have data of 2M records. For more details read this on why distribution key is important

SqlAlchemy look-ahead caching?

Edit: Main Topic:
Is there a way I can force SqlAlchemy to pre-populate the session as far as it can? Syncronize as much state from the database as possible (there will be no DB updates at this point).
I am having some mild performance issues and I believe I have traced it to SqlAlchemy. I'm sure there are changes in my declarative and db-schema that could improve time, but that is not what I am asking about here.
My SqlAlchemy declarative defines 8 classes, my database has 11 tables with only 7 of them holding my real data, and total my database has 800 records (all Integers and UnicodeText). My database engine is sqlite and the actual size is currently 242Kb.
Really, the number of entities is quite small, but many of the table relationships have recursive behavior (5-6 levels deep). My problem starts with the wonderful automagic that SA does for me, and my reluctance to properly extract the data with my own python classes.
I have ORM attribute access scattered across all kinds of iterators, recursive evaluators, right up to my file I/O streams. The access on these attributes is largely non-linear, and ever time I do a lookup, my callstack disappears into SqlAlchemy for quite some time, and I am getting lots of singleton queries.
I am using mostly default SA settings (python 2.7.2, sqlalchemy 0.7).
Considering that RAM is not an issue, and that my database is so small (for the time being), is there a way I can just force SqlAlchemy to pre-populate the session as far as it can. I am hoping that if I just load the raw data into memory, then the most I will have to do is chase a few joins dynamically (almost all queries are pretty straighforward).
I am hoping for a 5 minute fix so I can run some reports ASAP. My next month of TODO is likely going to be full of direct table queries and tighter business logic that can pipeline tuples.

A five minute fix for that kind of issue is unlikely, but for many-to-one "singleton" gets there is a simple recipe I use often. Suppose you're loading lots of User objects and they all have many-to-one references to a Category of some kind:
# load all categories, then hold onto them
categories = Session.query(Category).all()
for user in Session.query(User):
print user, user.category # no SQL will be emitted for the Category
this because the query.get() that a many-to-one emits will look in the local identity map for the primary key first.
If you're looking for more caching than that (and have a bit more than five minutes to spare), the same concept can be expanded to also cache the results of SELECT statements in a way that the cache is associated only with the current Session - check out the local_session_caching.py recipe included with the distribution examples.

Overhead of a Round-trip to MySql?

So I've been building django applications for a while now, and drinking the cool-aid and all: only using the ORM and never writing custom SQL.
The main page of the site (the primary interface where users will spend 80% - 90% of their time) was getting slow once you have a large amount of user specific content (ie photos, friends, other data, etc)
So I popped in the sql logger (was pre-installed with pinax, I just enabled it in the settings) and imagine my surprise when it reported over 500 database queries!! With hand coded sql I hardly ever ran more than 50 on the most complex pages.
In hindsight it's not all together surprising, but it seems that this can't be good.
...even if only a dozen or so of the queries take 1ms+
So I'm wondering, how much overhead is there on a round trip to mysql? django and mysql are running on the same server so there shouldn't be any networking related overhead.

Just because you are using an ORM doesn't mean that you shouldn't do performance tuning.
I had - like you - a home page of one of my applications that had low performance. I saw that I was doing hundreds of queries to display that page. I went looking at my code and realized that with some careful use of select_related() my queries would bring more of the data I needed - I went from hundreds of queries to tens.
You can also run a SQL profiler and see if there aren't indices that would help your most common queries - you know, standard database stuff.
Caching is also your friend, I would think. If a lot of a page is not changing, do you need to query the database every single time?
If all else fails, remember: the ORM is great, and yes - you should try to use it because it is the Django philosophy; but you are not married to it.
If you really have a usecase where studying and tuning the ORM navigation didn't help, if you are sure that you could do it much better with a standard query: use raw sql for that case.

The overhead of each queries is only part of the picture. The actual round trip time between your Django and Mysql servers is probably very small since most of your queries are coming back in less than a one millisecond. The bigger problem is that the number of queries issued to your database can quickly overwhelm it. 500 queries for a page is way to much, even 50 seems like a lot to me. If ten users view complicated pages you're now up to 5000 queries.
The round trip time to the database server is more of a factor when the caller is accessing the database from a Wide Area Network, where roundtrips can easily be between 20ms and 100ms.
I would definitely look into using some kind of caching.

There are some ways to reduce the query volume.
Use .filter() and .all() to get a bunch of things; pick and choose in the view function (or template via {%if%}). Python can process a batch of rows faster than MySQL.
"But I could send too much to the template". True, but you'll execute fewer SQL requests. Measure to see which is better.
This is what you used to do when you wrote SQL. It's not wrong -- it doesn't break the ORM -- but it optimizes the underlying DB work and puts the processing into the view function and the template.
Avoid query navigation in the template. When you do {{foo.bar.baz.quux}}, SQL is used to get the bar associated with foo, then the baz associated with the bar, then the quux associated with baz. You may be able to reduce this query business with some careful .filter() and Python processing to assemble a useful tuple in the view function.
Again, this was something you used to do when you hand-crafted SQL. In this case, you gather larger batches of ORM-managed objects in the view function and do your filtering in Python instead of via a lot of individual ORM requests.
This doesn't break the ORM. It changes the usage profile from lots of little queries to a few bigger queries.

There is always overhead in database calls, in your case the overhead is not that bad because the application and database are on the same machine so there is no network latency but there is still a significant cost.
When you make a request to the database it has to prepare to service that request by doing a number of things including:
Allocating resources (memory buffers, temp tables etc) to the database server connection/thread that will handle the request,
De-serializing the sql and parameters (this is necessary even on one machine as this is an inter-process request unless you are using an embeded database)
Checking whether the query exists in the query cache if not optimise it and put it in the cache.
Note also that if your queries are not parametrised (that is the values are not separated from the SQL) this may result in cache misses for statements that should be the same meaning that each request results in the query being analysed and optimized each time.
Process the query.
Prepare and return the results to the client.
This is just an overview of the kinds of things the most database management systems do to process an SQL request. You incur this overhead 500 times even if the the query itself runs relatively quickly. Bottom line database interactions even to local database are not as cheap as you might expect.

Django Table with Million of rows

I have a project with 2 applications ( books and reader ).
Books application has a table with 4 milions of rows with this fields:
book_title = models.CharField(max_length=40)
book_description = models.CharField(max_length=400)
To avoid to query the database with 4 milions of rows, I am thinking to divide it by subject ( 20 models with 20 tables with 200.000 rows ( book_horror, book_drammatic, ecc ).
In "reader" application, I am thinking to insert this fields:
reader_name = models.CharField(max_length=20, blank=True)
book_subject = models.IntegerField()
book_id = models.IntegerField()
So instead of ForeignKey, I am thinking to use a integer "book_subject" (which allows to access the appropriate table) and "book_id" (which allows to access the book in the table specified in "book_subject").
Is a good solution to avoid to query a table with 4 milions of rows ?
Is there an alternative solution?

Like many have said, it's a bit premature to split your table up into smaller tables (horizontal partitioning or even sharding). Databases are made to handle tables of this size, so your performance problem is probably somewhere else.
Indexes are the first step, it sounds like you've done this though. 4 million rows should be ok for the db to handle with an index.
Second, check the number of queries you're running. You can do this with something like the django debug toolbar, and you'll often be surprised how many unnecessary queries are being made.
Caching is the next step, use memcached for pages or parts of pages that are unchanged for most users. This is where you will see your biggest performance boost for the little effort required.
If you really, really need to split up the tables, the latest version of django (1.2 alpha) can handle sharding (eg multi-db), and you should be able to hand write a horizontal partitioning solution (postgres offers an in-db way to do this). Please don't use genre to split the tables! pick something that you wont ever, ever change and that you'll always know when making a query. Like author and divide by first letter of the surname or something. This is a lot of effort and has a number of drawbacks for a database which isn't particularly big --- this is why most people here are advising against it!
[edit]
I left out denormalisation! Put common counts, sums etc in the eg author table to prevent joins on common queries. The downside is that you have to maintain it yourself (until django adds a DenormalizedField). I would look at this during development for clear, straightforward cases or after caching has failed you --- but well before sharding or horizontal partitioning.

ForeignKey is implemented as IntegerField in the database, so you save little to nothing at the cost of crippling your model.
Edit:
And for pete's sake, keep it in one table and use indexes as appropriate.

You haven't mentioned which database you're using. Some databases - like MySQL and PostgreSQL - have extremely conservative settings out-of-the-box, which are basically unusable for anything except tiny databases on tiny servers.
If you tell us which database you're using, and what hardware it's running on, and whether that hardware is shared with other applications (is it also serving the web application, for example) then we may be able to give you some specific tuning advice.
For example, with MySQL, you will probably need to tune the InnoDB settings; for PostgreSQL, you'll need to alter shared_buffers and a number of other settings.

I'm not familiar with Django, but I have a general understanding of DB.
When you have large databases, it's pretty normal to index your database. That way, retrieving data, should be pretty quick.
When it comes to associate a book with a reader, you should create another table, that links reader to books.
It's not a bad idea to divide the books into subjects. But I'm not sure what you mean by having 20 applications.

Are you having performance problems? If so, you might need to add a few indexes.
One way to get an idea where an index would help is by looking at your db server's query log (instructions here if you're on MySQL).
If you're not having performance problems, then just go with it. Databases are made to handle millions of records, and django is pretty good at generating sensible queries.

A common approach to this type of problem is Sharding. Unfortunately it's mostly up to the ORM to implement it (Hibernate does it wonderfully) and Django does not support this. However, I'm not sure 4 million rows is really all that bad. Your queries should still be entirely manageable.
Perhaps you should look in to caching with something like memcached. Django supports this quite well.

You can use a server-side datatable. If you can implement a server-side datatable, you will be able to have more than 4 million records in less than a second.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.