Having issues doing fast enough db inserts inside a Flask endpoint

Having issues doing fast enough db inserts inside a Flask endpoint - python

I have an HTTP POST endpoint in Flask which needs to insert whatever data comes in into a database. This endpoint can receive up to hundreds of requests per second. Doing an insert every time a new request comes takes too much time. I have thought that doing a bulk insert every 1000 request with all the previous 1000 request data should work like some sort of caching mechanism. I have tried saving 1000 incoming data objects into some collection and then doing a bulk insert once the array is 'full'.
Currently my code looks like this:
#app.route('/user', methods=['POST'])
def add_user():
firstname = request.json['firstname']
lastname = request.json['lastname']
email = request.json['email']
usr = User(firstname, lastname, email)
global bulk
bulk.append(usr)
if len(bulk) > 1000:
bulk = []
db.session.bulk_save_objects(bulk)
db.session.commit()
return user_schema.jsonify(usr)
The problem I'm having with this is that the database becomes 'locked', and I really don't know if this is a good solution but just poorly implemented, or a stupid idea.
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked

Your error message indicates that you are using an sqlite DB with SQLAlchemy. You may want to try changing the setting of the sqlite "synchronous" flag to turn syncing OFF. This can speed INSERT queries up dramatically, but it does come with the increased risk of data loss. See https://sqlite.org/pragma.html#pragma_synchronous for more details.
With synchronous OFF (0), SQLite continues without syncing as soon as
it has handed data off to the operating system. If the application
running SQLite crashes, the data will be safe, but the database might
become corrupted if the operating system crashes or the computer loses
power before that data has been written to the disk surface. On the
other hand, commits can be orders of magnitude faster with synchronous
OFF
If your application and use case can tolerate the increased risks, then disabling syncing may negate the need for bulk inserts.
See "How to set SQLite PRAGMA statements with SQLAlchemy": How to set SQLite PRAGMA statements with SQLAlchemy

Once I moved the code on AWS and used the Aurora instance as the database, the problems went away, so I suppose it's safe to conclude that the issue were solely related to my sqlite3 instance.
The final solution gave me satisfactory results and I ended up changing only this line:
db.session.bulk_save_objects(bulk)
to this:
db.session.save_all(bulk)
I can now safely do up to 400 or more (haven't tested for more) calls on that specific endpoints, all ending with valid inserts, per second.

Not an expert on this, but seems like database has reached its concurrency limits. You can try using Pony for better concurrency and transaction management
https://docs.ponyorm.org/transactions.html
By default Pony uses the optimistic concurrency control concept for increasing performance. With this concept, Pony doesn’t acquire locks on database rows. Instead it verifies that no other transaction has modified the data it has read or is trying to modify.

Related

AppEngine NDB Query return different Results

I have a query in my live app that has gone "odd"...
Running 1.8.4 SDK... 1.8.5 live instance using Python 2.7
Measurement is an NDB model... with a string property called status and a key property called asset....
(Deep in my handler code.... )
cursor=None
limit=10
asset_key = <a key to an actual asset>
qry = Measurement.query(
Measurement.status=='PENDING',
Measurement.asset=asset_key)
results, cursor, more = qry.fetch_page(page_size=limit, start_cursor=cursor)
Now the weird thing is if I run this sometimes I get 4 items and sometimes only 1. (the right answer is 4)....
The dump of the query is exactly the same ... cursor is set to None... limit is always the same....same handler...same query and no new records in between each query. Fresh instance (eg 1st time + no other users)
Each query is only separated by seconds yet results a different.
Am I missing something here... has anyone else experienced this? Is this some sort of corrupt index? (It is a relatively large "table" with 482,911 items) Is NDB caching a cursor variable???
Very very odd.

Queries do not look up values in any cache. However, query results are written back to the in-context cache if the cache policy says so (as per the docs). https://developers.google.com/appengine/docs/python/ndb/cache#incontext
Perhaps review the caching policy for the entity in question. However, from your snippet I'm unsure if your query is strongly consistent. That is more likely the cause of this issue: https://developers.google.com/appengine/docs/python/datastore/structuring_for_strong_consistency

Poor MySQL performance on EC2 Micro instance

I have one small webapp, which uses Pyhon/Flask and a MySQL db for storage of data. I have a studentsdatabase, which has around 3 thousand rows. When trying to load that page, the loading takes very much time, sometimes even a minute or so. Its around 20 seconds, which is really slow and I am wondering what is causing this. This is the state of the server before any request is made, and this happens when I try to load that site.
As I said, this is not too much records, and I am puzzled by why this is so ineffective. I am using Ubuntu 12.04, with Ver 14.14 Distrib 5.5.32, for debian-linux-gnu (x86_64) using readline 6.2 mysql version. Other queries run fine, for example listing students whose name starts with some letter takes around 2-3 seconds, which is acceptable. That shows the portion of the table, so I am guessing something is not optimized right.
My.cnf file is located here. I tried some stuff, added some lines at the bottom, but without too much success.
The actual queries are done by sqlalchemy, and this is the specific code used to load this:
score = db.session.query(Scores.id).order_by(Scores.date.desc()).correlate(Students).filter(Students.email == Scores.email).limit(1)
students = db.session.query(Students, score.as_scalar()).filter_by(archive=0).order_by(Students.exam_date)
return render_template("students.html", students=students.all())
This appears to be the sql generated:
SELECT student.id AS student_id, student.first_name AS student_first_name, student.middle_name AS student_middle_name, student.last_name AS student_last_name, student.email AS student_email, student.password AS student_password, student.address1 AS student_address1, student.address2 AS student_address2, student.city AS student_city, student.state AS student_state, student.zip AS student_zip, student.country AS student_country, student.phone AS student_phone, student.cell_phone AS student_cell_phone, student.active AS student_active, student.archive AS student_archive, student.imported AS student_imported, student.security_pin AS student_security_pin, (SELECT scores.id \nFROM scores \nWHERE student.email = scores.email ORDER BY scores.date DESC \n LIMIT 1) AS anon_1 \nFROM student \nWHERE student.archive = 0"
Thanks in advance for your time and help!

#datasage is right - the micro instance can only do so much. You might try starting a second micro-instance for your mysql database. Running both apache and mysql on a single micro instance will be slow.
From my experience, when using AWS's RDS service (mysql)- you can get reasonable performance on the micro-instance for testing. Depending on how long the instance has been on, sometimes you can get crawlers pinging your site, so it can help to IP restrict it to your computer in the security policy.
It doesn't look like your database structure is that complex - you might add an index on your email fields, but I suspect unless your dataset is over 5000 rows it won't make much difference. If you're using the sqlalchemy ORM, this would look like:
class Scores(base):
__tablename__ = 'center_master'
id = Column(Integer(), primary_key=True)
email = Column(String(255), index=True)

Micro instances are pretty slow performance wise. They are designed with burstable CPU profiles and will be heavily restricted when the burstable time is exceeded.
That said, your problem here is likely with your database design. Any time you want to join two tables, you want to have indexes on columns of the right and left side of the join. In this case you are using the email field.
Using strings to join on isn't quite as optimal as using an integer id. Also using the Explain keyword will running the query directly in mysql will show you an execution plan and can help you quickly identify where you may be missing indexes or have other problems.

Optimize Django

I have a problem with the speed of page loading.
Now it takes about 7 seconds to load the pages and 2~3 seconds is Django processing.
Obvious thing to blame is my lack of knowledge of architecture, execute average 50 queries, as shown by "Django debug tool bar" when accessing the pages but most of the queries are like "yesterday`s snapshot(group by something)" or "daily snapshot(group by something) before yesterday" and doesn't have to be updated each time.
I am coming out of idea using memory caching or create new table for prepare-possible type of data.
Is there any convention or Design Pattern for this kind of issue?
sample queries are these( I believe they must not query each time on yesterdays data or last month`s data):
SELECT `sample_salestarget`.`id`, `sample_salestarget`.`country_id`, `sample_salestarget`.`year`, `sample_salestarget`.`month`, `sample_salestarget`.`sales` FROM `sample_salestarget` WHERE (`sample_salestarget`.`country_id` = "abc" AND `sample_salestarget`.`month` = 8 AND `sample_salestarget`.`year` = 2012 )
SELECT `sample_dailysummary`.`id`, `sample_dailysummary`.`country_id`, `sample_dailysummary`.`date`, `sample_dailysummary`.`pv_day`, `sample_dailysummary`.`pv_week`, `sample_dailysummary`.`pv_month`, `sample_dailysummary`.`active_uu_day`, `sample_dailysummary`.`active_uu_week`, `sample_dailysummary`.`active_uu_month`, `sample_dailysummary`.`active_uu_7days`, `sample_dailysummary`.`active_uu_30days`, `sample_dailysummary`.`paid_uu_day`, `sample_dailysummary`.`paid_uu_week`, `sample_dailysummary`.`paid_uu_month`, `sample_dailysummary`.`sales_day`, `sample_dailysummary`.`sales_week`, `sample_dailysummary`.`sales_month`, `sample_dailysummary`.`register_uu_day`, `sample_dailysummary`.`register_uu_week`, `sample_dailysummary`.`register_uu_month`, `sample_dailysummary`.`pay_count_day`, `sample_dailysummary`.`pay_count_week`, `sample_dailysummary`.`pay_count_month`, `sample_dailysummary`.`total_user`, `sample_dailysummary`.`inv_access_uu`, `sample_dailysummary`.`inv_sender_uu`, `sample_dailysummary`.`inv_accepted_uu`, `sample_dailysummary`.`inv_send_count`, `sample_dailysummary`.`memo`, `sample_dailysummary`.`first_charge_uu` FROM `sample_dailysummary` WHERE `sample_dailysummary`.`date` = 2012-09-07 AND `sample_dailysummary`.`country_id` = "abc" )

Using Memcached can really speed things up for you. However, that does come with it's problems. You have to be extra careful on dynamic pages about explicitly invalidating caches whenever required.
Along with Memcached, try johnny-cache which does a very good job of caching your django ORM queries
Also, make use of Django's session variables as far as possible. (Try the cached_db session engine if you're using Memcached.) You could save objects (like your user profile settings) which stay consistent throughout a session. This way you're reducing the number of sql calls again.
And if you really really need quick pageloads.. Maybe try loading your page and then asynchronously calling your sql statements using Celery and load your results in an AJAXy manner.

If this is a production application to be exposed to the internet, and you can't reduce the number of queries you make then you should at least reuse the answers, I would suggest using django's built in DB cache to store database results in ram using memcached. If this is a local app then i would suggest django's ram based cache. the reason for this is memcached is able to be scaled a lot further than django's but django's requires little setup
Caching for Django

Aggregating save()s in Django?

I'm using Django with an sqlite backend, and write performance is a problem. I may graduate to a "proper" db at some stage, but for the moment I'm stuck with sqlite. I think that my write performance problems are probably related to the fact that I'm creating a large number of rows, and presumably each time I save() one it's locking, unlocking and syncing the DB on disk.
How can I aggregate a large number of save() calls into a single database operation?

EDITED: commit_on_success is deprecated and was removed in Django 1.8. Use transaction.atomic instead. See Fraser Harris's answer.
Actually this is easier to do than you think. You can use transactions in Django. These batch database operations (specifically save, insert and delete) into one operation. I've found the easiest one to use is commit_on_success. Essentially you wrap your database save operations into a function and then use the commit_on_success decorator.
from django.db.transaction import commit_on_success
#commit_on_success
def lot_of_saves(queryset):
for item in queryset:
modify_item(item)
item.save()
This will have a huge speed increase. You'll also get the benefit of having roll-backs if any of the items fail. If you have millions of save operations then you may have to commit them in blocks using the commit_manually and transaction.commit() but I've rarely needed that.

New as of Django 1.6 is atomic, a simple API to control DB transactions. Copied verbatim from the docs:
atomic is usable both as a decorator:
from django.db import transaction
#transaction.atomic
def viewfunc(request):
# This code executes inside a transaction.
do_stuff()
and as a context manager:
from django.db import transaction
def viewfunc(request):
# This code executes in autocommit mode (Django's default).
do_stuff()
with transaction.atomic():
# This code executes inside a transaction.
do_more_stuff()
Legacy django.db.transaction functions autocommit(), commit_on_success(), and commit_manually() have been deprecated and will be remove in Django 1.8.

I think this is the method you are looking for: https://docs.djangoproject.com/en/dev/ref/models/querysets/#bulk-create
Code copied from the docs:
Entry.objects.bulk_create([
Entry(headline='This is a test'),
Entry(headline='This is only a test'),
])
Which in practice, would look like:
my_entries = list()
for i in range(100):
my_entries.append(Entry(headline='Headline #'+str(i))
Entry.objects.bulk_create(my_entries)
According to the docs, this executes a single query, regardless of the size of the list (maximum 999 items on SQLite3), which can't be said for the atomic decorator.
There is an important distinction to make. It sounds like, from the OP's question, that he is attempted to bulk create rather than bulk save. The atomic decorator is the fastest solution for saving, but not for creating.

"How can I aggregate a large number of save() calls into a single database operation?"
You don't need to. Django already manages a cache for you. You can't improve it's DB caching by trying to fuss around with saves.
"write performance problems are probably related to the fact that I'm creating a large number of rows"
Correct.
SQLite is pretty slow. That's the way it is. Queries are faster than most other DB's. Writes are pretty slow.
Consider more serious architecture change. Are you loading rows during a web transaction (i.e., bulk uploading files and loading the DB from those files)?
If you're doing bulk loading inside a web transaction, stop. You need to do something smarter. Use celery or use some other "batch" facility to do your loads in the background.
We try to limit ourself to file validation in a web transaction and do the loads when the user's not waiting for their page of HTML.

Overhead of a Round-trip to MySql?

So I've been building django applications for a while now, and drinking the cool-aid and all: only using the ORM and never writing custom SQL.
The main page of the site (the primary interface where users will spend 80% - 90% of their time) was getting slow once you have a large amount of user specific content (ie photos, friends, other data, etc)
So I popped in the sql logger (was pre-installed with pinax, I just enabled it in the settings) and imagine my surprise when it reported over 500 database queries!! With hand coded sql I hardly ever ran more than 50 on the most complex pages.
In hindsight it's not all together surprising, but it seems that this can't be good.
...even if only a dozen or so of the queries take 1ms+
So I'm wondering, how much overhead is there on a round trip to mysql? django and mysql are running on the same server so there shouldn't be any networking related overhead.

Just because you are using an ORM doesn't mean that you shouldn't do performance tuning.
I had - like you - a home page of one of my applications that had low performance. I saw that I was doing hundreds of queries to display that page. I went looking at my code and realized that with some careful use of select_related() my queries would bring more of the data I needed - I went from hundreds of queries to tens.
You can also run a SQL profiler and see if there aren't indices that would help your most common queries - you know, standard database stuff.
Caching is also your friend, I would think. If a lot of a page is not changing, do you need to query the database every single time?
If all else fails, remember: the ORM is great, and yes - you should try to use it because it is the Django philosophy; but you are not married to it.
If you really have a usecase where studying and tuning the ORM navigation didn't help, if you are sure that you could do it much better with a standard query: use raw sql for that case.

The overhead of each queries is only part of the picture. The actual round trip time between your Django and Mysql servers is probably very small since most of your queries are coming back in less than a one millisecond. The bigger problem is that the number of queries issued to your database can quickly overwhelm it. 500 queries for a page is way to much, even 50 seems like a lot to me. If ten users view complicated pages you're now up to 5000 queries.
The round trip time to the database server is more of a factor when the caller is accessing the database from a Wide Area Network, where roundtrips can easily be between 20ms and 100ms.
I would definitely look into using some kind of caching.

There are some ways to reduce the query volume.
Use .filter() and .all() to get a bunch of things; pick and choose in the view function (or template via {%if%}). Python can process a batch of rows faster than MySQL.
"But I could send too much to the template". True, but you'll execute fewer SQL requests. Measure to see which is better.
This is what you used to do when you wrote SQL. It's not wrong -- it doesn't break the ORM -- but it optimizes the underlying DB work and puts the processing into the view function and the template.
Avoid query navigation in the template. When you do {{foo.bar.baz.quux}}, SQL is used to get the bar associated with foo, then the baz associated with the bar, then the quux associated with baz. You may be able to reduce this query business with some careful .filter() and Python processing to assemble a useful tuple in the view function.
Again, this was something you used to do when you hand-crafted SQL. In this case, you gather larger batches of ORM-managed objects in the view function and do your filtering in Python instead of via a lot of individual ORM requests.
This doesn't break the ORM. It changes the usage profile from lots of little queries to a few bigger queries.

There is always overhead in database calls, in your case the overhead is not that bad because the application and database are on the same machine so there is no network latency but there is still a significant cost.
When you make a request to the database it has to prepare to service that request by doing a number of things including:
Allocating resources (memory buffers, temp tables etc) to the database server connection/thread that will handle the request,
De-serializing the sql and parameters (this is necessary even on one machine as this is an inter-process request unless you are using an embeded database)
Checking whether the query exists in the query cache if not optimise it and put it in the cache.
Note also that if your queries are not parametrised (that is the values are not separated from the SQL) this may result in cache misses for statements that should be the same meaning that each request results in the query being analysed and optimized each time.
Process the query.
Prepare and return the results to the client.
This is just an overview of the kinds of things the most database management systems do to process an SQL request. You incur this overhead 500 times even if the the query itself runs relatively quickly. Bottom line database interactions even to local database are not as cheap as you might expect.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Having issues doing fast enough db inserts inside a Flask endpoint - python

Related

AppEngine NDB Query return different Results

Poor MySQL performance on EC2 Micro instance

Optimize Django

Aggregating save()s in Django?

Overhead of a Round-trip to MySql?

Categories

Resources