Reduce GAE hrd(db) read operation counts

Reduce GAE hrd(db) read operation counts - python

To reduce GAE Python usage cost, I want to optimize DB read operation. Do you have any suggestions?
I can't understand why GAE shows quite a lot DB read operation than I thought. If you can give general logic how GAE counts DB read operation it also should be very helpful.
Thanks!

You can get the full breakdown of what a high-level operation (get, query, put, delete, ...) costs in low-level operations (small, read, write) here - https://developers.google.com/appengine/docs/billing (scroll down about half way).
I highly recommend using AppStats to help track down where your read operations are coming from. One big thing to watch out for is not to use the offset option with .fetch() for pagination, as this just skips results, but still costs reads. That means if you do .fetch(10, offset=20), it will cost you 30 reads. You want to use query cursors instead.
Another optimization is to fetch by key (.get(keys)) vs querying, which will only cost 1 read operation, as opposed to querying which cost 1 read for the query + 1 read for each entity returned (so a query with 1 entity returned cost 2 reads, but a .get() for that same entity would only cost 1 read. You might also want to look at using projection queries, which cost 1 read for the query, but only 1 small per projected entity retrieved (note: all properties projected must be indexed).
Also, if you're not already, you should be using the NDB API which automatically caches fetches and will help reduce your read operations. Along with the official docs, the NDB cheat sheet by Rodrigo and Guido is a great way to transition from ext.db to ndb.
There are some good tips under Managing Datastore Usage here:
https://developers.google.com/appengine/articles/managing-resources
Lastly, you might also be interested in using gae_mini_profiler, which provides convenient access to AppStats for the current request, as well as other helpful profiling and logging information.

Hard to say why without seeing your code but if you're not already, use memcache to save on db reads.
https://developers.google.com/appengine/docs/python/memcache/usingmemcache

Related

in-memory database with publish subscribe and query filter?

I am looking for a trading UI solution for my work. I require an in-memory database that can
store table pattern data (rows and columns) with indexing capability.
Provide publish and subscribe mechanism. There will be multiple subscribers to the topic/table.
Query filter capability since every user will have different criteria for subscription.
I have found out a few technologies/options myself.
AMPS (60 East technologies): The most efficient one. Provides pretty much everything I mentioned above. But this is a paid solution. It is column based storage and allows indexing as well.
Mongodb Tailable Cursor/Capped Collection: This also provides query based subscription with open cursors, though it is not in-memory. Any thoughts on its performance. (I expect more than million rows with 100s of column)
Use simple pubsub mechanism and perform query filter at client. But that would require unnecessary data flow which will result in security issues and performance bottleneck.
Any suggestion on the product or toolset ideal for such a scenario. Our client side is a Python/C++ UI with server side will have a mixture of C++/java/python components. All ideas are welcome.
Many thanks!

SQLite, maybe? https://www.sqlite.org/index.html
I'm not exactly sure about your publish/subscribe mechanism requirements, but SQLite is used all over the place.
Though, to be honest, your in memory database seems like it's going to be huge ("I expect more than [a] million rows with 100s of column").

Is it possible to Bulk Insert using Google Cloud Datastore

We are migrating some data from our production database and would like to archive most of this data in the Cloud Datastore.
Eventually we would move all our data there, however initially focusing on the archived data as a test.
Our language of choice is Python, and have been able to transfer data from mysql to the datastore row by row.
We have approximately 120 million rows to transfer and at a one row at a time method will take a very long time.
Has anyone found some documentation or examples on how to bulk insert data into cloud datastore using python?
Any comments, suggestions is appreciated thank you in advanced.

There is no "bulk-loading" feature for Cloud Datastore that I know of today, so if you're expecting something like "upload a file with all your data and it'll appear in Datastore", I don't think you'll find anything.
You could always write a quick script using a local queue that parallelizes the work.
The basic gist would be:
Queuing script pulls data out of your MySQL instance and puts it on a queue.
(Many) Workers pull from this queue, and try to write the item to Datastore.
On failure, push the item back on the queue.
Datastore is massively parallelizable, so if you can write a script that will send off thousands of writes per second, it should work just fine. Further, your big bottleneck here will be network IO (after you send a request, you have to wait a bit to get a response), so lots of threads should get a pretty good overall write rate. However, it'll be up to you to make sure you split the work up appropriately among those threads.
Now, that said, you should investigate whether Cloud Datastore is the right fit for your data and durability/availability needs. If you're taking 120m rows and loading it into Cloud Datastore for key-value style querying (aka, you have a key and an unindexed value property which is just JSON data), then this might make sense, but loading your data will cost you ~$70 in this case (120m * $0.06/100k).
If you have properties (which will be indexed by default), this cost goes up substantially.
The cost of operations is $0.06 per 100k, but a single "write" may contain several "operations". For example, let's assume you have 120m rows in a table that has 5 columns (which equates to one Kind with 5 properties).
A single "new entity write" is equivalent to:
+ 2 (1 x 2 write ops fixed cost per new entity)
+ 10 (5 x 2 write ops per indexed property)
= 12 "operations" per entity.
So your actual cost to load this data is:
120m entities * 12 ops/entity * ($0.06/100k ops) = $864.00

I believe what you are looking for is the put_multi() method.
From the docs, you can use put_multi() to batch multiple put operations. This will result in a single RPC for the batch rather than one for each of the entities.
Example:
# a list of many entities
user_entities = [ UserEntity(name='user %s' % i) for i in xrange(10000)]
users_keys = ndb.put_multi(user_entities) # keys are in same order as user_entities
Also to note, from the docs is that:
Note: The ndb library automatically batches most calls to Cloud Datastore, so in most cases you don't need to use the explicit batching operations shown below.
That said, you may still, as suggested by , use a task queue (I prefer the deferred library) in order to batch-put a lot of data in the background.

As an update to the answer of #JJ Geewax, as of July 1st, 2016
the cost of read and write operations have changed as explained here: https://cloud.google.com/blog/products/gcp/google-cloud-datastore-simplifies-pricing-cuts-cost-dramatically-for-most-use-cases
So writing should have gotten cheaper for the described case, as
writing a single entity only costs 1 write regardless of indexes and will now cost $0.18 per 100,000

Collecting keys vs automatic indexing in Google App Engine

After enabling Appstats and profiling my application, I went on a panic rage trying to figure out how to reduce costs by any means. A lot of my costs per request came from queries, so I sought out to eliminate querying as much as possible.
For example, I had one query where I wanted to get a User's StatusUpdates after a certain date X. I used a query to fetch: statusUpdates = StatusUpdates.query(StatusUpdates.date > X).
So I thought I might outsmart the system and avoid a query, but incur higher write costs for the sake of lower read costs. I thought that every time a user writes a Status, I store the key to that status in a list property of the user. So instead of querying, I would just do ndb.get_multi(user.list_of_status_keys).
The question is, what is the difference for the system between these two approaches? Sure I avoid a query with the second case, but what is happening behind the scenes here? Is what I'm doing in the second case, where I'm collecting keys, just me doing a manual indexing that GAE would have done for me with queries?
In general, what is the difference between get_multi(keys) and a query? Which is more efficient? Which is less costly?

Check the docs on billing:
https://developers.google.com/appengine/docs/billing
It's pretty straightforward. Reads are $0.07/100k, smalls are $0.01/100k, so you want to do smalls.
A query is 1 read + 1 small / entity
A get is 1 read. If you are getting more than 1 entity back with a query, it's cheaper to do a query than reading entities from keys.
Query is likely more efficient too. The only benefit from doing the gets is that they'll be fully consistent (whereas a query is eventually consistent).

Storing the keys does not query, as you cannot do anything with just the keys. You will still have to fetch the Status objects from memory. Also, since you want to query on the date of the Status object, you will need to fetch all the Status objects into memory and compare their dates yourself. If you use a Query, appengine will fetch only the Status with the required date. Since you fetch less, your read costs will be lower.
As this is basically the same question as you have posed here, I suggest that you look at the answer I gave there.

Is it best to query by keys_only=True then get_multi or just full query?

I am using NDB with python 2.7 with threadsafe mode turned on.
I understand that querying for entities with NDB does not use local cache or memcache but goes straight to the datastore unlike getting by key name. (The rest of the question might be redundant if this premise is not correct.)
Therefore would a good paradigm be to only query with keys_only=True and then do a get_multi to obtain the full entities?
The benefits would be that keys_only=True queries are much faster than keys_only=False, get_multi could potentially just hit memcache & by calling get_multi your entities are now saved in memcache in case you need to do the query again.
The drawbacks are that you now have an RPC query call + get_multi call and I think there is a limit to how may entities you can call in one get_multi therefore your effective query size might be limited.
What do you think? Should we only ever query using keys_only=True then perform get_multi? Are there certain minimum and maximum query size limits that make this technique not as effective as just doing a query that returns the full entities?

This has been extensively researched. See http://code.google.com/p/appengine-ndb-experiment/issues/detail?id=118

Optimizing join query performance in google app engine

Scenario
Entity1 (id,itmname)
Entity2 (id,itmname,price)
Entity3 (id,itmname,profit)
profit and price are both IntegerProperty
I want to count all the item with price more then 500 and profit more then 10.
I know its join operation and is not supported by google. I tried my best to find out the way other then executing queries separately and performing count but I didn't get anything.
The reason for not executing queries separately is query execution time. In each query I am getting more then 50000 records as result so it takes nearly 20 seconds in fetching records from first query.

Google App Engine developers have always focused on read optimization and that is one thing where denormalization pops in. Before you design your data structure, you should work on possible cases in which the data could be retrieved. Designing of models comes later. A closer look at I/O session about
Building Scalable Web Applications with Google App Engine will prove helpful.
In the current situation, if you are interested in just the counts, you may go with a shard counter. It will require you to update every associated counter if the field updates.
Another approach involves performing nightly scheduled task which will do heavy calculations, and update counts and other stats you might need. You might find mapreduce helpful in this case. This approach will never give you real time data.

The standard solution to this problem is denormalization. Try storing a copy of price and profit in Entity1 and then you can answer your question with a single, simple query on Entity1.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.