For 100k+ entities in google datastore, ndb.query().count() is going to cancelled by deadline , even with index. I've tried with produce_cursors options but only iter() or fetch_page() will returns cursor but count() doesn't.
How can I count large entities?
To do something that expensive you should take a look on Task Queue Python API. Based on the Task Queue API, Google App Engine provides the deferred library, which we can use to simplify the whole process of running background tasks.
Here is an example of how you could use the deferred library in your app:
import logging
def count_large_query(query):
total = query.count()
logging.info('Total entities: %d' % total)
Then you can call the above function from within your app like:
from google.appengine.ext import deferred
# Somewhere in your request:
deferred.defer(count_large_query, ndb.query())
While I'm still not sure if the count() going to return any results with such large datastore you could use this count_large_query() function instead, which is using cursors (untested):
LIMIT = 1024
def count_large_query(query):
cursor = None
more = True
total = 0
while more:
ndbs, cursor, more = query.fetch_page(LIMIT, start_cursor=cursor, keys_only=True)
total += len(ndbs)
logging.info('Total entitites: %d' % total)
To try locally the above set the LIMIT to 4 and check if in your console you can see the Total entitites: ## line.
As Guido mentioned in the comment this will not going to scale either:
This still doesn't scale (though it may postpone the problem). A task
has a 10 minute instead of 1 minute, so maybe you can count 10x as
many entities. But it's pretty expensive! Have a search for sharded
counters if you want to solve this properly (unfortunately it's a lot
of work).
So you might want to take a look on best practices for writing scalable applications and especially the sharding counters.
This is indeed a frustrating issue. I've been doing some work in this area lately to get some general count stats - basically, the number of entities that satisfy some query. count() is a great idea, but it is hobbled by the datastore RPC timeout.
It would be nice if count() supported cursors somehow so that you could cursor across the result set and simply add up the resulting integers rather than returning a large list of keys only to throw them away. With cursors, you could continue across all 1-minute / 10-minute boundaries, using the "pass the baton" deferred approach. With count() (as opposed to fetch(keys_only=True)) you can greatly reduce the waste and hopefully increase the speed of the RPC calls, e.g., it takes a shocking amount of time to count to 1,000,000 using the fetch(keys_only=True) approach - an expensive proposition on backends.
Sharded counters are a lot of overhead if you only need/want periodic count statistics (e.g., a daily count of all my accounts in the system by, e.g., country).
It is better to use google app engine backends.
Backends are exempt from the 60-second deadline for user requests and the 10-minute deadline for tasks, and run indefinitely.
Please take a look at the document here: https://developers.google.com/appengine/docs/java/backends/overview
Related
This is the simplified version of my code, at which each process crawl the link and get data and store them in database in parallel.
def crawl_and_save_data(url):
while True:
res = requests.get(url)
price_list = res.json()
if len(price_list) == 0:
sys.exit()
# Save all datas in DB HERE
# for price in price_list:
# Save price in PostgreSQL Database table (same table)
until_date = convert_date_format(price_list[len(price_list)-1]['candleDateTime'])
time.sleep(1)
if __name__=='__main__':
# When executed with pure python
pool = Pool()
pool.map(
crawl_and_save_data,
get_bunch_of_url_list()
)
The key point of this code is,
# Save all data in DB HERE
# for price in price_list:
# Save price in PostgreSQL Database table (same table)
, where each process accesses same database table.
I wonder whether this kind of task prevents concurrency of my whole task.
Or, Would it be a possibility to lose data because of the concurrent database accesses?
Or, would all queries are put in a I/O queue or something?
Need your advices. Thanks
tl;dr - you should be fine, but the question doesn't include enough detail to answer definitively. You will need to run some tests, but you should expect to get a good amount of concurrency (a few dozen simultaneous writes) before things start to slow down.
Note though - as currently written, seems like your workers will get the same URL over and over again, because of the while True loop that never breaks or exits. You detect if the list is empty, but does the URL track state somehow? I would expect multiple, identical GETs to return the same data over and over...
As far as concurrency, that ultimately depends on -
The resources available to the database (memory, I/O, CPU)
The server-side resources consumed by each connection/operation.
That second point includes memory, etc., but also whether independent
operations are competing for the same resources (are 10 different connections
trying to update the same set of rows in the database?). Updating the same
table is fine, more or less, because the database can use row-level locks.
Also note the difference between concurrency (how many things happen at
once) and throughput (how many things happen within a period of time).
Concurrency and throughput can relate to each in counter-intuitive ways -
it's not uncommon to see a situation where 1 process can do N operations per
second, but M processes sustain far less than M x N operations per second,
possibly even bringing the whole thing to a screeching halt (e.g., via a
deadlock)
Thinking about your code snippet, here are some observations:
You are using multiprocessing.Pool, which uses sub-processes for concurrency and will work well for your case if you...
Make sure you open your connections in the sub-process; trying to re-use a connection from the parent process will not work
If you do nothing else to your code, you will be using a number of sub-processes equal to the number of cores on your db client machine
This is a good starting point. If a function is CPU-bound, you really can't go higher. If your function is I/O-bound, the CPU will be idle waiting for I/O operations to return. You can start ramping up the worker count in this case.
Thus, each sub-process will have a connection to the database, with some amount of server memory per connection.
This also means that each insert should be in isolated transactions, with no additional work on your part.
Given that, simple, append-only, row-by-row transactions should support
relatively high concurrency and high throughput, again depending on how
big and fast your DB server is.
Also, note that you are already queueing :) With no args, Pool() creates a
number of child processes equal to os.cpu_count() (see
the docs).
If that's greater than the number of URLs in your collection, that collection
is a queue of sorts, just not a durable one. If your master process dies, the
list of URLs is gone.
Unrelated - unless you are worried about your URL fetches getting throttled, from a db perspective, there is no need for the time.sleep(1) statement.
Hope this helps.
I'm seeing very poor performance when fetching multiple keys from Memcache using ndb.get_multi() in App Engine (Python).
I am fetching ~500 small objects, all of which are in memcache. If I do this using ndb.get_multi(keys), it takes 1500ms or more. Here is typical output from App Stats:
and
As you can see, all the data is served from memcache. Most of the time is reported as being outside of RPC calls. However, my code is about as minimal as you can get, so if the time is spent on CPU it must be somewhere inside ndb:
# Get set of keys for items. This runs very quickly.
item_keys = memcache.get(items_memcache_key)
# Get ~500 small items from memcache. This is very slow (~1500ms).
items = ndb.get_multi(item_keys)
The first memcache.get you see in App Stats is the single fetch to get a set of keys. The second memcache.get is the ndb.get_multi call.
The items I am fetching are super-simple:
class Item(ndb.Model):
name = ndb.StringProperty(indexed=False)
image_url = ndb.StringProperty(indexed=False)
image_width = ndb.IntegerProperty(indexed=False)
image_height = ndb.IntegerProperty(indexed=False)
Is this some kind of known ndb performance issue? Something to do with deserialization cost? Or is it a memcache issue?
I found that if instead of fetching 500 objects, I instead aggregate all the data into a single blob, my function runs in 20ms instead of >1500ms:
# Get set of keys for items. This runs very quickly.
item_keys = memcache.get(items_memcache_key)
# Get individual item data.
# If we get all the data from memcache as a single blob it is very fast (~20ms).
item_data = memcache.get(items_data_key)
if not item_data:
items = ndb.get_multi(item_keys)
flat_data = json.dumps([{'name': item.name} for item in items])
memcache.add(items_data_key, flat_data)
This is interesting, but isn't really a solution for me since the set of items I need to fetch isn't static.
Is the performance I'm seeing typical/expected? All these measurements are on the default App Engine production config (F1 instance, shared memcache). Is it deserialization cost? Or due to fetching multiple keys from memcache maybe?
I don't think the issue is instance ramp-up time. I profiled the code line by line using time.clock() calls and I see roughly similar numbers (3x faster than what I see in AppStats, but still very slow). Here's a typical profile:
# Fetch keys: 20 ms
# ndb.get_multi: 500 ms
# Number of keys is 521, fetch time per key is 0.96 ms
Update: Out of interest I also profiled this with all the app engine performance settings increased to maximum (F4 instance, 2400Mhz, dedicated memcache). The performance wasn't much better. On the faster instance the App Stats timings now match my time.clock() profile (so 500ms to fetch 500 small objects instead of 1500ms). However, it seem seems extremely slow.
I investigated this in a bit of detail, and the problem is ndb and Python, not memcache. The reason things are so incredibly slow is partly deserialization (explains about 30% of the time), and the rest seems to be overhead in ndb's task queue implementation.
This means that, if you really want to, you can avoid ndb and instead fetch and deserialize from memcache directly. In my test case with 500 small entities, this gives a massive 2.5x speedup (650ms vs 1600ms on an F1 instance in production, or 200ms vs 500ms on an F4 instance).
This gist shows how to do it:
https://gist.github.com/mcummins/600fa8852b4741fb2bb1
Here is the appstats output for the manual memcache fetch and deserialization:
Now compare this to fetching exactly the same entities using ndb.get_multi(keys):
Almost 3x difference!!
Profiling each step is shown below. Note the timings don't match appstats because they're running on an F1 instance, so real time is 3x clock time.
Manual version:
# memcache.get_multi: 50.0 ms
# Deserialization: 140.0 ms
# Number of keys is 521, fetch time per key is 0.364683301344 ms
vs ndb version:
# ndb.get_multi: 500 ms
# Number of keys is 521, fetch time per key is 0.96 ms
So ndb takes 1ms per entity fetched, even if the entity has one single property and is in memcache. That's on an F4 instance. On an F1 instance it takes 3ms. This is a serious practical limitation: if you want to maintain reasonable latency, you can't fetch more than ~100 entities of any kind when handling a user request on an F1 instance.
Clearly ndb is doing something really expensive and (at least in this case) unnecessary. I think it has something to do with its task queue and all the futures it sets up. Whether it is worth going around ndb and doing things manually depends on your app. If you have some memcache misses then you will have to go do the datastore fetches. So you essentially end up partly reimplementing ndb. However, since ndb seems to have such massive overhead, this may be worth doing. At least it seems so based on my use case of a lot of get_multi calls for small objects, with a high expected memcache hit rate.
It also seems to suggest that if Google were to implement some key bits of ndb and/or deserialization as C modules, Python App Engine could be massively faster.
I am writing an application that uses a remote API that serves up a fairly static data (but still can update several times a day). The problem is that the API is quite slow, and I'd much rather import that data into my own datastore anyway, so that I can actually query the data on my end as well.
The problem is that the results contain ~700 records that need to be sync'd every 5 hours or so. This involves adding new records, updating old records and deleting stale ones.
I have a simple solution that works -- but it's slow as molasses, and uses 30,000 datastore read operations before it times out (after about 500 records).
The worst part about this is that the 700 records are for a single client, and I was doing it as a test. In reality, I would want to do the same thing for hundreds or thousands of clients with a similar number of records... you can see how that is not going to scale.
Here is my entity class definition:
class Group(ndb.Model):
groupid = ndb.StringProperty(required=True)
name = ndb.StringProperty(required=True)
date_created = ndb.DateTimeProperty(required=True, auto_now_add=True)
last_updated = ndb.DateTimeProperty(required=True, auto_now=True)
Here is my sync code (Python):
currentTime = datetime.now()
groups = get_list_of_groups_from_api(clientid) #[{'groupname':'Group Name','id':'12341235'}, ...]
for group in groups:
groupid = group["id"]
groupObj = Group.get_or_insert(groupid, groupid=group["id"], name=group["name"])
groupObj.put()
staleGroups = Group.query(Group.last_updated < currentTime)
for staleGroup in staleGroups:
staleGroup.delete()
I can't tell you why you are getting 30,000 read operations.
You should start by running appstats and profiling this code, to see where the datastore operations are being performed.
That being said I can see some real inefficiencies in your code.
For instance your delete stale groups code is horribly inefficient.
You should be doing a keys_only query, and then doing batch deletes.
What you are doing is really slow with lots of latency for each delete() in the loop.
Also get_or_insert uses a transaction (also if the group didn't exist a put is already done, and then you do a second put()) , and if you don't need transactions you will find things will run faster. The fact that you are not storing any additional data means you could just blind write the groups (So initial get/read), unless you want to preserve date_created.
Other ways of making this faster would be by doing batch gets/puts on the list of keys.
Then for all the entities that didn't exist, do a batch put()
Again this would be much faster than iterating over each key.
In addition you should use a TaskQueue to run this set of code, you then have a 10 min processing window.
After that further scaling can be achieved by splitting the process into two tasks. The first creates/updates the group entities. Once that completes you start the task that deletes stale groups - passing the datetime as an argument to the next task.
If you have even more entities than can be processed in this simple model then start looking at MapReduce.
But for starters just concentrate on making the job you are currently running more efficient.
I'm building an application based around a task queue: it serves a series of tasks to multiple, asynchronously connected clients. The twist is that the tasks must be served in a random order.
My problem is that the algorithm I'm using now is computationally expensive, because it relies on many large queries and transfers from the database. I have a strong hunch that there's a cheaper way to achieve the same result, but I can't quite see the solution. Can you think of a clever fix for this problem?
Here's the (computationally expensive) algorithm I'm using now:
When the client queries for a new task...
Query the database for "unfinished" tasks
Put all tasks in a list
Shuffle the list (using random.shuffle)
Flag the first task as "in progress"
Send the task parameters to the client for completion
When the client finishes the task...
6a. Record the result and flag the task as "finished."
If the client fails to finish the task by some deadline...
6b. Re-flag the task as "unfinished."
Seems like we could do better by replacing steps 1, 2, and 3, with pseudorandom sequences or hash functions. But I can't quite figure out the whole solution. Ideas?
Other considerations:
In case it's important, I'm using python and mongodb for all of this. (Mongodb doesn't have some clever "use find_one to efficiently return a random matching entry" usage, does it?)
The term "queue" is a little misleading. All the tasks are stored in subfields of a single collection within the mongodb. The length (total number of tasks) in the collection is known and fixed at the outset.
If it's necessary, it might be okay to let the same task be assigned multiple times, as long as the occurrence is rare. But instances of this kind would need to be very rare, because completing each task is costly.
I have identifying information on each client, so we know exactly who originates each task request.
There is an easy way to get a random document from MongoDB!
See Random record from MongoDB
If you don't want a task to be picked twice, you could mark the task as active and not select it.
Ah, based on the comments that I missed, you can do something along these lines:
import random
available = range(lengthofdatabase)
inprogress = []
while len(available) > 0:
taskindex = available.pop(random.randrange(0, len(available)))
# I'm not sure of your implementation, but you said something
# along these lines was possible
task = GetTask(taskindex)
inprogress.append(taskindex)
I'm not sure of any of the functions you are using - this is just an algorithm.
Happy Coding!
I've got 3 parts to this question:
I have an application where users create objects that other users can update within 5 minutes. After 5 minutes, the objects time out and are invalid. I'm storing the objects as entities. To do the timeout, I have a cron job that runs once a minute to clear out the expired objects.
Most of the time right now, I don't have any active objects. In this case, the mapreduce handler checks the entity it gets, and does nothing if it's not active, no writes. However, my free datastore write quota is running out from the mapreduce calls after about 7 hours. According to my rough estimate, it looks like just running mapreduce causes ~ 120 writes/call. (Rough math, 60 calls/hr * 7 hr = 420 calls, 50k ops limit / 420 calls ~ 120 writes/call)
Q1: Can anyone verify that just running mapreduce triggers ~120 datastore writes?
To get around it, I'm checking the datastore before I kick off the mapreduce:
def cronhandler():
count = model.all(keys_only=True).count(limit=1000)
if count:
shards = (count / 100) + 1;
from mapreduce import control
control.start_map("Timeout open objects",
"expire.maphandler",
"expire.OpenOrderInputReader",
{'entity_kind' : 'model'},
shard_count=shards)
return HttpResponse()
Q2: Is this the best way to avoid the mapreduce-induced datastore writes? Is there a better way to configure mapreduce to avoid extraneous writes? I was thinking potentially it was possible with a better custom InputReader
Q3: I'm guessing more shards result in more extraneous datastore writes from mapreduce bookkeeping. Is limiting shards by the expected number of objects I need to write appropriately?
What if you kept your objects on memcache instead of the datastore? My only worry is whether a memcache is consistent between all instances running a given application, but, if it is, the problem has a very neat solution.
This doesn't exactly answer your quesion, but could you reduced the frequency of the cron job?
Instead of deleting models as soon as they become invalid, simply remove them from the queries that your Users see.
For example:
import datetime
now = datetime.datetime.now(created_at)
five_minutes_ago = now - datetime.timedelta(minutes=5)
q = model.all()
q.filter('create_at >=', five_minutes_ago)
Or if you don't want to use an inequality filter you could use == based on five minute blocks.
Then, you run your cron every hour or so to clean out the inactive models.
The downside to this approach is the the entities would be returned by key only fetch, in which case you would need to verify that they were still valid before returning them to the user.
I'm assuming what I've done is the best way to go about doing things. It looks like the Mapreduce API uses the datastore to keep track of the jobs launched and synchronize workers. By default the API uses 8 workers. Reducing the number of workers reduces the number of datastore writes, but that reduces wall time performance as well.