How to query and update entries in MongoDB (Pymongo) effectively? - python

I have a function where if I pass in an MongoDB.Collection.Object, it might or might not give me a vector [1,2,3,4...]. First of all, I need to extract all the required queries and then update the entries based on each. Problem is that I have this sort of function:
def return_vector(q):
# condition = my_logic
if condition:
return vector
else:
return None'
for q in db.coll_name.find({'a.b.c':{'$gt':0.4}}):
vector = return_vector(q)
if q:
db.coll_name.update({'_id': q['_id']},{'$set': {'vector_v1': vector}})
Problem? It is taking too much time. How am I supposed to do it with in optimized manner effectively?

You do multiple updates. It could be one of the reason for slowness.
You can combine multiple updates and send one request rot he server using bulkwrite. It would significantly reduce network round trip time.
Hope you don't have multiple write requests for the same document at the same time.

Related

running mongo queries against data in memory

I have a mongodb collection against which I need to run many count operations (each with a different query) every hour. When I first set this up, the collection was small, and these count operations ran in approx one minute, which was acceptable. Now they take approx 55 minutes, so they're running nearly continuously.
The query associated with each count operation is rather involved, and I don't think there's a way to get them all to run with indices (i.e. as COUNT_SCAN operations).
The only feasible solution I've come up with is to:
Run a full collection scan every hour, pulling every document out of the db
Once each document is in memory, run all of the count operations against it myself
Without my solution the server is running dozens and dozens of full collection scans each hour. With my solution the server is only running one. This has led me to a strange place where I need to take my complex queries and re-implement them myself so I can come up with my own counts every hour.
So my question is whether there's any support from mongo drivers (pymongo in my case, but I'm curious in general) in interpreting query documents but running them locally against data in memory, not against data on the mongodb server.
Initially this felt like an odd request, but there's actually quite a few places where this approach would probably greatly lessen the load on the database in my particular use case. So I wonder if it comes up from time to time in other production deployments.
MongoDB In-Memory storage engine
If you want to process data using complex queries only in RAM using MongoDB syntax, you can configure MongoDB to use In-Memory only storage engine that avoids disk I/O at all.
For me, it is the best option to have the ability to have complex queries and best performance.
Python in-memory databases:
You can use one of the following:
PyDbLite - a fast, pure-Python, untyped, in-memory database engine, using Python syntax to manage data, instead of SQL
TinyDB - if you need a simple database with a clean API that just works without lots of configuration, TinyDB might be the right choice for you. But not a fast solution and have few other disadvantages.
They should allow working with data directly in RAM, but I'm not sure if this is better than the previous option.
Own custom solution (e.g. written in Python)
Some services handle data in RAM only on application level only. If your solution is not complicated and queries are simple - this is ok. But since some time queries become more complicated and code require some abstraction level (for advanced CRUD), like previous databases.
The last solution can have the best performance, but it takes more time to develop and support it.
As you are using python, have you considered Pandas?, you could basically try and transform your JSON data to pandas data frame and query it as you like, you could achieve whole bunch of operations like count, group by, aggregate etc. Please take a look at the doc. Adding a small example below to help you relate. Hope this helps.
For example:
import pandas as pd
from pandas.io.json import json_normalize
data = {
"data_points":[
{"a":1,"b":3,"c":2},
{"a":3,"b":2,"c":1},
{"a":5,"b":4,"d":3}
]
}
# convert json to data frame
df = json_normalize(data["data_points"])
Pandas data frame view above.
now you could just try and perform operation on them like sum, count etc.
Example:
# sum of column `a`
df['a'].sum()
output: 9
# sum of column `c` that has null values.
df['c'].sum()
output: 3.0
# count of column `c` that has null values.
df['c'].count()
output: 2
Here's the code I have currently to solve this problem. I have enough tests running against it to qualify it for my use case, but it's probably not 100% correct. I certainly don't handle all possible query documents.
def check_doc_against_mongo_query(doc, query):
"""Return whether the given doc would be returned by the given query.
Initially this might seem like work the db should be doing, but consider a use case where we
need to run many complex queries regularly to count matches. If each query results in a full-
collection scan, it is often faster to run a single scan fetching the entire collection into
memory, then run all of the matches locally.
We don't support mongo's full query syntax here, so we'll need to add support as the need
arises."""
# Run our check recursively
return _match_query(doc, query)
def _match_query(doc, query):
"""Return whether the given doc matches the given query."""
# We don't expect a null query
assert query is not None
# Check each top-level field for a match, we AND them together, so return on mismatch
for k, v in query.items():
# Check for AND/OR operators
if k == Mongo.AND:
if not all(_match_query(doc, x) for x in v):
return False
elif k == Mongo.OR:
if not any(_match_query(doc, x) for x in v):
return False
elif k == Mongo.COMMENT:
# Ignore comments
pass
else:
# Now grab the doc's value and match it against the given query value
doc_v = nested_dict_get(doc, k)
if not _match_doc_and_query_value(doc_v, v):
return False
# All top-level fields matched so return match
return True
def _match_doc_and_query_value(doc_v, query_v):
"""Return whether the given doc and query values match."""
cmps = [] # we AND these together below, trailing bool for negation
# Check for operators
if isinstance(query_v, Mapping):
# To handle 'in' we use a tuple, otherwise we use an operator and a value
for k, v in query_v.items():
if k == Mongo.IN:
cmps.append((operator.eq, tuple(v), False))
elif k == Mongo.NIN:
cmps.append((operator.eq, tuple(v), True))
else:
op = {Mongo.EQ: operator.eq, Mongo.GT: operator.gt, Mongo.GTE: operator.ge,
Mongo.LT: operator.lt, Mongo.LTE: operator.le, Mongo.NE: operator.ne}[
k]
cmps.append((op, v, False))
else:
# We expect a simple value here, perform an equality check
cmps.append((operator.eq, query_v, False))
# Now perform each comparison
return all(_invert(_match_cmp(op, doc_v, v), invert) for op, v, invert in cmps)
def _invert(result, invert):
"""Invert the given result if necessary."""
return not result if invert else result
def _match_cmp(op, doc_v, v):
"""Return whether the given values match with the given comparison operator.
If v is a tuple then we require op to match with any element.
We take care to handle comparisons with null the same way mongo does, i.e. only null ==/<=/>=
null returns true, all other comps with null return false. See:
https://stackoverflow.com/questions/29835829/mongodb-comparison-operators-with-null
for details.
As an important special case of null comparisons, ne null matches any non-null value.
"""
if doc_v is None and v is None:
return op in (operator.eq, operator.ge, operator.le)
elif op is operator.ne and v is None:
return doc_v is not None
elif v is None:
return False
elif isinstance(v, tuple):
return any(op(doc_v, x) for x in v)
else:
return op(doc_v, v)
Maybe you could try another approach?
I mean, MongoDB performs really bad in counting, overall with big collections.
I had a pretty similar problem in my last company and what we did is to create some "counters" object, and update them in every update you perform over your data.
In this way, you avoid counting at all.
The document would be something like:
{
query1count: 12,
query2count: 512312,
query3count: 6
}
If the query1count is related to the query: "all documents where userId = 13", then in your python layer you can check before creating/updating a document if the userId = 13, and if so then increase the desired counter.
It will do add a lot of extra complexity to your code, but the reads of the counters will be performed in O(1).
Of course, not all the queries may be that easy but you can reduce a lot the execution time with this approach.

Django: queryset.count() is significantly slower on chained filters than single filters regardless of returned query size--is there a solution?

EDIT: Best solution thanks to Hakan--
queriedForms.filter(pk__in=list(formtype.form_set.all().filter(formrecordattributevalue__record_value__contains=constraint['TVAL'], formrecordattributevalue__record_attribute_type__pk=rtypePK).values_list('pk', flat=True))).count()
I tried more of his suggestions but I can't avoid an INNER JOIN--this seems to be a a stable solution that does get me small, but predictable speed increases across the board. Look through his answer for more details!
I've been struggling with a problem I haven't seen an answer to online.
When chaining two filters in Django e.g.
masterQuery = bigmodel.relatedmodel_set.all()
masterQuery = masterQuery.filter(name__contains="test")
masterQuery.count()
#returns 100,000 results in < 1 second
#test filter--all 100,000+ names have "test x" where x is 0-9
storedCount = masterQuery.filter(name__contains="9").count()
#returns ~50,000 results but takes 5-6 seconds
Trying a slightly different way:
masterQuery = masterQuery.filter(name__contains="9")
masterQuery.count()
#also returns ~50,000 results in 5-6 seconds
performing an & merge seems to ever so slightly improve performance, e.g
masterQuery = bigmodel.relatedmodel_set.all()
masterQuery = masterQuery.filter(name__contains="test")
(masterQuery & masterQuery.filter(name__contains="9")).count()
It seems as if count takes a significantly longer time beyond a single filter in a queryset.
I assume it may have something to do with mySQL, which apparently doesn't like nested statements--and I assume that two filters are creating a nested query that slows mySQL down, regardless of the SELECT COUNT(*) django uses
So my question is: Is there anyway to speed this up? I'm getting ready to do a lot of regular nested querying only using queryset counts (I don't need the actual model values) without database hits to load the models. e.g. I don't need to load 100,000 models from the database, I just need to know there are 100,000 there. It's obviously much faster to do this through querysets than len() but even at 5 secs a count when I'm running 40 counts for an entire complex query is 3+ minutes--I'd prefer it be under a minute. Am I just fantasizing or does someone have a suggestion as to how this could be accomplished outside of increasing the server's processor speed?
EDIT: If it's helpful--the time.clock() speed is .3 secs for the chained filter() count--the actual time to console and django view output is 5-6s
EDIT2: To answer any questions about indexing, the filters use both an indexed and non indexed value for each link in the chain:
mainQuery = masterQuery = bigmodel.relatedmodel_set.all()
mainQuery = mainQuery.filter(reverseforeignkeytestmodel__record_value__contains="test", reverseforeignkeytestmodel__record_attribute_type__pk=1)
#Where "record_attribute_type" is another foreign key being used as a filter
mainQuery.count() #produces 100,000 results in < 1sec
mainQuery.filter(reverseforeignkeytestmodel__record_value__contains="9", reverseforeignkeytestmodel__record_attribute_type__pk=5).count()
#produces ~50,000 results in 5-6 secs
So each filter in the chain is functionally similar, it is an AND filter(condition,condition) where one condition is indexed, and the other is not. I can't index both conditions.
Edit 3:
Similar queries that result in smaller results, e.g. < 10,000 are much faster, regardless of the nesting--e.g. the first filter in the chain produces 10,000 results in ~<1sec but the second filter in the chain will produce 5,000 results in ~<1sec
Edit 4:
Still not working based on #Hakan's solution
mainQuery = bigmodel.relatedmodel_set.all()
#Setup the first filter as normal
mainQuery = mainQuery.filter(reverseforeignkeytestmodel__record_value__contains="test", reverseforeignkeytestmodel__record_attribute_type__pk=1)
#Grab a values list for the second chained filter instead of chaining it
values = bigmodel.relatedmodel_set.all().filter(reverseforeignkeytestmodel__record_value__contains="test", reverseforeignkeytestmodel__record_attribute_type__pk=8).values_list('pk', flat=True)
#filter the first query based on the values_list rather than a second filter
mainQuery = mainQuery.filter(pk__in=values)
mainQuery.count()
#Still takes on average the same amount of time after enough test runs--seems to be slightly faster than average--similar to the (quersetA & querysetB) merge solution I tried.
It's possible I did this wrong--but the count results are consistent between the new value_list filter technique, e.g. I'm getting the same # of results. So it's definitely working--but seemingly taking the same amount of time
EDIT 5:
Also based on #Hakan's solution with some slight tweaks
mainQuery.filter(pk__in=list(formtype.form_set.all().filter(formrecordattributevalue__record_value__contains=constraint['TVAL'], formrecordattributevalue__record_attribute_type__pk=rtypePK).values_list('pk', flat=True))).count()
This seems to operate faster for larger results in a queryset, e.g. > 50,000, but is actually much slower on smaller queryset results, e.g. < 50,000--where they used to be <1sec--sometimes 2-3 running in 1 second for chain filtering, they now all take 1 second individually. Essentially the speed gains in the larger queryset have been nullified by the speed loss in the smaller querysets.
I'm still going to try and break up the queries as per his suggestion further--but I'm not sure I'm able to. I'll update again(possibly on Monday) when I figure that out and let everyone interested know the progress.
Not sure if this helps, since I don't have a mysql project to test with.
The QuerySet API reference contains a section about the performance of nested queries.
Performance considerations
Be cautious about using nested queries and understand your database
server’s performance characteristics (if in doubt, benchmark!). Some
database backends, most notably MySQL, don’t optimize nested queries
very well. It is more efficient, in those cases, to extract a list of
values and then pass that into the second query. That is, execute two
queries instead of one:
values = Blog.objects.filter(
name__contains='Cheddar').values_list('pk', flat=True)
entries = Entry.objects.filter(blog__in=list(values))
Note the list() call around the Blog QuerySet to force execution of the first query.
Without it, a nested query would be executed, because QuerySets are
lazy.
So, maybe you can improve the performance by trying something like this:
masterQuery = bigmodel.relatedmodel_set.all()
pks = list(masterQuery.filter(name__contains="test").values_list('pk', flat=True))
count = masterQuery.filter(pk__in=pks, name__contains="9")
Since your initial MySQL performance is so slow, it might even be faster to do the second step in Python instead of in the database.
names = masterQuery.filter(name__contains='test').values_list('name')
count = sum('9' in n for n in names)
Edit:
From your updates, I see that you are querying fields in related models, which result in multiple sql JOIN operations. That's likely a big reason why the query is slow.
To avoid joins, you could try something like this. The goal is to avoid doing deeply chained lookups across relations.
# query only RelatedModel, avoid JOIN
related_pks = RelatedModel.objects.filter(
record_value__contains=constraint['TVAL'],
record_attribute_type=rtypePK,
).values_list('pk', flat=True)
# list(queryset) will do a database query, resulting in a list of integers.
pks_list = list(related_pks)
# use that result to filter your main model.
count = MainModel.objects.filter(
formrecordattributevalue__in=pks_list
).count()
I'm assuming that the relation is defined as a foreign key from MainModel to RelatedModel.

GAE NDB model sequential ID [duplicate]

I have to label something in a "strong monotone increasing" fashion. Be it Invoice Numbers, shipping label numbers or the like.
A number MUST NOT BE used twice
Every number SHOULD BE used when exactly all smaller numbers have been used (no holes).
Fancy way of saying: I need to count 1,2,3,4 ...
The number Space I have available are typically 100.000 numbers and I need perhaps 1000 a day.
I know this is a hard Problem in distributed systems and often we are much better of with GUIDs. But in this case for legal reasons I need "traditional numbering".
Can this be implemented on Google AppEngine (preferably in Python)?
If you absolutely have to have sequentially increasing numbers with no gaps, you'll need to use a single entity, which you update in a transaction to 'consume' each new number. You'll be limited, in practice, to about 1-5 numbers generated per second - which sounds like it'll be fine for your requirements.
If you drop the requirement that IDs must be strictly sequential, you can use a hierarchical allocation scheme. The basic idea/limitation is that transactions must not affect multiple storage groups.
For example, assuming you have the notion of "users", you can allocate a storage group for each user (creating some global object per user). Each user has a list of reserved IDs. When allocating an ID for a user, pick a reserved one (in a transaction). If no IDs are left, make a new transaction allocating 100 IDs (say) from the global pool, then make a new transaction to add them to the user and simultaneously withdraw one. Assuming each user interacts with the application only sequentially, there will be no concurrency on the user objects.
The gaetk - Google AppEngine Toolkit now comes with a simple library function to get a number in a sequence. It is based on Nick Johnson's transactional approach and can be used quite easily as a foundation for Martin von Löwis' sharding approach:
>>> from gaeth.sequences import *
>>> init_sequence('invoce_number', start=1, end=0xffffffff)
>>> get_numbers('invoce_number', 2)
[1, 2]
The functionality is basically implemented like this:
def _get_numbers_helper(keys, needed):
results = []
for key in keys:
seq = db.get(key)
start = seq.current or seq.start
end = seq.end
avail = end - start
consumed = needed
if avail <= needed:
seq.active = False
consumed = avail
seq.current = start + consumed
seq.put()
results += range(start, start + consumed)
needed -= consumed
if needed == 0:
return results
raise RuntimeError('Not enough sequence space to allocate %d numbers.' % needed)
def get_numbers(needed):
query = gaetkSequence.all(keys_only=True).filter('active = ', True)
return db.run_in_transaction(_get_numbers_helper, query.fetch(5), needed)
If you aren't too strict on the sequential, you can "shard" your incrementer. This could be thought of as an "eventually sequential" counter.
Basically, you have one entity that is the "master" count. Then you have a number of entities (based on the load you need to handle) that have their own counters. These shards reserve chunks of ids from the master and serve out from their range until they run out of values.
Quick algorithm:
You need to get an ID.
Pick a shard at random.
If the shard's start is less than its end, take it's start and increment it.
If the shard's start is equal to (or more oh-oh) its end, go to the master, take the value and add an amount n to it. Set the shards start to the retrieved value plus one and end to the retrieved plus n.
This can scale quite well, however, the amount you can be out by is the number of shards multiplied by your n value. If you want your records to appear to go up this will probably work, but if you want to have them represent order it won't be accurate. It is also important to note that the latest values may have holes, so if you are using that to scan for some reason you will have to mind the gaps.
Edit
I needed this for my app (that was why I was searching the question :P ) so I have implemented my solution. It can grab single IDs as well as efficiently grab batches. I have tested it in a controlled environment (on appengine) and it performed very well. You can find the code on github.
Take a look at how the sharded counters are made. It may help you. Also do you really need them to be numeric. If unique is satisfying just use the entity keys.
Alternatively, you could use allocate_ids(), as people have suggested, then creating these entities up front (i.e. with placeholder property values).
first, last = MyModel.allocate_ids(1000000)
keys = [Key(MyModel, id) for id in range(first, last+1)]
Then, when creating a new invoice, your code could run through these entries to find the one with the lowest ID such that the placeholder properties have not yet been overwritten with real data.
I haven't put that into practice, but seems like it should work in theory, most likely with the same limitations people have already mentioned.
Remember: Sharding increases the probability that you will get a unique, auto-increment value, but does not guarantee it. Please take Nick's advice if you MUST have a unique auto-incrment.
I implemented something very simplistic for my blog, which increments an IntegerProperty, iden rather than the Key ID.
I define max_iden() to find the maximum iden integer currently being used. This function scans through all existing blog posts.
def max_iden():
max_entity = Post.gql("order by iden desc").get()
if max_entity:
return max_entity.iden
return 1000 # If this is the very first entry, start at number 1000
Then, when creating a new blog post, I assign it an iden property of max_iden() + 1
new_iden = max_iden() + 1
p = Post(parent=blog_key(), header=header, body=body, iden=new_iden)
p.put()
I wonder if you might also want to add some sort of verification function after this, i.e. to ensure the max_iden() has now incremented, before moving onto the next invoice.
Altogether: fragile, inefficient code.
I'm thinking in using the following solution: use CloudSQL (MySQL) to insert the records and assign the sequential ID (maybe with a Task Queue), later (using a Cron Task) move the records from CloudSQL back to the Datastore.
The entities also can have a UUID, so we can map the entities from the Datastore in CloudSQL, and also have the sequential ID (for legal reasons).

Converting a list of strings to a single pattern

I have a list of strings that follow a specific pattern. Here's an example
['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
I'm trying to end up with a blob pattern that will represent this list like the following
'ratelimiter:foobar:201401011*
I know the first two fields ahead of time. The third field is a time stamp and I want to find the column at which they start to have different values from other values in the column.
In the example given the timestamp ranges from 2014-01-01-11:57 to 2014-01-01-12:00 and the column that's different is the third to the last column where 1 changes to 2. If I can find that, then I can slice the string to [:-3] += '*' (for this example)
Every time I try and tackle this problem I end up with loops everywhere. I just feel like there's a better way of doing this.
Or maybe someone knows a better way of doing this with redis. I'm doing this because I'm trying to get keys from redis and I don't want to make a request for every key but rather make a batch request using the pattern parameter. Maybe there's a better way of doing this but haven't found anything yet.
Thanks
Staying in the pattern thing (converting to timestamp is probably best, though), I would do that to find the longest prefix:
items = ['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
print items[0][:[len(set(x)) == 1 for x in zip(*items)].index(False)] + '*'
# ratelimiter:foobar:201401011*
Which reads as: cut the first element of items where all nth elements of items are no longer equals.
[len(set(x)) == 1 for x in zip(*items)] will return a list of boolean being True for i if all elements at i are equal across items
This is what I will do:
convert the timestamp to numbers
find the max and min (if your list is not ordered)
take the difference between max and min and convert it back to pattern.
For example, in your case, the difference between max and min is 43. And the min is already 57, you can quickly deduct that if the min ends with ***157, the max should be ***200. And you know the pattern
You almost never want to use the '*' parameter in Redis in production because it is very slow-- much slower than making a request for each key individually in the vast majority of cases. Unless you're requesting so many keys that your bottleneck becomes the sheer amount of data you're transferring over the network (in which case you should really convert things to Lua and run the logic server-side), a pipeline is really want you want.
The reason you want a pipeline is you're probably getting hit by the costs of transferring data back and forth between your Redis server in separate hops right now. A pipeline, in contrast, queues up a bunch of commands to run against Redis, and then executes them all at once, when you're ready. Assuming you're using redis-py (if you're not, you really should be), and r is your connection to your Redis server, you can do this like so:
r = redis.Redis(...)
pipe = r.pipeline()
items = ['ratelimiter:foobar:201401011157',
'ratelimiter:foobar:201401011158',
'ratelimiter:foobar:201401011159',
'ratelimiter:foobar:201401011200']
for item in items:
pipe.get(item)
#all the values for each item you're getting from Redis will be here.
item_values = pipe.execute()
Note: this will only make one call to Redis and will be much faster than either getting each value individually or running a pattern selection.
All of the other answers so far are good Python answers, but you're dealing with a Redis problem. You need a Redis answer.

How to implement "autoincrement" on Google AppEngine

I have to label something in a "strong monotone increasing" fashion. Be it Invoice Numbers, shipping label numbers or the like.
A number MUST NOT BE used twice
Every number SHOULD BE used when exactly all smaller numbers have been used (no holes).
Fancy way of saying: I need to count 1,2,3,4 ...
The number Space I have available are typically 100.000 numbers and I need perhaps 1000 a day.
I know this is a hard Problem in distributed systems and often we are much better of with GUIDs. But in this case for legal reasons I need "traditional numbering".
Can this be implemented on Google AppEngine (preferably in Python)?
If you absolutely have to have sequentially increasing numbers with no gaps, you'll need to use a single entity, which you update in a transaction to 'consume' each new number. You'll be limited, in practice, to about 1-5 numbers generated per second - which sounds like it'll be fine for your requirements.
If you drop the requirement that IDs must be strictly sequential, you can use a hierarchical allocation scheme. The basic idea/limitation is that transactions must not affect multiple storage groups.
For example, assuming you have the notion of "users", you can allocate a storage group for each user (creating some global object per user). Each user has a list of reserved IDs. When allocating an ID for a user, pick a reserved one (in a transaction). If no IDs are left, make a new transaction allocating 100 IDs (say) from the global pool, then make a new transaction to add them to the user and simultaneously withdraw one. Assuming each user interacts with the application only sequentially, there will be no concurrency on the user objects.
The gaetk - Google AppEngine Toolkit now comes with a simple library function to get a number in a sequence. It is based on Nick Johnson's transactional approach and can be used quite easily as a foundation for Martin von Löwis' sharding approach:
>>> from gaeth.sequences import *
>>> init_sequence('invoce_number', start=1, end=0xffffffff)
>>> get_numbers('invoce_number', 2)
[1, 2]
The functionality is basically implemented like this:
def _get_numbers_helper(keys, needed):
results = []
for key in keys:
seq = db.get(key)
start = seq.current or seq.start
end = seq.end
avail = end - start
consumed = needed
if avail <= needed:
seq.active = False
consumed = avail
seq.current = start + consumed
seq.put()
results += range(start, start + consumed)
needed -= consumed
if needed == 0:
return results
raise RuntimeError('Not enough sequence space to allocate %d numbers.' % needed)
def get_numbers(needed):
query = gaetkSequence.all(keys_only=True).filter('active = ', True)
return db.run_in_transaction(_get_numbers_helper, query.fetch(5), needed)
If you aren't too strict on the sequential, you can "shard" your incrementer. This could be thought of as an "eventually sequential" counter.
Basically, you have one entity that is the "master" count. Then you have a number of entities (based on the load you need to handle) that have their own counters. These shards reserve chunks of ids from the master and serve out from their range until they run out of values.
Quick algorithm:
You need to get an ID.
Pick a shard at random.
If the shard's start is less than its end, take it's start and increment it.
If the shard's start is equal to (or more oh-oh) its end, go to the master, take the value and add an amount n to it. Set the shards start to the retrieved value plus one and end to the retrieved plus n.
This can scale quite well, however, the amount you can be out by is the number of shards multiplied by your n value. If you want your records to appear to go up this will probably work, but if you want to have them represent order it won't be accurate. It is also important to note that the latest values may have holes, so if you are using that to scan for some reason you will have to mind the gaps.
Edit
I needed this for my app (that was why I was searching the question :P ) so I have implemented my solution. It can grab single IDs as well as efficiently grab batches. I have tested it in a controlled environment (on appengine) and it performed very well. You can find the code on github.
Take a look at how the sharded counters are made. It may help you. Also do you really need them to be numeric. If unique is satisfying just use the entity keys.
Alternatively, you could use allocate_ids(), as people have suggested, then creating these entities up front (i.e. with placeholder property values).
first, last = MyModel.allocate_ids(1000000)
keys = [Key(MyModel, id) for id in range(first, last+1)]
Then, when creating a new invoice, your code could run through these entries to find the one with the lowest ID such that the placeholder properties have not yet been overwritten with real data.
I haven't put that into practice, but seems like it should work in theory, most likely with the same limitations people have already mentioned.
Remember: Sharding increases the probability that you will get a unique, auto-increment value, but does not guarantee it. Please take Nick's advice if you MUST have a unique auto-incrment.
I implemented something very simplistic for my blog, which increments an IntegerProperty, iden rather than the Key ID.
I define max_iden() to find the maximum iden integer currently being used. This function scans through all existing blog posts.
def max_iden():
max_entity = Post.gql("order by iden desc").get()
if max_entity:
return max_entity.iden
return 1000 # If this is the very first entry, start at number 1000
Then, when creating a new blog post, I assign it an iden property of max_iden() + 1
new_iden = max_iden() + 1
p = Post(parent=blog_key(), header=header, body=body, iden=new_iden)
p.put()
I wonder if you might also want to add some sort of verification function after this, i.e. to ensure the max_iden() has now incremented, before moving onto the next invoice.
Altogether: fragile, inefficient code.
I'm thinking in using the following solution: use CloudSQL (MySQL) to insert the records and assign the sequential ID (maybe with a Task Queue), later (using a Cron Task) move the records from CloudSQL back to the Datastore.
The entities also can have a UUID, so we can map the entities from the Datastore in CloudSQL, and also have the sequential ID (for legal reasons).

Categories

Resources