Performing an operation on every document in a MongoDB instance - python

I have a mongoDB collection with 1.5 million documents, all of which have the same fields, and I want to take the contents of Field A (which is unique in every document) and perform f(A) on it, then create and populate Field B. Pseudocode in Python:
for i in collection.find():
x = i**2
collection.update(i,x) #update i with x
NOTE: I am aware that the update code is probably wrong, but unless it affects the speed of operation, I chose to leave it there for the sake of simplicity
The problem is, this code is really really slow, primarily because it can run through 1000 documents in about a second, then the server cuts off the cursor for about a minute,then it allows another 1000. I'm wondering if there is any way to optimize this operation, or if I'm stuck with this slow bottleneck.
Additional notes:
I have adjusted batch_size as an experiment, it is faster, but it's not efficient, and still takes hours
I am also aware that SQL could probably do this faster, there are other reasons I am using an noSQL DB that are not relevant to this problem
The instance is running locally so for all intents and purposes, there is not network latency
I have seen this question, but it's answer doesn't really address my problem

Database clients tend to be extremely abstracted from actual database activity, so observed delay behaviors can be deceptive. It's likely that you are actually hammering the database in that time, but the activity is all hidden from the Python interpreter.
That said, there are a couple things you can do to make this lighter.
1) Put an index on the property A that you're basing the update on. This will allow it to return much faster.
2) Put a projection operator on your find call:
for doc in collection.find(projection=['A']):
That will ensure that you only return the fields you need to, and if you've properly indexed the unique A property, will ensure your results are drawn entirely from the very speedy index.
3) Use an update operator to ensure you only have to send the new field back. Rather than send the whole document, send back the dictionary:
{'$set': {'B': a**2}}
which will create the field B in each document without affecting any of the other content.
So, the whole block will look like this:
for doc in collection.find(projection=['A', '_id']):
collection.update(filter={'_id': doc['_id']},
update={'$set': {'B': doc['A']**2}})
That should cut down substantially on the work that Mongo has to do, as well as (currently irrelevant to you) network traffic.

Maybe you should do your updates in multiple threads. I think it may be better to load data in one thread, split it into multiple parts and pass that parts to parallel worker threads that will perform updates. It will be faster.
EDIT:
I suggest you doing paginated queries.
Python pseudocode:
count = collection.count()
page_size = 20
i = 0;
while(i < count):
for row in collection.find().limit(pageSize).skip(i):
x = i**2
collection.update(i, x);
i += page_size

Related

Google app engine: using a property value as a key

I have a next entities:
class Player(ndb.Model):
player_id = ndb.IntegerProperty()
and
class TimeRecord(ndb.Model):
time = ndb.StringProperty()
So TimeRecord's instance is a child of certain instance of Player.
If I need to put a instance of TimeRecord with certain Player I am doing like this:
tr = TimeRecord(parent = ndb.Key("Player", Player.query(Player.player_id == int(certain_id)).get().key.integer_id()), time = value)
This query is expensive and sophisticated. Accordingly to doc
qry = Account.query(Account.userid == 42)
If you are sure that there was just one Account with that userid, you might prefer to use userid as a key. Account.get_by_id(...) is faster than Account.query(...).get().
As I understand I need to change structure of my datastore:
Use player_id as a key of Player and move TimeRecord (time) to property of Players. player_id is unique value.
class Player(ndb.Model):
time = ndb.StringProperty()
Q: Is that a right approach?
This is similar to mixing different levels of entities inheritance due to as I see every data should be a different entity. And inheritance implemented by ancestor keys.
Upd:
But in this case I can store just one TimeRecord value for a certain Player.
And I need a set of TimeRecords for a Player. Is a repeated property solution of this problem?
The redesign you're proposing is essentially, from the POV of a relational database user, a "de-normalization" -- which is almost a bad word in the relational field, but absolutely "normal" (ha ha) once you move into NoSQL.
If you know how things will be queried and updated, de-normalization improves performance (usually) and/or storage (sometimes) at the expense of some flexibility.
Do be aware of the trade-offs, though. Often, de-normalizing improves performance in querying/reading at the expense of extra burdens in updating -- that can be fine since typically reading is much more frequent than writing, but you need to know whether this is the case for your application.
Examining your specific use case, I see definite savings in storage (esp. if you can use a more specialized type for your time property, see https://cloud.google.com/appengine/docs/python/ndb/properties#Date_and_Time) and fewer interactions with the backend (thus better performance) on retrievals. It also simplifies your code (simplicity is good: fewer risks of bugs, easier to unit-test).
However, if saving new "time records" is a very frequent need for a player, the repeated property grows larger and larger (at some point this slows things down despite it still being a single interaction; at worst it would "bump its head" against a single entity's maximum size, which is one megabyte -- sure, that would take many tens of thousands of "time records" per player, but, not knowing your app at all, I can't tell whether that's a risk... only you can!-).
Queries can also be a problem, again entirely depending on what your app needs. I'm specifically thinking of inequality queries. Suppose you need all players with time records greater than, say, '20141215-10:00:00', and smaller than, say, '20141215-18:00:00'. Alas, an inequality query on a repeated property won't do that for you! That is, if you query for
ndb.AND(Player.time > '20141215-10:00:00',
Player.time < '20141215-18:00:00')
you'll get players with a time greater than the first constant and a time less than the second one -- not necessarily the same time! This means the query may return many more players than you wish it would, and you'll need to "filter" the resulting bunch of players in your app's code.
If you had an entity where time is not a repeated property (such as your original TimeRecord entity), then the query analogous to this one would return exactly the bunch of entities of interest (though if you then needed to fetch the players sporting those times, you'd then need another interaction with the storage back-end, typically an ndb.get_multi, so it's hard to predict performance effects without knowing much more about your app's operational parameters!).
That's what de-normalization usually boils down to: trade-offs between different aspect of "desirability" (simplicity, storage saving, fewer backend interactions, smaller amounts of data going to/from the backend -- and we're not even getting into atomic transactions and applicability of async techniques!-) -- trade-offs that can be made only with some deep understanding of an app's operational parameters.
Indeed it may be worth deploying two or more prototypes, each to a small set of users, to get actual data about how they perform (the new Cloud Monitoring offer can help with the "get actual data" part), before choosing a "definitive" (ha!) architecture -- despite the fact that migrating the data from the prototypes to the "definitive" schema will incur overhead-effort needs.
And if the app is an overnight success and suddenly you get tens of thousands of queries per second, rather than the orders-of-magnitude fewer you had planned for, the performance characteristics may just as suddenly change to the point the pain of re-architecting and migrating again may be warranted (a good problem to have, for sure, but still...).

How to fetch Riak object, change its value and store it back with all indexes in Python

I am using Riak database to store my Python application objects that are used and processed in parallel by multiple scripts. Because of that, I need to lock them in various places, to avoid being processed by more than one script at once, like that:
riak_bucket = riak_connect('clusters')
cluster = riak_bucket.get(job_uuid).get_data()
cluster['status'] = 'locked'
riak_obj = riak_bucket.new(job_uuid, data=cluster)
riak_obj.add_index('username_bin', cluster['username'])
riak_obj.add_index('hostname_bin', cluster['hostname'])
riak_obj.store()
The thing is, this is quite a bit of code to do one simple, repeatable thing, and given the fact locking occurs quite often, I would like to find a simpler, cleaner way of doing that. I've tried to write a function to do locking/unlocking, like that (for a different object, called 'build'):
def build_job_locker(uuid, status='locked'):
riak_bucket = riak_connect('builds')
build = riak_bucket.get(uuid).get_data()
build['status'] = status
riak_obj = riak_bucket.new(build['uuid'], data=build)
riak_obj.add_index('cluster_uuid_bin', build['cluster_uuid'])
riak_obj.add_index('username_bin', build['username'])
riak_obj.store()
# when locking, return the locked db object to avoid fetching it again
if 'locked' in status:
return build
else:
return
but since the objects are obviously quite different one from another, they've different indexes and so on, I ended up writing a locking function per every object... which is almost as much messy as not having the functions at all and repeating the code.
The question is: is there a way to write a general function to do so, knowing that every object has a 'status' field, that'd lock them in db retaining all indexes and other attributes? Or, perhaps, there is another, easier way I havent thought about?
After some more research, and questions asked on various IRC channels it seems that this is not doable, as there's no way to fetch this kind of metadata about objects from Riak.

In practice, how eventual is the "eventual consistency" in HRD?

I am in the process of migrating an application from Master/Slave to HRD. I would like to hear some comments from who already went through the migration.
I tried a simple example to just post a new entity without ancestor and redirecting to a page to list all entities from that model. I tried it several times and it was always consistent. Them I put 500 indexed properties and again, always consistent...
I was also worried about some claims of a limit of one 1 put() per entity group per second. I put() 30 entities with same ancestor (same HTTP request but put() one by one) and it was basically no difference from puting 30 entities without ancestor. (I am using NDB, could it be doing some kind of optimization?)
I tested this with an empty app without any traffic and I am wondering how much a real traffic would affect the "eventual consistency".
I am aware I can test "eventual consistency" on local development. My question is:
Do I really need to restructure my app to handle eventual consistency?
Or it would be acceptable to leave it the way it is because the eventual consistency is actually consistent in practice for 99%?
If you have a small app then your data probably live on the same part of the same disk and you have one instance. You probably won't notice eventual consistency. As your app grows, you notice it more. Usually it takes milliseconds to reach consistency, but I've seen cases where it takes an hour or more.
Generally, queries is where you notice it most. One way to reduce the impact is to query by keys only and then use ndb.get_multi() to load the entities. Fetching entities by keys ensures that you get the latest version of that entity. It doesn't guarantee that the keys list is strongly consistent, though. So you might get entities that don't match the query conditions, so loop through the entities and skip the ones that don't match.
From what I've noticed, the pain of eventual consistency grows gradually as your app grows. At some point you do need to take it seriously and update the critical areas of your code to handle it.
What's the worst case if you get inconsistent results?
Does a user see some unimportant info that's out of date? That's probably ok.
Will you miscalculate something important, like the price of something? Or the number of items in stock in a store? In that case, you would want to avoid that chance occurence.
From observation only, it seems like eventually consistent results show up more as your dataset gets larger, I suspect as your data is split across more tablets.
Also, if you're reading your entities back with get() requests by key/id, it'll always be consistent. Make sure you're doing a query to get eventually consistent results.
The replication speed is going to be primarily server-workload-dependent. Typically on an unloaded system the replication delay is going to be milliseconds.
But the idea of "eventually consistent" is that you need to write your app so that you don't rely on that; any replication delay needs to be allowable within the constraints of your application.

Google App Engine counters

For all my data in the GAE Datastore I have a model for keeping track of counters/total number of records (since we can't use traditional SUM queries). I want to know the most efficient way of incrementing these global count values whenever I insert/delete a record. This is what I'm currently doing:
counter = DBCounter.all().fetch(1)
dbc = DBCounter(totalTopics=counter[0].totalTopics+1)
dbc.put()
But this seems quite sloppy to me. Any thoughts on a better way to do this?
There are a few issues with your approach:
It may under-count since you don't use a transaction to atomically update the counter.
It is inefficient:
Contention may become a problem if you need to update this counter frequently. Since you only have one counter, it won't scale well. Datastore entities can only be written at a rate of at most 5 times per second.
You're writing to the datastore twice every time you insert a record. If you end up using transactions to fix the above problem, then you'll be making two round-trips to the datastore every time you insert the record (once to insert, and once to update the counter). You might be able to use an approach which avoids this extra round-trip to the datastore.
Here are some alternate approaches (from least accurate [and fastest] to most accurate [and slowest]):
If you only need a rough count of the number of entities of particular kind in the datastore, then you can use the Stats API. The counts you retrieve are not constantly updated, however.
If you need more granularity but are okay with a small possibility of occasionally under-counting, then you could use a memcache-enhanced counter. There are several good implementations discussed in this question. In particular, see the code in the comments in this recipe.
If you really want to avoid undercounting, then you should consider a sharded datastore counter. This will eliminate the contention issue from above.
If you need to keep scalability while counting, you should look into Joe Gregorio's article on sharding counters and DocSavage's implementation of the idea.
AppEngineFan's excellent blog also has info on scalable non-sharded counters, see this one which uses task queues and points to the previous article on using cron jobs instead.

Shelve is too slow for large dictionaries, what can I do to improve performance?

I am storing a table using python and I need persistence.
Essentially I am storing the table as a dictionary string to numbers. And the whole is stored with shelve
self.DB=shelve.open("%s%sMoleculeLibrary.shelve"%(directory,os.sep),writeback=True)
I use writeback to True as I found the system tends to be unstable if I don't.
After the computations the system needs to close the database, and store it back. Now the database (the table) is about 540MB, and it is taking ages. The time exploded after the table grew to about 500MB. But I need a much bigger table. In fact I need two of them.
I am probably using the wrong form of persistence. What can I do to improve performance?
For storing a large dictionary of string : number key-value pairs, I'd suggest a JSON-native storage solution such as MongoDB. It has a wonderful API for Python, Pymongo. MongoDB itself is lightweight and incredibly fast, and json objects will natively be dictionaries in Python. This means that you can use your string key as the object ID, allowing for compressed storage and quick lookup.
As an example of how easy the code would be, see the following:
d = {'string1' : 1, 'string2' : 2, 'string3' : 3}
from pymongo import Connection
conn = Connection()
db = conn['example-database']
collection = db['example-collection']
for string, num in d.items():
collection.save({'_id' : string, 'value' : num})
# testing
newD = {}
for obj in collection.find():
newD[obj['_id']] = obj['value']
print newD
# output is: {u'string2': 2, u'string3': 3, u'string1': 1}
You'd just have to convert back from unicode, which is trivial.
Based on my experience, I would recommend using SQLite3, which comes with Python. It works well with larger databases and key numbers. Millions of keys and gigabytes of data is not a problem. Shelve is totally wasted at that point. Also having separate db-process isn't beneficial, it just requires more context swaps. In my tests I found out that SQLite3 was the preferred option to use, when handling larger data sets locally. Running local database engine like mongo, mysql or postgresql doesn't provide any additional value and also were slower.
I think your problem is due to the fact that you use the writeback=True. The documentation says (emphasis is mine):
Because of Python semantics, a shelf cannot know when a mutable
persistent-dictionary entry is modified. By default modified objects
are written only when assigned to the shelf (see Example). If the
optional writeback parameter is set to True, all entries accessed are
also cached in memory, and written back on sync() and close(); this
can make it handier to mutate mutable entries in the persistent
dictionary, but, if many entries are accessed, it can consume vast
amounts of memory for the cache, and it can make the close operation
very slow since all accessed entries are written back (there is no way
to determine which accessed entries are mutable, nor which ones were
actually mutated).
You could avoid using writeback=True and make sure the data is written only once (you have to pay attention that subsequent modifications are going to be lost).
If you believe this is not the right storage option (it's difficult to say without knowing how the data is structured), I suggest sqlite3, it's integrated in python (thus very portable) and has very nice performances. It's somewhat more complicated than a simple key-value store.
See other answers for alternatives.
How much larger? What are the access patterns? What kinds of computation do you need to do on it?
Keep in mind that you are going to have some performance limits if you can't keep the table in memory no matter how you do it.
You may want to look at going to SQLAlchemy, or directly using something like bsddb, but both of those will sacrifice simplicity of code. However, with SQL you may be able to offload some of the work to the database layer depending on the workload.

Categories

Resources