Provide a hint in bulk upserts

Provide a hint in bulk upserts - python

Is there a way to provide a hint for an upsert in a bulk in MongoDB / Python?
I would like to add a hint in a query like: Bulk.find(<query>).upsert().update(<update>).
I have tried:
Bulk.find(<query>).hint(<index>).upsert().update(<update>): .hint() method does not exist.
Bulk.find({'$query': <query>, '$hint': <hint>}).upsert().update(<update>): one cannot mix {$query: <query>} syntax with method chaining (see this & this for example).
Am I missing something?

This is not so much about Bulk Operations but is rather about the general behavior of queries in "update" statements. See SERVER-1599.
So the same format of operations supported by the basic Op_Query which is linked to .find() has never been supported in update statements. This is also true of the Bulk API because the .find() method there is it's own method and belongs to the Bulk API where it is not related to the basic collection method, hence the lacking .hint() method.
So using the special forms as with $query does not work even with .update() in a basic form. But there is something you can do as of MongoDB 2.6 to influence the index chosen by the query.
The new addition here is "index filters", this allows you to set up a list of indexes to be considered for a given "query shape". The main definition here is through the planCacheSetFilter command. This allows you do do something like the following ( just in shell for brevity ):
db.junk.ensureIndex({ "b": 1, "a": 1 })
db.runCommand({
"planCacheSetFilter": "junk",
"query": { "a": 1 },
"indexes": [
{ "b": 1, "a": 1 }
]
})
The values provided in the "query" argument there are irrelevant, but what is important is the "shape". So regardless of what data is being queried for, as long as the "shape" is basically the same then the filter set is considered. i.e:
db.junk.find({ "a": 1 }).explain(1).filterSet; // returns true
db.junk.find({ "a": 2 }).explain(1).filterSet; // returns true
db.junk.find({ "b": 1 }).explain(1).filterSet; // returns false, different shape
Unlike the direct form of $hint, this will work with both .update() statements or in the Bulk .find().update() chain as a way to provide an index choice for the query operation.
Beware though that this is not a "permanent" setting, nor is it able to be isolated to a singular operation or sequence of operations. This "filter" will stay in the plan cache once set until the server instance is restarted. You can alternately clear it with the planCacheClearFilters command.
So until that JIRA Issue is resolved, "filters" are the only possible way like what you are asking to achieve without factoring in other queries to narrow down additional filtering parameters to optimize on the likely selected index.

Related

What's the most Pythonic way to pull from structured data with an inconsistent maximum "depth"?

I've got a JSON file holding different dialog lines for a Discord bot to use, sorted by which bot command triggers the line. It looks something like this:
{
"!remind": {
"responses": {
"welcome": "Reminder set for {time}.",
"reminder": "At {time} you asked me to remind you {thing}."
},
"errors": {
"R0": "Invalid reminder time.",
"R1": "Reminder time is in the past."
},
"help": "To set a reminder say `!reminder [time] [thing]`"
},
"<!timezone, !spoilimage, !crosspost, etc.>": {
<same structure>
}
}
I have a function that's meant to access the values stored in the JSON file, do any necessary formatting using kwargs, and return a string. My original approach was
def dialog(command, category, name, **fmt):
json_data = <json stuff>
return json_data[command][category][name].format(**fmt)
# Sample call:
pastcommand = <magic>
reply = dialog("!remind", "response", "reminder", time=pastcommand.time, thing=pastcommand.message)
# Although in practice I've made wrapper methods to avoid having to specify all of these args each time
But this will only work for "responses" and "errors", not "help," since in "help" the message to send is a level "shallower".
Two other things to note:
It's unlikely there will ever need to be anything in "help" other than the single value.
Currently there are no name conflicts between keys in different subcategories, and it's very easy to keep it that way. However, "responses"/"errors"/"help" is consistent across all categories, and some key names are repeated across categories (although I could change that if necessary).
So, in terms of fixing this, I could always just restructure the JSON file, something like
"help": {"main": "To set a reminder say `!reminder [time] [thing]`"}
but I don't like the idea of turning a string into a dict containing just a single string, just to satisfy the constraints of a function that pulls it.
Beyond that, I've run through a number of options, namely: explicitly checking the category and making it a special case (if category == "help"); trying both options with a try/except block, and using pandas.json_normalize (which I'm pretty sure would work? I haven't actually worked with it. Either way, any time a seemingly simple problem brings me to a third-party library, it makes me suspect I'm doing something wrong.).
What I've settled on, so far, is this:
def dialog(*json_keys, **fmt):
json_data = <json stuff>
current_level = json_data
for key in json_keys:
# Let's pretend I did error-handling here.
current_level = current_level[key]
return current_level.format(**fmt)
It's a lot more elegant and more flexible than any of the other things I considered, but I'm self-taught and pretty inexperienced, and I'm wondering if I'm overlooking some better approach.

AttributeError: 'QuerySet' object has no attribute 'is_staff' [duplicate]

I was having a debate on this with some colleagues. Is there a preferred way to retrieve an object in Django when you're expecting only one?
The two obvious ways are:
try:
obj = MyModel.objects.get(id=1)
except MyModel.DoesNotExist:
# We have no object! Do something...
pass
And:
objs = MyModel.objects.filter(id=1)
if len(objs) == 1:
obj = objs[0]
else:
# We have no object! Do something...
pass
The first method seems behaviorally more correct, but uses exceptions in control flow which may introduce some overhead. The second is more roundabout but won't ever raise an exception.
Any thoughts on which of these is preferable? Which is more efficient?

get() is provided specifically for this case. Use it.
Option 2 is almost precisely how the get() method is actually implemented in Django, so there should be no "performance" difference (and the fact that you're thinking about it indicates you're violating one of the cardinal rules of programming, namely trying to optimize code before it's even been written and profiled -- until you have the code and can run it, you don't know how it will perform, and trying to optimize before then is a path of pain).

You can install a module called django-annoying and then do this:
from annoying.functions import get_object_or_None
obj = get_object_or_None(MyModel, id=1)
if not obj:
#omg the object was not found do some error stuff

1 is correct. In Python an exception has equal overhead to a return. For a simplified proof you can look at this.
2 This is what Django is doing in the backend. get calls filter and raises an exception if no item is found or if more than one object is found.

I'm a bit late to the party, but with Django 1.6 there is the first() method on querysets.
https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.first
Returns the first object matched by the queryset, or None if there is no matching object. If the QuerySet has no ordering defined, then the queryset is automatically ordered by the primary key.
Example:
p = Article.objects.order_by('title', 'pub_date').first()
Note that first() is a convenience method, the following code sample is equivalent to the above example:
try:
p = Article.objects.order_by('title', 'pub_date')[0]
except IndexError:
p = None

Why do all that work? Replace 4 lines with 1 builtin shortcut. (This does its own try/except.)
from django.shortcuts import get_object_or_404
obj = get_object_or_404(MyModel, id=1)

I can't speak with any experience of Django but option #1 clearly tells the system that you are asking for 1 object, whereas the second option does not. This means that option #1 could more easily take advantage of cache or database indexes, especially where the attribute you're filtering on is not guaranteed to be unique.
Also (again, speculating) the second option may have to create some sort of results collection or iterator object since the filter() call could normally return many rows. You'd bypass this with get().
Finally, the first option is both shorter and omits the extra temporary variable - only a minor difference but every little helps.

Some more info about exceptions. If they are not raised, they cost almost nothing. Thus if you know you are probably going to have a result, use the exception, since using a conditional expression you pay the cost of checking every time, no matter what. On the other hand, they cost a bit more than a conditional expression when they are raised, so if you expect not to have a result with some frequency (say, 30% of the time, if memory serves), the conditional check turns out to be a bit cheaper.
But this is Django's ORM, and probably the round-trip to the database, or even a cached result, is likely to dominate the performance characteristics, so favor readability, in this case, since you expect exactly one result, use get().

I've played with this problem a bit and discovered that the option 2 executes two SQL queries, which for such a simple task is excessive. See my annotation:
objs = MyModel.objects.filter(id=1) # This does not execute any SQL
if len(objs) == 1: # This executes SELECT COUNT(*) FROM XXX WHERE filter
obj = objs[0] # This executes SELECT x, y, z, .. FROM XXX WHERE filter
else:
# we have no object! do something
pass
An equivalent version that executes a single query is:
items = [item for item in MyModel.objects.filter(id=1)] # executes SELECT x, y, z FROM XXX WHERE filter
count = len(items) # Does not execute any query, items is a standard list.
if count == 0:
return None
return items[0]
By switching to this approach, I was able to substantially reduce number of queries my application executes.

.get()
Returns the object matching the given lookup parameters, which should
be in the format described in Field lookups.
get() raises MultipleObjectsReturned if more than one object was
found. The MultipleObjectsReturned exception is an attribute of the
model class.
get() raises a DoesNotExist exception if an object wasn't found for
the given parameters. This exception is also an attribute of the model
class.
.filter()
Returns a new QuerySet containing objects that match the given lookup
parameters.
Note
use get() when you want to get a single unique object, and filter()
when you want to get all objects that match your lookup parameters.

Interesting question, but for me option #2 reeks of premature optimisation. I'm not sure which is more performant, but option #1 certainly looks and feels more pythonic to me.

I suggest a different design.
If you want to perform a function on a possible result, you could derive from QuerySet, like this: http://djangosnippets.org/snippets/734/
The result is pretty awesome, you could for example:
MyModel.objects.filter(id=1).yourFunction()
Here, filter returns either an empty queryset or a queryset with a single item. Your custom queryset functions are also chainable and reusable. If you want to perform it for all your entries: MyModel.objects.all().yourFunction().
They are also ideal to be used as actions in the admin interface:
def yourAction(self, request, queryset):
queryset.yourFunction()

Option 1 is more elegant, but be sure to use try..except.
From my own experience I can tell you that sometimes you're sure there cannot possibly be more than one matching object in the database, and yet there will be two... (except of course when getting the object by its primary key).

Sorry to add one more take on this issue, but I am using the django paginator, and in my data admin app, the user is allowed to pick what to query on. Sometimes that is the id of a document, but otherwise it is a general query returning more than one object, i.e., a Queryset.
If the user queries the id, I can run:
Record.objects.get(pk=id)
which throws an error in django's paginator, because it is a Record and not a Queryset of Records.
I need to run:
Record.objects.filter(pk=id)
Which returns a Queryset with one item in it. Then the paginator works just fine.

".get()" can return one object:
{
"name": "John",
"age": "26",
"gender": "Male"
}
".filter()" can return **a list(set) of one or more objects:
[
{
"name": "John",
"age": "26",
"gender": "Male"
},
{
"name": "Tom",
"age": "18",
"gender": "Male"
},
{
"name": "Marry",
"age": "22",
"gender": "Female"
}
]

Using google.protobuf.Any in python file

I have such .proto file
syntax = "proto3";
import "google/protobuf/any.proto";
message Request {
google.protobuf.Any request_parameters = 1;
}
How can I create Request object and populate its fields? I tried this:
import ma_pb2
from google.protobuf.any_pb2 import Any
parameters = {"a": 1, "b": 2}
Request = ma_pb2.Request()
some_any = Any()
some_any.CopyFrom(parameters)
Request.request_parameters = some_any
But I have an error:
TypeError: Parameter to CopyFrom() must be instance of same class: expected google.protobuf.Any got dict.
UPDATE
Following prompts of #Kevin I added new message to .proto file:
message Small {
string a = 1;
}
Now code looks like this:
Request = ma_pb2.Request()
small = ma_pb2.Small()
small.a = "1"
some_any = Any()
some_any.Pack(small)
Request.request_parameters = small
But at the last assignment I have an error:
Request.request_parameters = small
AttributeError: Assignment not allowed to field "request_parameters" in protocol message object.
What did I do wrong?

Any is not a magic box for storing arbitrary keys and values. The purpose of Any is to denote "any" message type, in cases where you might not know which message you want to use until runtime. But at runtime, you still need to have some specific message in mind. You can then use the .Pack() and .Unpack() methods to convert that message into an Any, and at that point you would do something like Request.request_parameters.CopyFrom(some_any).
So, if you want to store this specific dictionary:
{"a": 1, "b": 2}
...you'll need a .proto file which describes some message type that has integer fields named a and b. Personally, I'd see that as overkill; just throw your a and b fields directly into the Request message, unless you have a good reason for separating them out. If you "forget" one of these keys, you can always add it later, so don't worry too much about completeness.
If you really want a "magic box for storing arbitrary keys and values" rather than what I described above, you could use a Map instead of Any. This has the advantage of not requiring you to declare all of your keys upfront, in cases where the set of keys might include arbitrary strings (for example, HTTP headers). It has the disadvantage of being harder to lint or type-check (especially in statically-typed languages), because you can misspell a string more easily than an attribute. As shown in the linked resource, Maps are basically syntactic sugar for a repeated field like the following (that is, the on-wire representation is exactly the same as what you'd get from doing this, so it's backwards compatible to clients which don't support Maps):
message MapFieldEntry {
key_type key = 1;
value_type value = 2;
}
repeated MapFieldEntry map_field = N;

Avoiding multiple queries for multiple counts

I'm trying to figure out a good way to get a few analytics counts from my DB without doing a bunch of queries and somehow doing one
What I have right now is a function that returns counts
def get_counts(self):
return {
'item_one_counts' : self.items_one.count(),
'item_two_counts' : self.items_two.count(),
'item_three_count' : self.items_three.count(),
}
etc.
I know I can do this with a raw query that does a SELECT as count1,2,3 FROM table X
Is there a more django-y way to do this?

You're a tad late if you want to get the counts in an instance method. The easiest way to optimize this is by using annotations in the initial query:
obj = MyModel.objects.annotate(item_one_count=Count('items_one')) \
.annotate(item_two_count=Count('items_two')) \
.annotate(item_three_count=Count('items_three')) \
.get(...)
Another good optimization is to cache the results, e.g.:
MyModel(models.Model):
def get_item_one_count(self):
if not hasattr(self, '_item_one_count'):
self._item_one_count = self.items_one.count()
return self._item_one_count
...
def get_counts(self):
return {
'item_one_counts' : self.get_item_one_count(),
'item_two_counts' : self.get_item_two_count(),
'item_three_count' : self.get_item_three_count(),
}
Combine these methods (i.e. .annotate(_item_one_count=Count('items_one'))), and you can optimize the counts into a single query when you have control over the query, while having fallback method in case you can't annotate the results.
Another option is to perform the annotation in your model manager, but you will no longer have fine-grained control over the queries.

Django - When is a query made?

I want to minimize the number of database queries my application makes, and I am familiarizing myself more with Django's ORM. I am wondering, what are the cases where a query is executed.
For instance, this format is along the lines of the answer I'm looking for (for example purposes, not accurate to my knowledge):
Model.objects.get()
Always launches a query
Model.objects.filter()
Launches a query if objects is empty only
(...)
I am assuming curried filter operations never make additional requests, but from the docs it looks like filter() does indeed make database requests if it's the first thing called.

If you're using test cases, you can use this custom assertion included in django's TestCase: assertNumQueries().
Example:
with self.assertNumQueries(2):
x = SomeModel.objects.get(pk=1)
y = x.some_foreign_key_in_object
If the expected number of queries was wrong, you'd see an assertion failed message of the form:
Num queries (expected - actual):
2 : 5
In this example, the foreign key would cause an additional query even though there's no explicit query (get, filter, exclude, etc.).
For this reason, I would use a practical approach: Test or logging, instead of trying to learn each of the cases in which django is supposed to query.
If you don't use unit tests, you may use this other method which prints the actual SQL statements sent by django, so you can have an idea of the complexity of the query, and not just the number of queries:
(DEBUG setting must be set to True)
from django.db import connection
x = SomeModel.objects.get(pk=1)
y = x.some_foreign_key_in_object
print connection.queries
The print would show a dictionary of queries:
[
{'sql': 'SELECT a, b, c, d ... FROM app_some_model', 'time': '0.002'},
{'sql': 'SELECT j, k, ... FROM app_referenced_model JOIN ... blabla ',
'time': '0.004'}
]
Docs on connection.queries.
Of course, you can also combine both methods and use the print connection.queries in your test cases.

See Django's documentation on when querysets are evaluated: https://docs.djangoproject.com/en/dev/ref/models/querysets/#when-querysets-are-evaluated
Evaluation in this case means that the query is executed. This mostly happens when you are trying to access the results, eg. when calling list() or len() on it or iterating over the results.
get()in your example doesn't return a queryset but a model objects, therefore it is evaluated immediately.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.