'Stringing together' a pymongo query based on a set of conditions - python

I have a set of conditions that I need to use to retrieve some data from a mongodb database (using pymongo). Some of these conditions are optional, and others may have more than one possible value.
I'm wondering if there is a way of 'dynamically' constructing a pymongo query based on these conditions (instead of creating individual queries for each possible combination of conditions).
For example, assume that I have one query which has to be constrained to the following conditions:
tag contains any of this, is, a, tag
user is johnsmith
date_published is before today
...whereas another query may only be constrained to the following:
user is johnsmith
date_published is after today
Summary: Instead of having to create every possible combination of conditions, is there a way of stringing conditions together to form a query in pymongo?

A PyMongo query is just a Python dictionary, so you can use all the usual techniques to build one on the fly:
def find_things(tags=None, user=None, published_since=None):
# all queries begin with something common, which may
# be an empty dict, but here's an example
query = {
'is_published': True
}
if tags:
# assume that it is an array of strings
query['tags'] = {'$in': tags}
if user:
# assume that it is a string
query['user'] = user
if published_since:
# assume that it is a datetime.datetime
query['date_published'] = {'$gte': published_since}
# etc...
return db.collection.find(query)
The actual logic you implement is obviously dependent on what you want to vary your find calls by, these are just a few examples. You will also want to validate the input if it is coming from an untrusted source (e.g. a web application form, URL parameters, etc).

Related

Django - Filter GTE LTE for alphanumeric IDs

I am trying to serve up our APIs to allow filtering capabilities using LTE and GTE on our ID look ups. However, the IDs we have are alphanumeric like AB:12345, AB:98765 and so on. I am trying to do the following on the viewset using the Django-Filter:
class MyFilter(django_filters.FilterSet):
item_id = AllLookupsFilter()
class Meta:
model = MyModel
fields = {
'item_id': ['lte', 'gte']
}
But the issue is, if I query as: http://123.1.1.1:7000/my-entities/?item_id__gte=AB:1999 or even http://123.1.1.1:7000/my-entities/?item_id__lte=AB:100 it won't exactly return items with ID's greater than 1999 or less than 100 since the filter will take ID as a string and tries to filter by every character. Any idea how to achieve so I can filter on IDs so I get items exactly greater / less than the numeric ID (ignoring the initial characters) ?
What you'll want to do is write a custom lookup. You can read more about them here: https://docs.djangoproject.com/en/2.0/howto/custom-lookups/
The code sample below has everything you need to define your own except the actual function. For that part of the example check the link.
from django.db.models import Lookup
#Field.register_lookup
class NotEqual(Lookup):
lookup_name = 'ne'
In the lookup, you'll need to split the string and then search based on your own parameters. This will likely require you to do one of the following:
Write custom SQL that you can pass through Django into your query.
Request a large number of results containing the subset you're looking for and filter that through Python, returning only the important bits.
What you're trying to accomplish is usually called Natural Sorting, and tends to be very difficult to do on the SQL side of things. There's a good trick, explained well here: https://www.copterlabs.com/natural-sorting-in-mysql/ However, the highlights are simple in the SQL world:
Sort by Length first
Sort by column value second

What is the difference between with_entities and load_only in SQLAlchemy?

When querying my database, I only want to load specified columns. Creating a query with with_entities requires a reference to the model column attribute, while creating a query with load_only requires a string corresponding to the column name. I would prefer to use load_only because it is easier to create a dynamic query using strings. What is the difference between the two?
load_only documentation
with_entities documentation
There are a few differences. The most important one when discarding unwanted columns (as in the question) is that using load_only will still result in creation of an object (a Model instance), while using with_entities will just get you tuples with values of chosen columns.
>>> query = User.query
>>> query.options(load_only('email', 'id')).all()
[<User 1 using e-mail: n#d.com>, <User 2 using e-mail: n#d.org>]
>>> query.with_entities(User.email, User.id).all()
[('n#d.org', 1), ('n#d.com', 2)]
load_only
load_only() defers loading of particular columns from your models.
It removes columns from query. You can still access all the other columns later, but an additional query (in the background) will be performed just when you try to access them.
"Load only" is useful when you store things like pictures of users in your database but you do not want to waste time transferring the images when not needed. For example, when displaying a list of users this might suffice:
User.query.options(load_only('name', 'fullname'))
with_entities
with_entities() can either add or remove (simply: replace) models or columns; you can even use it to modify the query, to replace selected entities with your own function like func.count():
query = User.query
count_query = query.with_entities(func.count(User.id)))
count = count_query.scalar()
Note that the resulting query is not the same as of query.count(), which would probably be slower - at least in MySQL (as it generates a subquery).
Another example of the extra capabilities of with_entities would be:
query = (
Page.query
.filter(<a lot of page filters>)
.join(Author).filter(<some author filters>)
)
pages = query.all()
# ok, I got the pages. Wait, what? I want the authors too!
# how to do it without generating the query again?
pages_and_authors = query.with_entities(Page, Author).all()

Django Raw Query with params on Table Column (SQL Injection)

I have a kinda unusual scenario but in addition to my sql parameters, I need to let the user / API define the table column name too. My problem with the params is that the query results in:
SELECT device_id, time, 's0' ...
instead of
SELECT device_id, time, s0 ...
Is there another way to do that through raw or would I need to escape the column by myself?
queryset = Measurement.objects.raw(
'''
SELECT device_id, time, %(sensor)s FROM measurements
WHERE device_id=%(device_id)s AND time >= to_timestamp(%(start)s) AND time <= to_timestamp(%(end)s)
ORDER BY time ASC;
''', {'device_id': device_id, 'sensor': sensor, 'start': start, 'end': end})
As with any potential for SQL injection, be careful.
But essentially this is a fairly common problem with a fairly safe solution. The problem, in general, is that query parameters are "the right way" to handle query values, but they're not designed for schema elements.
To dynamically include schema elements in your query, you generally have to resort to string concatenation. Which is exactly the thing we're all told not to do with SQL queries.
But the good news here is that you don't have to use the actual user input. This is because, while possible query values are infinite, the superset of possible valid schema elements is quite finite. So you can validate the user's input against that superset.
For example, consider the following process:
User inputs a value as a column name.
Code compares that value (raw string comparison) against a list of known possible values. (This list can be hard-coded, or can be dynamically fetched from the database schema.)
If no match is found, return an error.
If a match is found, use the matched known value directly in the SQL query.
So all you're ever using are the very strings you, as the programmer, put in the code. Which is the same as writing the SQL yourself anyway.
It doesn't look like you need raw() for the example query you posted. I think the following queryset is very similar.
measurements = Measurement.objects.filter(
device_id=device_id,
to_timestamp__gte=start,
to_timestamp__lte,
).order_by('time')
for measurement in measurements:
print(getattr(measurement, sensor)
If you need to optimise and avoid loading other fields, you can use values() or only().

Mongo Embedded Document Query

I've 2 DynamicDocuments:
class Tasks(db.DynamicDocument):
task_id = db.UUIDField(primary_key=True,default=uuid.uuid4)
name = db.StringField()
flag = db.IntField()
class UserTasks(db.DynamicDocument):
user_id = db.ReferenceField('User')
tasks = db.ListField(db.ReferenceField('Tasks'),default=list)
I want to filter the UserTasks document by checking whether the flag value (from Tasks Document) of the given task_id is 0 or 1, given the task_id and user_id. So I query in the following way:-
obj = UserTasks.objects.get(user_id=user_id,tasks=task_id)
This fetches me an UserTask object.
Now I loop around the task list and first I get the equivalent task and then check its flag value in the following manner.
task_list = obj.tasks
for t in task_list:
if t['task_id'] == task_id:
print t['flag']
Is there any better/direct way of querying UserTasks Document in order to fetch the flag value of Tasks Document.
PS : I could have directly fetched flag value from the Tasks Document, but I also need to check whether the task is associated with the user or not. Hence I directly queried the USerTasks document.
Can we directly filter on a document with the ReferenceField's fields in a single query?
No, its not possible to directly filter a document with the fields of ReferenceField as doing this would require joins and mongodb does not support joins.
As per MongoDB docs on database references:
MongoDB does not support joins. In MongoDB some data is denormalized,
or stored with related data in documents to remove the need for joins.
From another page on the official site:
If we were using a relational database, we could perform a join on
users and stores, and get all our objects in a single query. But
MongoDB does not support joins and so, at times, requires bit of
denormalization.
Relational purists may be feeling uneasy already, as if we were
violating some universal law. But let’s bear in mind that MongoDB
collections are not equivalent to relational tables; each serves a
unique design objective. A normalized table provides an atomic,
isolated chunk of data. A document, however, more closely represents
an object as a whole.
So in 1 query, we can't both filter tasks with a particular flag value and with the given user_id and task_id on the UserTasks model.
How to perform the filtering then?
To perform the filtering as per the required conditions, we will need to perform 2 queries.
In the first query we will try to filter the Tasks model with the given task_id and flag. Then, in the 2nd query, we will filter UserTasks model with the given user_id and the task retrieved from the first query.
Example:
Lets say we have a user_id, task_id and we need to check if the related task has flag value as 0.
1st Query
We will first retrive the my_task with the given task_id and flag as 0.
my_task = Tasks.objects.get(task_id=task_id, flag=0) # 1st query
2nd Query
Then in the 2nd query, you need to filter on UserTask model with the given user_id and my_task object.
my_user_task = UserTasks.objects.get(user_id=user_id, tasks=my_task) # 2nd query
You should perform 2nd query only if you get a my_task object with the given task_id and flag value. Also, you will need to add error handling in case there are no matched objects.
What if we have used EmbeddedDocument for the Tasks model?
Lets say we have defined our Tasks document as an EmbeddedDocument and the tasks field in UserTasks model as an EmbeddedDocumentField, then to do the desired filtering we could have done something like below:
my_user_task = UserTasks.objects.get(user_id=user_id, tasks__task_id=task_id, tasks__flag=0)
Getting the particular my_task from the list of tasks
The above query will return a UserTask document which will contain all the tasks. We will then need to perform some sort of iteration to get the desired task.
For doing that, we can perform list comprehension using enumerate().
Then the desired index will be the 1st element of the 1-element list returned.
my_task_index = [i for i,v in enumerate(my_user_task.tasks) if v.flag==0][0]
#Praful, based on your schema you need two queries because mongodb does not have joins, so if you want to get "all the data" in one query you need a schema which fit that case.
ReferenceField is a special field which does a lazy load of the other collection (it requires a query).
Based on the query you need, I recommend you to change your schema to fit that. The idea behind NOSQL engines is "denormalization" so it is not bad to have a list of EmbeddedDocument. EmbeddedDocument can be a smaller document (denormalized version) with a set of fields instead of all of them.
If you do not want to load the whole document into memory while querying you can exclude that fields using a "projection".
Supossing your UserTasks has a list of EmbeddedDocument with the task you could do:
UserTasks.objects.exclude('tasks').filter(**filters)
I hope it helps you.
Good luck!

Django Haystack Distinct Value for Field

I am building a small search engine using Django Haystack + Elasticsearch + Django REST Framework, and I'm trying to figure out reproduce the behavior of a Django QuerySet's distinct method.
My index looks something like this:
class ItemIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
item_id = indexes.IntegerField(faceted=True)
def prepare_item_id(self, obj):
return obj.item_id
What I'd like to be able to do is the following:
sqs = SearchQuerySet().filter(content=my_search_query).distinct('item_id')
However, Haystack's SearchQuerySet doesn't have a distinct method, so I'm kind of lost. I tried faceting the field, and then querying Django using the returned list of item_id's, but this loses the performance of Elasticsearch, and also makes it impossible to use Elasticsearch's sorting features.
Any thoughts?
EDIT:
Example data:
Example data:
Item Model
==========
id title
1 'Item 1'
2 'Item 2'
3 'Item 3'
VendorItem Model << the table in question
================
id item_id vendor_id lat lon
1 1 1 38 -122
2 2 1 38.2 -121.8
3 3 2 37.9 -121.9
4 1 2 ... ...
5 2 2 ... ...
6 2 3 ... ...
As you can see, there are multiple VendorItem's for the same Item, however when searching I only want to retrieve at most one result for each item. Therefore I need the item_id column to be unique/distinct.
I have tried faceting on the item_id column, and then executing the following query:
facets = SearchQuerySet().filter(content=query).facet('item_id')
counts = sqs.facet_counts()
# ids will look like: [345, 892, 123, 34,...]
ids = [i[0] for i in counts['fields']['item_id']]
items = VendorItem.objects.filter(vendor__lat__gte=latMin,
vendor__lon__gte=lonMin, vendor__lat__lte=latMax,
vendor__lon__lte=lonMax, item_id__in=ids).distinct(
'item').select_related('vendor', 'item')
The main problem here is that results are limited to 100 items, and they cannot be sorted with haystack.
I think the best advice I can give you is to stop using Haystack.
Haystack's default backend (the elasticsearch_backend.py) is mostly written with Solr in mind. There are a lot of annoyances that I find in haystack, but the biggest has to be that it packs all queries into something called query_string. Using query string, they can use the lucene syntax, but it also means losing the entire elasticsearch DSL. The lucene syntax has some advantages, especially if this is what you are used to, but it is very limiting from an elasticsearch point of view.
Furthermore, I think you are applying an RDBMS concept to a search engine. That isn't to say that you shouldn't get the results you need, but the approach is often different.
The way you might query and retrieve this data might be different if you don't use haystack because haystack creates indexes in a way more appropriate for solr than for elasticsearch.
For example, in creating a new index, haystack will assign a "type" called "modelresult" to all models that will go in an index.
So, let's say you have some entities called Items and some other entities called vendoritems.
It might be appropriate to have them both in the same index but with vendoritems as a type of vendoritems and items having a type of items.
When querying, you would then query based on the rest endpoint so, something like localhost:9200/index/type (query). The way haystack achieves is this is through the django content types module. Accordingly, there is a field called "django_ct" that haystack queries and attaches to any query you might make when you are only looking for unique items.
To illustrate the above:
This endpoint searches accross all indexes
`localhost:9200/`
This endpoint searches across all types in an index:
`localhost:9200/yourindex/`
This endpoint searches in a type within an index:
`localhost:9200/yourindex/yourtype/`
and this endpoint searches in two specified types within an index:
`localhost:9200/yourindex/yourtype,yourothertype/`
Back to haystack though, you can possibly get unique values by adding a django_ct to your query, but likely that isn't what you want.
What you really want to do is a facet, and probably you want to use term facets. This could be a problem in haystack because it A.) analyzes all text and B.) applies store=True to all fields (really not something you want to do in elasticsearch, but something you often want to do in solr).
You can order facet results in elasticsearch (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-terms-facet.html#_ordering)
I don't mean for this to be a slam on haystack. I think it does a lot of things right conceptually. It's especially good if all you need to do is index a single model (like say a blog) and just have it quickly return results.
That said, I highly recommend to use elasticutils. Some of the concepts from haystack are similar, but it uses the search dsl, rather than query_string (but you can still use query_string if you wanted).
Be warned though, I don't think you can order facets using elasticutils by default, but you can just pass in a python dictionary of the facets you want to facet_raw method (something I don't think you can do in haystack).
Your last option is to create your own haystack backend, inherit from the existing backend and just add some functionality to the .facet() method to allow for ordering per the above dsl.

Categories

Resources