Django Haystack Distinct Value for Field

Django Haystack Distinct Value for Field - python

I am building a small search engine using Django Haystack + Elasticsearch + Django REST Framework, and I'm trying to figure out reproduce the behavior of a Django QuerySet's distinct method.
My index looks something like this:
class ItemIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
item_id = indexes.IntegerField(faceted=True)
def prepare_item_id(self, obj):
return obj.item_id
What I'd like to be able to do is the following:
sqs = SearchQuerySet().filter(content=my_search_query).distinct('item_id')
However, Haystack's SearchQuerySet doesn't have a distinct method, so I'm kind of lost. I tried faceting the field, and then querying Django using the returned list of item_id's, but this loses the performance of Elasticsearch, and also makes it impossible to use Elasticsearch's sorting features.
Any thoughts?
EDIT:
Example data:
Example data:
Item Model
==========
id title
1 'Item 1'
2 'Item 2'
3 'Item 3'
VendorItem Model << the table in question
================
id item_id vendor_id lat lon
1 1 1 38 -122
2 2 1 38.2 -121.8
3 3 2 37.9 -121.9
4 1 2 ... ...
5 2 2 ... ...
6 2 3 ... ...
As you can see, there are multiple VendorItem's for the same Item, however when searching I only want to retrieve at most one result for each item. Therefore I need the item_id column to be unique/distinct.
I have tried faceting on the item_id column, and then executing the following query:
facets = SearchQuerySet().filter(content=query).facet('item_id')
counts = sqs.facet_counts()
# ids will look like: [345, 892, 123, 34,...]
ids = [i[0] for i in counts['fields']['item_id']]
items = VendorItem.objects.filter(vendor__lat__gte=latMin,
vendor__lon__gte=lonMin, vendor__lat__lte=latMax,
vendor__lon__lte=lonMax, item_id__in=ids).distinct(
'item').select_related('vendor', 'item')
The main problem here is that results are limited to 100 items, and they cannot be sorted with haystack.

I think the best advice I can give you is to stop using Haystack.
Haystack's default backend (the elasticsearch_backend.py) is mostly written with Solr in mind. There are a lot of annoyances that I find in haystack, but the biggest has to be that it packs all queries into something called query_string. Using query string, they can use the lucene syntax, but it also means losing the entire elasticsearch DSL. The lucene syntax has some advantages, especially if this is what you are used to, but it is very limiting from an elasticsearch point of view.
Furthermore, I think you are applying an RDBMS concept to a search engine. That isn't to say that you shouldn't get the results you need, but the approach is often different.
The way you might query and retrieve this data might be different if you don't use haystack because haystack creates indexes in a way more appropriate for solr than for elasticsearch.
For example, in creating a new index, haystack will assign a "type" called "modelresult" to all models that will go in an index.
So, let's say you have some entities called Items and some other entities called vendoritems.
It might be appropriate to have them both in the same index but with vendoritems as a type of vendoritems and items having a type of items.
When querying, you would then query based on the rest endpoint so, something like localhost:9200/index/type (query). The way haystack achieves is this is through the django content types module. Accordingly, there is a field called "django_ct" that haystack queries and attaches to any query you might make when you are only looking for unique items.
To illustrate the above:
This endpoint searches accross all indexes
`localhost:9200/`
This endpoint searches across all types in an index:
`localhost:9200/yourindex/`
This endpoint searches in a type within an index:
`localhost:9200/yourindex/yourtype/`
and this endpoint searches in two specified types within an index:
`localhost:9200/yourindex/yourtype,yourothertype/`
Back to haystack though, you can possibly get unique values by adding a django_ct to your query, but likely that isn't what you want.
What you really want to do is a facet, and probably you want to use term facets. This could be a problem in haystack because it A.) analyzes all text and B.) applies store=True to all fields (really not something you want to do in elasticsearch, but something you often want to do in solr).
You can order facet results in elasticsearch (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-terms-facet.html#_ordering)
I don't mean for this to be a slam on haystack. I think it does a lot of things right conceptually. It's especially good if all you need to do is index a single model (like say a blog) and just have it quickly return results.
That said, I highly recommend to use elasticutils. Some of the concepts from haystack are similar, but it uses the search dsl, rather than query_string (but you can still use query_string if you wanted).
Be warned though, I don't think you can order facets using elasticutils by default, but you can just pass in a python dictionary of the facets you want to facet_raw method (something I don't think you can do in haystack).
Your last option is to create your own haystack backend, inherit from the existing backend and just add some functionality to the .facet() method to allow for ordering per the above dsl.

Related

Django - Filter GTE LTE for alphanumeric IDs

I am trying to serve up our APIs to allow filtering capabilities using LTE and GTE on our ID look ups. However, the IDs we have are alphanumeric like AB:12345, AB:98765 and so on. I am trying to do the following on the viewset using the Django-Filter:
class MyFilter(django_filters.FilterSet):
item_id = AllLookupsFilter()
class Meta:
model = MyModel
fields = {
'item_id': ['lte', 'gte']
}
But the issue is, if I query as: http://123.1.1.1:7000/my-entities/?item_id__gte=AB:1999 or even http://123.1.1.1:7000/my-entities/?item_id__lte=AB:100 it won't exactly return items with ID's greater than 1999 or less than 100 since the filter will take ID as a string and tries to filter by every character. Any idea how to achieve so I can filter on IDs so I get items exactly greater / less than the numeric ID (ignoring the initial characters) ?

What you'll want to do is write a custom lookup. You can read more about them here: https://docs.djangoproject.com/en/2.0/howto/custom-lookups/
The code sample below has everything you need to define your own except the actual function. For that part of the example check the link.
from django.db.models import Lookup
#Field.register_lookup
class NotEqual(Lookup):
lookup_name = 'ne'
In the lookup, you'll need to split the string and then search based on your own parameters. This will likely require you to do one of the following:
Write custom SQL that you can pass through Django into your query.
Request a large number of results containing the subset you're looking for and filter that through Python, returning only the important bits.
What you're trying to accomplish is usually called Natural Sorting, and tends to be very difficult to do on the SQL side of things. There's a good trick, explained well here: https://www.copterlabs.com/natural-sorting-in-mysql/ However, the highlights are simple in the SQL world:
Sort by Length first
Sort by column value second

What is the difference between with_entities and load_only in SQLAlchemy?

When querying my database, I only want to load specified columns. Creating a query with with_entities requires a reference to the model column attribute, while creating a query with load_only requires a string corresponding to the column name. I would prefer to use load_only because it is easier to create a dynamic query using strings. What is the difference between the two?
load_only documentation
with_entities documentation

There are a few differences. The most important one when discarding unwanted columns (as in the question) is that using load_only will still result in creation of an object (a Model instance), while using with_entities will just get you tuples with values of chosen columns.
>>> query = User.query
>>> query.options(load_only('email', 'id')).all()
[<User 1 using e-mail: n#d.com>, <User 2 using e-mail: n#d.org>]
>>> query.with_entities(User.email, User.id).all()
[('n#d.org', 1), ('n#d.com', 2)]
load_only
load_only() defers loading of particular columns from your models.
It removes columns from query. You can still access all the other columns later, but an additional query (in the background) will be performed just when you try to access them.
"Load only" is useful when you store things like pictures of users in your database but you do not want to waste time transferring the images when not needed. For example, when displaying a list of users this might suffice:
User.query.options(load_only('name', 'fullname'))
with_entities
with_entities() can either add or remove (simply: replace) models or columns; you can even use it to modify the query, to replace selected entities with your own function like func.count():
query = User.query
count_query = query.with_entities(func.count(User.id)))
count = count_query.scalar()
Note that the resulting query is not the same as of query.count(), which would probably be slower - at least in MySQL (as it generates a subquery).
Another example of the extra capabilities of with_entities would be:
query = (
Page.query
.filter(<a lot of page filters>)
.join(Author).filter(<some author filters>)
)
pages = query.all()
# ok, I got the pages. Wait, what? I want the authors too!
# how to do it without generating the query again?
pages_and_authors = query.with_entities(Page, Author).all()

Django Model and Many-to-Many Relationships -- finding most similar objects

I'm running into an issue that I can't find an explanation for.
Given one object (in this case, an "Article"), I want to use another type of object (in this case, a "Category") to determine which other articles are most similar to article X, as measured by the number of categories they have in common. The relationship between Article and Category is Many-to-Many. The use case is to get a quick list of related Objects to present as links.
I know exactly how I would write the SQL by hand:
select
ac.article_id
from
Article_Category ac
where
ac.category_id in
(
select
category_id
from
Article_Category
where
article_id = 1 -- get all categories for article in question
)
and ac.article_id <> 1
group by
ac.article_id
order by
count(ac.category_id) desc, random() limit 5
What I'm struggling with is how to use the Django Model aggregation to match this logic and only run one query. I'd obv. prefer to do it within the framework if possible. Does anybody have pointers on this?

Adding this in now that I've found a way within the model framework to do this.
related_article_list = Article.objects.filter(category=self.category.all())\
.exclude(id=self.id)
related_article_ids = related_article_list.values('id')\
.annotate(count=models.Count('id'))\
.order_by('-count','?')
In the related_article_list part, other Article objects that match on 2 or more Categories will be included separate times. Thus, when using annotation to count them the number will be > 1 and they can be ordered that way.

I think the correct answer if you really want to filter articles on all category should look like this:
related_article_list = Article.objects.filter(category__in=self.category.all())\
.exclude(id=self.id)

Variable interpolation in python/django, django query filters [duplicate]

Given a class:
from django.db import models
class Person(models.Model):
name = models.CharField(max_length=20)
Is it possible, and if so how, to have a QuerySet that filters based on dynamic arguments? For example:
# Instead of:
Person.objects.filter(name__startswith='B')
# ... and:
Person.objects.filter(name__endswith='B')
# ... is there some way, given:
filter_by = '{0}__{1}'.format('name', 'startswith')
filter_value = 'B'
# ... that you can run the equivalent of this?
Person.objects.filter(filter_by=filter_value)
# ... which will throw an exception, since `filter_by` is not
# an attribute of `Person`.

Python's argument expansion may be used to solve this problem:
kwargs = {
'{0}__{1}'.format('name', 'startswith'): 'A',
'{0}__{1}'.format('name', 'endswith'): 'Z'
}
Person.objects.filter(**kwargs)
This is a very common and useful Python idiom.

A simplified example:
In a Django survey app, I wanted an HTML select list showing registered users. But because we have 5000 registered users, I needed a way to filter that list based on query criteria (such as just people who completed a certain workshop). In order for the survey element to be re-usable, I needed for the person creating the survey question to be able to attach those criteria to that question (don't want to hard-code the query into the app).
The solution I came up with isn't 100% user friendly (requires help from a tech person to create the query) but it does solve the problem. When creating the question, the editor can enter a dictionary into a custom field, e.g.:
{'is_staff':True,'last_name__startswith':'A',}
That string is stored in the database. In the view code, it comes back in as self.question.custom_query . The value of that is a string that looks like a dictionary. We turn it back into a real dictionary with eval() and then stuff it into the queryset with **kwargs:
kwargs = eval(self.question.custom_query)
user_list = User.objects.filter(**kwargs).order_by("last_name")

Additionally to extend on previous answer that made some requests for further code elements I am adding some working code that I am using
in my code with Q. Let's say that I in my request it is possible to have or not filter on fields like:
publisher_id
date_from
date_until
Those fields can appear in query but they may also be missed.
This is how I am building filters based on those fields on an aggregated query that cannot be further filtered after the initial queryset execution:
# prepare filters to apply to queryset
filters = {}
if publisher_id:
filters['publisher_id'] = publisher_id
if date_from:
filters['metric_date__gte'] = date_from
if date_until:
filters['metric_date__lte'] = date_until
filter_q = Q(**filters)
queryset = Something.objects.filter(filter_q)...
Hope this helps since I've spent quite some time to dig this up.
Edit:
As an additional benefit, you can use lists too. For previous example, if instead of publisher_id you have a list called publisher_ids, than you could use this piece of code:
if publisher_ids:
filters['publisher_id__in'] = publisher_ids

Django.db.models.Q is exactly what you want in a Django way.

This looks much more understandable to me:
kwargs = {
'name__startswith': 'A',
'name__endswith': 'Z',
***(Add more filters here)***
}
Person.objects.filter(**kwargs)

A really complex search forms usually indicates that a simpler model is trying to dig it's way out.
How, exactly, do you expect to get the values for the column name and operation?
Where do you get the values of 'name' an 'startswith'?
filter_by = '%s__%s' % ('name', 'startswith')
A "search" form? You're going to -- what? -- pick the name from a list of names? Pick the operation from a list of operations? While open-ended, most people find this confusing and hard-to-use.
How many columns have such filters? 6? 12? 18?
A few? A complex pick-list doesn't make sense. A few fields and a few if-statements make sense.
A large number? Your model doesn't sound right. It sounds like the "field" is actually a key to a row in another table, not a column.
Specific filter buttons. Wait... That's the way the Django admin works. Specific filters are turned into buttons. And the same analysis as above applies. A few filters make sense. A large number of filters usually means a kind of first normal form violation.
A lot of similar fields often means there should have been more rows and fewer fields.

'Stringing together' a pymongo query based on a set of conditions

I have a set of conditions that I need to use to retrieve some data from a mongodb database (using pymongo). Some of these conditions are optional, and others may have more than one possible value.
I'm wondering if there is a way of 'dynamically' constructing a pymongo query based on these conditions (instead of creating individual queries for each possible combination of conditions).
For example, assume that I have one query which has to be constrained to the following conditions:
tag contains any of this, is, a, tag
user is johnsmith
date_published is before today
...whereas another query may only be constrained to the following:
user is johnsmith
date_published is after today
Summary: Instead of having to create every possible combination of conditions, is there a way of stringing conditions together to form a query in pymongo?

A PyMongo query is just a Python dictionary, so you can use all the usual techniques to build one on the fly:
def find_things(tags=None, user=None, published_since=None):
# all queries begin with something common, which may
# be an empty dict, but here's an example
query = {
'is_published': True
}
if tags:
# assume that it is an array of strings
query['tags'] = {'$in': tags}
if user:
# assume that it is a string
query['user'] = user
if published_since:
# assume that it is a datetime.datetime
query['date_published'] = {'$gte': published_since}
# etc...
return db.collection.find(query)
The actual logic you implement is obviously dependent on what you want to vary your find calls by, these are just a few examples. You will also want to validate the input if it is coming from an untrusted source (e.g. a web application form, URL parameters, etc).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.