Django Model and Many-to-Many Relationships -- finding most similar objects

Django Model and Many-to-Many Relationships -- finding most similar objects - python

I'm running into an issue that I can't find an explanation for.
Given one object (in this case, an "Article"), I want to use another type of object (in this case, a "Category") to determine which other articles are most similar to article X, as measured by the number of categories they have in common. The relationship between Article and Category is Many-to-Many. The use case is to get a quick list of related Objects to present as links.
I know exactly how I would write the SQL by hand:
select
ac.article_id
from
Article_Category ac
where
ac.category_id in
(
select
category_id
from
Article_Category
where
article_id = 1 -- get all categories for article in question
)
and ac.article_id <> 1
group by
ac.article_id
order by
count(ac.category_id) desc, random() limit 5
What I'm struggling with is how to use the Django Model aggregation to match this logic and only run one query. I'd obv. prefer to do it within the framework if possible. Does anybody have pointers on this?

Adding this in now that I've found a way within the model framework to do this.
related_article_list = Article.objects.filter(category=self.category.all())\
.exclude(id=self.id)
related_article_ids = related_article_list.values('id')\
.annotate(count=models.Count('id'))\
.order_by('-count','?')
In the related_article_list part, other Article objects that match on 2 or more Categories will be included separate times. Thus, when using annotation to count them the number will be > 1 and they can be ordered that way.

I think the correct answer if you really want to filter articles on all category should look like this:
related_article_list = Article.objects.filter(category__in=self.category.all())\
.exclude(id=self.id)

Related

Django annotate count doesn't work always return 1

My models:
class PicKeyword(models.Model):
"""chat context collection
"""
keyword = models.TextField(blank=True, null=True)
Myfilter:
from django.db.models import Count
PicKeyword.objects.annotate(Count('keyword'))[0].keyword__count
The result always get 1
just like:
print(PicKeyword.objects.filter(keyword='hi').count()
show:
3
PicKeyword.objects.filter(keyword='hi').annotate(Count('keyword'))[0].keyword__count
show:
1
Is it because I use sqllite or my keyword type is Textfield?

I had exactly the same problem and none of the answers worked for me.
I was able to get it working using the following syntax passing only 1 field as value (worked for me because I was trying to spot duplicates):
dups = PicKeyword.objects.values('keyword').annotate(Count('keyword')).filter(keyword__count__gt=1)
The following returns always an empty queryset, which Should NOT:
dups = PicKeyword.objects.values('keyword','id').annotate(Count('keyword')).filter(keyword__count__gt=1)
Hope it helps

The way you are trying to annotate the count of keywords here,
PicKeyword.objects.annotate(Count('keyword'))[0].keyword__count
only works when you want to summarize the relationship between multiple objects. Since, you do not have any related objects to the keyword field the output is always 1 (it's own instance)
As the API docs for queryset.annotate states,
Annotates each object in the QuerySet with the provided list of query expressions. An expression may be a simple value, a reference to a field on the model (or any related models), or an aggregate expression (averages, sums, etc.) that has been computed over the objects that are related to the objects in the QuerySet.
Blog Model reference for the Queryset.annotate example
Finally, If there are no relationships among objects and your aim is to just get the count of objects in your PicKeyword Model, the answers from #samba and #Willem are correct.

The following will show the exact count:
PicKeyword.objects.annotate(Count('keyword')).count()

How to do a Django subquery

I have two examples of code which accomplish the same thing. One is using python, the other is in SQL.
Exhibit A (Python):
surveys = Survey.objects.all()
consumer = Consumer.objects.get(pk=24)
for ballot in consumer.ballot_set.all()
consumer_ballot_list.append(ballot.question_id)
for survey in surveys:
if survey.id not in consumer_ballot_list:
consumer_survey_list.append(survey.id)
Exhibit B (SQL):
SELECT * FROM clients_survey WHERE id NOT IN (SELECT question_id FROM consumers_ballot WHERE consumer_id=24) ORDER BY id;
I want to know how I can make exhibit A much cleaner and more efficient using Django's ORM and subqueries.
In this example:
I have ballots which contain a question_id that refers to the survey which a consumer has answered.
I want to find all of the surveys that the consumer hasn't answered. So I need to check each question_id(survey.id) in the consumer's set of ballots against the survey model's id's and make sure that only the surveys that the consumer does NOT have a ballot of are returned.

You more or less have the correct idea. To replicate your SQL code using Django's ORM you just have to break the SQL into each discrete part:
1.create table of question_ids the consumer 24 has answered
2.filter the survey for all ids not in the aformentioned table
consumer = Consumer.objects.get(pk=24)
# step 1
answered_survey_ids = consumer.ballot_set.values_list('question_id', flat=True)
# step 2
unanswered_surveys_ids = Survey.objects.exclude(id__in=answered_survey_ids).values_list('id', flat=True)
This is basically what you did in your current python based approach except I just took advantage of a few of Django's nice ORM features.
.values_list() - this allows you to extract a specific field from all the objects in the given queryset.
.exclude() - this is the opposite of .filter() and returns all items in the queryset that don't match the condition.
__in - this is useful if we have a list of values and we want to filter/exclude all items that match those values.
Hope this helps!

Django filter and get the whole record back when using a .values() column-based annotation

This may be a common query but I've struggled to find an answer. This answer to an earlier question gets me half-way using .annotate() and Count but I can't figure out how then to get the full record for the filtered results.
I'm working with undirected networks and would like to limit the query based on a subset of target nodes.
Sample model:
class Edges(Model):
id = models.AutoField(primary_key=True)
source = models.BigIntegerField()
target = models.BigIntegerField()
I want to get a queryset of Edges where the .target exists within a list passed to filter. I then want to exclude any Edges where the source is not greater than a number (1 in the example below but may change).
Here's the query so far (parenthesis added just for better legibility):
(Edges.objects.filter(target__in=[1234,5678, 9012])
.values('source')
.annotate(source_count=Count("source"))
.filter(source_count__gt=1)
)
This query just delivers the source and new source_count fields but I want the whole record (id, source and target) for the subset.
Should I be using this as a subquery or am I missing some obvious Django-foo?

I would suggest either
Edges.objects.filter(target__in=[1234,5678, 9012], source_count__gt=1)
.annotate(source_count=Count('source'))
.values('id', 'source', 'target', 'source_count')
to get only the values of id, source, target and source_count, or
Edges.objects.filter(target__in=[1234,5678, 9012], source_count__gt=1)
.annotate(source_count=Count('source'))
to get a QuerySet of Edges instances, where not only you get the above values but you can call any methods you have defined on them (might be a db consuming, though).

Can I create Dynamic columns (and models) in django?

I want to create a database of dislike items, but depending on the category of item, it has different columns I'd like to show when all you're looking at is cars. In fact, I'd like the columns to be dynamic based on the category so we can easily an additional property to cars in the future, and have that column show up now too.
For example:
But when you filter on car or person, additional rows show up for filtering.
All the examples that I can find about using django models aren't giving me a very clear picture on how I might accomplish this behavior in a clean, simple web interface.

I would probably go for a model describing a "dislike criterion":
class DislikeElement(models.Model):
item = models.ForeignKey(Item) # Item is the model corresponding to your first table
field_name = models.CharField() # e.g. "Model", "Year born"...
value = models.CharField() # e.g. "Mustang", "1960"...
You would have quite a lot of flexibility in what data you can retrieve. For example, to get for a given item all the dislike elements, you would just have to do something like item.dislikeelements_set.all().
The only problem with this solution is that you would to store in value numbers, strings, dates... under the same data type. But maybe that's not an issue for you.

Django Haystack Distinct Value for Field

I am building a small search engine using Django Haystack + Elasticsearch + Django REST Framework, and I'm trying to figure out reproduce the behavior of a Django QuerySet's distinct method.
My index looks something like this:
class ItemIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
item_id = indexes.IntegerField(faceted=True)
def prepare_item_id(self, obj):
return obj.item_id
What I'd like to be able to do is the following:
sqs = SearchQuerySet().filter(content=my_search_query).distinct('item_id')
However, Haystack's SearchQuerySet doesn't have a distinct method, so I'm kind of lost. I tried faceting the field, and then querying Django using the returned list of item_id's, but this loses the performance of Elasticsearch, and also makes it impossible to use Elasticsearch's sorting features.
Any thoughts?
EDIT:
Example data:
Example data:
Item Model
==========
id title
1 'Item 1'
2 'Item 2'
3 'Item 3'
VendorItem Model << the table in question
================
id item_id vendor_id lat lon
1 1 1 38 -122
2 2 1 38.2 -121.8
3 3 2 37.9 -121.9
4 1 2 ... ...
5 2 2 ... ...
6 2 3 ... ...
As you can see, there are multiple VendorItem's for the same Item, however when searching I only want to retrieve at most one result for each item. Therefore I need the item_id column to be unique/distinct.
I have tried faceting on the item_id column, and then executing the following query:
facets = SearchQuerySet().filter(content=query).facet('item_id')
counts = sqs.facet_counts()
# ids will look like: [345, 892, 123, 34,...]
ids = [i[0] for i in counts['fields']['item_id']]
items = VendorItem.objects.filter(vendor__lat__gte=latMin,
vendor__lon__gte=lonMin, vendor__lat__lte=latMax,
vendor__lon__lte=lonMax, item_id__in=ids).distinct(
'item').select_related('vendor', 'item')
The main problem here is that results are limited to 100 items, and they cannot be sorted with haystack.

I think the best advice I can give you is to stop using Haystack.
Haystack's default backend (the elasticsearch_backend.py) is mostly written with Solr in mind. There are a lot of annoyances that I find in haystack, but the biggest has to be that it packs all queries into something called query_string. Using query string, they can use the lucene syntax, but it also means losing the entire elasticsearch DSL. The lucene syntax has some advantages, especially if this is what you are used to, but it is very limiting from an elasticsearch point of view.
Furthermore, I think you are applying an RDBMS concept to a search engine. That isn't to say that you shouldn't get the results you need, but the approach is often different.
The way you might query and retrieve this data might be different if you don't use haystack because haystack creates indexes in a way more appropriate for solr than for elasticsearch.
For example, in creating a new index, haystack will assign a "type" called "modelresult" to all models that will go in an index.
So, let's say you have some entities called Items and some other entities called vendoritems.
It might be appropriate to have them both in the same index but with vendoritems as a type of vendoritems and items having a type of items.
When querying, you would then query based on the rest endpoint so, something like localhost:9200/index/type (query). The way haystack achieves is this is through the django content types module. Accordingly, there is a field called "django_ct" that haystack queries and attaches to any query you might make when you are only looking for unique items.
To illustrate the above:
This endpoint searches accross all indexes
`localhost:9200/`
This endpoint searches across all types in an index:
`localhost:9200/yourindex/`
This endpoint searches in a type within an index:
`localhost:9200/yourindex/yourtype/`
and this endpoint searches in two specified types within an index:
`localhost:9200/yourindex/yourtype,yourothertype/`
Back to haystack though, you can possibly get unique values by adding a django_ct to your query, but likely that isn't what you want.
What you really want to do is a facet, and probably you want to use term facets. This could be a problem in haystack because it A.) analyzes all text and B.) applies store=True to all fields (really not something you want to do in elasticsearch, but something you often want to do in solr).
You can order facet results in elasticsearch (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-terms-facet.html#_ordering)
I don't mean for this to be a slam on haystack. I think it does a lot of things right conceptually. It's especially good if all you need to do is index a single model (like say a blog) and just have it quickly return results.
That said, I highly recommend to use elasticutils. Some of the concepts from haystack are similar, but it uses the search dsl, rather than query_string (but you can still use query_string if you wanted).
Be warned though, I don't think you can order facets using elasticutils by default, but you can just pass in a python dictionary of the facets you want to facet_raw method (something I don't think you can do in haystack).
Your last option is to create your own haystack backend, inherit from the existing backend and just add some functionality to the .facet() method to allow for ordering per the above dsl.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.