Django full text search using indexes with PostgreSQL

Django full text search using indexes with PostgreSQL - python

After solving the problem I asked about in this question, I am trying to optimize performance of the FTS using indexes.
I issued on my db the command:
CREATE INDEX my_table_idx ON my_table USING gin(to_tsvector('italian', very_important_field), to_tsvector('italian', also_important_field), to_tsvector('italian', not_so_important_field), to_tsvector('italian', not_important_field), to_tsvector('italian', tags));
Then I edited my model's Meta class as follows:
class MyEntry(models.Model):
very_important_field = models.TextField(blank=True, null=True)
also_important_field = models.TextField(blank=True, null=True)
not_so_important_field = models.TextField(blank=True, null=True)
not_important_field = models.TextField(blank=True, null=True)
tags = models.TextField(blank=True, null=True)
class Meta:
managed = False
db_table = 'my_table'
indexes = [
GinIndex(
fields=['very_important_field', 'also_important_field', 'not_so_important_field', 'not_important_field', 'tags'],
name='my_table_idx'
)
]
But nothing seems to have changed. The lookup takes exactly the same amount of time as before.
This is the lookup script:
from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector
# other unrelated stuff here
vector = SearchVector("very_important_field", weight="A") + \
SearchVector("tags", weight="A") + \
SearchVector("also_important_field", weight="B") + \
SearchVector("not_so_important_field", weight="C") + \
SearchVector("not_important_field", weight="D")
query = SearchQuery(search_string, config="italian")
rank = SearchRank(vector, query, weights=[0.4, 0.6, 0.8, 1.0]). # D, C, B, A
full_text_search_qs = MyEntry.objects.annotate(rank=rank).filter(rank__gte=0.4).order_by("-rank")
What am I doing wrong?
Edit:
The above lookup is wrapped in a function I use a decorator on to time. The function actually returns a list, like this:
#timeit
def search(search_string):
# the above code here
qs = list(full_text_search_qs)
return qs
Might this be the problem, maybe?

You need to add a SearchVectorField to your MyEntry, update it from your actual text fields and then perform the search on this field. However, the update can only be performed after the record has been saved to the database.
Essentially:
from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVector, SearchVectorField
class MyEntry(models.Model):
# The fields that contain the raw data.
very_important_field = models.TextField(blank=True, null=True)
also_important_field = models.TextField(blank=True, null=True)
not_so_important_field = models.TextField(blank=True, null=True)
not_important_field = models.TextField(blank=True, null=True)
tags = models.TextField(blank=True, null=True)
# The field we actually going to search.
# Must be null=True because we cannot set it immediately during create()
search_vector = SearchVectorField(editable=False, null=True)
class Meta:
# The search index pointing to our actual search field.
indexes = [GinIndex(fields=["search_vector"])]
Then you can create the plain instance as usual, for example:
# Does not set MyEntry.search_vector yet.
my_entry = MyEntry.objects.create(
very_important_field="something very important", # Fake Italien text ;-)
also_important_field="something different but equally important"
not_so_important_field="this one matters less"
not_important_field="we don't care are about that one at all"
tags="things, stuff, whatever"
Now that the entry exists in the database, you can update the search_vector field using all kinds of options. For example weight to specify the importance and config to use one of the default language configurations. You can also completely omit fields you don't want to search:
# Update search vector on existing database record.
my_entry.search_vector = (
SearchVector("very_important_field", weight="A", config="italien")
+ SearchVector("also_important_field", weight="A", config="italien")
+ SearchVector("not_so_important_field", weight="C", config="italien")
+ SearchVector("tags", weight="B", config="italien")
)
my_entry.save()
Manually updating the search_vector field every time some of the text fields change can be error prone, so you might consider adding an SQL trigger to do that for you using a Django migration. For an example on how to do that see for instance a blog article on Full-text Search with Django and PostgreSQL.
To actually search in MyEntry using the index you need to filter and rank by your search_vector field. The config for the SearchQuery should match the one of the SearchVector above (to use the same stopword, stemming etc).
For example:
from django.contrib.postgres.search import SearchQuery, SearchRank
from django.core.exceptions import ValidationError
from django.db.models import F, QuerySet
search_query = SearchQuery("important", search_type="websearch", config="italien")
search_rank = SearchRank(F("search_vector"), search_query)
my_entries_found = (
MyEntry.objects.annotate(rank=search_rank)
.filter(search_vector=search_query) # Perform full text search on index.
.order_by("-rank") # Yield most relevant entries first.
)

I'm not sure but according to postgresql documentation (https://www.postgresql.org/docs/9.5/static/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX):
Because the two-argument version of to_tsvector was used in the index
above, only a query reference that uses the 2-argument version of
to_tsvector with the same configuration name will use that index. That
is, WHERE to_tsvector('english', body) ## 'a & b' can use the index,
but WHERE to_tsvector(body) ## 'a & b' cannot. This ensures that an
index will be used only with the same configuration used to create the
index entries.
I don't know what configuration django uses but you can try to remove first argument

Related

How can i change this query to ORM?

Hi i have two models like this,
class Sample(models.Model):
name = models.CharField(max_length=256) ##
processid = models.IntegerField(default=0) #
class Process(models.Model):
sample = models.ForeignKey(Sample, blank=False, null=True, on_delete=models.SET_NULL, related_name="process_set")
end_at = models.DateTimeField(null=True, blank=True)
and I want to join Sample and Process model. Because Sample is related to process and I want to get process information with sample .
SELECT sample.id, sample.name, process.endstat
FROM sample
INNER JOIN process
ON sample.processid = process.id
AND process.endstat = 1;
(i'm using SQLite)
I used
sample_list = sample_list.filter(process_set__endstat=1))
but it returned
SELECT sample.id, sample.name
FROM sample
INNER JOIN process
ON (sample.id = process.sample_id)
AND process.endstat = 1)
This is NOT what I want.
How can i solve the problem?

This should work for you
Process.objects.filter(end_at=1).values('sample__id','sample__name','end_at')
.values() method returns selective table fields.

I'm assuming sample_list = Sample.objects.
When you are filtering a model, only the fields defined in the model are selected. In your example, id and processid. If you want to retrieve values from related models as a single record you need to use values or values_list. To get the desired query you have to do this
sample_list = sample_list.filter(process_set__endstat=1).values('id', 'name', 'process__endstat')
Btw, Django does JOIN on the foreign key field. So, you can't get ON sample.processid = process.id since processid is not a ForeignKey field.
Reference:
https://docs.djangoproject.com/en/4.0/ref/models/querysets/#values

I found JOIN not on foreign key field in django.
sample_list = sample_list.filter(processid__in=Process.objects.filter(endstat=1)
I used the medthod of
Django-queryset join without foreignkey

django - prefetch only the newest record?

I am trying to prefetch only the latest record against the parent record.
my models are as such
class LinkTargets(models.Model):
device_circuit_subnet = models.ForeignKey(DeviceCircuitSubnets, verbose_name="Device", on_delete=models.PROTECT)
interface_index = models.CharField(max_length=100, verbose_name='Interface index (SNMP)', blank=True, null=True)
get_bgp = models.BooleanField(default=False, verbose_name="get BGP Data?")
dashboard = models.BooleanField(default=False, verbose_name="Display on monitoring dashboard?")
class LinkData(models.Model):
link_target = models.ForeignKey(LinkTargets, verbose_name="Link Target", on_delete=models.PROTECT)
interface_description = models.CharField(max_length=200, verbose_name='Interface Description', blank=True, null=True)
...
The below query fails with the error
AttributeError: 'LinkData' object has no attribute '_iterable_class'
Query:
link_data = LinkTargets.objects.filter(dashboard=True) \
.prefetch_related(
Prefetch(
'linkdata_set',
queryset=LinkData.objects.all().order_by('-id')[0]
)
)
I thought about getting LinkData instead and doing a select related but ive no idea how to get only 1 record for each link_target_id
link_data = LinkData.objects.filter(link_target__dashboard=True) \
.select_related('link_target')..?
EDIT:
using rtindru's solution, the pre fetched seems to be empty. there is 6 records in there currently, atest 1 record for each of the 3 LinkTargets
>>> link_data[0]
<LinkTargets: LinkTargets object>
>>> link_data[0].linkdata_set.all()
<QuerySet []>
>>>

The reason is that Prefetch expects a Django Queryset as the queryset parameter and you are giving an instance of an object.
Change your query as follows:
link_data = LinkTargets.objects.filter(dashboard=True) \
.prefetch_related(
Prefetch(
'linkdata_set',
queryset=LinkData.objects.filter(pk=LinkData.objects.latest('id').pk)
)
)
This does have the unfortunate effect of undoing the purpose of Prefetch to a large degree.
Update
This prefetches exactly one record globally; not the latest LinkData record per LinkTarget.
To prefetch the max LinkData for each LinkTarget you should start at LinkData: you can achieve this as follows:
LinkData.objects.filter(link_target__dashboard=True).values('link_target').annotate(max_id=Max('id'))
This will return a dictionary of {link_target: 12, max_id: 3223}
You can then use this to return the right set of objects; perhaps filter LinkData based on the values of max_id.
That will look something like this:
latest_link_data_pks = LinkData.objects.filter(link_target__dashboard=True).values('link_target').annotate(max_id=Max('id')).values_list('max_id', flat=True)
link_data = LinkTargets.objects.filter(dashboard=True) \
.prefetch_related(
Prefetch(
'linkdata_set',
queryset=LinkData.objects.filter(pk__in=latest_link_data_pks)
)
)

The following works on PostgreSQL. I understand it won't help OP, but it might be useful to somebody else.
from django.db.models import Count, Prefetch
from .models import LinkTargets, LinkData
link_data_qs = LinkData.objects.order_by(
'link_target__id',
'-id',
).distinct(
'link_target__id',
)
qs = LinkTargets.objects.prefetch_related(
Prefetch(
'linkdata_set',
queryset=link_data_qs,
)
).all()

LinkData.objects.all().order_by('-id')[0] is not a queryset, it is an model object, hence your error.
You could try LinkData.objects.all().order_by('-id')[0:1] which is indeed a QuerySet, but it's not going to work. Given how prefetch_related works, the queryset argument must return a queryset that contains all the LinkData records you need (this is then further filtered, and the items in it joined up with the LinkTarget objects). This queryset only contains one item, so that's no good. (And Django will complain "Cannot filter a query once a slice has been taken" and raise an exception, as it should).
Let's back up. Essentially you are asking an aggregation/annotation question - for each LinkTarget, you want to know the most recent LinkData object, or the 'max' of an 'id' column. The easiest way is to just annotate with the id, and then do a separate query to get all the objects.
So, it would look like this (I've checked with a similar model in my project, so it should work, but the code below may have some typos):
linktargets = (LinkTargets.objects
.filter(dashboard=True)
.annotate(most_recent_linkdata_id=Max('linkdata_set__id'))
# Now, if we need them, lets collect and get the actual objects
linkdata_ids = [t.most_recent_linkdata_id for t in linktargets]
linkdata_objects = LinkData.objects.filter(id__in=linkdata_ids)
# And we can decorate the LinkTarget objects as well if we want:
linkdata_d = {l.id: l for l in linkdata_objects}
for t in linktargets:
if t.most_recent_linkdata_id is not None:
t.most_recent_linkdata = linkdata_d[t.most_recent_linkdata_id]
I have deliberately not made this into a prefetch that masks linkdata_set, because the result is that you have objects that lie to you - the linkdata_set attribute is now missing results. Do you really want to be bitten by that somewhere down the line? Best to make a new attribute that has just the thing you want.

Tricky, but it seems to work:
class ForeignKeyAsOneToOneField(models.OneToOneField):
def __init__(self, to, on_delete, to_field=None, **kwargs):
super().__init__(to, on_delete, to_field=to_field, **kwargs)
self._unique = False
class LinkData(models.Model):
# link_target = models.ForeignKey(LinkTargets, verbose_name="Link Target", on_delete=models.PROTECT)
link_target = ForeignKeyAsOneToOneField(LinkTargets, verbose_name="Link Target", on_delete=models.PROTECT, related_name='linkdata_helper')
interface_description = models.CharField(max_length=200, verbose_name='Interface Description', blank=True, null=True)
link_data = LinkTargets.objects.filter(dashboard=True) \
.prefetch_related(
Prefetch(
'linkdata_helper',
queryset=LinkData.objects.all().order_by('-id'),
'linkdata'
)
)
# Now you can access linkdata:
link_data[0].linkdata
Ofcourse with this approach you can't use linkdata_helper to get related objects.

This is not a direct answer to you question, but solves the same problem. It is possible annotate newest object with a subquery, which I think is more clear. You also don't have to do stuff like Max("id") to limit the prefetch query.
It makes use of django.db.models.functions.JSONObject (added in Django 3.2) to combine multiple fields:
MainModel.objects.annotate(
last_object=RelatedModel.objects.filter(mainmodel=OuterRef("pk"))
.order_by("-date_created")
.values(
data=JSONObject(
id="id", body="body", date_created="date_created"
)
)[:1]
)

Django filtering based on count of related model

I have the following working code:
houses_of_agency = House.objects.filter(agency_id=90)
area_list = AreaHouse.objects.filter(house__in=houses_of_agency).values('area')
area_ids = Area.objects.filter(area_id__in=area_list).values_list('area_id', flat=True)
That returns a queryset with a list of area_ids. I want to filter further so that I only get area_ids where there are more than 100 houses belonging to the agency.
I tried the following adjustment:
houses_of_agency = House.objects.filter(agency_id=90)
area_list = AreaHouse.objects.filter(house__in=houses_of_agency).annotate(num_houses=Count('house_id')).filter(num_houses__gte=100).values('area')
area_ids = Area.objects.filter(area_id__in=area_list).values_list('area_id', flat=True)
But it returns an empty queryset.
My models (simplified) look like this:
class House(TimeStampedModel):
house_pk = models.IntegerField()
agency = models.ForeignKey(Agency, on_delete=models.CASCADE)
class AreaHouse(TimeStampedModel):
area = models.ForeignKey(Area, on_delete=models.CASCADE)
house = models.ForeignKey(House, on_delete=models.CASCADE)
class Area(TimeStampedModel):
area_id = models.IntegerField(primary_key=True)
parent = models.ForeignKey('self', null=True)
name = models.CharField(null=True, max_length=30)
Edit: I'm using MySQL for the database backend.

You are querying for agency_id with just one underscore. I corrected your queries below. Also, in django it's more common to use pk instead of id however the behaviour is the same. Further, there's no need for three separate queries as you can combine everything into one.
Also note that your fields area_id and house_pk are unnecessary, django automatically creates primary key fields which area accessible via modelname__pk.
# note how i inlined your first query in the .filter() call
area_list = AreaHouse.objects \
.filter(house__agency__pk=90) \
.annotate(num_houses=Count('house')) \ # <- 'house'
.filter(num_houses__gte=100) \
.values('area')
# note the double underscore
area_ids = Area.objects.filter(area__in=area_list)\
.values_list('area__pk', flat=True)
you could simplify this even further if you don't need the intermediate results. here are both queries combined:
area_ids = AreaHouse.objects \
.filter(house__agency__pk=90) \
.annotate(num_houses=Count('house')) \
.filter(num_houses__gte=100) \
.values_list('area__pk', flat=True)
Finally, you seem to be manually defining a many-to-many relation in your model (through AreaHouse). There are better ways of doing this, please read the django docs.

Django queries with complex filter

I have the following model:
...
from django.contrib.auth.models import User
class TaxonomyNode(models.Model):
node_id = models.CharField(max_length=20)
name = models.CharField(max_length=100)
...
class Annotation(models.Model):
...
taxonomy_node = models.ForeignKey(TaxonomyNode, blank=True, null=True)
class Vote(models.Model):
created_by = models.ForeignKey(User, related_name='votes', null=True, on_delete=models.SET_NULL)
vote = models.FloatField()
annotation = models.ForeignKey(Annotation, related_name='votes')
...
In the app, a User can produce Vote for an Annotation instance.
A User can vote only once for an Annotation instance.
I want to get a query set with the TaxonomyNode which a User can still annotate a least one of its Annotation. For now, I do it that way:
def user_can_annotate(node_id, user):
if Annotation.objects.filter(node_id=node_id).exclude(votes__created_by=user).count() == 0:
return False
else:
return True
def get_categories_to_validate(user):
"""
Returns a query set with the TaxonomyNode which have Annotation that can be validated by a user
"""
nodes = TaxonomyNode.objects.all()
nodes_to_keep = [node.node_id for node in nodes if self.user_can_annotate(node.node_id, user)]
return nodes.filter(node_id__in=nodes_to_keep)
categories_to_validate = get_category_to_validate(<user instance>)
I guess there is a way to do it in one query, that would speed up the process quite a lot. In brief, I want to exclude from the TaxonomyNode set, all the nodes that have all their annotations already voted once by the user.
Any idea of how I could do it? With django ORM or in SQL?
I have Django version 1.10.6

Try to use this:
#SQL query
unvoted_annotations = Annotation.objects.exclude(votes__created_by=user).select_related('taxonomy_node')
#remove duplicates
taxonomy_nodes=[]
for annotation in unvoted_annotations:
if annotation.taxonomy_node not in taxonomy_nodes:
taxonomy_nodes.append(annotation.taxonomy_node)
There would be only one SQL query as select_related will return the related taxonomy_node in a single query. Also there might be a better way to remove duplicates, eg: by using .distinct().

What I have done so far:
taxonomy_node_pk = [a[0] for a in Annotation.objects.exclude(votes__created_by=user)
.select_related('taxonomy_node').values_list('taxonomy_node').distinct()]
nodes = TaxonomyNode.objects.filter(pk__in=taxonomy_node_pk)
I am doing two queries but the second one is not very costly.
It is quite faster than my original version.
Still what I do is not really beatifull. There is no way to get a query set of TaxonomyNode from the Annotation set directly? And then applying disctinct() in it?

How to sort a Django QuerySet by (field, custom function, field)

I am looking for getting a QuerySet that is sorted by field1, function, field2.
The model:
class Task(models.Model):
issue_id = models.CharField(max_length=20, unique=True)
title = models.CharField(max_length=100)
priority_id = models.IntegerField(blank=True, null=True)
created_date = models.DateTimeField(auto_now_add=True)
def due_date(self):
...
return ageing
I'm looking for something like:
taskList = Task.objects.all().order_by('priority_id', ***duedate***, 'title')
Obviously, you can't sort a queryset by custom function. Any advise?

Since the actual sorting happens in the database, which does not speak Python, you cannot use a Python function for ordering. You will need to implement your due date logic in an SQL expression, as an Queryset.extra(select={...}) calculated field, something along the lines of:
due_date_expr = '(implementation of your logic in SQL)'
taskList = Task.objects.all().extra(select={'due_date': due_date_expr}).order_by('priority_id', 'due_date', 'title')
If your logic is too complicated, you might need to implement it as a stored procedure in your database.
Alternatively, if your data set is very small (say, tens to a few hundred records), you can fetch the entire result set in a list and sort it post-factum:
taskList = list(Task.objects.all())
taskList.sort(cmp=comparison_function) // or .sort(key=key_function)

The answer by #lanzz, even though seems correct, didn't work for me but this answer from another thread did the magic for me:
https://stackoverflow.com/a/37648265/6420686
from django.db.models import Case, When
ids = [list of ids]
preserved = Case(*[When(id=pk, then=pos) for pos, pk in enumerate(ids)])
filtered_users = User.objects \
.filter(id__in=ids) \
.order_by(preserved)

You can use sort in Python if the queryset is not too large:
ordered = sorted(Task.objects.all(), key=lambda o: (o.priority_id, o.due_date(), o.title))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django full text search using indexes with PostgreSQL - python

Related

How can i change this query to ORM?

django - prefetch only the newest record?

Django filtering based on count of related model

Django queries with complex filter

How to sort a Django QuerySet by (field, custom function, field)

Categories

Resources