A short intoduction to the problem...
PostgreSQL has very neat array fields (int array, string array) and functions for them like UNNEST and ANY.
These fields are supported by Django (I am using djorm_pgarray for that), but functions are not natively supported.
One could use .extra(), but Django 1.8 introduced a new concept of database functions.
Let me provide a most primitive example of what I am basicly doing with all these. A Dealer has a list of makes that it supports. A Vehicle has a make and is linked to a dealer. But it happens that Vehicle's make does not match Dealer's make list, that is inevitable.
MAKE_CHOICES = [('honda', 'Honda'), ...]
class Dealer(models.Model):
make_list = TextArrayField(choices=MAKE_CHOICES)
class Vehicle(models.Model):
dealer = models.ForeignKey(Dealer, null=True, blank=True)
make = models.CharField(max_length=255, choices=MAKE_CHOICES, blank=True)
Having a database of dealers and makes, I want to count all vehicles for which the vehicle's make and its dealer's make list do match. That's how I do it avoiding .extra().
from django.db.models import functions
class SelectUnnest(functions.Func):
function = 'SELECT UNNEST'
...
Vehicle.objects.filter(
make__in=SelectUnnest('dealer__make_list')
).count()
Resulting SQL:
SELECT COUNT(*) AS "__count" FROM "myapp_vehicle"
INNER JOIN "myapp_dealer"
ON ( "myapp_vehicle"."dealer_id" = "myapp_dealer"."id" )
WHERE "myapp_vehicle"."make"
IN (SELECT UNNEST("myapp_dealer"."make_list"))
And it works, and much faster than a traditional M2M approach we could use in Django. BUT, for this task, UNNEST is not a very good solution: ANY is much faster. Let's try it.
class Any(functions.Func):
function = 'ANY'
...
Vehicle.objects.filter(
make=Any('dealer__make_list')
).count()
It generates the following SQL:
SELECT COUNT(*) AS "__count" FROM "myapp_vehicle"
INNER JOIN "myapp_dealer"
ON ( "myapp_vehicle"."dealer_id" = "myapp_dealer"."id" )
WHERE "myapp_vehicle"."make" =
(ANY("myapp_dealer"."make_list"))
And it fails, because braces around ANY are bogus. If you remove them, it runs in the psql console with no problems, and fast.
So my question.
Is there any way to remove these braces? I could not find anything about that in Django documentation.
If not, - maybe there are other ways to rephrase this query?
P. S. I think that an extensive library of database functions for different backends would be very helpful for database-heavy Django apps.
Of course, most of these will not be portable. But you typically do not often migrate such a project from one database backend to another. In our example, using array fields and PostGIS we are stuck to PostgreSQL and do not intend to move.
Is anybody developing such a thing?
P. P. S. One might say that, in this case, we should be using a separate table for makes and intarray instead of string array, that is correct and will be done, but nature of problem does not change.
UPDATE.
TextArrayField is defined at djorm_pgarray. At the linked source file, you can see how it works.
The value is list of text strings. In Python, it is represented as a list. Example: ['honda', 'mazda', 'anything else'].
Here is what is said about it in the database.
=# select id, make from appname_tablename limit 3;
id | make
---+----------------------
58 | {vw}
76 | {lexus,scion,toyota}
39 | {chevrolet}
And underlying PostgreSQL field type is text[].
I've managed to get (more or less) what you need using following:
from django.db.models.lookups import BuiltinLookup
from django.db.models.fields import Field
class Any(BuiltinLookup):
lookup_name = 'any'
def get_rhs_op(self, connection, rhs):
return " = ANY(%s)" % (rhs,)
Field.register_lookup(Any)
and query:
Vehicle.objects.filter(make__any=F('dealer__make_list')).count()
as result:
SELECT COUNT(*) AS "__count" FROM "zz_vehicle"
INNER JOIN "zz_dealer" ON ("zz_vehicle"."dealer_id" = "zz_dealer"."id")
WHERE "zz_vehicle"."make" = ANY(("zz_dealer"."make_list"))
btw. instead djorm_pgarray and TextArrayField you can use native django:
make_list = ArrayField(models.CharField(max_length=200), blank=True)
(to simplify your dependencies)
Related
I have these two models, Cases and Specialties, just like this:
class Case(models.Model):
...
judge = models.CharField()
....
class Specialty(models.Model):
name = models.CharField()
sys_num = models.IntegerField()
I know this sounds like a really weird structure but try to bare with me:
The field judge in the Case model refer to a Specialty instance sys_num value (judge is a charfield but it will always carries an integer) (each Specialty instance has a unique sys_num). So I can get the Specialty name related to a specific Case instance using something like this:
my_pk = #some number here...
my_case_judge = Case.objects.get(pk=my_pk).judge
my_specialty_name = Specialty.objects.get(sys_num=my_case_judge)
I know this sounds really weird but I can't change the underlying schemma of the tables, just work around it with sql and Django's orm.
My problem is: I want to annotate the Specialty names in a queryset of Cases that have already called values().
I only managed to get it working using Case and When but it's not dynamic. If I add more Specialty instances I'll have to manually alter the code.
cases.annotate(
specialty=Case(
When(judge=0, then=Value('name 0 goes here')),
When(judge=1, then=Value('name 1 goes here')),
When(judge=2, then=Value('name 2 goes here')),
When(judge=3, then=Value('name 3 goes here')),
...
Can this be done dynamically? I look trough django's query reference docs but couldn't produce a working solution with the tools specified there.
You can do this with a subquery expression:
from django.db.models import OuterRef, Subquery
Case.objects.annotate(
specialty=Subquery(
Specialty.objects.filter(sys_num=OuterRef('judge')).values('name')[:1]
)
)
For some databases, casting might even be necessary:
from django.db.models import IntegerField, OuterRef, Subquery
from django.db.models.functions import Cast
Case.objects.annotate(
specialty=Subquery(
Specialty.objects.filter(sys_num=Cast(
OuterRef('judge'),
output_field=IntegerField()
)).values('name')[:1]
)
)
But the modeling is very bad. Usually it is better to work with a ForeignKey, this will guarantee that the judge can only point to a valid case (so referential integrity), will create indexes on the fields, and it will also make the Django ORM more effective since it allows more advanced querying with (relativily) small queries.
Here is my model:
class Item(models.Model):
status = models.IntegerField(choices=STATUS_CHOICES, default=3)
def __str__(self):
return 'Item: {0}'.format(self.id)
class Name(models.Model):
name = models.CharField(, max_length=600, default='')
item = models.ForeignKey(Item, db_index=True, blank=True, null=True)
main = models.BooleanField(default=False)
def __str__(self):
return '{}'.format(self.name)
I would like to query set Items, so it returns X items sorted by Fuzzy wuzzy.
Basically, I need to find matching Items and merge them.
I tried to create a dictionary, but it is extremely slow. I have about 80 000 items and it's still counting.
I tried to something like:
items = Item.objects.filter(status=3)
.annotate( score=fuzz.ratio(query,i.name_set.all().first().name))
.order_by('-score')
Can anyone give me some light on the topic?
thanks
If there are 80.000 entries in the database you have to consider the options:
(A) let the database sort (preferrably using some pre-created index) and return only the chosen rows. This allows for paging via the database.
(B) return everything from the DB as fast as possible and sort all 80.000 in RAM. If you want to stick to the python module FuzzyWuzzy you would have to do that. But as you have experienced by now, this might not be fast. You'd have to do paging on your own.
[FuzzyWuzzy] uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.
If you are using PostgreSQL as backend, you can use the levenshtein function as described here:
https://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html
There seem to be contributions to Django to integrate this:
https://github.com/django/django/pull/4825
TrigramSimilarity is already available. You could have a look at the source code and implement something similar based on the postgres levenshtein. But I'd recommend to first try it out. It might already suit your needs.
EDIT:
In general for tables of this size and more, make sure that your DB has indexes where required and makes use of them. For example: Django's __icontains filter is not covered by Django's db_index. You have to add a trigram index on that column yourself.
It could be that your current code takes this long because the query already takes unnecessarily long.
if you are using PostgreSQL you can use Trigram Similar
trigram_similar lookup allows you to perform trigram lookups, measuring the number of trigrams (three consecutive characters) shared, using a dedicated PostgreSQL extension.
Installation:
update your settings.py::
INSTALLED_APPS = [
...
'django.contrib.postgres',
]
add a new migration::
python manage.py makemigrations --empty yourappname
a new migration file will be created eg: migrations/0002_auto_<date>_<time>.py. update it::
from django.db import migrations
from django.contrib.postgres.operations import TrigramExtension
class Migration(migrations.Migration):
dependencies = [
...
]
operations = [
TrigramExtension(),
]
now migrate:
./manage.py migrate
now you can use trigram_similar lookup on CharField and TextField eg:
>>> City.objects.filter(name__trigram_similar="Middlesborough")
['<City: Middlesbrough>']
I would like to save array of enums.
I have the following:
CREATE TABLE public.campaign
(
id integer NOT NULL,
product product[]
)
product is an enum.
In Django I defined it like this:
PRODUCT = (
('car', 'car'),
('truck', 'truck')
)
class Campaign(models.Model):
product = ArrayField(models.CharField(null=True, choices=PRODUCT))
However, when I write the following:
campaign = Campaign(id=5, product=["car", "truck"])
campaign.save()
I get the following error:
ProgrammingError: column "product" is of type product[] but expression is of type text[]
LINE 1: ..."product" = ARRAY['car...
Note
I saw this answer, but I don't use sqlalchemy and would rather not use it if not needed.
EDITED
I tried #Roman Konoval suggestion below like this:
class PRODUCT(Enum):
CAR = 'car'
TRUCK = 'truck'
class Campaign(models.Model):
product = ArrayField(EnumField(PRODUCT, max_length=10))
and with:
campaign = Campaign(id=5, product=[CAR, TRUCK])
campaign.save()
However, I still get the same error,
I see that django is translating it to list of strings.
if I write the following directly the the psql console:
INSERT INTO campaign ("product") VALUES ('{car,truck}'::product[])
it works just fine
There are two fundamental problems here.
Don't use Enums
If you continue to use enum, your next question here on Stackoverflow will be "how do I add a new entry to an enum?". Django does not support enum type out of the box (thank heavens). So you have to use third party libraries for this. Your mileage will vary with how complete the library is.
An enum value occupies four bytes on disk. The length of an enum
value's textual label is limited by the NAMEDATALEN setting compiled
into PostgreSQL; in standard builds this means at most 63 bytes.
If you are thinking that you are saving space on disk by using enum, the above quote from the manual shows that it's an illusion.
See this Q&A for more on advantages and disadvantages of enum. But generally the disadvantages outweigh the advantages.
Don't use Arrays
Tip: Arrays are not sets; searching for specific array elements can be
a sign of database misdesign. Consider using a separate table with a
row for each item that would be an array element. This will be easier
to search, and is likely to scale better for a large number of
elements.
Source: https://www.postgresql.org/docs/9.6/static/arrays.html
If you are going to search for a campaign that deals with Cars or Trucks you are going to have to do a lot of hard work. So will the database.
The correct design
The correct design is the one suggested in the postgresql arrays documentation page. Create a related table. This is the standard django way as well.
class Campaign(models.Model):
name = models.CharField(max_length=20)
class Product(Models.model):
name = models.CharField(max_length=20)
campaign = models.ForeignKey(Campaign)
This makes your code simpler. Doesn't require any extra storage. Doesn't require third party libraries. And best of all the vast api of the django related models becomes available to you.
The definition of product field is incorrect as it specifies that it is array of CharFields but it is array of enums in reality. Django does not support enum type now so you can try this extension to define the type correctly:
class Product(Enum):
ProductA = 'a'
...
class Campaign(models.Model):
product = ArrayField(EnumField(Product, max_length=<whatever>))
Try this:
def django2psql(s):
return '{'+','.join(s) + '}'
campaign = Campaign(id=5, product=django2psql(["car", "truck"]))
I think you may have to subclass CharField to get it to report the correct db_type. There may be more problems than this but you can give this a try:
class Product(models.CharField):
def db_type(self, connection):
return 'product'
PRODUCT = (
('car', 'car'),
('truck', 'truck')
)
class Campaign(models.Model):
product = ArrayField(Product(null=True, choices=PRODUCT))
Assuming that the file models.py in my django application (webapp) is like the following :
from django.db import models
from django.db import connection
class Foo(models.Model):
name = models.CharField(...)
surname = models.CharField(...)
def dictfetchall(cursor):
"Returns all rows from a cursor as a dict"
desc = cursor.description
return [
dict(zip([col[0] for col in desc], row))
for row in cursor.fetchall()
]
def get_foo():
cursor = connection.cursor()
cursor.execute('SELECT * FROM foo_table')
rows = dictfetchall(cursor)
return rows
To get access to my database content, I have basicly two options :
Option 1 :
from webapp.models import Foo
bar = Foo.objects.raw('SELECT * FROM foo_table')
Option 2 :
from application.models import get_foo
bar = get_foo()
Which option is the fastest in execution ?
Is there a better way to do what I want to do ?
There is no direct and clear answer on which approach is better.
Using Manager.raw() still keeps you within the ORM layer and while it returns Model instances you still have a nice database abstraction. But, while making a raw query, django does more than just cursor.execute in order to translate the results into Model instances (see what is happening in RawQuerySet and RawQuery classes).
But (quote from docs):
Sometimes even Manager.raw() isn’t quite enough: you might need to
perform queries that don’t map cleanly to models, or directly execute
UPDATE, INSERT, or DELETE queries.
So, generally speaking, what to choose depends on what results are going to get and what you are going to do with them.
See also:
Performing raw SQL queries
executing-custom-sql-directly
Raw sql queries in Django views
Using the connection cursor is for sure the faster than using raw() as it doesn't instantiate additionals objects... But for really telling what the fastest solution is you should do some benchmarking!
And don't overdo optimizations if not necessary because you are avoiding some of Django's most useful features this way as long as you don't have any serious performance problems. And if you have some they will most likely not be the result of how you execute the query. Of course you will be able to write better queries if you exactly know your use case and the ORM doesn't.
I'm trying to implement a simple triplestore using Django's ORM. I'd like to be able to search for arbitrarily complex triple patterns (e.g. as you would with SparQL).
To do this, I'm attempting to use the .extra() method. However, even though the docs mention it can, in theory, handle duplicate references to the same table by automatically creating an alias for the duplicate table references, I've found it does not do this in practice.
For example, say I have the following model in my "triple" app:
class Triple(models.Model):
subject = models.CharField(max_length=100)
predicate = models.CharField(max_length=100)
object = models.CharField(max_length=100)
and I have the following triples stored in my database:
subject predicate object
bob has-a hat .
bob knows sue .
sue has-a house .
bob knows tom .
Now, say I want to query the names of everyone bob knows who has a house. In SQL, I'd simply do:
SELECT t2.subject AS name
FROM triple_triple t1
INNER JOIN triple_triple t2 ON
t1.subject = 'bob'
AND t1.predicate = 'knows'
AND t1.object = t2.subject
AND t2.predicate = 'has-a'
AND t2.object = 'house'
I'm not completely sure what this would look like with Django's ORM, although I think it would be along the lines of:
q = Triple.objects.filter(subject='bob', predicate='knows')
q = q.extra(tables=['triple_triple'], where=["triple_triple.object=t1.subject AND t1.predicate = 'has-a' AND t1.object = 'house'"])
q.values('t1.subject')
Unfortunately, this fails with the error "DatabaseError: no such column: t1.subject"
Running print q.query shows:
SELECT "triple_triple"."subject" FROM "triple_triple" WHERE ("triple_triple"."subject" = 'bob' AND "triple_triple"."predicate" = 'knows'
AND triple_triple.object = t1.subject AND t1.predicate = 'has-a' AND t1.object = 'house')
which appears to show that the tables param in my call to .extra() is being ignored, as there's no second reference to triple_triple inserted anywhere.
Why is this happening? What's the appropriate way to refer to complex relationships between records in the same table using Django's ORM?
EDIT: I found this useful snippet for including custom SQL via .extra() so that it's usable inside a model manager.
I think what you're missing is the select parameter (for the extra method)
This seems to work:
qs = Triple.objects.filter(subject="bob", predicate="knows").extra(
select={'known': "t1.subject"},
tables=['"triple_triple" AS "t1"'],
where=['''triple_triple.object=t1.subject
AND t1.predicate="has-a" AND t1.object="'''])
qs.values("known")
I've had the same issue where Django escapes (adds back-ticks) to my table names, meaning that I can't add an alias manually; the resulting FROM clause looks like this:
"mytable" AS T100
But at the same time, Django won't automatically create aliases for you if the table is already mentioned; instead it ignores the tables and just adds on the WHERE clauses as if they refer to the original tables.
The documentation for Django 1.8 suggests that .extra() will create aliases for you:
https://docs.djangoproject.com/en/1.8/ref/models/querysets/#django.db.models.query.QuerySet.extra
But this doesn't appear to be the case for my query, possibly because original table is part of a LEFT OUTER JOIN rather than a simple FROM x,y,z clause.