I'm trying to extract information from a number of denormalized tables, using Django models. The tables are pre-existing, part of a legacy MySQL database.
Schema description
Let's say that each table describes traits about a person, and each person has a name (this essentially identifies the person, but does not correspond to some unifying "Person" table). For example:
class JobInfo(models.Model):
name = models.CharField(primary_key=True, db_column='name')
startdate = models.DateField(db_column='startdate')
...
class Hobbies(models.Model):
name = models.CharField(primary_key=True, db_column='name')
exercise = models.CharField(db_column='exercise')
...
class Clothing(model.Model):
name = models.CharField(primary_key=True, db_column='name')
shoes = models.CharField(db_column='shoes')
...
# Twenty more classes exist, all of the same format
Accessing via SQL
In raw SQL, when I want to access information across all tables, I do a series of ugly OUTER JOINs, refining it with a WHERE clause.
SELECT JobInfo.startdate, JobInfo.employer, JobInfo.salary,
Hobbies.exercise, Hobbies.fun,
Clothing.shoes, Clothing.shirt, Clothing,pants
...
FROM JobInfo
LEFT OUTER JOIN Hobbies ON Hobbies.name = JobInfo.name
LEFT OUTER JOIN Clothing ON Clothing.name = JobInfo.name
...
WHERE
Clothing.shoes REXEGP "Nike" AND
Hobbies.exercise REGEXP "out"
...;
Model-based approach
I'm trying to convert this to a Django-based approach, where I can easily get a QuerySet that pulls in information from all tables.
I've looked into using a OneToOneField (example), making one table have a field for tying it to each of the others. However, this would mean that one table needs the "central" table, which all others reference in reverse. This seems like a mess with twenty-odd fields, and doesn't really make schematic sense (is "job info" the core properties? clothes?).
I feel like I'm going about this the wrong way. How should I be building a QuerySet on related tables, where each table has one primary key field common across all tables?
If your DB access allows this, I would probably do this by defining a Person model, then declare the name DB column to be a foreign key to that model with to_field set as the name on the person model. Then you can use the usual __ syntax in your queries.
Assuming Django doesn't complain about a ForeignKey field with primary_key=True, anyway.
class Person(models.Model):
name = models.CharField(primary_key=True, max_length=...)
class JobInfo(models.Model):
person = models.ForeignKey(Person, primary_key=True, db_column='name', to_field='name')
startdate = models.DateField(db_column='startdate')
...
I don't think to_field is actually required as long as name is declared as your primary key, but I think it's good for clarity. Or if you don't declare name as the PK on person.
I haven't tested this, though.
To use a view, you have two options. I think both would do best with an actual table containing all the known user names, maybe with a numeric PK as Django usually expects as well. Let's assume that table exists - call it person.
One option is to create a single large view to encompass all information about a user, similar to the big join you use above - something like:
create or replace view person_info as
select person.id, person.name,
jobinfo.startdate, jobinfo.employer, jobinfo.salary,
hobbies.exercise, hobbies.fun,
clothing.shoes, ...
from person
left outer join hobbies on hobbies.name = person.name
left outer join jobinfo on jobinfo.name = person.name
left outer join clothing on clothing.name = person.name
;
That might take a little debugging, but the idea should be clear.
Then declare your model with db_table = person_info and managed = False in the Meta class.
A second option would be to declare a view for each subsidiary table that includes the person_id value matching the name, then just use Django FKs.
create or replace view jobinfo_by_person as
select person.id as person_id, jobinfo.*
from person inner join jobinfo on jobinfo.name = person.name;
create or replace view hobbies_by_person as
select person.id as person_id, hobbies.*
from person inner join hobbies on hobbies.name = person.name;
etc. Again, I'm not totally sure the .* syntax will work - if not, you'd have to list all the fields you're interested in. And check what the column names from the subsidiary tables are.
Then point your models at the by_person versions and use the standard FK setup.
This is a little inelegant and I make no claims for good performance, but it does let you avoid further denormalizing your database.
Related
We have a hand-written SQL query for proof of concept and hope to implement the function with the Django framework.
Specifically, Django's QuerySet usually implements a join query by matching the foreign key with the primary key of the referred table. However, in the sample SQL below, we need additional matching conditions besides the foreign key, like the eav_phone.attribute_id = 122 in the example snippet below.
...
left outer join eav_value as eav_postalcode
on t.id = eav_postalcode.entity_id and eav_phone.attribute_id = 122
...
Questions:
We wonder if there is a way to do it with Python, Django framework, or libraries.
We also wonder if other programming languages have any mature toolkits we can refer to as a design pattern. So we highly appreciate any hints and suggestions.
Backgrounds and Technical Details:
The scenario is a report that consists of transactions with customized columns by Django-EAV. This library implements the eav_value table consisting of columns of different data types, e.g. value_text, value_date, value_float, etc.
We forked an internal repository of Django-EAV and upgraded it to Python 3, so we can use any up-to-date Python features, although we are not using Django-EAV2. As far as we know, the new version, EAV2, follows the same database schema design.
So, the application defines a product with attributes in specific data types, and we referred it as metadata in this question, e.g.:
attribute_id
slug
datatype
122
postalcode
text
123
phone
text
...
...
e.g. date, float, etc. ...
One transaction is one entity, and the eav_value table contains multiple records with the matching entity_id corresponding to the different customized attributes. And we want to build a dynamic QuerySet according to the metadata to assemble the customized columns with left outer join similar to the sample SQL query below.
select
t.id, t.create_ts
, eav_postalcode.value_text as postalcode
, eav_phone.value_text as phone
from
(
select * from transactions
where product_id = __PRODUCT_ID__
) as t
left outer join eav_value as eav_postalcode
on t.id = eav_postalcode.entity_id and eav_phone.attribute_id = 122
left outer join eav_value as eav_phone
on t.id = eav_phone.entity_id and eav_phone.attribute_id = 123
;
We followed #NickODell's hint on FilteredRelation, and our tentative solution looks like the below snippet:
transaction_eav = transaction.annotate(
eav_postalcode=FilteredRelation('eav_values', condition=Q(eav_values__attribute_id=22))
)
transaction_eav = transaction_eav.annotate(
value_postalcode=F('eav_postalcode__value_text)}
)
We are new to Django ORM so please point out if the sample code above contains any low-efficient or non-standard flaws.
Many thanks to all for the great suggestions!
I am wanting to map a class object to a table that is a join between two tables, and all the columns from one table and only one column from the joined table being selected (mapped).
join_table = join(table1, table2, tabl1.c.description==table2.c.description)
model_table_join= select([table1, table2.c.description]).select_from(join_table).alias()
Am I doing this right?
If all you want to do is pull in one extra column from a JOIN, I'd not muck about with an arbitrary select mapping. As the documentation points out:
The practice of mapping to arbitrary SELECT statements, especially complex ones as above, is almost never needed; it necessarily tends to produce complex queries which are often less efficient than that which would be produced by direct query construction. The practice is to some degree based on the very early history of SQLAlchemy where the mapper() construct was meant to represent the primary querying interface; in modern usage, the Query object can be used to construct virtually any SELECT statement, including complex composites, and should be favored over the “map-to-selectable” approach.
You'd just either select that extra column in your application:
session.query(Table1Model, Table2Model.description).join(Table2Model)
or you can register a relationship on the Table1Model and an association property that always pulls in the extra column:
class Table1Model(Base):
# ...
_table2 = relationship('Table2Model', lazy='join')
description = association_proxy('_table2', 'description')
The association property manages the Table2Model.description column of the joined row as you interact with it on Table1Model instances.
That said, if you must stick with a join() query as the base, then you could just exclude the extra, duplicated columns from the join, with a exclude_properties mapper argument:
join_table = join(table1, table2, table1.c.description == table2.c.description)
class JoinedTableModel(Base):
__table__ = join_table
__mapper_args__ = {
'exclude_properties' : [table1.c.description]
}
The new model then uses all the columns from the join to create attributes with the same names, except for those listed in `exclude_properties.
Or you can keep using duplicated column names in the model simply by giving them a new name:
join_table = join(table1, table2, table1.c.description == table2.c.description)
class JoinedTableModel(Base):
__table__ = join_table
table1_description = table1.c.description
You can rename any column from the join this way, at which point they will no longer conflict with other columns with the same base name from the other table.
I know it's possible to query a model using a reverse related field using the Django ORM. But is it possible to also get all the fields of the reverse related model for which the query matched?
For example, if we have the following models:
class Location(models.Model):
name = models.CharField(max_length=50)
class Availability(models.Model):
location = models.ForeignKey(Location, on_delete=models.CASCADE)
start_datetime = models.DateTimeField()
end_datetime = models.DateTimeField()
price = models.PositiveIntegerField()
would it be possible to find all Locations that are available in a specific timeframe AND also get the price of the Location during that availability? We are under the assumption that Availability objects that have the same location can not have overlapping start/end datetimes.
if user_start_datetime and user_end_datetime are inputted by the user, then we could possibly do something like the following:
Location.objects.filter(
availability__start_datetime__lte=start_datetime,
availability__end_datetime__gte=end_datetime)
But I'm not sure how to also get the price field for the specific availability that did result in a match for the query.
In raw SQL, the behavior I'm talking about might be achievable via something like this:
SELECT l.id, l.name, a.price
FROM Location l
INNER JOIN Availability a
ON a.location_id = l.id
WHERE /* availability is within user-inputted timeframe */
I've considered using something like prefetch_related('availability_set'), but that would just give me all the availabilities for the Location objects that matched the query. I just want the one availability that was within the timeframe that was queried, and more specifically, the price of that availability.
When you are using an ORM, in general you fetch results from one model class at a time. Since Location and Availability are separate models, you can simply do the following:
availabilities = Availability.objects.filter(
start_datetime__lte=start_datetime,
end_datetime__gte=end_datetime)
for availability in availabilities:
print(availability.location.id, availability.location.name, availability.price)
Which is an easy to read implementation.
Now, accessing Location from an Availability object (in availability.location) requires a second SQL query. You can optimise this using select_related:
This is a performance booster which results in a single more complex query but means later use of foreign-key relationships won’t require database queries.
Simply append it to your original query, i.e.:
availabilities = Availability.objects.select_related('location').filter(...
This will create an SQL join statement in the background and the Location objects will not require an extra query.
We have a limitation for order_by/distinct fields.
From the docs: "fields in order_by() must start with the fields in distinct(), in the same order"
Now here is the use case:
class Course(models.Model):
is_vip = models.BooleanField()
...
class CourseEvent(models.Model):
date = models.DateTimeField()
course = models.ForeignKey(Course)
The goal is to fetch the courses, ordered by nearest date but vip goes first.
The solution could look like this:
CourseEvent.objects.order_by('-course__is_vip', '-date',).distinct('course_id',).values_list('course')
But it causes an error since the limitation.
Yeah I understand why ordering is necessary when using distinct - we get the first row for each value of course_id so if we don't specify an order we would get some arbitrary row.
But what's the purpose of limiting order to the same field that we have distinct on?
If I change order_by to something like ('course_id', '-course__is_vip', 'date',) it would give me one row for course but the order of courses will have nothing in common with the goal.
Is there any way to bypass this limitation besides walking through the entire queryset and filtering it in a loop?
You can use a nested query using id__in. In the inner query you single out the distinct events and in the outer query you custom-order them:
CourseEvent.objects.filter(
id__in=CourseEvent.objects\
.order_by('course_id', '-date').distinct('course_id')
).order_by('-course__is_vip', '-date')
From the docs on distinct(*fields):
When you specify field names, you must provide an order_by() in the QuerySet, and the fields in order_by() must start with the fields in distinct(), in the same order.
I'm using Django (I'm new to it). I want to define a foreign key, and I'm not sure how to go about it.
I have a table called stat_types:
class StatTypes(models.Model):
stat_type = models.CharField(max_length=20)
Now I want to define a foreign key in the overall_stats table to the stat_type id that is automatically generated by django. Would that be the following?
stat_types_id = models.ForeignKey('beta.StatTypes')
What if I wanted instead to have the stat_type column of the stat_types table be the foreign key. Would that be:
stat_type = models.ForeignKey('beta.StatTypes')
I guess my confusion arises in not knowing what to name the column in the second model, in order for it to know which column of the first model to use as the foreign key.
Thanks!
it does not matter what name you give to FK column name. Django figures it out that it is a ForeignKey and appends _id to the field. So you do not need _id here. I think this is good enough
stat_type = models.ForeignKey('beta.StatTypes')
Doc says:
It’s suggested, but not required, that the name of a ForeignKey field
(manufacturer in the example above) be the name of the model,
lowercase. You can, of course, call the field whatever you want.