(A Django rookie here)
For a "project" model I want to store some data. The data is about housing property. So, for example: Number of living spaces, on which floor those spaces are, and how big those spaces are.
Living-space 1 - Groundfloor - 50m2
Living-space 2 - First floor - 82m2
etc.
Because not every project object has the some amount of living spaces and some project object's also have a row called something like Shop-space or restaurant space I was wondering about a good approach to storing this data.
Basically I want to store a dynamically sized table in a model. Below is pretty good example:
Restaurant - Groundfloor - 147m2
Livingspace 1 - First floor - 55m2
Livingspace 2 - First floor - 110m2
Livingspace 3 - Second floor - 55m2
Livingspace 4 - Second floor - 110m2
Livingspace 5 - Third floor - 147m2
Now some projects will have only 2 maybe 3 living spaces and no Restaurant's etc. Others will have maybe up to 10 living spaces. I was thinking about creating 10 row fields. So I can put in comma separated values (or maybe JSONfield). Something like:
row_01 = models.CharField(max_length=100)
row_02 = models.CharField(max_length=100)
row_03 = models.CharField(max_length=100)
row_etc = models.CharField(max_length=100)
"Livingspace 1","First floor","55"
"Livingspace 2","Second floor","100"
etc
Would this be a correct approach for putting this table in the database? How about a JSONfield?
Also, in my model I have a field in which the number of housing-spaces has to be put in by the user. Therefore I was thinking if it's possible to dynamically create rows based on other fields in the model? So if the user is in the Django administration and enters 4 for the number of houses that the user only sees 4 rows in the Django administration.
No, that is not good practice.
From what I understand, the data can be included using two tables. In django, each model corresponds to a table, with fields as columns. So you just have to make a House model and another model Room with Foreign key relationship to the House model.
A simple example:
class House(models.Model):
name = models.CharField()
class Room(models.Model):
house = models.ForeignKey(House)
room_type = models.CharField()
area = models.CharField()
floor_no = models.IntegerField()
From this, each House model instance represents a house. Each Room instance represents the row you said earlier. By designing the models like this could make the filtering, querying​ very easy.
Room instances are linked to the House model through a foreign key relationship, which enables to create any number of rows according to the required specifications.
For more reference, try the docs
The kind of db schema depends on how you are going to use that data.
The fields (except primary key) who you want to aggregate (sum,count etc) , querying based on them will be directly placed as a column while fields that are dependent(dynamicity) on other columns can be come into JSON field.
Considering your use case , JSON Field can be a good approach because you might not want to query based on each and every data .
Related
My goal is to create an e-commerce website where customers can see related products on any product page (similar to amazon.com).
I have no idea how to get started with such a daunting task. From my research, my guess is to do the following:
Create a Category kind:
class Category(ndb.Model):
name = ndb.StringProperty()
Whenever a product is created, associate it with a Category via an ancestral relationship:
parent_category = ndb.Key("Category", "Books")
new_product = Product(
title="Coding Horrors Book",
parent=parent_category).put()
Now, on each product page, I can create a query to return a list of books as related products.
I have some concerns with this approach:
Firstly, this doesn't feel like a solid approach.
How do I specify the hierarchical relationship between product categories? For example, if we have two product categories, "AngularJS", "VueJS", how do we specify that these two categories are somehow related?
First, to clarify, the entity ancestry is not mandatory for establishing relationships (and it has some disadvantages), see Can you help me understand the nbd Key Class Documentation or rather ancestor relationship?. and related Ancestor relation in datastore
You'll need to consider Balancing Strong and Eventual Consistency with Google Cloud Datastore.
The rest of the answer assumes no entity ancestry is used.
To associate a product to a category (or several of them, if you want, using repeated properties) you can have:
class Product(ndb.Model):
name = ndb.StringProperty()
category = ndb.KeyProperty(kind='Category', repeated=True)
category = ndb.Key("Category", "Books")
new_product = Product(title="Coding Horrors Book",
category=[category]).put()
This approach has a scalability issue: if a product falls into many categories, updating the category list becomes increasingly slower (the entire entity, progressively growing, needs to be re-written every time) and, if the property is indexed, it is sensitive to the exploding indexes problem.
This can be avoided by storing product-category relationships as separate entities:
class ProductCategory(ndb.Model):
product = ndb.KeyProperty(kind='Product')
category = ndb.KeyProperty(kind='Category')
Scales a lot better, but in this case you'll need a ProductCategory query to determine the keys of related category entities for a product, followed by key lookups to get the details of those categories, something along these lines:
category_keys = ProductCategory.query(ProductCategory.product == product_key) \
.fetch(keys_only=True, limit=500)
if category_keys:
categories = ndb.get_multi(category_keys)
logging.info('product %s categories: %s' \
% (product.title, ','.join([c.name for c in categories])))
I want to create a database of dislike items, but depending on the category of item, it has different columns I'd like to show when all you're looking at is cars. In fact, I'd like the columns to be dynamic based on the category so we can easily an additional property to cars in the future, and have that column show up now too.
For example:
But when you filter on car or person, additional rows show up for filtering.
All the examples that I can find about using django models aren't giving me a very clear picture on how I might accomplish this behavior in a clean, simple web interface.
I would probably go for a model describing a "dislike criterion":
class DislikeElement(models.Model):
item = models.ForeignKey(Item) # Item is the model corresponding to your first table
field_name = models.CharField() # e.g. "Model", "Year born"...
value = models.CharField() # e.g. "Mustang", "1960"...
You would have quite a lot of flexibility in what data you can retrieve. For example, to get for a given item all the dislike elements, you would just have to do something like item.dislikeelements_set.all().
The only problem with this solution is that you would to store in value numbers, strings, dates... under the same data type. But maybe that's not an issue for you.
I'd the following Cassandra Model:-
class Automobile(Model):
manufacturer = columns.Text(primary_key=True)
year = columns.Integer(index=True)
model = columns.Text(index=True)
price = columns.Decimal(index=True)
I needed the following queries:-
q = Automobile.objects.filter(manufacturer='Tesla')
q = Automobile.objects.filter(year='something')
q = Automobile.objects.filter(model='something')
q = Automobile.objects.filter(price='something')
These all were working fine, until i wanted multiple column filtering, ie when I tried
q = Automobile.objects.filter(manufacturer='Tesla',year='2013')
it throws an error saying Cannot execute this query as it might involve data filtering and thus may have unpredictable performance.
I rewrote the query with allowed_filtering, but this is not an optimal solution.
Then upon reading more, I edited my model as follow:-
class Automobile(Model):
manufacturer = columns.Text(primary_key=True)
year = columns.Integer(primary_key=True)
model = columns.Text(primary_key=True)
price = columns.Decimal()
With this I was able to filter multiple coulms as well, without any warning.
When I did DESCRIBE TABLE automobile, it shows this creates composite key PRIMARY KEY ((manufacturer), year, model).
So, my question is what if I declare every attribute as primary key? Is there any problem with this, since I'll be able to filter multiple columns as well.
This is just a small model. What if I had a model such as:-
class UserProfile(Model):
id = columns.UUID(primary_key=True, default=uuid.uuid4)
model = columns.Text()
msisdn = columns.Text(index=True)
gender = columns.Text(index=True)
imei1 = columns.Set(columns.Text)
circle = columns.Text(index=True)
epoch = columns.DateTime(index=True)
cellid = columns.Text(index=True)
lacid = columns.Text(index=True)
mcc = columns.Text(index=True)
mnc = columns.Text(index=True)
installed_apps = columns.Set(columns.Text)
otp = columns.Text(index=True)
regtype = columns.Text(index=True)
ctype = columns.Text(index=True)
operator = columns.Text(index=True)
dob = columns.DateTime(index=True)
jsonver = columns.Text(index=True)
and if I declare every attribute as PK, is there any problem with this?
To understand this, you need to understand how cassandra stores data. The first key in the primary key is called the partition key. It defines the partition the row belongs to. All rows in a partition are stored together, and replicated together. Inside a partition, rows are stored according to the clustering keys. These are the columns in the PK that's not the partition key. So, if your PK is (a, b, c, d), a defines the partition. And in a particular partition (say, a = a1), the rows are stored sorted by b. And for each b, the rows are stored sorted by c...and so on. When querying, you hit one (or a few partitions), and then need to specify every successive clustering key up until the key you're looking for. These have to exact equalities except for the last clustering column specified in your query, which may be a range query.
In the previous example, you could thus do
where a = a1 and b > b1
where a = a1 and b=b1 and c>c1
where a = a1 and b=b1 and c=c1 and d > d1
but can't do this:
where a=a1 and c=c1
To do that, you'd need "allow filtering" (realistically, you should look at changing your model, or denormalizing at that point).
Now, on to your question about making every column part of the PK. You could do that, but remember, all writes in Cassandra are upserts. Rows are identified by their primary key. If you make every column part of the PK, you'll not be able to edit a row. You're not allowed to update the value of any column that's in the primary key.
The proper way to solve this, is to take a query-based modeling approach. Instead of one table with three secondary indexes, you should solve this with four (maybe three) tables and ZERO secondary indexes.
Your original table of Automobile is probably ok. Although I'd be curious to see your primary key definition. But so solve your query of Automobile.objects.filter(year='something') I would create an additional query table like this (note: defined in CQL):
CREATE TABLE automobileByYear (
manufacturer text,
year bigint,
model text,
price decimal,
PRIMARY KEY ((year),manufacturer,model));
Assuming that you also create a corresponding class on the Python side for this model (AutomobileByYear), you could then serve a query like:
AutomobileByYear.objects.filter(year='2013')
Additionally, having manufacturer as your first clustering key would also allow this query to function:
AutomobileByYear.objects.filter(manufacturer='Tesla',year='2013')
Likewise, to solve for your query by model, I would create an additional query table (automobileByModel), with the PRIMARY KEY definition of the table re-ordered like this:
PRIMARY KEY ((model),manufacturer,year));
The order of your clustering keys (manufacturer and year) would vary by your query requirements, but the point is that model should be your partition key in this case.
EDIT
...but should it such that I should design my table as per my queries, thereby having a LOT of data redundancy. Let say, I've this same Automobile model, with N fields, where lets say N=10. If I want to filter by every N field. should I create a different model for every different filter type query.
In this day and age disk is WAY cheaper than it used to be. That being said, I understand that it isn't always easy to just throw more disk at a problem. The bigger problem I see is adjusting the DAO layer of your application to keep 10 tables in-sync.
In that case, I would advise integrating with a search tool like Elastic or Solr. In fact, the enterprise version of Cassandra integrates with Solr out-of-the-box. If you really do need to run queries on 10+ columns, a robust search tool would compliment your Cassandra cluster nicely.
I'm trying to extract information from a number of denormalized tables, using Django models. The tables are pre-existing, part of a legacy MySQL database.
Schema description
Let's say that each table describes traits about a person, and each person has a name (this essentially identifies the person, but does not correspond to some unifying "Person" table). For example:
class JobInfo(models.Model):
name = models.CharField(primary_key=True, db_column='name')
startdate = models.DateField(db_column='startdate')
...
class Hobbies(models.Model):
name = models.CharField(primary_key=True, db_column='name')
exercise = models.CharField(db_column='exercise')
...
class Clothing(model.Model):
name = models.CharField(primary_key=True, db_column='name')
shoes = models.CharField(db_column='shoes')
...
# Twenty more classes exist, all of the same format
Accessing via SQL
In raw SQL, when I want to access information across all tables, I do a series of ugly OUTER JOINs, refining it with a WHERE clause.
SELECT JobInfo.startdate, JobInfo.employer, JobInfo.salary,
Hobbies.exercise, Hobbies.fun,
Clothing.shoes, Clothing.shirt, Clothing,pants
...
FROM JobInfo
LEFT OUTER JOIN Hobbies ON Hobbies.name = JobInfo.name
LEFT OUTER JOIN Clothing ON Clothing.name = JobInfo.name
...
WHERE
Clothing.shoes REXEGP "Nike" AND
Hobbies.exercise REGEXP "out"
...;
Model-based approach
I'm trying to convert this to a Django-based approach, where I can easily get a QuerySet that pulls in information from all tables.
I've looked into using a OneToOneField (example), making one table have a field for tying it to each of the others. However, this would mean that one table needs the "central" table, which all others reference in reverse. This seems like a mess with twenty-odd fields, and doesn't really make schematic sense (is "job info" the core properties? clothes?).
I feel like I'm going about this the wrong way. How should I be building a QuerySet on related tables, where each table has one primary key field common across all tables?
If your DB access allows this, I would probably do this by defining a Person model, then declare the name DB column to be a foreign key to that model with to_field set as the name on the person model. Then you can use the usual __ syntax in your queries.
Assuming Django doesn't complain about a ForeignKey field with primary_key=True, anyway.
class Person(models.Model):
name = models.CharField(primary_key=True, max_length=...)
class JobInfo(models.Model):
person = models.ForeignKey(Person, primary_key=True, db_column='name', to_field='name')
startdate = models.DateField(db_column='startdate')
...
I don't think to_field is actually required as long as name is declared as your primary key, but I think it's good for clarity. Or if you don't declare name as the PK on person.
I haven't tested this, though.
To use a view, you have two options. I think both would do best with an actual table containing all the known user names, maybe with a numeric PK as Django usually expects as well. Let's assume that table exists - call it person.
One option is to create a single large view to encompass all information about a user, similar to the big join you use above - something like:
create or replace view person_info as
select person.id, person.name,
jobinfo.startdate, jobinfo.employer, jobinfo.salary,
hobbies.exercise, hobbies.fun,
clothing.shoes, ...
from person
left outer join hobbies on hobbies.name = person.name
left outer join jobinfo on jobinfo.name = person.name
left outer join clothing on clothing.name = person.name
;
That might take a little debugging, but the idea should be clear.
Then declare your model with db_table = person_info and managed = False in the Meta class.
A second option would be to declare a view for each subsidiary table that includes the person_id value matching the name, then just use Django FKs.
create or replace view jobinfo_by_person as
select person.id as person_id, jobinfo.*
from person inner join jobinfo on jobinfo.name = person.name;
create or replace view hobbies_by_person as
select person.id as person_id, hobbies.*
from person inner join hobbies on hobbies.name = person.name;
etc. Again, I'm not totally sure the .* syntax will work - if not, you'd have to list all the fields you're interested in. And check what the column names from the subsidiary tables are.
Then point your models at the by_person versions and use the standard FK setup.
This is a little inelegant and I make no claims for good performance, but it does let you avoid further denormalizing your database.
I've built a product database that is divided in 3 parts. And each part has a "sub" part containing labels. But the more I work with it the more unstable it feels. And each addition I make it takes more and more code to get it to work.
A product is built of parts, and each part is of a type. Each product, part and type has a label. And there's a label for each language.
A product contains parts in 2 list. One list for default parts (one of each type) and one of optional parts.
Now I want to add currency in the mix and have come to the decision to re-model the entire way I handle this.
The result I want to get is a list of all product objects that contains the name, description, price, all parts and all types that match the parts. And for these the correct language labels.
Like so:
product
- name
- description (by language)
- price (by currency)
- parts
- part (type name and part name by language)
- partPrice (by currency)
The problem with my current setup that is a wild mix of db.ReferenceProperty and db.ListProperty(db.key)
And getting all data by is a bit of a hassle that require multiple for-loops, matching dict and datastore calls. Well it's bit of a mess.
The re-model(un-tested) look like this
class Products(db.model)
name = db.StringProperty()
imageUrl = db.StringProperty()
optionalParts = db.ListProperty(db.Key)
defaultParts = db.ListProperty(db.Key)
active = db.BooleanProperty(default=True)
#property
def itemId(self):
return self.key().id()
class ProductPartTypes(db.Model):
name= db.StringProperty()
#property
def itemId(self):
return self.key().id()
class ProductParts(db.Model):
name = db.StringProperty()
type = db.ReferenceProperty(ProductPartTypes)
imageUrl = db.StringProperty()
parts = db.ListProperty(db.Key)
#property
def itemId(self):
return self.key().id()
class Labels(db.Model)
key = db.StringProperty() #want to store a key here
language = db.StringProperty()
label = db.StringProperty()
class Price(db.Model)
key = db.StringProperty() #want to store a key here
language = db.StringProperty()
price = db.IntegerProperty()
The major thing here is that I've split the Labels and Price out. So these can contain labels and prices for any products, parts or types.
So what I am curious about, is this a solid solution from a architectural point of view? Will this hold even if there's thousands of entries in each model?
Also, any tips for retrieving data in a good manner are welcome. My current solution of get all data first and for-looping over them and stick them in dicts works but feels like it could fail any minute.
..fredrik
You need to keep in mind that App Engine's datastore requires you to rethink your usual way of designing databases. It goes against intuition at first but you must denormalize your data as much as possible if you want your application to be scalable. The datastore has been designed this way.
The approach I usually take is to consider first what kind of queries will need to be done in different use cases, eg. what data do I need to retrieve at the same time ? In what order ? What properties should be indexed ?
If I understand correctly, your main goal is to fetch a list of products with complete details. BTW, if you have other query scenarios - ie. filtering on price, type, etc - you should take them into account too.
In order to fetch all the data you need from only one query, I suggest you create one model which could look like this :
class ProductPart(db.Model):
product_name = db.StringProperty()
product_image_url = db.StringProperty()
product_active = db.BooleanProperty(default=True)
product_description = db.StringListProperty(indexed=False) # Contains product description in all languages
part_name = db.StringProperty()
part_image_url = db.StringProperty()
part_type = db.StringListProperty(indexed=False) # Contains part type in all languages
part_label = db.StringListProperty(indexed=False) # Contains part label in all languages
part_price = db.ListProperty(float, indexed=False) # Contains part price in all currencies
part_default = db.BooleanProperty()
part_optional = db.BooleanProperty()
About this solution :
ListProperties are set to
indexed=False in order to avoid
exploding indexes if you don't need
to filter on them.
In order to get the right
description, label or type, you will have to set
list values always in the same order.
For example : part_label[0] is
English, part_label[1] is Spanish,
etc. Same idea for prices and
currencies.
After fetching entities from this
model you will have to do some
in-memory manipulations in order to
get the data nicely structured the way
you want, maybe in a new dictionary.
Obviously, there will be a lot of redundancy in the datastore with such a design - but that's okay, since it allows you to query the datastore in a scalable fashion.
Besides, this is not meant as a replacement for the architecture that you had in mind, but rather an additional Model designed specifically for the user-facing kind of queries that you need to do, ie. retrieving lists of complete product/parts information.
These ProductPart entities could be populated by background tasks, replicating data located in your other normalized entities which would be the authoritative data source. Since you have plenty of data storage on App Engine, this should not be a problem.
IMO your design mostly makes sense. I did come up with almost same design after reading your problem statement. With a few differnces
I had prices with Product and ProductPart not as a separate table.
Other difference was part_types. If there are not many part_type you can simply have them as python list/tuple.
part_types = ('wheel', 'break', 'mirror')
It also depends on kind of queries you are anticipating. If there are many queries of nature price calculation (independent of rest of product and part info) then it might make sense to design it way you have done.
You have mentioned that you will get all the data first. Isn't querying possible? If you get the whole data in your app and then sort/filter in python then it would be slow. Which database are you considering? For me mongodb looks like a good option here.
Finally why are you suspicious about even 1000 records? You can run a few tests on your db beforehand.
Bests