NoSQL. Cassandra. Modelling external ids - python

We have product like:
class Product(Model):
"""
Base Product Model
"""
shop_id = columns.UUID(primary_key=True, required=True)
product_id = columns.UUID(primary_key=True, required=True, default=uuid.uuid4)
wikimart_id = columns.Integer(index=True) # Convert to user defined type?
yandex_id = columns.Integer(index=True)
Periodically (once a day) we update products from list.
Currently we have to use constructions like
if Product.filter(wikimart_id=external_id):
p = Product.get(shop_id=shop_id, wikimart_id=external_id)
d['product_id'] = p.product_id # Setting key in dict from which model will be updated
Is it ok for Cassandra, or we should think how to create models that will have external_id as primary key for updating products?
Like:
class ProductWikimart(Model):
"""
Wikimart Product Model
"""
shop_id = columns.UUID(primary_key=True, required=True)
wikimart_id = columns.Integer(primary_key=True)
product_id = columns.UUID(index=True)
class ProductYandex(Model):
"""
Yandex Product Model
"""
shop_id = columns.UUID(primary_key=True, required=True)
yandex_id = columns.Integer(primary_key=True)
product_id = columns.UUID(index=True)
Which way is more preferable?
UPD This question is about generic modelling for NoSQL. Not only about cassandra :)

Maybe this article would be helpful for you.
I don't think the product_id is a good candidate for a clustering key due to it relatively frequent changes. So, I think the second version of product model (with ProductWikimart and ProductYandex) would be better. But then you can get new problems: for instance, how to match ProductWikimart and ProductYandex product ids?
Speaking of data modeling for Cassandra in general there is Model Around Your Queries rule. So, to tell what kind of table structure would be better we should know how it would be requested.

Related

Django - Can I add a calculated field that only exists for a particular sub-set or occurences of my model?

Imagine that you have a model with some date-time fields that can be categorized depending on the date. You make an annotation for the model with different cases that assign a different 'status' depending on the calculation for the date-time fields:
#Models.py
class Status(models.TextChoices):
status_1 = 'status_1'
status_2 = 'status_2'
status_3 = 'status_3'
special_status = 'special_status'
class MyModel(models.Model):
important_date_1 = models.DateField(null=True)
important_date_2 = models.DateField(null=True)
calculated_status = models.CharField(max_length=32, choices=Status.choices, default=None, null=True, blank=False,)
objects = MyModelCustomManager()
And the manager with which to do the calculation as annotations:
# managers.py
class MyModelCustomManager(models.Manager):
def get_queryset(self):
queryset = super().get_queryset().annotate(**{
'status': Case(
When(**{'important_date_1' is foo, 'then':
Value(Status.status_1)}),
When(**{'important_date_2' is fii, 'then':
Value(Status.status_2)}),
When(**{'important_date_1' is foo AND 'importante_date_2' is whatever, 'then':
Value(Status.status_3)}),
# And so on and so on
)
}
)
return queryset
Now, here's where it gets tricky. Only one of these sub-sets of occurrences on the model requires an ADDITIONAL CALCULATED FIELD that literally only exists for it, that looks something like this:
special_calculated_field = F('important_date_1') - F('importante_date_2') #Only for special_status
So, basically I want to make a calculated field with the condition that the model instance must belong to this specific status. I don't want to make it an annotation, because other instances of the model would always have this value set to Null or empty if it were a field or annotation and I feel like it would be a waste of a row in the database.
Is there way, for example to do this kind of query:
>>> my_model_instance = MyModel.objects.filter(status='special_status')
>>> my_model_instance.special_calculated_field
Thanks a lot in advance if anyone can chime in with some help.

Django - Using signals to make changes to other models

say I have two models like so...
class Product(models.Model):
...
overall_rating = models.IntegerField()
...
class Review(models.Model):
...
product = models.ForeignKey(Product, related_name='review', on_delete=models.CASCADE)
rating = models.IntegerField()
...
I want to use the ratings from all of the child Review objects to build an average overall_rating for the parent Product.
Question: I'm wondering how I may be able to achieve something like this using Django signals?
I am a bit of a newbie to this part of Django; have never really understood the signals part before.
This overall_rating value needs to be stored in the database instead of calculated using a method since I plan on ordering the Product objects based on their overall_rating which is done on a DB level. The method may look something like this if I were to implement it (just for reference):
def overall_rating(self):
review_count=self.review.count()
if review_count >= 1:
ratings=self.review.all().values_list('rating',flat=True)
rating_sum = 0
for i in ratings:
rating_sum += int(i)
return rating_sum / review_count
else:
return 0
Thank you
You want to update your Product after each save of Review. So the best and fastest way would be using post save method. For example, after each saved product you can get all reviews and calculate overall rating and then save it to the Product.
#receiver(post_save, sender=Review, dispatch_uid="update_overall_rating")
def update_rating(sender, instance, **kwargs):
parent = instance.product
all_reviews = Review.objects.filter(product=parent)
parent.overall_rating = get_overall_rating(all_reviews)

Is it good to use Django ContentType Framework, or to keep it simple with M2M Relationship?

I'm having a dilemma on choosing whether to use ContentType or using ManyToManyField.
Consider the following example:
class Book(Model):
identifiers = ManyToManyField('Identifier')
title = CharField(max_length=10)
class Series(Model):
identifiers = ManyToManyField('Identifier')
book = ForeignKey('Book')
name = CharField(max_length=10)
class Author(Model):
identifiers = ManyToManyField('Identifier')
name = CharField(max_length=10)
class Identifier(Model):
id_type = ForeignKey('IdType')
value = CharField(max_length=10)
class IdType(Model):
# Sample Value:
# Book: ISBN10, ISBN13, LCCN
# Serial: ISSN
# Author: DAI, AIS
name = CharField(max_length=10)
As you notice, Identifier is being used in many places, and in fact, it is so generic that many business related object requires Identifier, similar to how TagItem from the Django examples is being used.
An alternative approach is to generalized this using the Generic Relation.
class Book(Model):
identifiers = GenericRelation(Identifier)
title = CharField(max_length=10)
authors = ManyToManyField(Author)
class Series(Model):
identifiers = GenericRelation(Identifier)
book = ForeignKey('Book')
name = CharField(max_length=10)
class Author(Model):
identifiers = GenericRelation(Identifier)
name = CharField(max_length=10)
class Identifier(Model):
content_type = models.ForeignKey(ContentType, on_delete=models.CASCADE)
object_id = models.PositiveIntegerField()
content_object = GenericForeignKey('content_type', 'object_id')
id_type = ForeignKey('IdType')
value = CharField(max_length=10)
class IdType(Model):
# Sample Value:
# Book: ISBN10, ISBN13, LCCN, MyLibrary, YourLibrary, XYZLibrary, etc.
# Serial: ISSN, XYZSerial, etc...
# Author: DAI, AIS, XYZAuthor, etc...
name = CharField(max_length=10)
I'm unsure if I'm raising the right concern regarding Generic Relation.
I'm worried about the data growth of the Identifier table, as it will grow very fast on one table, for example:
100,000 books, average 4 identifiers each. (total 400,000 identifier records)
average 2 authors per book (total 200,000 identifier records)
For each record in book, 4-6x data increased in identifier table. Soon, Identifier Table will be millions of records. Will the queries becoming very slow in the long run? Moreover, I believed that identifier is a field that being queried and used in the application.
Is this generalization correctly done? As in, Author Identifier is completely unrelated with Book Identifier and should have its own BookIdentifier and AuthorIdentifier on its own. Although they seems to have IdType.name and IdType.value pattern, but the domain are completely not related, one is author, the other is book. Should they be generalized? Why not?
What problem could there be if I were to implement under GenericRelation model?

Django query: Joining two models with two fields

I have the following models:
class AcademicRecord(models.Model):
record_id = models.PositiveIntegerField(unique=True, primary_key=True)
subjects = models.ManyToManyField(Subject,through='AcademicRecordSubject')
...
class AcademicRecordSubject(models.Model):
academic_record = models.ForeignKey('AcademicRecord')
subject = models.ForeignKey('Subject')
language_group = IntegerCharField(max_length=2)
...
class SubjectTime(models.Model):
time_id = models.CharField(max_length=128, unique=True, primary_key=True)
subject = models.ForeignKey(Subject)
language_group = IntegerCharField(max_length=2)
...
class Subject(models.Model):
subject_id = models.PositiveIntegerField(unique=True,primary_key=True)
...
The academic records have list of subjects each with a language code and the subject times have a subject and language code.
With a given AcademicRecord, how can I get the subject times that matches with the AcademicRecordSubjects that the AcademicRecord has?
This is my approach, but it makes more queries than needed:
# record is the given AcademicRecord
times = []
for record_subject in record.academicrecordsubject_set.all():
matched_times = SubjectTime.objects.filter(subject=record_subject.subject)
current_times = matched_times.filter(language_group=record_subject.language_group)
times.append(current_times)
I want to make the query using django ORM not with raw SQL
SubjectTime language group has to match with Subject's language group aswell
I got it, in part thanks to #Robert Jørgensgaard Eng
My problem was how to do the inner join using more than 1 field, in which the F object came on handly.
The correct query is:
SubjectTime.objects.filter(subject__academicrecordsubject__academic_record=record,
subject__academicrecordsubject__language_group=F('language_group'))
Given an AcademicRecord instance academic_record, it is either
SubjectTime.objects.filter(subject__academicrecordsubject_set__academic_record=academic_record)
or
SubjectTime.objects.filter(subject__academicrecordsubject__academic_record=academic_record)
The results reflect all the rows of the join that these ORM queries become in SQL. To avoid duplicates, just use distinct().
Now this would be much easier, if I had a django shell to test in :)

Django filter against ForeignKey and by result of manytomany sub query

I've looked at doing a query using an extra and/or annotate but have not been able to get the result I want.
I want to get a list of Products, which has active licenses and also the total number of available licenses. An active license is defined as being not obsolete, in date, and the number of licenses less the number of assigned licenses (as defined by a count on the manytomany field).
The models I have defined are:
class Vendor(models.Model):
name = models.CharField(max_length=200)
url = models.URLField(blank=True)
class Product(models.Model):
name = models.CharField(max_length=200)
vendor = models.ForeignKey(Vendor)
product_url = models.URLField(blank=True)
is_obsolete = models.BooleanField(default=False, help_text="Is this product obsolete?")
class License(models.Model):
product = models.ForeignKey(Product)
num_licenses = models.IntegerField(default=1, help_text="The number of assignable licenses.")
licensee_name = models.CharField(max_length=200, blank=True)
license_key = models.TextField(blank=True)
license_startdate = models.DateField(default=date.today())
license_enddate = models.DateField(null=True, blank=True)
is_obsolete = models.BooleanField(default=False, help_text="Is this licenses obsolete?")
licensees = models.ManyToManyField(User, blank=True)
I have tried filtering by the License model. Which works, but I don't know how to then collate / GROUP BY / aggregate the returned data into a single queryset that is returned.
When trying to filter by procuct, I can quite figure out the query I need to do. I can get bits and pieces, and have tried using a .extra() select= query to return the number of available licenses (which is all I really need at this point) of which there will be multiple licenses associated with a product.
So, the ultimate answer I am after is, how can I retrieve a list of available products with the number of available licenses in Django. I'd rather not resort to using raw as much as possible.
An example queryset that gets all the License details I want, I just can't get the product:
License.objects.annotate(
used_licenses=Count('licensees')
).extra(
select={
'avail_licenses': 'licenses_license.num_licenses - (SELECT count(*) FROM licenses_license_licensees WHERE licenses_license_licensees.license_id = licenses_license.id)'
}
).filter(
is_obsolete=False,
num_licenses__gt=F('used_licenses')
).exclude(
license_enddate__lte=date.today()
)
Thank you in advance.
EDIT (2014-02-11):
I think I've solved it in possibly an ugly way. I didn't want to make too many DB calls if I can, so I get all the information using a License query, then filter it in Python and return it all from inside a manager class. Maybe an overuse of Dict and list. Anyway, it works, and I can expand it with additional info later on without a huge amount of risk or custom SQL. And it also uses some of the models parameters that I have defined in the model class.
class LicenseManager(models.Manager):
def get_available_products(self):
licenses = self.get_queryset().annotate(
used_licenses=Count('licensees')
).extra(
select={
'avail_licenses': 'licenses_license.num_licenses - (SELECT count(*) FROM licenses_license_licensees WHERE licenses_license_licensees.license_id = licenses_license.id)'
}
).filter(
is_obsolete=False,
num_licenses__gt=F('used_licenses')
).exclude(
license_enddate__lte=date.today()
).prefetch_related('product')
products = {}
for lic in licenses:
if lic.product not in products:
products[lic.product] = lic.product
products[lic.product].avail_licenses = lic.avail_licenses
else:
products[lic.product].avail_licenses += lic.avail_licenses
avail_products = []
for prod in products.values():
if prod.avail_licenses > 0:
avail_products.append(prod)
return avail_products
EDIT (2014-02-12):
Okay, this is the final solution I have decided to go with. Uses Python to filter the results. Reduces cache calls, and has a constant number of SQL queries.
The lesson here is that for something with many levels of filtering, it's best to get as much as needed, and filter in Python when returned.
class ProductManager(models.Manager):
def get_all_available(self, curruser):
"""
Gets all available Products that are available to the current user
"""
q = self.get_queryset().select_related().prefetch_related('license', 'license__licensees').filter(
is_obsolete=False,
license__is_obsolete=False
).exclude(
license__enddate__lte=date.today()
).distinct()
# return a curated list. Need further information first
products = []
for x in q:
x.avail_licenses = 0
x.user_assigned = False
# checks licenses. Does this on the model level as it's cached so as to save SQL queries
for y in x.license.all():
if not y.is_active:
break
x.avail_licenses += y.available_licenses
if curruser in y.licensees.all():
x.user_assigned = True
products.append(x)
return q
One strategy would be to get all the product ids from your License queryset:
productIDList = list(License.objects.filter(...).values_list(
'product_id', flat=True))
and then query the products using that list of ids:
Product.objects.filter(id__in=productIDList)

Categories

Resources