In developing a website for indexing system documentation I've come across a tough nut to crack regarding data "matching"/relations across databases in Django.
A simplified model for my local database:
from django.db import models
class Document(models.Model):
name = models.CharField(max_length=200)
system_id = models.IntegerField()
...
Imagined model, system details are stored in a remote database.
from django.db import models
class System(models.Model):
name = models.CharField(max_length=200)
system_id = models.IntegerField()
...
The idea is that when creating a new Document entry at my website the ID of the related system is to be stored in the local database. When presenting the data I would have to use the stored ID to retrieve the system name among other details from the remote database.
I've looked into foreign keys across databases, but this seems to be very extensive and I'm not sure if I want relations. Rather I visualize a function inside the Document model/class which is able to retrieve the matching data, for example by importing a custom router/function.
How would I go about solving this?
Note that I won't be able to alter anything on the remote database, and it's read-only. Not sure if I should create a model for System aswell. Both databases use PostgreSQL, however my impression is that it's not really of relevance to this scenario which database is used.
In the django documentation multi-db (manually-selecting-a-database)
# This will run on the 'default' database.
Author.objects.all()
# So will this.
Author.objects.using('default').all()
# This will run on the 'other' database.
Author.objects.using('other').all()
The 'default' and 'other' are aliases for you databases.
In your case it would could be 'default' and 'remote'.
of course you could replace the .all() with anything you want.
Example: System.objects.using('remote').get(id=123456)
You are correct that foreign keys across databases are a problem in Django ORM, and to some extent at the db level too.
You already have the answer basically: "I visualize a function inside the Document model/class which is able to retrieve the matching data"
I'd do it like this:
class RemoteObject(object):
def __init__(self, remote_model, remote_db, field_name):
# assumes remote db is defined in Django settings and has an
# associated Django model definition:
self.remote_model = remote_model
self.remote_db = remote_db
# name of id field on model (real db field):
self.field_name = field_name
# we will cache the retrieved remote model on the instance
# the same way that Django does with foreign key fields:
self.cache_name = '_{}_cache'.format(field_name)
def __get__(self, instance, cls):
try:
rel_obj = getattr(instance, self.cache_name)
except AttributeError:
system_id = getattr(instance, self.field_name)
remote_qs = self.remote_model.objects.using(self.remote_db)
try:
rel_obj = remote_qs.get(id=system_id)
except self.remote_model.DoesNotExist:
rel_obj = None
setattr(instance, self.cache_name, rel_obj)
if rel_obj is None:
raise self.related.model.DoesNotExist
else:
return rel_obj
def __set__(self, instance, value):
setattr(instance, self.field_name, value.id)
setattr(instance, self.cache_name, value)
class Document(models.Model:
name = models.CharField(max_length=200)
system_id = models.IntegerField()
system = RemoteObject(System, 'system_db_name', 'system_id')
You may recognise that the RemoteObject class above implements Python's descriptor protocol, see here for more info:
https://docs.python.org/2/howto/descriptor.html
Example usage:
>>> doc = Document.objects.get(pk=1)
>>> print doc.system_id
3
>>> print doc.system.id
3
>>> print doc.system.name
'my system'
>>> other_system = System.objects.using('system_db_name').get(pk=5)
>>> doc.system = other_system
>>> print doc.system_id
5
Going further you could write a custom db router:
https://docs.djangoproject.com/en/dev/topics/db/multi-db/#using-routers
This would let you eliminate the using('system_db_name') calls in the code by routing all reads for System model to the appropriate db.
I'd go for a method get_system(). So:
class Document:
def get_system(self):
return System.objects.using('remote').get(system_id=self.system_id)
This is the simplest solution. A possible solution is also to use PostgreSQL's foreign data wrapper feature. By using FDW you can abstract away the multidb handling from django and do it inside the database - now you can use queries that need to use the document -> system relation.
Finally, if your use case allows it, just copying the system data periodically to the local db can be a good solution.
Related
Question
How can I build a Model that that stores one field in the database, and then retrieves other fields from an API behind-the-scenes when necessary?
Details:
I'm trying to build a Model called Interviewer that stores an ID in the database, and then retrieves name from an external API. I want to avoid storing a copy of name in my app's database. I also want the fields to be retrieved in bulk rather than per model instance because these will be displayed in a paginated list.
My first attempt was to create a custom Model Manager called InterviewManager that overrides get_queryset() in order to set name on the results like so:
class InterviewerManager(models.Manager):
def get_queryset(self):
query_set = super().get_queryset()
for result in query_set:
result.name = 'Mary'
return query_set
class Interviewer(models.Model):
# ID provided by API, stored in database
id = models.IntegerField(primary_key=True, null=False)
# Fields provided by API, not in database
name = 'UNSET'
# Custom model manager
interviewers = InterviewerManager()
However, it seems like the hardcoded value of Mary is only present if the QuerySet is not chained with subsequent calls. I'm not sure why. For example, in the django shell:
>>> list(Interviewer.interviewers.all())[0].name
'Mary' # Good :)
>>> Interviewer.interviewers.all().filter(id=1).first().name
'UNSET' # Bad :(
My current workaround is to build a cache layer inside of InterviewManager that the model accesses like so:
class InterviewerManager(models.Manager):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.api_cache = {}
def get_queryset(self):
query_set = super().get_queryset()
for result in query_set:
# Mock querying a remote API
self.api_cache[result.id] = {
'name': 'Mary',
}
return query_set
class Interviewer(models.Model):
# ID provided by API, stored in database
id = models.IntegerField(primary_key=True, null=False)
# Custom model
interviewers = InterviewerManager()
# Fields provided by API, not in database
#property
def name(self):
return Interviewer.interviewers.api_cache[self.id]['name']
However this doesn't feel like idiomatic Django. Is there a better solution for this situation?
Thanks
why not just make the API call in the name property?
#property
def name(self):
name = get_name_from_api(self.id)
return name
If that isnt possible by manipulating a get request where you can add a list of names and recieve the data. The easy way is to do it is in a loop.
I would recommand you to build a so called proxy where you load the articles in a dataframe/dict, save this varible data ( with for example pickle ) and use it when nessary. It reduces loadtimes and is near efficient.
I'm trying to create an object from the console but not sure how to set that up.
This is my modelManager:
class MajorManager(models.Manager):
def __str__(self):
return self.name
def createMajor(self, name):
try:
name = name.lower()
major = self.create(name=name)
except IntegrityError:
print("This major has already been created")
And here is the model:
class Majors(models.Model):
name = models.CharField(max_length=30, unique=True)
objects = MajorManager()
Any help would be much appreciated.
You can go this route using Django's API - checkout the docs
First create a shell:
python manage.py shell
Then you can import your models and do basic CRUD on them.
>>> from polls.models import Choice, Question # Import the model classes we just wrote.
# No questions are in the system yet.
>>> Question.objects.all()
<QuerySet []>
# Create a new Question.
# Support for time zones is enabled in the default settings file, so
# Django expects a datetime with tzinfo for pub_date. Use timezone.now()
# instead of datetime.datetime.now() and it will do the right thing.
>>> from django.utils import timezone
>>> q = Question(question_text="What's new?", pub_date=timezone.now())
# Save the object into the database. You have to call save() explicitly.
>>> q.save()
Or, alternatively you can try the dbshell route, here's the documentation.
This command assumes the programs are on your PATH so that a simple
call to the program name (psql, mysql, sqlite3, sqlplus) will find the
program in the right place. There’s no way to specify the location of
the program manually.
You can't use the Django's ORM though, it's pure SQL, so it would be instructions like:
CREATE TABLE user (
Id Int,
Name Varchar
);
Despite numerous recipes and examples in peewee's documentation; I have not been able to find how to accomplish the following:
For finer-grained control, check out the Using context manager / decorator. This allows you to specify the database to use with a given list of models for the duration of the wrapped block.
I assume it would go something like...
db = MySQLDatabase(None)
class BaseModelThing(Model):
class Meta:
database = db
class SubModelThing(BaseModelThing):
'''imagine all the fields'''
class Meta:
db_table = 'table_name'
runtime_db = MySQLDatabase('database_name.db', fields={'''imagine field mappings here''', **extra_stuff)
#Using(runtime_db, [SubModelThing])
#runtime_db.execution_context()
def some_kind_of_query():
'''imagine the queries here'''
but I have not found examples, so an example would be the answer to this question.
Yeah, there's not a great example of using Using or the execution_context decorators, so the first thing is: don't use the two together. It doesn't appear to break anything, just seems to be redundant. Logically that makes sense as both of the decorators cause the specified model calls in the block to run in a single connection/transaction.
The only(/biggest) difference between the two is that Using allows you to specify the particular database that the connection will be using - useful for master/slave (though the Read slaves extension is probably a cleaner solution).
If you run with two databases and try using execution_context on the 'second' database (in your example, runtime_db) nothing will happen with the data. A connection will be opened at the start of the block and closed and the end, but no queries will be executed on it because the models are still using their original database.
The code below is an example. Every run should result in only 1 row being added to each database.
from peewee import *
db = SqliteDatabase('other_db')
db.connect()
runtime_db = SqliteDatabase('cmp_v0.db')
runtime_db.connect()
class BaseModelThing(Model):
class Meta:
database = db
class SubModelThing(Model):
first_name = CharField()
class Meta:
db_table = 'table_name'
db.create_tables([SubModelThing], safe=True)
SubModelThing.delete().where(True).execute() # Cleaning out previous runs
with Using(runtime_db, [SubModelThing]):
runtime_db.create_tables([SubModelThing], safe=True)
SubModelThing.delete().where(True).execute()
#Using(runtime_db, [SubModelThing], with_transaction=True)
def execute_in_runtime(throw):
SubModelThing(first_name='asdfasdfasdf').save()
if throw: # to demo transaction handling in Using
raise Exception()
# Create an instance in the 'normal' database
SubModelThing.create(first_name='name')
try: # Try to create but throw during the transaction
execute_in_runtime(throw=True)
except:
pass # Failure is expected, no row should be added
execute_in_runtime(throw=False) # Create a row in the runtime_db
print 'db row count: {}'.format(len(SubModelThing.select()))
with Using(runtime_db, [SubModelThing]):
print 'Runtime DB count: {}'.format(len(SubModelThing.select()))
Does django enforce uniqueness for a primary key?
The documentation here seems to suggest so, but when I define a class as:
class Site(models.Model):
id = models.IntegerField(primary_key=True)
and test this constraint in a test case:
class SiteTestCase(TestCase):
def setUp(self):
self.site = Site(id=0, name='Site')
self.site.save()
def tearDown(self):
self.site.delete()
def test_unique_id(self):
with self.assertRaises(IntegrityError):
badSite = Site(id=0, name='Bad Site')
badSite.save()
badSite.delete()
the test fails.
If I test on a normal field (primary_key=False, unique=True) then the exception is raised correctly. Setting unique=True on the id field does not change the result.
Is there something about primary_key fields that I'm missing here?
My database backend is MySQL, if that's relevant.
Your test method is wrong. What you're doing here is updating the existing instance since you're supplying an already used primary key. Change the save to a force_insert like so.
def test_unique_id(self):
with self.assertRaises(IntegrityError):
badSite = Site(id=0, name='Bad Site')
badSite.save(force_insert=True)
badSite.delete()
The django docs explain how django knows whether to UPDATE or INSERT. You should read that section.
Are you aware that django already supports automatic primary keys? See the documentation for more of an explanation.
Is there any plugin or 3rd party backend to manage redis connections in Django, so the methods in view.py don't have to explicitly connect to redis for every request?
If not, how would you start implementing one? A new plugin? a new backend? a new django middleware?
Thank you.
I think the emerging standard for non-rel databases is django-nonrel . I don't know if django-nonrel is production ready or if support redis, but they have a guide on writing a custom no-sql backend.
Unfortunately, i don't think that writing support for a redis on standard django is easy as writing a DatabaseBackend. There's a lot in django models mechanics and workflow that simply assumes an ACID database. What about syncdb ? And about Querysets?
However, you may try to write a poor-mans approach using models.Manager and a lot of tweaking on your model. For example:
# helper
def fill_model_instance(instance, values):
""" Fills an model instance with the values from dict values """
attributes = filter(lambda x: not x.startswith('_'), instance.__dict__.keys())
for a in attributes:
try:
setattr(instance, a, values[a.upper()])
del values[a.upper()]
except:
pass
for v in values.keys():
setattr(instance, v, values[v])
return instance
class AuthorManager( models.Manager ):
# You may try to use the default methods.
# But should be freaking hard...
def get_query_set(self):
raise NotImplementedError("Maybe you can write a Non relational Queryset()! ")
def latest(self, *args, **kwargs):
# redis Latest query
pass
def filter(self, *args, **kwargs):
# redis filter query
pass
# Custom methods that you may use, instead of rewriting
# the defaults ones.
def open_connection(self):
# Open a redis connection
pass
def search_author( self, *args, **kwargs ):
self.open_connection()
# Write your query. I don't know how this shiny non-sql works.
# Assumes it returns a dict for every matched author.
authors_list = [{'name': 'Leibniz', 'email': 'iinventedcalculus#gmail.com'},
'name': 'Kurt Godel','email': 'self.consistent.error#gmail.com'}]
return [fill_instance(Author(), author) for author in authors_list]
class Author( models.Model ):
name = models.CharField( max_length = 255 )
email = models.EmailField( max_length = 255 )
def save(self):
raise NotImplementedError("TODO: write a redis save")
def delete(self):
raise NotImplementedError(""TODO: write a delete save")
class Meta:
managed = False
Please not that i've only made an sketch of how you can tweak the django models. I have not
tested and run this code. I first suggest you to investigate django-nonrel.