How do I structure a database cache (memcached/Redis) for a Python web app with many different variables for querying? - python

For my app, I am using Flask, however the question I am asking is more general and can be applied to any Python web framework.
I am building a comparison website where I can update details about products in the database. I want to structure my app so that 99% of users who visit my website will never need to query the database, where information is instead retrieved from the cache (memcached or Redis).
I require my app to be realtime, so any update I make to the database must be instantly available to any visitor to the site. Therefore I do not want to cache views/routes/html.
I want to cache the entire database. However, because there are so many different variables when it comes to querying, I am not sure how to structure this. For example, if I were to cache every query and then later need to update a product in the database, I would basically need to flush the entire cache, which isn't ideal for a large web app.
I would prefer is to cache individual rows within the database. The problem is, how do I structure this so I can flush the cache appropriately when an update is made to the database? Also, how can I map all of this together from the cache?
I hope this makes sense.

I had this exact question myself, with a PHP project, though. My solution was to use ElasticSearch as an intermediate cache between the application and database.
The trick to this is the ORM. I designed it so that when Entity.save() is called it is first stored in the database, then the complete object (with all references) is pushed to ElasticSearch and only then the transaction is committed and the flow is returned back to the caller.
This way I maintained full functionality of a relational database (atomic changes, transactions, constraints, triggers, etc.) and still have all entities cached with all their references (parent and child relations) together with the ability to invalidate individual cached objects.
Hope this helps.

So a free eBook called "Redis in Action" by Josiah Carlson answered all of my questions. It it quite long, but after reading through, I have a fairly solid understanding of how to structure a caching architecture. It gives real world examples, such as a social network and a shopping site with tons of traffic. I will need to read through it again once or twice to fully understand. A great book!
Link: Redis in Action

Related

Handling repetitive content within django apps

I am currently building a tool in Django for managing the design information within an engineering department. The idea is to have a common catalogue of items accessible to all projects. However, the projects would be restricted based on user groups.
For each project, you can import items from the catalogue and change them within the project. There is a requirement that each project must be linked to a different database.
I am not entirely sure how to approach this problem. From what I read, the solution I came up with is to have multiple django apps. One represents the common catalogue of items (linked to its own database) and then an app for each project(which can write and read from its own database but it can additionally read also from the common items catalogue database). In this way, I can restrict what user can access what database/project. However, the problem with this solution is that it is not DRY. All projects look the same: same models, same forms, same templates. They are just linked to different database and I do not know how to do this in a smart way (without copy-pasting entire files cause I think managing this would be a pain).
I was thinking that this could be avoided by changing the database label when doing queries (employing the using attribute) depending on the group of the authenticated user. The problem with this is that an user can have access to multiple projects. So, I am again at a loss.
It looks for me that all you need is a single application that will manage its access properly.
If the requirement is to have separate DBs then I will not argue that, but ... there is always small chance that separate tables in 1 DB is what they will accept
Django apps don't segregate objects, they are a way of structuring your code base. The idea is that an app can be re-used in other projects. Having a separate app for your catalogue of items and your projects is a good idea, but having them together in one is not a problem if you have a small codebase.
If I have understood your post correctly, what you want is for the databases of different departments to be separate. This is essentially a multi-tenancy question which is a big topic in itself, there are a few options:
Code separation - all of your projects/departments exist in a single database and schema but are separate by code that filters departments depending on who the end user is (literally by using Django .filters()). This is easy to do but there is a risk that data could be leaked to the wrong user if you get your code wrong. I would recommend this one for your use-case.
Schema separation - you are still using a single database but each department has its own schema. You would need to use Postgresql for this but once a schema has been set, there is far less chance that data is going to be visible to the wrong user. There are some Django libraries such as django-tenants that can do a lot of the heavy lifting.
Database separation - each department has their own database. There is even less of a chance that data will be leaked but you have to manage multi-databases and it is more difficult to scale. You can manage this through django as there is support for multi-databases.
Application separation - each department not only has their own database but their own application instance. The separation is absolute but again you need to manage multiple applications on a host like Heroku, which is even less scalable.

Django + Scrapy multi scrapers architecture

Recently I took over Django project whose one component is Scrapy scrapprs (a lot of - core functionality). It is worth adding that scrapers simply feed the database several times a day and django web app is using this data.
__Scraper__s have direct access to Django model, but in my opinion is not the best idea (mixed responsibilities - django rather should act as a web app, not also scrapers, isn't it?). For example after such split scrapers could be run serverless, saving money and being spawned only when needed.
I see it at least as separate component in the architecture. But if I would separate scrapers from Django website then I would need to populate DB there as well - and change in model either in Django webapp or in scraping app would require change in second app to adjust.
I haven't seen really articles about splitting those apps.
What are the best practices here? Is it worth splitting it? How would you organise deployment to cloud solution(e.g. AWS)?
Thank you
Well, this is a big discussion and I have the same "good problem".
Short answer:
I suggest you that if you want to separate it, you can separate the logic from the data using different schemes. I did it before and is a good approach.
Long answer:
The questions are:
Once you gather information from scrapers, are you doing something with them (Aggregation, treatment, or anything else)?
If the answer is yes, you can separate it in 2 DB. One with the raw information and the other with the treated one (which will be the shared with Django).
If the answer is no, I don't see any reason to separate it. At the end, Django is only the visualizer of the data.
The Django website is using a lot of stored data that for the Single Responsibility you want to separate it from the scraped data?
If the answer is yes, separate it by schemas or even DB.
If the answer is no, you can store it in the same DB of Django. At the end, the important data will be the extracted data. Django maybe will have a configuration's DB or other extra data to manage the web, but the big percentage of the DB will be the data crawled/treated. Depends how much cost it will take you to separate it and maintain. If you are doing from the beginning, I would do it separately.

microservices and multiple databases

i have written MicroServices like for auth, location, etc.
All of microservices have different database, with for eg location is there in all my databases for these services.When in any of my project i need a location of user, it first looks in cache, if not found it hits the database. So far so good.Now when location is changed in any of my different databases, i need to update it in other databases as well as update my cache.
currently i made a model (called subscription) with url as its field, whenever a location is changed in any database, an object is created of this subscription. A periodic task is running which checks for subscription model, when it finds such objects it hits api of other services and updates location and updates the cache.
I am wondering if there is any better way to do this?
I am wondering if there is any better way to do this?
"better" is entirely subjective. if it meets your needs, it's fine.
something to consider, though: don't store the same information in more than one place.
if you need an address, look it up from the service that provides address, every time.
this may be a performance hit, but it eliminates the problem of replicating the data everywhere.
another option would be a more proactive approach, as suggested in comments.
instead of creating a task list for changes, and doing that periodically, send a message across rabbitmq immediately when the change happens. let every service that needs to know, get a copy of the message and update it's own cache of info.
just remember, though. every time you have more than one copy of the information, you reduce the "correctness" of the system, as a whole. it will always be possible for the information found in one of your apps to be out of date, because it did not get an update from the official source.

Caching a static Database table in Django

I have a Django web application that is currently live and receiving a lot of queries. I am looking for ways to optimize its performance and one area that can be improved is how it interacts with its database.
In its current state, each request to a particular view loads an entire database table into a pandas dataframe, against which queries are done. This table consists of over 55,000 rows of text data (co-ordinates mostly).
To avoid needless queries, I have been advised to cache the database into memory and have it be cached upon the first time its loaded. This will remove some overhead on the DB side of things. I've never used this feature of Django before so I am a bit lost.
The Django manual does not seem to have a concrete implementation of what I want to do. Would it be a good idea to just store the entire table in memory or would storing it in a file be a better idea?
I had a similar problem and django-cache-machine worked like a charm. It uses the Django caching features to cache the results of your queries. It is very easy to set up (assuming you have already configured Django's cache backend):
pip install django-cache-machine
Then in the model you want to cache:
from caching.base import CachingManager, CachingMixin
class MyModel(CachingMixin, models.Model):
objects = CachingManager()
And that's it, your queries will be cached.

How to cache an (almost) read-only Flask web app?

I have a Flask web app that has no registered users, but its database is updated daily (therefore the content only changes once a day).
It seems to me the best choice would be to cache the entire website once a day and serve everything from the cache.
I tried with Flask Cache, but a dynamic page is created and then cached for every different user-session, which is clearly not ideal since the content is always the same no matter who's browsing the website.
Do you know how can I do better, either with Flask Cache or using something else?
Perhaps use an in-memory SQLite database? Will look and feel like any regular db, but with memory access speeds.
A couple of years ago, I wrote an in-memory database which I called littletable. Tables are represented as lists of objects. Selects and queries are normally done by simple list scans, but common object properties can be indexed. Tables can be joined or pivoted.
The main difference in the littletable model is that there is no separate concept of a table vs. a results list. The result of any query or join is another table. Tables can also store namedtuples and a littletable-defined type called a DataObject. Tables can be imported/exported to CSV files to persist any updates.
There is at least one website that uses littletable to maintain its mostly-static product catalog. You might also find littletable useful for prototyping before creating actual tables in a more common database. Here's a link to the online docs.

Categories

Resources