Caching data on backend side with Django (scalable applications)

Caching data on backend side with Django (scalable applications) - python

Suppose I need to build a web application where each client will be simulating their trading strategy using historical stock data. The data will be provided by a 3rd party vendor over the internet: for example, fetching historical data for a single stock based on stock ticker through HTTP call. Also, I am planning to use Django as a back-end framework.
Here is my question:
I would like to be able to prefetch and cache the data on the server side, so that each client's request would not need to do an HTTP call again, but get it from the shared resource. I guess, storing it in database, like SQL could be one solution. However, is there a way to use memory shared between clients in Django on the backend side? Any pointer or suggestion would be very helpful. Thanks.

This sounds like a fine thing to store in a share cache, like memcache or redis (or, yes, even an SQL database-backed cache).
You should read https://docs.djangoproject.com/en/dev/topics/cache/; this can explain how you can store the result of your HTTP call under a cache key and then retrieve it. The caching works the same regardless of what backend (memcache, redis, local memory, SQL DB) you use, so you can test this out with the local-memory cache or DB cache and, if you like it, move to a better solution like memcache.

There are many caching strategies you could use here, but a nice place to start rather then storing the data in a SQL database you could store the data in something like Memcached https://docs.djangoproject.com/en/dev/topics/cache/#memcached. Without more information I couldn't get more specific than that.

Related

Django and Rails with one common DB

I have earlier worked on Java+Spring to create a web-app.
I have to build a new web-app now.
It will have one centralized db.
There will be two different type of instance of web-app.
Web-App 1:
a) It would have nothing to UI render, no html,js etc.
b) All it need to give is some set of rest API which will
b.1) create some new entries in DB
b.2) modify some entries in DB
b.3) retrieve some of DB records in JSON format.
some frontend code ( doesn't belong to this app) will periodically fetch
this details.
c) it will be used by max by 100,000 people but at a given point of time,
we can expect about 1000 users logged in and doing whats being said in b)
Web-App2 :
a) It will have some dashboards
b) 90% of DB operations would be read operations
c) 10% of DB operations would be write/modify
d) There will be about 1000s of user of this system and at any given point of time
hardly 50-1000 people will be accessing it.
I am thinking of following.
Have Web-App 1 created in python+Django and Web-App 2 created in RoR.
I am planning to use to Dynamo DB and memcache.
Why two different frameworks?
1) So that I get to learn both of them
2) There have been concern about scalability in RoR (and I also know people claim its not there), Web-app 1 may have scaling needs in future.
My questions is Do you see any problem with this combination?
for example active records would want you to use specific namings format for your data base tables? Are there any other concerns similar to this?
Anyone else who have used similar technology stack?
both frameworks are full stack framework and and provide MVC, templating, unit testing, security, db migration, caching, security, ORMs.

For my startup, we also needed to put out a full fleshed website along with an API. We are also using DynamoDB for storing most of the data and are only using MySQL for session info.
I opted to use Ruby on Rails for the Webapp and Sinatra for the API. If you're criteria is simply learning as many new things as possible, then it would make sense to opt for relatively different stacks (django/python and RoR). In our case, we went with sinatra because it's essentially a very lightweight wrapper around Rack and perfect for an API which essentially receives requests, calls one or more services or does some processing and hands out a formatted response. While I don't see any problem with using python/django instead of sinatra, in our case the benefit was having to spend less time working with a different language.
Also, scalability in rails is a bit of an iffy subject. In the end, it's about how you use it. We've had no issues scaling rails with unicorn and nginx. Our business logic is all in the API service and the rails server as well uses the API for most of the work. This means we don't use active record on rails and the website is just another consumer for our API which does all the heavy lifting whether the request comes from an app or the website. Using MySQL for the session store ensures we can route requests to any of the application servers without having to worry about always routing requests from the same client to the same server every time. This allows us to ramp up and down easily only considering the amount of traffic we're getting.
At the time we started working on this, there wasn't an ORM for dynamo db which looked and felt just like active record, so we ended up writing a few high level classes of our own to handle storage and retrieval of models on DynamoDb. Considering DynamoDB is not tailored for scans or joins, this didn't take a lot of effort since we were almost always doing lookups based on keys and ranges. This meant we didn't really need a replacement for active record since the real strength of active record is being able to intuitively do joins, etc. by convention.
DynamoDB does have it's limitations though and you might find yourself in situations where you will need to scan a large number of records. In our case, we also use CloudSearch to index some important info and use it as a fallback for cases when we need to do text based searches which need to scan all our data.

standard way to handle user session in tornado

So, in order to avoid the "no one best answer" problem, I'm going to ask, not for the best way, but the standard or most common way to handle sessions when using the Tornado framework. That is, if we're not using 3rd party authentication (OAuth, etc.), but rather we have want to have our own Users table with secure cookies in the browser but most of the session info stored on the server, what is the most common way of doing this? I have seen some people using Redis, some people using their normal database (MySQL or Postgres or whatever), some people using memcached.
The application I'm working on won't have millions of users at a time, or probably even thousands. It will need to eventually get some moderately complex authorization scheme, though. What I'm looking for is to make sure we don't do something "weird" that goes down a different path than the general Tornado community, since authentication and authorization, while it is something we need, isn't something that is at the core of our product and so isn't where we should be differentiating ourselves. So, we're looking for what most people (who use Tornado) are doing in this respect, hence I think it's a question with (in theory) an objectively true answer.
The ideal answer would point to example code, of course.

Here's how it seems other micro frameworks handle sessions (CherryPy, Flask for example):
Create a table holding session_id and whatever other fields you'll want to track on a per session basis. Some frameworks will allow you to just store this info in a file on a per user basis, or will just store things directly in memory. If your application is small enough, you may consider those options as well, but a database should be simpler to implement on your own.
When a request is received (RequestHandler initialize() function I think?) and there is no session_id cookie, set a secure session-id using a random generator. I don't have much experience with Tornado, but it looks like setting a secure cookie should be useful for this. Store that session_id and associated info in your session table. Note that EVERY user will have a session, even those not logged in. When a user logs in, you'll want to attach their status as logged in (and their username/user_id, etc) to their session.
In your RequestHandler initialize function, if there is a session_id cookie, read in what ever session info you need from the DB and perhaps create your own Session object to populate and store as a member variable of that request handler.
Keep in mind sessions should expire after a certain amount of inactivity, so you'll want to check for that as well. If you want a "remember me" type log in situation, you'll have to use a secure cookie to signal that (read up on this at OWASP to make sure it's as secure as possible, thought again it looks like Tornado's secure_cookie might help with that), and upon receiving a timed out session you can re-authenticate a new user by creating a new session and transferring whatever associated info into it from the old one.

Tornado designed to be stateless and don't have session support out of the box.
Use secure cookies to store sensitive information like user_id.
Use standard cookies to store not critical information.
For storing large objects - use standard scheme - MySQL + memcache.

The key issue with sessions is not where to store them, is to how to expire them intelligently. Regardless of where sessions are stored, as long as the number of stored sessions is reasonable (i.e. only active sessions plus some surplus are stored), all this data is going to fit in RAM and be served fast. If there is a lot of old junk you may expect unpredictable delays (the need to hit the disk to load the session).

There isn't anything built directly into Tornado for this purpose. As others have commented already, Tornado is designed to be a very fast async framework. It is lean by design. However, it is possible to hook in your own session management capability. You need to add a preamble section to each handler that would create or grab a session container. You will need to store the session ID in a cookie. If you are not strictly HTTPS then you will want to use a secure cookie. The session persistence can be any technology of your choosing such as Redis, Postgres, MySQL, a file store, etc...
There is a Github project that provides session management for Tornado. Even if you decide not to use it, it can provide insight into how to structure your own session management. The Github project is called dustdevil. Full disclosure - we created this several years ago but find it very easy to use and have it in active use today.

Django Passing data between views

I was wondering what is the 'best' way of passing data between views. Is it better to create invisible fields and pass it using POST or should I encode it in my URLS? Or is there a better/easier way of doing this? Sorry if this question is stupid, I'm pretty new to web programming :)
Thanks

There are different ways to pass data between views. Actually this is not much different that the problem of passing data between 2 different scripts & of course some concepts of inter-process communication come in as well. Some things that come to mind are -
GET request - First request hits view1->send data to browser -> browser redirects to view2
POST request - (as you suggested) Same flow as above but is suitable when more data is involved
Django session variables - This is the simplest to implement
Client-side cookies - Can be used but there is limitations of how much data can be stored.
Shared memory at web server level- Tricky but can be done.
REST API's - If you can have a stand-alone server, then that server can REST API's to invoke views.
Message queues - Again if a stand-alone server is possible maybe even message queues would work. i.e. first view (API) takes requests and pushes it to a queue and some other process can pop messages off and hit your second view (another API). This would decouple first and second view API's and possibly manage load better.
Cache - Maybe a cache like memcached can act as mediator. But then if one is going this route, its better to use Django sessions as it hides a whole lot of implementation details but if scale is a concern, memcached or redis are good options.
Persistent storage - store data in some persistent storage mechanism like mysql. This decouples your request taking part (probably a client facing API) from processing part by having a DB in the middle.
NoSql storages - if speed of writes are in other order of hundreds of thousands per sec, then MySql performance would become bottleneck (there are ways to get around by tweaking mysql config but its not easy). Then considering NoSql DB's could be an alternative. e.g: dynamoDB, Redis, HBase etc.
Stream Processing - like Storm or AWS Kinesis could be an option if your use-case is real-time computation. In fact you could use AWS Lambda in the middle as a server-less compute module which would read off and call your second view API.
Write data into a file - then the next view can read from that file (real ugly). This probably should never ever be done but putting this point here as something that should not be done.
Cant think of any more. Will update if i get any. Hope this helps in someway.

How can I use redis with Django?

I've heard of redis-cache but how exactly does it work? Is it used as a layer between django and my rdbms, by caching the rdbms queries somehow?
Or is it supposed to be used directly as the database? Which I doubt, since that github page doesn't cover any login details, no setup.. just tells you to set some config property.

This Python module for Redis has a clear usage example in the readme: http://github.com/andymccurdy/redis-py
Redis is designed to be a RAM cache. It supports basic GET and SET of keys plus the storing of collections such as dictionaries. You can cache RDBMS queries by storing their output in Redis. The goal would be to speed up your Django site. Don't start using Redis or any other cache until you need the speed - don't prematurely optimize.

Just because Redis stores things in-memory does not mean that it is meant to be a cache. I have seen people using it as a persistent store for data.
That it can be used as a cache is a hint that it is useful as a high-performance storage. If your Redis system goes down though you might loose data that was not been written back onto the disk again. There are some ways to mitigate such dangers, e.g. a hot-standby replica.
If your data is 'mission-critical', like if you run a bank or a shop, Redis might not be the best pick for you. But if you write a high-traffic game with persistent live data or some social-interaction stuff and manage the probability of data-loss to be quite acceptable, then Redis might be worth a look.
Anyway, the point remains, yes, Redis can be used as a database.

Redis is basically an 'in memory' KV store with loads of bells and whistles. It is extremely flexible. You can use it as a temporary store, like a cache, or a permanent store, like a database (with caveats as mentioned in other answers).
When combined with Django the best/most common use case for Redis is probably to cache 'responses' and sessions.
There's a backend here https://github.com/sebleier/django-redis-cache/ and excellent documentation in the Django docs here: https://docs.djangoproject.com/en/1.3/topics/cache/ .
I've recently started using https://github.com/erussell/django-redis-status to monitor my cache - works a charm. (Configure maxmemory on redis or the results aren't so very useful).

You can also use Redis as a queue for distributed tasks in your Django app. You can use it as a message broker for Celery or Python RQ.

Redis as a Primary database
Yes you can use Redis key-value store as a primary database.
Redis not only store key-value pairs it also support different data structures like
List
Set
Sorted set
Hashes
Bitmaps
Hyperloglogs
Redis Data types Official doc
Redis is in memory key-value store so you must aware of it if Redis server failure occurred your data will be lost.
Redis can also persist data check official doc.
Redis Persistence Official doc
Redis as a Cache
Yes Redis reside between in Django and RDBMS.
How it works
given a URL, try finding that page in the cache if the page is in the cache: return the cached page else: generate the page save the generated page in the cache (for next time) return the generated page
Django’s cache framework Official Doc
How can use Redis with Django
We can use redis python client redis-py for Django application.
Redis python client redis-py Github
We can use Django-redis for django cache backend.
Django-redis build on redis-py and added extra features related to django application.
Django-redis doc Github
Other libraries also exists.
Redis use cases and data types
Some use cases
Session cache
Real time analytics
Web caching
Leaderboards
Top Redis Use Cases by Core Data structure types
Big Tech companies using Redis
Twitter GitHub Weibo Pinterest Snapchat Craigslist Digg StackOverflow Flickr

Best practice: How to persist simple data without a database in django?

I'm building a website that doesn't require a database because a REST API "is the database". (Except you don't want to be putting site-specific things in there, since the API is used by mostly mobile clients)
However there's a few things that normally would be put in a database, for example the "jobs" page. You have master list view, and the detail views for each job, and it should be easy to add new job entries. (not necessarily via a CMS, but that would be awesome)
e.g. example.com/careers/ and example.com/careers/77/
I could just hardcode this stuff in templates, but that's no DRY- you have to update the master template and the detail template every time.
What do you guys think? Maybe a YAML file? Or any better ideas?
Thx

Why not still keep it in a database? Your remote REST store is all well and funky, but if you've got local data, there's nothing (unless there's spec saying so) to stop you storing some stuff in a local db. Doesn't have to be anything v glamorous - could be sqlite, or you could have some fun with redis, etc.

You could use the Memcachedb via the Django cache interface.
For example:
Set the cache backend as memcached in your django settings, but install/use memcachedb instead.
Django can't tell the difference between the two because the provide the same interface (at least last time I checked).
Memcachedb is persistent, safe for multithreaded django servers, and won't lose data during server restarts, but it's just a key value store. not a complete database.

Some alternatives built into the Python library are listed in the Data Persistence chapter of the documentation. Still another is JSON.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.