Multi-user efficient time-series storing for Django web app - python

I'm developing a Django app. Use-case scenario is this:
50 users, each one can store up to 300 time series and each time serie has around 7000 rows.
Each user can ask at any time to retrieve all of their 300 time series and ask, for each of them, to perform some advanced data analysis on the last N rows. The data analysis cannot be done in SQL but in Pandas, where it doesn't take much time... but retrieving 300,000 rows in separate dataframes does!
Users can also ask results of some analysis that can be performed in SQL (like aggregation+sum by date) and that is considerably faster (to the point where I wouldn't be writing this post if that was all of it).
Browsing and thinking around, I've figured storing time series in SQL is not a good solution (read here).
Ideal deploy architecture looks like this (each bucket is a separate server!):
Problem: time series in SQL are too slow to retrieve in a multi-user app.
Researched solutions (from this article):
PyStore: https://github.com/ranaroussi/pystore
Arctic: https://github.com/manahl/arctic
Here are some problems:
1) Although these solutions are massively faster for pulling millions of rows time series into a single dataframe, I might need to pull around 500.000 rows into 300 different dataframes. Would that still be as fast?
This is the current db structure I'm using:
class TimeSerie(models.Model):
...
class TimeSerieRow(models.Model):
date = models.DateField()
timeserie = models.ForeignKey(timeserie)
number = ...
another_number = ...
And this is the bottleneck in my application:
for t in TimeSerie.objects.filter(user=user):
q = TimeSerieRow.objects.filter(timeserie=t).orderby("date")
q = q.filter( ... time filters ...)
df = pd.DataFrame(q.values())
# ... analysis on df
2) Even if PyStore or Arctic can do that faster, that'd mean I'd loose the ability to decouple my db from the Django instances, effictively using resources of one machine way better, but being stuck to use only one and not being scalable (or use as many separate databases as machines). Can PyStore/Arctic avoid this and provide an adapter for remote storage?
Is there a Python/Linux solution that can solve this problem? Which architecture I can use to overcome it? Should I drop the scalability of my app and/or accept that every N new users I'll have to spawn a separate database?

The article you refer to in your post is probably the best answer to your question. Clearly good research and a few good solutions being proposed (don't forget to take a look at InfluxDB).
Regarding the decoupling of the storage solution from your instances, I don't see the problem:
Arctic uses mongoDB as a backing store
pyStore uses a file system as a backing store
InfluxDB is a database server on its own
So as long as you decouple the backing store from your instances and make them shared among instances, you'll have the same setup as for your posgreSQL database: mongoDB or InfluxDB can run on a separate centralised instance; the file storage for pyStore can be shared, e.g. using a shared mounted volume. The python libraries that access these stores of course run on your django instances, like psycopg2 does.

Related

Extracting data continuously from RDS MySQL schemas in parallel

I have got a requirement to extract data from Amazon Aurora RDS instance and load it to S3 to make it a data lake for analytics purposes. There are multiple schemas/databases in one instance and each schema has a similar set of tables. I need to pull selective columns from these tables for all schemas in parallel. This should happen in real-time capturing the DML operations periodically.
There may arise the question of using dedicated services like Data Migration or Copy activity provided by AWS. But I can't use them since the plan is to make the solution cloud platform independent as it could be hosted on Azure down the line.
I was thinking Apache Spark could be used for this, but I got to know it doesn't support JDBC as a source in Structured streaming. I read about multi-threading and multiprocessing techniques in Python for this but have to assess if they are suitable (the idea is to run the code as daemon threads, each thread fetching data from the tables of a single schema in the background and they run continuously in defined cycles, say every 5 minutes). The data synchronization between RDS tables and S3 is also a crucial aspect to consider.
To talk more about the data in the source tables, they have an auto-increment ID field but are not sequential and might be missing a few numbers in between as a result of the removal of those rows due to the inactivity of the corresponding entity, say customers. It is not needed to pull the entire data of a record, only a few are pulled which would be been predefined in the configuration. The solution must be reliable, sustainable, and automatable.
Now, I'm quite confused to decide which approach to use and how to implement the solution once decided. Hence, I seek the help of people who dealt with or know of any solution to this problem statement. I'm happy to provide more info in case it is required to get to the right solution. Any help on this would be greatly appreciated.

Recommendation for manipulating data with python vs. SQL for a django app

Background:
I am developing a Django app for a business application that takes client data and displays charts in a dashboard. I have large databases full of raw information such as part sales by customer, and I will use that to populate the analyses. I have been able to do this very nicely in the past using python with pandas, xlsxwriter, etc., and am now in the process of replicating what I have done in the past in this web app. I am using a PostgreSQL database to store the data, and then using Django to build the app and fusioncharts for the visualization. In order to get the information into Postgres, I am using a python script with sqlalchemy, which does a great job.
The question:
There are two ways I can manipulate the data that will be populating the charts. 1) I can use the same script that exports the data to postgres to arrange the data as I like it before it is exported. For instance, in certain cases I need to group the data by some parameter (by customer for instance), then perform calculations on the groups by columns. I could do this for each different slice I want and then export different tables for each model class to postgres.
2) I can upload the entire database to postgres and manipulate it later with django commands that produce SQL queries.
I am much more comfortable doing it up front with python because I have been doing it that way for a while. I also understand that django's queries are little more difficult to implement. However, doing it with python would mean that I will need more tables (because I will have grouped them in different ways), and I don't want to do it the way I know just because it is easier, if uploading a single database and using django/SQL queries would be more efficient in the long run.
Any thoughts or suggestions are appreciated.
Well, it's the usual tradeoff between performances and flexibility. With the first approach you get better performances (your schema is taylored for the exact queries you want to run) but lacks flexibility (if you need to add more queries the scheam might not match so well - or even not match at all - in which case you'll have to repopulate the database, possibly from raw sources, with an updated schema), with the second one you (hopefully) have a well normalized schema but one that makes queries much more complex and much more heavy on the database server.
Now the question is: do you really have to choose ? You could also have both the fully normalized data AND the denormalized (pre-processed) data alongside.
As a side note: Django ORM is indeed most of a "80/20" tool - it's designed to make the 80% simple queries super easy (much easier than say SQLAlchemy), and then it becomes a bit of a PITA indeed - but nothing forces you to use django's ORM for everything (you can always drop down to raw sql or use SQLAlchemy alongside).
Oh and yes: your problem is nothing new - you may want to read about OLAP

Fast database solutions for Python Programming

Situation: Need to deal with large amounts of data (~260MB CSV datafile of about 50B of data per line)
Problem: If I just read from the file every time I need to deal with it, it will take a long time. So I decided to push everything into a database. I need a fast database infrastructure to handle the data as I need to do a lot of reading and writing.
Question: What are the faster choices for database in Python?
Additional information 1: The data comprises of 3 columns and I do not see myself needing anymore than that. Would this mean that a NoSQL database is preferred?
Additional information 2: However, if in the future I do need more than one database working together, would it be better to go for a SQL database?
Additional information 3: I think it would help to mention that I am looking at a few different DBs (MongoDB, SQLite, tinydb), but do suggest other DBs that you know are faster.
I've experienced this same situation many times and I wanted something faster than a typical relational database. Redis is very fast and scalable key/value database. You can get started quickly using the Popoto ORM
Here is an example:
import popoto
class City(popoto.Model):
id = popoto.UniqueKeyField()
name = popoto.KeyField()
description = popoto.Field()
for line in open("cities.csv"):
csv_row = line.split('\t')
City.create(
id=csv_row[0],
name=csv_row[1],
description=csv_row[2]
)
new_york = City.query.get(name="New York")
This is the absolute fastest way to store and retrieve data without having to learn the nuances of a new database system.
Keep in mind that if your database grows beyond 5GB, an in-memory database like Redis can start to become expensive compared to slower disk-based databases like Postgres or MySQL
full disclosure: I help maintain the open source Popoto project

Django application having in memory a big Panda object shared across all requests?

I have developed a Shiny Application. When it starts, it loads, ONCE, some datatables. Around 4 GB of datatables. Then, people connecting to the app can use the interface and play with those datatables.
This application is nice but has some limitations. That's why I am looking for another solution.
My idea is to have Pandas and Django working together. This way, I could develop an interface and a RESTful API at the same time. But what I would need is that all requests coming to Django can use pandas datatables that have been loaded once. Imagine if for any request 4 GB of memory had to be loaded... It would be horrible.
I have looked everywhere but couldn't find any way of doing this. I found this question: https://stackoverflow.com/questions/28661255/pandas-sharing-same-dataframe-across-the-request But it has no responses.
Why do I need to have the data in RAM ? Because I need performance to render the asked results fastly. I can't ask MariaDB to calculate and maintain those datas for example as it involves some calculations that sole R or a specialized package in Python or other languages can do.
I have a similar use case where I want to load (instantiate) a certain object only once and use it in all requests, since it takes some time (seconds) to load and I couldn't afford the lag that would introduce for every request.
I use a feature in Django>=1.7, the AppConfig.ready() method, to load this only once.
Here is the code:
# apps.py
from django.apps import AppConfig
from sexmachine.detector import Detector
class APIConfig(AppConfig):
name = 'api'
def ready(self):
# Singleton utility
# We load them here to avoid multiple instantiation across other
# modules, that would take too much time.
print("Loading gender detector..."),
global gender_detector
gender_detector = Detector()
print("ok!")
Then when you want to use it:
from api.apps import gender_detector
gender_detector.get_gender('John')
Load your data table in the ready() method and then use it from anywhere. I reckon the table will be loaded once for each WSGI worker, so be careful here.
I may be misunderstanding the problem but to me having a 4 GB database table that is readily accessible by users shouldn't be too much of a problem. Is there anything wrong with actually just loading up the data one time upfront like you described? 4GB isn't too much RAM now.
Personally I'd recommend you just use the database system itself instead of loading stuff into memory and crunching with python. If you set up the data properly you can issue many thousands of queries in just seconds. Pandas is actually written to mimic SQL so most of the code you are using can probably be translated directly to SQL. Just recently I had a situation at work where I set up a big join operation essentially to take a couple hundred files (~4GB in total, 600k rows per each file) using pandas. The total execution time ended up being like 72 hours or something and this was an operation that had to be run once an hour. A coworker ended up rewriting the same python/pandas code as a pretty simple SQL query that finished in 5 minutes instead of 72 hours.
Anyways I'd recommend looking into storing your pandas dataframe in an actual database table. Django is built on a database (usually mySQL or Postgres) and pandas even has a function to directly insert your dataframe into the database dataframe.to_sql( 'database_connection_str' )! From there you can write the django code such that responses will make a single query to DB, fetch values and return data in timely manner.

Pandas as fast data storage for Flask application

I'm impressed by the speed of running transformations, loading data and ease of use of Pandas and want to leverage all these nice properties (amongst others) to model some large-ish data sets (~100-200k rows, <20 columns). The aim is to work with the data on some computing nodes, but also to provide a view of the data sets in a browser via Flask.
I'm currently using a Postgres database to store the data, but the import (coming from csv files) of the data is slow, tedious and error prone and getting the data out of the database and processing it is not much easier. The data is never going to be changed once imported (no CRUD operations), so I thought it's ideal to store it as several pandas DataFrame (stored in hdf5 format and loaded via pytables).
The question is:
(1) Is this a good idea and what are the things to watch out for? (For instance I don't expect concurrency problems as DataFrames are (should?) be stateless and immutable (taken care of from application-side)). What else needs to be watched out for?
(2) How would I go about caching the data once it's loaded from the hdf5 file into a DataFrame, so it doesn't need to be loaded for every client request (at least the most recent/frequent dataframes). Flask (or werkzeug) has a SimpleCaching class, but, internally, it pickles the data and unpickles the cached data on access. I wonder if this is necessary in my specific case (assuming the cached object is immutable). Also, is such a simple caching method usable when the system gets deployed with Gunicorn (is it possible to have static data (the cache) and can concurrent (different process?) requests access the same cache?).
I realise these are many questions, but before I invest more time and build a proof-of-concept, I thought I get some feedback here. Any thoughts are welcome.
Answers to some aspects of what you're asking for:
It's not quite clear from your description whether you have the tables in your SQL database only, stored as HDF5 files or both. Something to look out for here is that if you use Python 2.x and create the files via pandas' HDFStore class, any strings will be pickled leading to fairly large files. You can also generate pandas DataFrame's directly from SQL queries using read_sql, for example.
If you don't need any relational operations then I would say ditch the postgre server, if it's already set up and you might need that in future keep using the SQL server. The nice thing about the server is that even if you don't expect concurrency issues, it will be handled automatically for you using (Flask-)SQLAlchemy causing you less headache. In general, if you ever expect to add more tables (files), it's less of an issue to have one central database server than maintaining multiple files lying around.
Whichever way you go, Flask-Cache will be your friend, using either a memcached or a redis backend. You can then cache/memoize the function that returns a prepared DataFrame from either SQL or HDF5 file. Importantly, it also let's you cache templates which may play a role in displaying large tables.
You could, of course, also generate a global variable, for example, where you create the Flask app and just import that wherever it's needed. I have not tried this and would thus not recommend it. It might cause all sorts of concurrency issues.

Categories

Resources