I'm writing a Django app (that uses Postgres 8.4 as the backend) that aggregates a large volume of data (154 GB, 150 Tables). I'd like to know if there are any existing Python modules or frameworks that support analysis across multiple tables and columns.
For example:
Table 1 has columns A, B, C
Table 2 has columns A, D
Table 3 has columns F, G, H, I
I'd like to see how B relates/corresponds to D - plotting B vs D in 2 axes or other forms. It would be nice if I could feed it a list of dimensions and it could compare any one to another.
Prewarning: All 3 of the db-based graphing libraries I worked with that do what you want use NOT Postgres (...and I only liked 2 of them anyway...).
If you're still early in development you may want to consider graphite. It does have great graphing functionality and is very clean to work with as well is written in python.
If you want something with more of a kick, OpenTSDB.
The easiest way to use either of these would be to write a shellscript/scraper to query your tables and spit it back to your graphite/opentsdb instance. If you're looking to map directly from your db, you might have better luck recycling graphite's code.
You are going to have to write custom SQL code, and how you plugin that data into an app or a graphing monitoring system depends on you.
+1 for graphite, and with collectd+graphite plugin and a postgresql plugin it's easy to get postgresql data into graphite.
The things you want to monitor are related to a specific database AND probably to your use case, afaik there is nothing in pythonland that will help you with the SQL.
For those who aren't postgresql gurus there is an excellent book with a whole bunch of examples of monitoring/admin queries.
Personally I wouldn't use django itself for these operations, but they can be done easily using rawsql and then you can just define some models to hold the data, and use your visualization tool of choice to display the data.
Related
I'm developing a Django app. Use-case scenario is this:
50 users, each one can store up to 300 time series and each time serie has around 7000 rows.
Each user can ask at any time to retrieve all of their 300 time series and ask, for each of them, to perform some advanced data analysis on the last N rows. The data analysis cannot be done in SQL but in Pandas, where it doesn't take much time... but retrieving 300,000 rows in separate dataframes does!
Users can also ask results of some analysis that can be performed in SQL (like aggregation+sum by date) and that is considerably faster (to the point where I wouldn't be writing this post if that was all of it).
Browsing and thinking around, I've figured storing time series in SQL is not a good solution (read here).
Ideal deploy architecture looks like this (each bucket is a separate server!):
Problem: time series in SQL are too slow to retrieve in a multi-user app.
Researched solutions (from this article):
PyStore: https://github.com/ranaroussi/pystore
Arctic: https://github.com/manahl/arctic
Here are some problems:
1) Although these solutions are massively faster for pulling millions of rows time series into a single dataframe, I might need to pull around 500.000 rows into 300 different dataframes. Would that still be as fast?
This is the current db structure I'm using:
class TimeSerie(models.Model):
...
class TimeSerieRow(models.Model):
date = models.DateField()
timeserie = models.ForeignKey(timeserie)
number = ...
another_number = ...
And this is the bottleneck in my application:
for t in TimeSerie.objects.filter(user=user):
q = TimeSerieRow.objects.filter(timeserie=t).orderby("date")
q = q.filter( ... time filters ...)
df = pd.DataFrame(q.values())
# ... analysis on df
2) Even if PyStore or Arctic can do that faster, that'd mean I'd loose the ability to decouple my db from the Django instances, effictively using resources of one machine way better, but being stuck to use only one and not being scalable (or use as many separate databases as machines). Can PyStore/Arctic avoid this and provide an adapter for remote storage?
Is there a Python/Linux solution that can solve this problem? Which architecture I can use to overcome it? Should I drop the scalability of my app and/or accept that every N new users I'll have to spawn a separate database?
The article you refer to in your post is probably the best answer to your question. Clearly good research and a few good solutions being proposed (don't forget to take a look at InfluxDB).
Regarding the decoupling of the storage solution from your instances, I don't see the problem:
Arctic uses mongoDB as a backing store
pyStore uses a file system as a backing store
InfluxDB is a database server on its own
So as long as you decouple the backing store from your instances and make them shared among instances, you'll have the same setup as for your posgreSQL database: mongoDB or InfluxDB can run on a separate centralised instance; the file storage for pyStore can be shared, e.g. using a shared mounted volume. The python libraries that access these stores of course run on your django instances, like psycopg2 does.
Background:
I am developing a Django app for a business application that takes client data and displays charts in a dashboard. I have large databases full of raw information such as part sales by customer, and I will use that to populate the analyses. I have been able to do this very nicely in the past using python with pandas, xlsxwriter, etc., and am now in the process of replicating what I have done in the past in this web app. I am using a PostgreSQL database to store the data, and then using Django to build the app and fusioncharts for the visualization. In order to get the information into Postgres, I am using a python script with sqlalchemy, which does a great job.
The question:
There are two ways I can manipulate the data that will be populating the charts. 1) I can use the same script that exports the data to postgres to arrange the data as I like it before it is exported. For instance, in certain cases I need to group the data by some parameter (by customer for instance), then perform calculations on the groups by columns. I could do this for each different slice I want and then export different tables for each model class to postgres.
2) I can upload the entire database to postgres and manipulate it later with django commands that produce SQL queries.
I am much more comfortable doing it up front with python because I have been doing it that way for a while. I also understand that django's queries are little more difficult to implement. However, doing it with python would mean that I will need more tables (because I will have grouped them in different ways), and I don't want to do it the way I know just because it is easier, if uploading a single database and using django/SQL queries would be more efficient in the long run.
Any thoughts or suggestions are appreciated.
Well, it's the usual tradeoff between performances and flexibility. With the first approach you get better performances (your schema is taylored for the exact queries you want to run) but lacks flexibility (if you need to add more queries the scheam might not match so well - or even not match at all - in which case you'll have to repopulate the database, possibly from raw sources, with an updated schema), with the second one you (hopefully) have a well normalized schema but one that makes queries much more complex and much more heavy on the database server.
Now the question is: do you really have to choose ? You could also have both the fully normalized data AND the denormalized (pre-processed) data alongside.
As a side note: Django ORM is indeed most of a "80/20" tool - it's designed to make the 80% simple queries super easy (much easier than say SQLAlchemy), and then it becomes a bit of a PITA indeed - but nothing forces you to use django's ORM for everything (you can always drop down to raw sql or use SQLAlchemy alongside).
Oh and yes: your problem is nothing new - you may want to read about OLAP
I am making a database with data in it. That database has two customers: 1) a .NET webserver that makes the data visible to users somehow someway. 2) a python dataminer that creates the data and populates the tables.
I have several options. I can use the .NET Entity Framework to create the database, then reverse engineer it on the python side. I can vice versa that. I can just write raw SQL statements in one or the other systems, or both. What are possible pitfalls of doing this one way or the other? I'm worried, for example, that if I use the python ORM to create the tables, then I'm going to have a hard time in the .NET space...
I love questions like that.
Here is what you have to consider, your web site has to be fast, and the bottleneck of most web sites is a database. The answer to your question would be - make it easy for .NET to work with SQL. That will require little more work with python, like specifying names of the table, maybe row names. I think Django and SQLAlchemy are both good for that.
Another solution could be to have a bridge between database with gathered data and database to display data. On a background you can have a task/job to migrate collected data to your main database. That is also an option and will make your job easier, at least all database-specific and strange code will go to the third component.
I've been working with .NET for quite a long time before I switched to python, and what you should know is that whatever strategy you chose it will be possible to work with data in both languages and ORMs. Do the hardest part of the job in the language your know better. If you are a Python developer - pick python to mess with the right names of tables and rows.
I have a Django-based application with an Oracle backend. I want to do analysis of the application's data in R. What can I do?
I would like to avoid directly querying the database because there are several aspects of our Django models that make it hard to understand the resulting database schema and would make the SQL very complicated.
I would also like to avoid writing a separate Python script to manually export data to a file and then load that file into R because these separate steps would slow down the analysis and iteration process.
My ideal would be some interface that would allow me to write Django queries directly in R. As far as I can tell the only option for this is rPython and this would be tricky to set up with the necessary Django/Python environment variables et al (right?). Are there any other ways this direct interface could be possible?
I want to get the data into R because: 1) there are some statistical R packages that aren't well implemented in Python, 2) I am quicker at transforming data in R than Python, and 3) I need to plot the results and I find it easier to make ggplot2 plots look nice than matplotlib.
My company has decided to implement a datamart using [Greenplum] and I have the task of figuring out how to go on about it. A ballpark figure of the amount of data to be transferred from the existing [DB2] DB to the Greenplum DB is about 2 TB.
I would like to know :
1) Is the Greenplum DB the same as vanilla [PostgresSQL]? (I've worked on Postgres AS 8.3)
2) Are there any (free) tools available for this task (extract and import)
3) I have some knowledge of Python. Is it feasible, even easy to do this in a resonable amount of time?
I have no idea how to do this. Any advice, tips and suggestions will be hugely welcome.
1) Greenplum is not vanilla postgres, but it is similar. It has some new syntax, but in general, is highly consistent.
2) Greenplum itself provides something called "gpfdist" which lets you listen on a port that you specify in order to bring in a file (but the file has to be split up). You want readable external tables. They are quite fast. Syntax looks like this:
CREATE READABLE EXTERNAL TABLE schema.ext_table
( thing int, thing2 int )
LOCATION (
'gpfdist://server:port1/path/to/filep1.txt',
'gpfdist://server:port2/path/to/filep2.txt',
'gpfdist://server:port3/path/to/filep3.txt'
) FORMAT 'text' (delimiter E'\t' null 'null' escape 'off') ENCODING 'UTF8';
CREATE TEMP TABLE import AS SELECT * FROM schema.ext_table DISTRIBUTED RANDOMLY;
If you play to their rules and your data is clean, the loading can be blazing fast.
3) You don't need python to do this, although you could automate it by using python to kick off the gpfdist processes, and then sending a command to psql that creates the external table and loads the data. Depends on what you want to do though.
Many of Greenplum's utilities are written in python and the current DBMS distribution comes with python 2.6.2 installed, including the pygresql module which you can use to work inside the GPDB.
For data transfer into greenplum, I've written python scripts that connect to the source (Oracle) DB using cx_Oracle and then dumping that output either to flat files or named pipes. gpfdist can read from either sort of source and load the data into the system.
Generally, it is really slow if you use SQL insert or merge to import big bulk data.
The recommended way is to use the external tables you define to use file-based, web-based or gpfdist protocol hosted files.
And also greenplum has a utility named gpload, which can be used to define your transferring jobs, like source, output, mode(inert, update or merge).
1) It's not vanilla postgres
2) I have used pentaho data integration with good success in various types of data transfer projects.
It allows for complex transformations and multi-threaded, multi-step loading of data if you design your steps carefully.
Also I believe Pentaho support Greenplum specifically though I have no experience of this.