How can I get data from Django into R? - python

I have a Django-based application with an Oracle backend. I want to do analysis of the application's data in R. What can I do?
I would like to avoid directly querying the database because there are several aspects of our Django models that make it hard to understand the resulting database schema and would make the SQL very complicated.
I would also like to avoid writing a separate Python script to manually export data to a file and then load that file into R because these separate steps would slow down the analysis and iteration process.
My ideal would be some interface that would allow me to write Django queries directly in R. As far as I can tell the only option for this is rPython and this would be tricky to set up with the necessary Django/Python environment variables et al (right?). Are there any other ways this direct interface could be possible?
I want to get the data into R because: 1) there are some statistical R packages that aren't well implemented in Python, 2) I am quicker at transforming data in R than Python, and 3) I need to plot the results and I find it easier to make ggplot2 plots look nice than matplotlib.

Related

Recommendation for manipulating data with python vs. SQL for a django app

Background:
I am developing a Django app for a business application that takes client data and displays charts in a dashboard. I have large databases full of raw information such as part sales by customer, and I will use that to populate the analyses. I have been able to do this very nicely in the past using python with pandas, xlsxwriter, etc., and am now in the process of replicating what I have done in the past in this web app. I am using a PostgreSQL database to store the data, and then using Django to build the app and fusioncharts for the visualization. In order to get the information into Postgres, I am using a python script with sqlalchemy, which does a great job.
The question:
There are two ways I can manipulate the data that will be populating the charts. 1) I can use the same script that exports the data to postgres to arrange the data as I like it before it is exported. For instance, in certain cases I need to group the data by some parameter (by customer for instance), then perform calculations on the groups by columns. I could do this for each different slice I want and then export different tables for each model class to postgres.
2) I can upload the entire database to postgres and manipulate it later with django commands that produce SQL queries.
I am much more comfortable doing it up front with python because I have been doing it that way for a while. I also understand that django's queries are little more difficult to implement. However, doing it with python would mean that I will need more tables (because I will have grouped them in different ways), and I don't want to do it the way I know just because it is easier, if uploading a single database and using django/SQL queries would be more efficient in the long run.
Any thoughts or suggestions are appreciated.
Well, it's the usual tradeoff between performances and flexibility. With the first approach you get better performances (your schema is taylored for the exact queries you want to run) but lacks flexibility (if you need to add more queries the scheam might not match so well - or even not match at all - in which case you'll have to repopulate the database, possibly from raw sources, with an updated schema), with the second one you (hopefully) have a well normalized schema but one that makes queries much more complex and much more heavy on the database server.
Now the question is: do you really have to choose ? You could also have both the fully normalized data AND the denormalized (pre-processed) data alongside.
As a side note: Django ORM is indeed most of a "80/20" tool - it's designed to make the 80% simple queries super easy (much easier than say SQLAlchemy), and then it becomes a bit of a PITA indeed - but nothing forces you to use django's ORM for everything (you can always drop down to raw sql or use SQLAlchemy alongside).
Oh and yes: your problem is nothing new - you may want to read about OLAP

Python, computationally efficient data storage methods

I am retrieving structured numerical data (float 2-3 decimal spaces) via http requests from a server. The data comes in as sets of numbers which are then converted into an array/list. I want to then store each set of data locally on my computer so that I can further operate on it.
Since there are very many of these data sets which need to be collected, simply writing each data set that comes in to a .txt file does not seem very efficient. On the other hand I am aware that there are various solutions such as mongodb, python to sql interfaces...ect but i'm unsure of which one I should use and which would be the most appropriate and efficient for this scenario.
Also the database that is created must be able to interface and be queried from different languages such as MATLAB.
If you just want to store it somewhere so MATLAB can work with it; pick your choice from the databases supported by matlab and then install the appropriate drivers for Python for that database.
All databases in Python have a standard API (called the dbapi) so there is a uniform way of dealing with databases.
As you haven't told us how to intend to work with this data later on, it is difficult to provide any further specifics.
the idea is that i wish to essentially download all of the data onto
my machine so that i can operate on it locally later (run analytics
and perform certain mathematical operations on it) instead of having
to call it from the server constantly.
For that purpose you can use any storage mechanism from text files to any of the databases supported by MATLAB - as all databases supported by MATLAB are supported by Python.
You can choose to store the data as "text" and then do the numeric calculations on the application side (ie, MATLAB side). Or you can choose to store the data as numbers/float/decimal (depending on the precision you need) and this will allow you to do some calculations on the database side.
If you just want to store it as text and do calculations on the application side then the easiest option is mongodb as it is schema-less. You would be storing the data as JSON - which may be the format that it is being retrieved from the web.
If you wish to take advantages of some math functions or other capabilities (for example, geospatial calculations) then a better choice is a traditional database that you are familiar with. You'll have to create a schema and define the datatypes for each of your incoming data objects; and then store them appropriately in order to take advantage of the database's query features.
I can recommend using a lightweight ORM like peewee which can use a number of SQL databases as the storage method. Then it becomes a matter of choosing the database you want. The simplest database to use is sqlite, but should you decide that's not fast enough switching to another database like PostgreSQL or MySQL is trivial.
The advantage of the ORM is that you can use Python syntax to interact with the SQL database and don't have to learn any SQL.
Have you considered HDF5? It's very efficient for numerical data, and is supported by both Python and Matlab.

Prometheus + simple time series + Python

I'm trying to build a simple time series database in prometheus. I'm looking at financial time series data, and need somewhere to store the data to quickly access via Python. I'm loading the data into the time series via xml or .csvs, so this isn't some crazy "lots of data in and out at the same time" kind of project. I'm the only user, and maybe have a couple others use it in time and just want something thats easy to load data into, and pull out of.
I was hoping for some guidance on how to do this. Few questions:
1) Is it simple to pull data from a prometheus database via python?
2) I wanted to run this all locally off my windows machine, is that doable?
3) Am I completely overengineering this? (My worry with SQL is that it would be a mess to work with, as its large time series data sets)
Thanks
Prometheus is intended primarily for operational monitoring. While you may be able to get something working, Prometheus doesn't for example support bulk loading of data.
1) Is it simple to pull data from a prometheus database via python?
The HTTP api should be easy to use.
2) I wanted to run this all locally off my windows machine, is that doable?
That should work.
3) Am I completely overengineering this? (My worry with SQL is that it would be a mess to work with, as its large time series data sets)
I'd more say that Prometheus is probably not the right tool for the job here. Up to say 100GB I'd consider a SQL database to be a good starting point.

Visualized data analysis for Django (Postgres) data

I'm writing a Django app (that uses Postgres 8.4 as the backend) that aggregates a large volume of data (154 GB, 150 Tables). I'd like to know if there are any existing Python modules or frameworks that support analysis across multiple tables and columns.
For example:
Table 1 has columns A, B, C
Table 2 has columns A, D
Table 3 has columns F, G, H, I
I'd like to see how B relates/corresponds to D - plotting B vs D in 2 axes or other forms. It would be nice if I could feed it a list of dimensions and it could compare any one to another.
Prewarning: All 3 of the db-based graphing libraries I worked with that do what you want use NOT Postgres (...and I only liked 2 of them anyway...).
If you're still early in development you may want to consider graphite. It does have great graphing functionality and is very clean to work with as well is written in python.
If you want something with more of a kick, OpenTSDB.
The easiest way to use either of these would be to write a shellscript/scraper to query your tables and spit it back to your graphite/opentsdb instance. If you're looking to map directly from your db, you might have better luck recycling graphite's code.
You are going to have to write custom SQL code, and how you plugin that data into an app or a graphing monitoring system depends on you.
+1 for graphite, and with collectd+graphite plugin and a postgresql plugin it's easy to get postgresql data into graphite.
The things you want to monitor are related to a specific database AND probably to your use case, afaik there is nothing in pythonland that will help you with the SQL.
For those who aren't postgresql gurus there is an excellent book with a whole bunch of examples of monitoring/admin queries.
Personally I wouldn't use django itself for these operations, but they can be done easily using rawsql and then you can just define some models to hold the data, and use your visualization tool of choice to display the data.

Transferring data from a DB2 DB to a greenplum DB

My company has decided to implement a datamart using [Greenplum] and I have the task of figuring out how to go on about it. A ballpark figure of the amount of data to be transferred from the existing [DB2] DB to the Greenplum DB is about 2 TB.
I would like to know :
1) Is the Greenplum DB the same as vanilla [PostgresSQL]? (I've worked on Postgres AS 8.3)
2) Are there any (free) tools available for this task (extract and import)
3) I have some knowledge of Python. Is it feasible, even easy to do this in a resonable amount of time?
I have no idea how to do this. Any advice, tips and suggestions will be hugely welcome.
1) Greenplum is not vanilla postgres, but it is similar. It has some new syntax, but in general, is highly consistent.
2) Greenplum itself provides something called "gpfdist" which lets you listen on a port that you specify in order to bring in a file (but the file has to be split up). You want readable external tables. They are quite fast. Syntax looks like this:
CREATE READABLE EXTERNAL TABLE schema.ext_table
( thing int, thing2 int )
LOCATION (
'gpfdist://server:port1/path/to/filep1.txt',
'gpfdist://server:port2/path/to/filep2.txt',
'gpfdist://server:port3/path/to/filep3.txt'
) FORMAT 'text' (delimiter E'\t' null 'null' escape 'off') ENCODING 'UTF8';
CREATE TEMP TABLE import AS SELECT * FROM schema.ext_table DISTRIBUTED RANDOMLY;
If you play to their rules and your data is clean, the loading can be blazing fast.
3) You don't need python to do this, although you could automate it by using python to kick off the gpfdist processes, and then sending a command to psql that creates the external table and loads the data. Depends on what you want to do though.
Many of Greenplum's utilities are written in python and the current DBMS distribution comes with python 2.6.2 installed, including the pygresql module which you can use to work inside the GPDB.
For data transfer into greenplum, I've written python scripts that connect to the source (Oracle) DB using cx_Oracle and then dumping that output either to flat files or named pipes. gpfdist can read from either sort of source and load the data into the system.
Generally, it is really slow if you use SQL insert or merge to import big bulk data.
The recommended way is to use the external tables you define to use file-based, web-based or gpfdist protocol hosted files.
And also greenplum has a utility named gpload, which can be used to define your transferring jobs, like source, output, mode(inert, update or merge).
1) It's not vanilla postgres
2) I have used pentaho data integration with good success in various types of data transfer projects.
It allows for complex transformations and multi-threaded, multi-step loading of data if you design your steps carefully.
Also I believe Pentaho support Greenplum specifically though I have no experience of this.

Categories

Resources