How to generate graphs and statistics from SQLAlchemy tables? - python

After running a bunch of simulations I'm going to be outputting the results into a table created using SQLAlchemy. I plan to use this data to generate statistics - mean and variance being key. These, in turn, will be used to generate some graphs - histograms/line graphs, pie-charts and box-and-whisker plots specifically.
I'm aware of the Python graphing libraries like matplotlib. The thing is, I'm not sure how to have this integrate with the information contained within the database tables.
Any suggestions on how to make these two play with each other?
The main problem is that I'm not sure how to supply the information as "data sets" to the graphing library.

It looks like matplotlib takes simple python data types -- lists of numbers, etc, so you'll be need to write custom code to massage what you pull out of mysql/sqlalchemy for input into the graphing functions...

Related

Visualize large and complex dataset

How do I dynamically visualize large and complex data sets (stored in Elasticsearch) with dozens of sub-graphs and drill-downs? I would like some kind of dynamic control over this, but would prefer to not roll my own (in Python). Is it possible to use Kibana for this kind of thing? Or is there some better tool for the task?
Best would be if I could have rudimentary control over layout, then be able to show a number of bar charts if the user wants to see time series, but if she wants to slice it perpendicular I could show for instance pie charts. User should be able to interactively klick her way down, generating AND/OR lucene expressions, etc.
The more I rubber duck, the more I feel like I need to build this myself in Bokeh or something of the kind. If I need to create all such business logic manually, what would be my best HTML/graphing library? Or are there plugins for Kibana that does this perhaps? If I have to create things manually, it does not necessarily need to be in Python, but it needs to be back-end (for Elasticsearch security).
I wrote my own, took around 500 LoC and a week or two of devopsing.

Can I translate/duplicate an existing graph in Excel into Python?

I have many graphs in Excel that I would like to convert to Python but am struggling with how to do so using Matplotlib. Is there a package or method that would essentially convert/translate all the formatting and data series selection into python?
Once I could see a few examples of the correct code I think I could start doing this directly in python but I do not have much experience manually creating graph code (I use Excel insert graphs mostly) so am looking for a bridge.

How to store multi-dimensional data

I am building a couple of web applications to store data using Django. This data would be generated from lab tests and might have up to 100 parameters being logged against time. This would leave me with an NxN matrix of data.
I'm struggling to see how this would fit into a Django model as the number of parameters logged may change each time, and it seems inefficient to create a new model for each dataset.
What would be a good way of storing data like this? Would it be best to store it as a separate file and then just use a model to link a test to a datafile? If so what would be the best format for fast access and being able to quickly render and search through data, generate graphs etc in the application?
In answer to the question below:
It would be useful to search through datasets generated from the same test for trend analysis etc.
As I'm still beginning with this site I'm using SQLite, but planning to move to full SQL as it grows

Geospatial Analytics in Python

I have been doing some investigation to find a package to install and use for Geospatial Analytics
The closest I got to was https://github.com/harsha2010/magellan - This however has only scala interface and no doco how to use it with Python.
I was hoping if you someone knows of a package I can use?
What I am trying to do is analyse Uber's data and map it to the actual postcodes/suburbs and run it though SGD to predict the number of trips to a particular suburb.
There is already lots of data info here - http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/#comment-606532 and I am looking for ways to do it in Python.
In Python I'd take a look at GeoPandas. It provides a data structure called GeoDataFrame: it's a list of features, each one having a geometry and some optional attributes. You can join two GeoDataFrames together based on geometry intersection, and you can aggregate the numbers of rows (say, trips) within a single geometry (say, postcode).
I'm not familiar with Uber's data, but I'd try to find a way to get it into a GeoPandas GeoDataFrame.
Likewise postcodes can be downloaded from places like the U.S. Census, OpenStreetMap[1], etc, and coerced into a GeoDataFrame.
Join #1 to #2 based on geometry intersection. You want a new GeoDataFrame with one row per Uber trip, but with the postcode attached to each. Another StackOverflow post discusses how do to this, and it's currently harder than it ought to be.
Aggregate this by postcode and count the trips in each. The code will look like joined_dataframe.groupby('postcode').count().
My fear for the above process is if you have hundreds of thousands of very complex trip geometries, it could take forever on one machine. The link you posted uses Spark and you may end up wanting to parallelize this after all. You can write Python against a Spark cluster(!) but I'm not the person to help you with this component.
Finally, for the prediction component (e.g. SGD), check out scikit-learn: it's a pretty fully featured machine learning package, with a dead simple API.
[1]: There is a separate package called geopandas_osm that grabs OSM data and returns a GeoDataFrame: https://michelleful.github.io/code-blog/2015/04/27/osm-data/
I realize this is an old questions, but to build on Jeff G's answer.
If you arrive at this page looking for help putting together a suite of geospatial analytics tools in python - I would highly recommend this tutorial.
https://geohackweek.github.io/vector
It really picks up steam in the 3rd section.
It shows how to integrate
GeoPandas
PostGIS
Folium
rasterstats
add in scikit-learn, numpy, and scipy and you can really accomplish a lot. You can grab information from this nDarray tutorial as well

Big Data Retrieval and Processing Python and PostgreSQL

Just for some background. I am developing a hotel data analytics dashboard much like this one [here](https://my.infocaptor.com/free_data_visualization.php"D3 Builder") using d3.js and dc.js (with cross filter). It is a Django project and the database I am using is Postgresql. I am currently working on a universal bar graph series, it will eventually allow the user to choose the fields (from the data set provided) that they would like to see plotted against each other in a bar chart format.
My database consists of 10 million entries, with 54 fields each (single table). Retrieving the three fields used to plot the time based bar chart takes over a minute. Processing the data in Python(altering column key names to match those of the universal bar chart) and putting the data into a json format to be used by the graph takes a further few minutes which is unacceptable for my desired application.
Would it be possible to "parallelise" the querying of the data base and would this be faster than what I am doing currently (a normal query). I have looked around a bit and not found much. And is there a library or optimized function I might use to parse my data to the desired format quickly?
I have worked on similar kind of table size. Well, for what you are looking for you need to switch to something like a distributed postgres environment i.e. Greenplum which is MPP architecture and supports columnar storage. Which is ideal for table with large number of columns and table size.
http://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html
If you do not intend to switch to Greenplum you can try table partitioning in your current postgres database. Your dashboard queries should be such that they query individual partitions, that way you end up querying smaller partitions(tables) and the query time will be much much faster.

Categories

Resources