Using Altair on data aggregated from large datasets

Using Altair on data aggregated from large datasets - python

I am trying to histogram counts of a large (300,000 records) temporal data set. I am for now just trying to histogram by month which is only 6 data points, but doing this with either json or altair_data_server storage makes the page crash. Is this impossible to handle well with pure Altair? I could of course preprocess in pandas, but that ruins the wonderful declarative nature of altair.
If so is this a missing feature of altair or is it out of scope? I'm learning that vegalite stores the entire underlying data and applies the transformation at run time, but it seems like altair could (and maybe does) have a way to store only the relevant data for the chart.
alt.Chart(df).mark_bar().encode(
x=alt.X('month(timestamp):T'),
y='count()'
)

Altair charts work by sending the entire dataset to your browser and processing it in the frontend; for this reason it does not work well for larger datasets, no matter how the dataset is served to the frontend.
In cases like yours, where you are aggregating the data before displaying it, it would in theory be possible to do that aggregation in the backend, and only send aggregated data to the frontend renderer. There are some projects that hope to make this more seamless, including scalable Vega and altair-transform, but neither approach is very mature yet.
In the meantime, I'd suggest doing your aggregations in Pandas, and sending the aggregated data to Altair to plot.
Edit 2023-01-25: VegaFusion addresses this problem by automatically pre-aggregating the data on the server and is mature enough for production use. Version 1.0 is available under the same license as Altair.

Try below :-
alt.data_transformers.enable('default', max_rows=None)
and then
alt.Chart(df).mark_bar().encode(
x=alt.X('month(timestamp):T'),
y='count()'
)
you will get the chart but make sure to save all of your work if the browser will crash.

Using the following works for me:
alt.data_transformers.enable('data_server')

Related

How to visualize real-time data candlestick chart for cryptocurrencies?

I'm actually trying to code a small program in Python to visualize the actual price of a crypto asset with real-time data. I already have the data (historic and actual data updated every second). I just want to find a good python library (as optimized as possible) in order to show the candlestick chart and eventually some indicators or lines/curves on the same graph. I did some quick research and it seems like "plotly" (used with "cufflinks") or "bokeh" are good choices. Which one would you advise me and why ? I'm also open to some suggestions of other libraries if they are good and optimized !
Thank you in advance :)

Take a look at https://github.com/highfestiva/finplot.
Where you can find examples of fetching realtime data from crypto exchange. Author notifies that this library designed in favor of speed and crypto.
Looks very nice.

Do we Visualize Big Data

I began to fall in love with a Python Visualization library called Altair, and i use it with every small data science project that ive done.
Now, in terms of Industry use cases, Does it make sense to visualize Big Data or should we just take a random sample?

Short answer: no, if you're trying to visualize data with tens of thousands of rows or more, Altair is probably not the right tool. But there are efforts in progress to add support for larger datasets in the vega ecosystem; see https://github.com/vega/scalable-vega.

Does Seaborn's sns.load_dataset() function use real data?

I know all of the datasets that can be loaded with sns.load_dataset() are all example datasets, used for Seaborn's documentation, but do these example datasets use actual data?
I'm asking because I want to know if it's useful to pay attention to the results I get as I play around with these datasets, or if I should just see them as solely a means to learning the module.

The data does appear to be real. This is not formally documented by Seaborn, but:
Several of the dataset are "real" well-known datasets that can be verified elsewhere, such as the Iris dataset hosted on UCI's Machine Learning repository.
All of the data are sourced from https://github.com/mwaskom/seaborn-data, and in turn from actual CSVs on Michael Waskom's (core Seaborn developer) local drive, it appears. If the data were random/fake, it is more likely it would be generated by Python libraries such as NumPy.

Big Data Retrieval and Processing Python and PostgreSQL

Just for some background. I am developing a hotel data analytics dashboard much like this one [here](https://my.infocaptor.com/free_data_visualization.php"D3 Builder") using d3.js and dc.js (with cross filter). It is a Django project and the database I am using is Postgresql. I am currently working on a universal bar graph series, it will eventually allow the user to choose the fields (from the data set provided) that they would like to see plotted against each other in a bar chart format.
My database consists of 10 million entries, with 54 fields each (single table). Retrieving the three fields used to plot the time based bar chart takes over a minute. Processing the data in Python(altering column key names to match those of the universal bar chart) and putting the data into a json format to be used by the graph takes a further few minutes which is unacceptable for my desired application.
Would it be possible to "parallelise" the querying of the data base and would this be faster than what I am doing currently (a normal query). I have looked around a bit and not found much. And is there a library or optimized function I might use to parse my data to the desired format quickly?

I have worked on similar kind of table size. Well, for what you are looking for you need to switch to something like a distributed postgres environment i.e. Greenplum which is MPP architecture and supports columnar storage. Which is ideal for table with large number of columns and table size.
http://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html
If you do not intend to switch to Greenplum you can try table partitioning in your current postgres database. Your dashboard queries should be such that they query individual partitions, that way you end up querying smaller partitions(tables) and the query time will be much much faster.

How to generate graphs and statistics from SQLAlchemy tables?

After running a bunch of simulations I'm going to be outputting the results into a table created using SQLAlchemy. I plan to use this data to generate statistics - mean and variance being key. These, in turn, will be used to generate some graphs - histograms/line graphs, pie-charts and box-and-whisker plots specifically.
I'm aware of the Python graphing libraries like matplotlib. The thing is, I'm not sure how to have this integrate with the information contained within the database tables.
Any suggestions on how to make these two play with each other?
The main problem is that I'm not sure how to supply the information as "data sets" to the graphing library.

It looks like matplotlib takes simple python data types -- lists of numbers, etc, so you'll be need to write custom code to massage what you pull out of mysql/sqlalchemy for input into the graphing functions...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Altair on data aggregated from large datasets - python

Try below :- alt.data_transformers.enable('default', max_rows=None) and then alt.Chart(df).mark_bar().encode( x=alt.X('month(timestamp):T'), y='count()' ) you will get the chart but make sure to save all of your work if the browser will crash.

Using the following works for me: alt.data_transformers.enable('data_server')

Related

How to visualize real-time data candlestick chart for cryptocurrencies?

Do we Visualize Big Data

Does Seaborn's sns.load_dataset() function use real data?

Big Data Retrieval and Processing Python and PostgreSQL

How to generate graphs and statistics from SQLAlchemy tables?

Categories

Resources