I'm hoping someone can point me in the right direction. The python datavis landscape has now become huge and there are so many options that I'm a bit lost on what the best way to achieve this is.
I have an xarray dataset (but it could easily be a pandas dataframe or a list of numpy arrays).
I have 3 columns, A, B, and C. They contain 40 data points.
I want to plot a scatter plot of A vs B + scale*C where scale is determined from an interactive slider.
The more advanced version of this would have a dropdown where you can select a different set of 3 columns but I'll worry about that bit later.
The caveat on all of this is that I'd like it to be online and interactive for others to use.
There seem to be so many options:
Jupyter (I don't use notebooks so I'm not that familiar with them but
with mybinder I assume this is easy to do)
Plotly
Bokeh Server
pyviz.org (this is the really interesting one but again, there'd seem
to be so many options on how to accomplish this)
Any thoughts or advice would be much appreciated.
There are indeed many options and i'm not sure what is best but i use bokeh a lot and am happy about it. The example below can get you started. To launch this open a cmd in the directory where you save the script and run "bokeh serve script.py --show --allow-websocket-origin=*".
from bokeh.plotting import figure
from bokeh.io import curdoc
from bokeh.models.widgets import Slider
from bokeh.models import Row,ColumnDataSource
#create the starting data
x=[0,1,2,3,4,5,6,7,8]
y_noise=[1,2,2.5,3,3.5,6,5,7,8]
slope=1 #set the starting value of the slope
intercept=0 #set the line to go through 0, you can change this later
y= [slope*i + intercept for i in x]#create the y data via a list comprehension
# create a plot
fig=figure() #create a figure
source=ColumnDataSource(dict(x=x, y=y)) #the data destined for the figure
fig.circle(x,y_noise)#add some datapoints to the plot
fig.line('x','y',source=source,color='red')#add a line to the figure
#create a slider and update the graph source data when it changes
def updateSlope(attrname, old, new):
print(str(new)+" is the new slider value")
y = [float(new)*i + intercept for i in x]
source.data = dict(x=x, y=y)
slider = Slider(title="slope", value=slope, start=0.0, end=2.0,step=0.1)
slider.on_change('value', updateSlope)
layout=Row(fig,slider)#put figure and slider next to eachother
curdoc().add_root(layout)#serve it via "bokeh serve slider.py --show --allow-websocket-origin=*"
The allow-websocket-origin=* is to allow other users to reach out to the server and see the graph. The http would be http://yourPCservername:5006/ (5006 is the default bokeh port). If you don't want to serve from your PC you can subscribe to a cloud service like Heroku: example.
Related
I have a dataset with millions of Latitude/Longitude points that we are plotting at high resolution using plotly-dash with a Densitymapbox:
data = pandas.DataFrame()
# ...
go.Densitymapbox(
lat=data['Latitude'],
lon=data['Longitude'],
z=data['Count'],
hoverinfo='skip',
# ...
)
According to Mapbox, their library should support millions of points without issue as shown by their demo # https://demos.mapbox.com/100mpoints/
When I try to do this, it does appear that Mapbox is able to handle the requests. However in my implementation with plotly/dash, unlike the demo above, the browser gets underwater. The first load works fine (although does use a lot of memory), but on a reload of the data, Chrome crashes and Firefox reports an out of memory error to the console and does not update the heatmap.
The data set I am using is 1093737 points. Doing back-of-the-napkin math, this should only be < ~25 MB of data (1093737 * (8 + 8 + 8)) for 2 double precision floating point values and 1 (64bit) integer, and the amount of data sent to the browser does show this. However, the browser process balloons in memory to over 3.5GB and then on subsequent reloads, it appears the browser runs out of memory.
Are there any facilities in dash/plotly to prevent this from taking down the browser? I do not need to interact with the points of density plot, and have set the hoverinfo='skip' to indicate that, but would like to keep the interactivity of the heatmap recalculating the overlay when the map zoom changes. I am investigating other alternatives such as rasterizing the heatmap server side using datashader, but that would remove this interactivity which I would like to keep.
LensPy was created to solve this exact problem. It is built on top of Plotly Dash to allow you to plot very large datasets while maintaining fluid interactivity. Here is an example of how you can achieve this with a Mapbox.
import pandas as pd
import plotly.express as px
from lenspy import DynamicPlot
df = pd.read_csv(
'https://raw.githubusercontent.com/plotly/datasets/master/earthquakes-23k.csv')
fig = px.density_mapbox(df,
lat='Latitude', lon='Longitude',
z='Magnitude',
radius=10,
center=dict(lat=0, lon=180),
zoom=0,
mapbox_style="stamen-terrain")
plot = DynamicPlot(fig)
plot.show()
Disclaimer: I am the creator of LensPy.
I am writing in python and have all my functionalities for analyzing datasets. Now, I would like to turn these functions into a user-ready application that, in a sense, works like an .exe app. In bokeh, I saw that you could create a plot, table...etc; however, is it possible to create a GUI where you can:
upload a file for analysis
load functions from my written python to analyze the uploaded file
graph the results onto a graph
click different buttons that can take you to different (so-called) pages so that you can perform different functions.
basically it can go from one page to the other kind of like a webpage where you click one button it links you to the next page for another purpose and home to go back. Could you potentially do this with bokeh?
There are several examples of data web applications created using Bokeh at demo.bokeh.org. Here is one modeled after the "Shiny Movie Explorer", but written in pure Python/Bokeh (instead of R/Shiny).
You can find much more details about creating and deploying Bokeh applications in the Running a Bokeh Server chapter of the docs.
Here is a complete (but simpler) example that demonstrates the basic gist and structure:
import numpy as np
from bokeh.io import curdoc
from bokeh.layouts import column
from bokeh.models import ColumnDataSource, Slider
from bokeh.plotting import figure
# Set up data
x = np.linspace(0, 4*np.pi, 200)
y = np.sin(x)
source = ColumnDataSource(data=dict(x=x, y=y))
# Set up plot
plot = figure(title="my sine wave")
plot.line('x', 'y', source=source, line_width=3, line_alpha=0.6)
# Set up widgets
freq = Slider(title="frequency", value=1.0, start=0.1, end=5.1, step=0.1)
# Set up callbacks
def update_data(attrname, old, new):
# Get the current slider values and set new data
k = freq.value
x = np.linspace(0, 4*np.pi, 200)
y = np.sin(k*x)
source.data = dict(x=x, y=y)
freq.on_change('value', update_data)
# Set up layouts and add to document
curdoc().add_root(column(freq, plot))
curdoc().title = "Sliders"
To run this locally you'd execute:
bokeh serve --show myscript.py
For more sophisticated deployments (i.e. with proxies) or to embed directly in e.g. Flask, see the docs.
What I'm trying to do is make an interactive scatter plot where I can control which columns of a DataFrame are on X and Y axes and then select a subset of data using lasso or something similar. Because of the dataset size I have to use datashader.
I tried to declare the DynamicMap as:
dmap = hv.DynamicMap(selector.make_view, kdims=[], streams=[selector, RangeX(), RangeY(), Stream.define('Next')()])
and have a custom callback on the lasso which would select desired rows of data, create the visual representation and update the plot with dmap.event().
So that doesn't seem to work. If I select something, the plot gets updated only when I pan or zoom or change axes selection. VIDEO
If I leave only Stream.define('Next')():
dmap = hv.DynamicMap(selector.make_view, kdims=[], streams=[Stream.define('Next')()])
then lasso updates the plot, but I loose everything else including the ability to zoom. VIDEO
I hope this question makes sense. If needed, I've pushed the notebook here.
I was hoping to rely on the order of the dataframe to sort by group size of various clusters of data such that the most populous levels of a classification appear early in the data frame and small rare populations appear at the end. The goal I'm pursuing is to ensure that my rare populations always appear on top in the z-order of my scatter plot.
I experimented with a simple example of stacked circles and discovered that the z-order is not what I expected by their arrangement in the original dataframe I defined them as.
Here's a minimal example to demonstrate with
import pandas
import numpy
from bokeh.application.handlers import FunctionHandler
from bokeh.application import Application
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.plotting import figure
from bokeh.server.server import Server
def modify_doc(doc):
df = pandas.DataFrame()
theta = numpy.linspace( 0 , 2*numpy.pi , 20 )
colors = ['yellow' if (c % 2 == 0) else 'blue' for c in range(len(theta))]
df['X'] = numpy.cos(theta)
df['Y'] = numpy.sin(theta)
source = ColumnDataSource(data=df) # does this change the order?
plot = figure()
plot.circle('X', 'Y', source=source, radius=0.22 , fill_alpha=1, color=colors)
plot.add_tools( HoverTool( tooltips=[ ( '(x,y)', '$x,$y') , ( 'index' , "$index" ) ] ) )
doc.add_root(plot)
bokeh_app = Application(FunctionHandler(modify_doc))
# Setting num_procs here means we can't touch the IOLoop before now, we must
# let Server handle that. If you need to explicitly handle IOLoops then you
# will need to use the lower level BaseServer class.
server = Server({'/': bokeh_app})
server.start()
if __name__ == '__main__':
print('Opening Bokeh application on http://localhost:5006/')
server.io_loop.add_callback(server.show, "/")
server.io_loop.start()
I'm finding two things confusing here, I expect the order to run counterclockwise with the discs in the first quadrant all beneath the subsequent disc. Instead I see discs on the top with the subsequent disc on the bottom. Such rendering would be consistent with a reverse plotting given that the last data point in the dataframe was plotting first all the way to the first data point. However I see other inconsistencies with the two discs which are eclipsed by two discs, something that I can't explain at all apart from wondering if a ColumnDataSource rearranges my data so the renderer obeys the order in the rearranged ColumnDataSource and not my original DataFrame. Is this accurate? How does Bokeh settle on a z-order with respect to a DataFrame's row order, is there any predictable relationship between the two?
The real problem about the clustering is that we have a full event record with several hundred thousand data points. The algorithms subsample the data to classify with, and then I take that subsample classifications and conditionally color data points by those labels. The bulk of the data is unsampled and I'd like it to essentially play to the background. Both sampled and unsampled data are in the same ColumnDataSource which is convenient instead of plotting two distinct glyphs which I may consider to force a z-order. In this scatter plot below, gray data points represent unsampled data.
The ColumnDataSource does not ever change the order of the data. However in order to optimize drawing and hit-testing, points are copied from teh CDN and put into a spatial index by the glyph views. The order that points are returned when the index is queried is not specified, which explains the result you are seeing.
It's possible that an option could be added to disable spatial indexing (at least for drawing, it will aways be necessary to make hit-testing performant on non-trivial datasets), but this would require new development, so a GitHub issue to request the feature would be the next step. It should not be a hard task, but the core team is overextended, so if you have the ability to collaborate and become a contributor that would be the quickest path to getting it added.
All that said, if you are needing to display hundreds of thousands of points, you may want to have a look at DataShader, which is a fast, configurable rendering pipeline for larger data sets that integrates closely with Bokeh. (It has been demonstrated interactively exploring hundreds of millions of points on a laptop on many occasions)
I am trying to plot something with a huge number of data points (2mm-3mm) using plotly.
When I run
py.iplot(fig, filename='test plot')
I get the following error:
Woah there! Look at all those points! Due to browser limitations, the Plotly SVG drawing functions have a hard time graphing more than 500k data points for line charts, or 40k points for other types of charts. Here are some suggestions:
(1) Use the `plotly.graph_objs.Scattergl` trace object to generate a WebGl graph.
(2) Trying using the image API to return an image instead of a graph URL
(3) Use matplotlib
(4) See if you can create your visualization with fewer data points
If the visualization you're using aggregates points (e.g., box plot, histogram, etc.) you can disregard this warning.
So then I try to save it with this:
py.image.save_as(fig, 'my_plot.png')
But then I get this error:
PlotlyRequestError: Unknown Image Server Error
How do I do this properly? I don't care if it's a still image or an interactive display within my notebook.
Plotly really seems to be very bad in this. I am just trying to create a boxplot with 5 Million points, which is no problem in the simple R function "boxplot", but plotly is calculating endlessly for this.
It should be a major issue to improve this. Not all data has to be saved (and shown) in the plotly object. This is the main problem I guess.
one option would be down-sampling your data, not sure if you'd like that:
https://github.com/devoxi/lttb-py
I also have problems with plotly in the browser with large datasets - if anyone has solutions, please write!
Thank you!
You can try the render_mode argument. Example:
import plotly.express as px
import pandas as pd
import numpy as np
N = int(1e6) # Number of points
df = pd.DataFrame(dict(x=np.random.randn(N),
y=np.random.randn(N)))
fig = px.scatter(df, x="x", y="y", render_mode='webgl')
fig.update_traces(marker_line=dict(width=1, color='DarkSlateGray'))
fig.show()
In my computer N=1e6 takes about 5 seconds until the plot is visible, and the "interactiveness" is still very good. With N=10e6 it takes about 1 minute and the plot is not responsive anymore (i.e. it is really slow to zoom, pan or anything).