How to make a scatter plot with non-numerical column?

How to make a scatter plot with non-numerical column? - python

If I want to make a scatter plot in matplotlib I can do:
import pandas as pd
from bokeh.io import show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
import matplotlib.pyplot as plt
df = pd.DataFrame({'a': range(1, 6), 'b': list('ABCDE')})
plt.scatter(df['a'], df['b'])
plt.show()
Which gives
How would I get the same output in bokeh?
I tried (same set-up as above):
source = ColumnDataSource(df)
p = figure(
title="Something great",
tools='save,pan,box_zoom,reset,wheel_zoom',
background_fill_color="#fafafa"
)
p.scatter(
'a',
'b',
source=source
)
show(p)
but that does not plot anything. If I plot column a against itself it works fine, suggesting that the code structure is fine, but that it only works for numerical values. Is there a quick fix to this?

y_range parameter fixed the issue for me.
I found it at Handling Categorical Data.
p = figure(
y_range=df['b'], # < -- what I added
title="Something great",
tools='save,pan,box_zoom,reset,wheel_zoom',
background_fill_color="#fafafa"
)

Related

Bokeh scatter plot: is it possible to overlay a line colored by category?

I have a dataframe that details sales of various product categories vs. time. I'd like to make a "line and marker" plot of sales vs. time, per category. To my surprise, this appears to be very difficult in Bokeh.
The scatter plot is easy. But then trying to overplot a line of sales vs. date with the same source (so I can update both scatter and line plots in one go when the source updates) and in such a way that the colors of the line match the colors of the scatter plot markers proves near impossible.
Minimal reproducible example with contrived data:
import pandas as pd
df = pd.DataFrame({'Date':['2020-01-01','2020-01-02','2020-01-01','2020-01-02'],\
'Product Category':['shoes','shoes','grocery','grocery'],\
'Sales':[100,180,21,22],'Colors':['red','red','green','green']})
df['Date'] = pd.to_datetime(df['Date'])
from bokeh.io import output_notebook
output_notebook()
from bokeh.io import output_file, show
from bokeh.plotting import figure
source = ColumnDataSource(df)
plot = figure(x_axis_type="datetime", plot_width=800, toolbar_location=None)
plot.scatter(x="Date",y="Sales",size=15, source=source, fill_color="Colors", fill_alpha=0.5, \
line_color="Colors",legend="Product Category")
for cat in list(set(source.data['Product Category'])):
tmp = source.to_df()
col = tmp[tmp['Product Category']==cat]['Colors'].values[0]
plot.line(x="Date",y="Sales",source=source, line_color=col)
show(plot)
Here's what it looks like, which is clearly wrong:
Here's what I want and don't know how to make:
Can Bokeh not make such plots, where scatter markers and lines have the same color per category, with a legend?

With bokeh it is often helpful to first think about the visualisation you want and then structuring the data source appropriately. You want two lines, on per category, the x axis is time and y axis is the sales. Then a natural way to structure your data source is the following:
df = pd.DataFrame({'Date':['2020-01-01','2020-01-02'],
'Shoe Sales':[100, 180],
'Grocery Sales': [21, 22]
})
from bokeh.io import output_notebook
output_notebook()
from bokeh.io import output_file, show
from bokeh.plotting import figure
source = ColumnDataSource(df)
plot = figure(x_axis_type="datetime", plot_width=800, toolbar_location=None)
categories = ["Shoe Sales", "Grocery Sales"]
colors = {"Shoe Sales": "red", "Grocery Sales": "green"}
for category in categories:
plot.scatter(x="Date",y=category,size=15, source=source, fill_color=colors[category], legend=category)
plot.line(x="Date",y=category,source=source, line_color=colors[category])
show(plot)

The solutions is to group your data. Then you can plot lines for each group.
Minimal Example
import pandas as pd
from bokeh.plotting import figure, show, output_notebook
output_notebook()
df = pd.DataFrame({'Date':['2020-01-01','2020-01-02','2020-01-01','2020-01-02'],
'Product Category':['shoes','shoes','grocery','grocery'],
'Sales':[100,180,21,22],'Colors':['red','red','green','green']})
df['Date'] = pd.to_datetime(df['Date'])
plot = figure(x_axis_type="datetime",
plot_width=400,
plot_height=400,
toolbar_location=None
)
plot.scatter(x="Date",
y="Sales",
size=15,
source=df,
fill_color="Colors",
fill_alpha=0.5,
line_color="Colors",
legend_field="Product Category"
)
for color in df['Colors'].unique():
plot.line(x="Date", y="Sales", source=df[df['Colors']==color], line_color=color)
show(plot)
Output

Cannot import name 'Scatter' from 'bokeh.plotting'

I am trying to represent the data using the bokeh scatter.
Here is my code:
from bokeh.plotting import Scatter, output_file, show import pandas
df=pandas.Dataframe(colume["X","Y"])
df["X"]=[1,2,3,4,5,6,7]
df["Y"]=[23,43,32,12,34,54,33]
p=Scatter(df,x="X",y="Y", title="Day Temperature measurement", xlabel="Tempetature", ylabel="Day")
output_file("File.html")
show(p)
The Output should look like this:
Expected Output
The error is:
ImportError Traceback (most recent call
> last) <ipython-input-14-1730ac6ad003> in <module>
> ----> 1 from bokeh.plotting import Scatter, output_file, show
> 2 import pandas
> 3
> 4 df=pandas.Dataframe(colume["X","Y"])
> 5
ImportError: cannot import name 'Scatter' from 'bokeh.plotting'
(C:\Users\LENOVO\Anaconda3\lib\site-packages\bokeh\plotting__init__.py)
I had also found that the Scatter is no longer maintained now. Is there is any way to use it?
Also which alternative do I have to represent the data same as the Scatter using any another python libraries?
Using older version of Bokeh will resolve this issue?

Scatter (with a capital S) has never been part of bokeh.plotting. It used to be a part of the old bokeh.charts API that was removed several years ago. However, it is not needed at all to create basic scatter plots, since all the glyph methods in bokeh.plotting (e.g circle, square) are all implicitly scatter-type functions to begin with:
from bokeh.plotting import figure, show
import pandas as pd
df = pd.DataFrame({"X" :[1,2,3,4,5,6,7],
"Y": [23,43,32,12,34,54,33]})
p = figure(x_axis_label="Tempetature", y_axis_label="Day",
title="Day Temperature measurement")
p.circle("X", "Y", size=15, source=df)
show(p)
Which yields:
You can also just pass the data directly to circle as in the other answer.
If you want to do fancier things, like map the marker type based on a column there is also a plot.scatter (lower case s) methods on the figure:
from bokeh.plotting import figure, show
from bokeh.sampledata.iris import flowers
from bokeh.transform import factor_cmap, factor_mark
SPECIES = ['setosa', 'versicolor', 'virginica']
MARKERS = ['hex', 'circle_x', 'triangle']
p = figure(title = "Iris Morphology")
p.xaxis.axis_label = 'Petal Length'
p.yaxis.axis_label = 'Sepal Width'
p.scatter("petal_length", "sepal_width", source=flowers, legend_field="species", fill_alpha=0.4, size=12,
marker=factor_mark('species', MARKERS, SPECIES),
color=factor_cmap('species', 'Category10_3', SPECIES))
show(p)
which yields:

If you look up "scatter" in the docs, you'll find
Scatter Markers
To scatter circle markers on a plot, use the circle() method of Figure:
from bokeh.plotting import figure, output_file, show
# output to static HTML file
output_file("line.html")
p = figure(plot_width=400, plot_height=400)
# add a circle renderer with a size, color, and alpha
p.circle([1, 2, 3, 4, 5], [6, 7, 2, 4, 5], size=20, color="navy", alpha=0.5)
# show the results
show(p)
To work with dataframes, just pass in the columns like df.X and df.Y to the x and y args.
from bokeh.plotting import figure, show, output_file
import pandas as pd
df = pd.DataFrame(columns=["X","Y"])
df["X"] = [1,2,3,4,5,6,7]
df["Y"] = [23,43,32,12,34,54,33]
p = figure()
p.scatter(df.X, df.Y, marker="circle")
#from bokeh.io import output_notebook
#output_notebook()
show(p) # or output to a file...

How to properly handle datetime and categorical axes in bokeh/holoviews heatmap plot?

I'm trying to plot a simple heatmap using bokeh/holoviews. My data (pandas dataframe) has categoricals (on y) and datetime (on x). The problem is that the number of categorical elements is >3000 and the resulting plot appears with messed overlapped tickers on the y axis that makes it totally useless. Currently, is there a reliable way in bokeh to select only a subset of the tickers based on the zoom level?
I've already tried plotly and the result looks perfect but however I need to use bokeh/holoviews and datashader. I want also avoid to replace categoricals with numericals tickers.
I've also tried this solution but actually it doesn't work (bokeh 1.2.0).
This is a toy example representing my use case (Actually here #y is 1000 but it gives the idea)
from datetime import datetime
import pandas as pd
import numpy as np
from bokeh.plotting import figure, show
from bokeh.transform import linear_cmap
from bokeh.io import output_notebook
output_notebook()
# build sample data
index = pd.date_range(start='1/1/2019', periods=1000, freq='T')
data = np.random.rand(1000,100)
columns = ['col'+ str(n) for n in range(100)]
# initial data format
df = pd.DataFrame(data=data, index=index, columns=columns)
# bokeh
df = df.stack().reset_index()
df.rename(columns={'level_0':'x','level_1':'y', 0:'z'},inplace=True)
df.sort_values(by=['y'],inplace=True)
x = [
date.to_datetime64().astype('M8[ms]').astype('O')
for date in df.x.to_list()
]
data = {
'value': df.z.to_list(),
'x': x,
'y': df.y.to_list(),
'date' : df.x.to_list()
}
p = figure(x_axis_type='datetime', y_range=columns, width=900, tooltips=[("x", "#date"), ("y", "#y"), ("value", "#value")])
p.rect(x='x', y='y', width=60*1000, height=1, line_color=None,
fill_color=linear_cmap('value', 'Viridis256', low=df.z.min(), high=df.z.max()), source=data)
show(p)

Finally, I partially followed the suggestion from James and managed to get it to work using a python callback for the ticker. This solution was hard to find for me. I really searched all the Bokeh docs, examples and source code for days.
The main problem for me is that in the doc is not mentioned how I can use "ColumnDataSource" objects in the custom callback.
https://docs.bokeh.org/en/1.2.0/docs/reference/models/formatters.html#bokeh.models.formatters.FuncTickFormatter.from_py_func
Finally, this helped a lot:
https://docs.bokeh.org/en/1.2.0/docs/user_guide/interaction/callbacks.html#customjs-with-a-python-function.
So, I modified the original code as follow in the hope it can be useful to someone:
from datetime import datetime
import pandas as pd
import numpy as np
from bokeh.plotting import figure, show
from bokeh.transform import linear_cmap
from bokeh.io import output_notebook
from bokeh.models import FuncTickFormatter
from bokeh.models import ColumnDataSource
output_notebook()
# build sample data
index = pd.date_range(start='1/1/2019', periods=1000, freq='T')
data = np.random.rand(1000,100)
columns_labels = ['col'+ str(n) for n in range(100)]
columns = [n for n in range(100)]
# initial data format
df = pd.DataFrame(data=data, index=index, columns=columns)
# bokeh
df = df.stack().reset_index()
df.rename(columns={'level_0':'x','level_1':'y', 0:'z'},inplace=True)
df.sort_values(by=['y'],inplace=True)
x = [
date.to_datetime64().astype('M8[ms]').astype('O')
for date in df.x.to_list()
]
data = {
'value': df.z.to_list(),
'x': x,
'y': df.y.to_list(),
'y_labels_tooltip' : [columns_labels[k] for k in df.y.to_list()],
'y_ticks' : columns_labels*1000,
'date' : df.x.to_list()
}
cd = ColumnDataSource(data=data)
def ticker(source=cd):
labels = source.data['y_ticks']
return "{}".format(labels[tick])
#p = figure(x_axis_type='datetime', y_range=columns, width=900, tooltips=[("x", "#date{%F %T}"), ("y", "#y_labels"), ("value", "#value")])
p = figure(x_axis_type='datetime', width=900, tooltips=[("x", "#date{%F %T}"), ("y", "#y_labels_tooltip"), ("value", "#value")])
p.rect(x='x', y='y', width=60*1000, height=1, line_color=None,
fill_color=linear_cmap('value', 'Viridis256', low=df.z.min(), high=df.z.max()), source=cd)
p.hover.formatters = {'date': 'datetime'}
p.yaxis.formatter = FuncTickFormatter.from_py_func(ticker)
p.yaxis[0].ticker.desired_num_ticks = 20
show(p)
The result is this:

Line plot with data points in pandas

Using pandas I can easily make a line plot:
import pandas as pd
import numpy as np
%matplotlib inline # to use it in jupyter notebooks
df = pd.DataFrame(np.random.randn(50, 4),
index=pd.date_range('1/1/2000', periods=50), columns=list('ABCD'))
df = df.cumsum()
df.plot();
But I can't figure out how to also plot the data as points over the lines, as in this example:
This matplotlib example seems to suggest the direction, but I can't find how to do it using pandas plotting capabilities. And I am specially interested in learning how to do it with pandas because I am always working with dataframes.
Any clues?

You can use the style kwarg to the df.plot command. From the docs:
style : list or dict
matplotlib line style per column
So, you could either just set one linestyle for all the lines, or a different one for each line.
e.g. this does something similar to what you asked for:
df.plot(style='.-')
To define a different marker and linestyle for each line, you can use a list:
df.plot(style=['+-','o-','.--','s:'])
You can also pass the markevery kwarg onto matplotlib's plot command, to only draw markers at a given interval
df.plot(style='.-', markevery=5)

You can use markevery argument in df.plot(), like so:
df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
df = df.cumsum()
df.plot(linestyle='-', markevery=100, marker='o', markerfacecolor='black')
plt.show()
markevery would accept a list of specific points(or dates), if that's what you want.
You can also define a function to help finding the correct location:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
df = df.cumsum()
dates = ["2001-01-01","2002-01-01","2001-06-01","2001-11-11","2001-09-01"]
def find_loc(df, dates):
marks = []
for date in dates:
marks.append(df.index.get_loc(date))
return marks
df.plot(linestyle='-', markevery=find_loc(df, dates), marker='o', markerfacecolor='black')
plt.show()

Bokeh plot displaying without data

I am running the following code to render a plot with dates in the x axis and floats in the y axis:
import pandas as pd
from bokeh.plotting import figure, output_file, show
from bokeh.models import DatetimeTickFormatter
from bokeh.charts import Bar, Line, show
def datetime(x):
return pd.DataFrame(x, dtype='datetime64')
openxbids = pd.read_csv('data')
openxbids.sort_values('date')
output_file("lines.html")
p = figure(width=800, height=250, x_axis_type="datetime")
p.line(datetime(openxbids['date']), openxbids['bids'], color = 'navy', alpha=0.5)
show(p)
However, when I run this, I get a graph without any data plotted. The x and y axis ranges seem to be correctly detected. What am I missing?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to make a scatter plot with non-numerical column? - python

y_range parameter fixed the issue for me. I found it at Handling Categorical Data. p = figure( y_range=df['b'], # < -- what I added title="Something great", tools='save,pan,box_zoom,reset,wheel_zoom', background_fill_color="#fafafa" )

Related

Bokeh scatter plot: is it possible to overlay a line colored by category?

Cannot import name 'Scatter' from 'bokeh.plotting'

How to properly handle datetime and categorical axes in bokeh/holoviews heatmap plot?

Line plot with data points in pandas

Bokeh plot displaying without data

Categories

Resources