I love how easy it is to set up basic hover feedback with HoverTool, but I'm wrestling with a couple aspects of the display. I have time-series data, with measurements that represent amounts in US$. This data starts out life as a pandas.Series. Legible plotting is easy (following assumes jupyter notebook):
p = figure(title='Example currency', x_axis_type='datetime',
plot_height=200, plot_width=600, tools='')
p.line(my_data.index, my_data)
p.yaxis[0].formatter = NumeralTickFormatter(format='$0,0')
show(p)
This shows me the time-series, with date-formatting on the x-axis and y-axis values that look like "$150,000", "$200,000", "$250,000", etc. I have two questions about HoverTool behavior:
Controlling formatting for $x and $y.
Accessing the name of the dataset under the cursor.
Simply adding a HoverTool allows me to see values, but in unhelpful units:
p.add_tools(HoverTool())
The corresponding tooltip values with these defaults show "1.468e+5" rather than "$146,800" (or even "146800", the underlying Series value); similarly, the date value appears as "1459728000000" rather than (say) "2016-04-04". I can manually work around this display issue by making my pandas.Series into a ColumnDataSource and adding string columns with the desired formatting:
# Make sure Series and its index have `name`, before converting to DataFrame
my_data.name = 'revenue'
my_data.index.name = 'day'
df = my_data.reset_index()
# Add str columns for tooltip display
df['daystr'] = df['day'].dt.strftime('%m %b %Y')
df['revstr'] = df['revenue'].apply(lambda x: '${:,d}'.format(int(x)))
cds = ColumnDataSource(df)
p = figure(title='Example currency', x_axis_type='datetime',
plot_height=200, plot_width=600, tools='')
p.line('day', 'revenue', source=cds)
p.yaxis[0].formatter = NumeralTickFormatter(format='$0,0')
p.add_tools(HoverTool(tooltips=[('Amount', '#revstr'), ('Day', '#daystr')]))
show(p)
but is there a way to handle the formatting in the HoverTool configuration instead? That seems much more desirable than all the data-set transformation that's required above. I looked through the documentation and (quickly) scanned through the source, and didn't see anything obvious that would save me from building the "output" columns as above.
Related to that, when I have several lines in a single plot, is there any way for me to access the name (or perhaps legend value) of each line within HoverTool.tooltips? It would be extremely helpful to include something in the tooltip to differentiate which dataset values are coming from, rather than needing to rely on (say) line-color in conjunction with the tool-tip display. For now, I've added an additional column to the ColumnDataSource that's just the string value I want to show; that obviously only works for datasets that include a single measurement column. When multiple lines are sharing an underlying ColumnDataSource, it would be sufficient to access the column-name that's provided to y.
Hey i know its 2 years late but this is for other people that come across this
p.add_tools(HoverTool(
tooltips=[
('Date', '#Date{%F}'),
('Value', '#Value{int}')],
formatters={
'Date':'datetime',
'Value':'numeral'},mode='vline'
))
Related
I'm trying to output a Pandas dataframe as a pyplot table to be inserted into a presentation, but I'm running into problems with the formatting of the values in the table itself.
This is one part of a much larger project, so I'll try to give as much code and context as I can:
values = []
for item in scenarios:
value = dataframe.loc[dataframe['Scenario'] == item, 'Value'].sum()
# Formatting for scientific notation with 2 decimal points (so, 3 sig figs)
values.append("{:.2e}".format(value))
# Create a dataframe to hold this new, visual-ready data and add the generated values to it.
visual_data = pd.DataFrame(scens, columns=["Scenario"])
visual_data["Data"] = values
# Formatting the data as a float so Pandas can sort it properly.
visual_data["Data"] = visual_data["Data"].astype(float)
# Sorting. By default, Pandas sorts in descending order.
visual_data.sort_values(by=['Data'], inplace=True)
# Matplotlib code.
fig, ax = plt.subplots()
# fig.patch.set_visible(False)
ax.axis('off')
ax.axis('tight')
table = ax.table(cellText=visual_data.to_numpy(), colLabels=visual_data.columns, loc='center', cellLoc='center')
fig.tight_layout()
plt.show()
The problem I run in to is how everything in the table itself displays. The data I'm working with runs from 1e-8 to 1e-4. When I just have my sorted dataframe everything is formatted in proper scientific notation from lowest to highest values.
As soon as I insert the data into the table (the '.to_numpy' statement when creating the table) the output looks something like the following:
[['DUJ' 3.964e-08]
['DUE' 4.467e-08]
['DUD' 1.172e-07]
['DUC' 2.098e-07]
['DUG' 2.136e-07]
...
['DUN' 7.356e-05]
['MCC' 0.0001046]
['ALU' 0.0001652]]
With the final two entries rendering as standard floats instead of being consistent scientific notation like all the other entries in the table.
I know that Pandas' "set_printoption" has a "suppress=True" variable that should prevent this kind of behavior, but I can't figure out how to enable it (or disable it, as it were).
Any ideas?
I would like to find a way to modify the labels on holoviews sankey diagrams that they show, in addition to the numerical values, also the percentage values.
For example:
import holoviews as hv
import pandas as pd
hv.extension('bokeh')
data = {'A':['XX','XY','YY','XY','XX','XX'],
'B':['RR','KK','KK','RR','RK','KK'],
'values':[10,5,8,15,19,1]}
df = pd.DataFrame(data, columns=['A','B','values'])
sankey = hv.Sankey(df)
For 'From' label 'YY' which is 'YY - 8' change this to 'YY - 8 (13.7%)' - add the additional percentage in there.
I have found ways to change from the absolute value to percentage by using something along the lines of:
value_dim = hv.Dimension('Percentage', unit='%')
But can't find a way to have both values in the label.
Additionally, I tried to modify the hover tag. In my search to find ways to modify this I found ways to reference and display various attributes in the hover information (through the bokeh tooltips) but it does not seem like you can manipulate this information.
In this post two possible ways are explained how to achive the wanted result. Let's start with the example DataFrame and the necessary imports.
import holoviews as hv
from holoviews import opts, dim # only needed for 2. solution
import pandas as pd
data = {'A':['XX','XY','YY','XY','XX','XX'],
'B':['RR','KK','KK','RR','RK','KK'],
'values':[10,5,8,15,19,1],
}
df = pd.DataFrame(data)
1. Option
Use hv.Dimension(spec, **params), which gives you the opportunity to apply a formatter with the keyword value_format to a column name. This formatter is simple the combination of the value and the value in percent.
total = df.groupby('A', sort=False)['values'].sum().sum()
def fmt(x):
return f'{x} ({round(x/total,2)}%)'
hv.Sankey(df, vdims = hv.Dimension('values', value_format=fmt))
2. Option
Extend the DataFrame df by one column wich stores the labels, you want to use. This can be later reused inside the Sankey, with opts(labels=dim('labels')). To ckeck if the calculations are correct, you can turn show_values on, but this will cause a duplicate inside the labels. Therefor in the final solution show_values is set to False. This can be sometime tricky to find the correct order.
labels = []
for item in ['A', 'B']:
grouper = df.groupby(item, sort=False)['values']
total_sum = grouper.sum().sum()
for name, group in grouper:
_sum = group.sum()
_percent = round(_sum/total_sum,2)
labels.append(f'{name} - {_sum} ({_percent}%)')
df['labels'] = labels
hv.Sankey(df).opts(show_values=False, labels=dim('labels'))
The downside of this solution is, that we apply a groupby for both columns 'A' and 'B'. This is something holoviews will do, too. So this is not very efficient.
Output
Comment
Both solutions create nearly the same figure, except that the HoverTool is not equal.
This is a direct follow up to Sorting based on the alt.Color field in Altair
using the same dataframe (that is included for ease of reference). I asked a follow up in the comments section, but after giving it a whirl on my own and getting close, I am creating a new question.
Unnamed: 0,Species,Unknown,group,LDA Score,p value
11,a,3.474929757,bad,3.07502591,5.67e-05
16,b,3.109308852,bad,2.739744898,0.000651725
31,c,3.16979865,bad,2.697247855,0.03310557
38,d,0.06730106400000001,bad,2.347746497,0.013009626000000002
56,e,2.788383183,good,2.223874347,0.0027407140000000004
65,f,2.644346144,bad,2.311106698,0.00541244
67,g,3.626001112,good,2.980960068,0.038597163
74,h,3.132399759,good,2.849798377,0.007021518000000001
117,i,3.192113412,good,2.861299028,8.19e-06
124,j,0.6140430960000001,bad,2.221483531,0.0022149739999999998
147,k,2.873671544,bad,2.390164757,0.002270102
184,l,3.003479213,bad,2.667274876,0.008129727
188,m,2.46344998,good,2.182085465,0.001657861
256,n,0.048663767,bad,2.952260299,0.013009626000000002
285,o,2.783848855,good,2.387345098,0.00092491
286,p,3.636218971,good,3.094046639,0.001584756
The follow up question was after grouping by "color", how can I do a subsequent ordering within the groups by "LDA Score" or essentially by bar length and have the text column sort by LDA, as well. I didn't know how to incorporate a second level or ordering in the code I was using, so I opted to turn the groups into facets and then try sorting by LDA Score for both the bar charts and the text column. I am getting the proper sorting by LDA score on the charts, but I can't seem to make it work for the text column. I am pasting the code and the image. As you can see, I am telling it to use LDA Score as the sorting field for the "text" chart (which is the pvalue), but it is still sorting alphabetically by species. Any thoughts? To be honest I feel like I'm heading down the rabbit hole where I'm forcing a solution to work in the current code, so if you think a different strategy altogether is the better way to go, let me know.
FYI, there are some formatting issues (like redundant labels on axes) that you can ignore for now.
bars = alt.Chart(df).mark_bar().encode(
alt.X('LDA Score'),
alt.Y("Species:N", sort='-x'),
color='group:N',
row='group:N'
).resolve_scale(y='independent'
)
text = alt.Chart(df).mark_text().encode(
alt.Text('p value:Q', format='.2e'),
alt.Y('Species:N', sort=alt.EncodingSortField(field='LDA Score', op='count', order='descending')),
row='group:N'
).resolve_scale(y='independent'
).properties(width=50)
#bars | text
alt.hconcat(bars, text, spacing=0)
Drop op="count". The count in each row is exactly 1 (there is one data point in each row). It sounds like you want to instead sort by the data value.
It also would make sense in this context to use this same sort expression for both y encodings, since they're meant to match:
y_sort = alt.EncodingSortField(field='LDA Score', order='descending')
bars = alt.Chart(df).mark_bar().encode(
alt.X('LDA Score'),
alt.Y("Species:N", sort=y_sort),
color='group:N',
row='group:N'
).resolve_scale(
y='independent'
)
text = alt.Chart(df).mark_text().encode(
alt.Text('p value:Q', format='.2e'),
alt.Y("Species:N", sort=y_sort, axis=None),
alt.Row('group:N', header=alt.Header(title=None, labelFontSize=0))
).resolve_scale(
y='independent'
).properties(width=50)
alt.hconcat(bars, text, spacing=0)
(labelFontSize is a workaround because there is a bug with labels=False)
I am attempting to sort a horizontal barchart based on the group to which it belongs. I have included the dataframe, code that I thought would get me to group-wise sorting, and image. The chart is currently sorted according to the species column in alphabetical order, but I would like it sorted by the group so that all "bads" are together, similarly, all "goods" are together. Ideally, I would like to take it one step further so that the goods and bads are subsequently sorted by value of 'LDA Score', but that was the next step.
Dataframe:
Unnamed: 0,Species,Unknown,group,LDA Score,p value
11,a,3.474929757,bad,3.07502591,5.67e-05
16,b,3.109308852,bad,2.739744898,0.000651725
31,c,3.16979865,bad,2.697247855,0.03310557
38,d,0.06730106400000001,bad,2.347746497,0.013009626000000002
56,e,2.788383183,good,2.223874347,0.0027407140000000004
65,f,2.644346144,bad,2.311106698,0.00541244
67,g,3.626001112,good,2.980960068,0.038597163
74,h,3.132399759,good,2.849798377,0.007021518000000001
117,i,3.192113412,good,2.861299028,8.19e-06
124,j,0.6140430960000001,bad,2.221483531,0.0022149739999999998
147,k,2.873671544,bad,2.390164757,0.002270102
184,l,3.003479213,bad,2.667274876,0.008129727
188,m,2.46344998,good,2.182085465,0.001657861
256,n,0.048663767,bad,2.952260299,0.013009626000000002
285,o,2.783848855,good,2.387345098,0.00092491
286,p,3.636219,good,3.094047,0.001584756
The code:
bars = alt.Chart(df).mark_bar().encode(
alt.X('LDA Score:Q'),
alt.Y("Species:N"),
alt.Color('group:N', sort=alt.EncodingSortField(field="Clinical group", op='distinct', order='ascending'))
)
bars
The resulting figure:
Two things:
If you want to sort the y-axis, you should put the sort expression in the y encoding. Above, you are sorting the color labels in the legend.
Sorting by field in Vega-Lite only works for numeric data (Edit: this is incorrect; see below), so you can use a calculate transform to map the entries to numbers by which to sort.
The result might look something like this:
alt.Chart(df).transform_calculate(
order='datum.group == "bad" ? 0 : 1'
).mark_bar().encode(
alt.X('LDA Score:Q'),
alt.Y("Species:N", sort=alt.SortField('order')),
alt.Color('group:N')
)
Edit: it turns out the reason sorting by group fails is that the default operation for sort fields is sum, which only works well on quantitative data. If you choose a different operation, you can sort on nominal data directly. For example, this shows the correct output:
alt.Chart(df).mark_bar().encode(
alt.X('LDA Score:Q'),
alt.Y("Species:N", sort=alt.EncodingSortField('group', op='min')),
alt.Color('group:N')
)
See vega/vega-lite#6064 for more information.
Note from maintainers: This question is about the obsolete bokeh.charts API removed years ago. For information on plotting with modern Boheh, including timseries, see:
https://docs.bokeh.org/en/latest/docs/user_guide/plotting.html
I have a defined dictionary of values where key is a date in a form of string and values are an array of floats.
Dictionary looks like this:
dict =
{'2017-03-23': [1.07874, 1.07930, 1.07917, 1.07864,],
'2017-03-27': [1.08382, 1.08392, 1.08410, 1.08454],
'2017-03-24': [1.07772, 1.07721, 1.07722, 1.07668]}
I want to display each date as a separate line on a Bokeh line_chart. Since the dates interval will change over time, I do not want to simply define p1.line, p2.line, p3.line (a static set) for each date because the amount of plotted dates will vary over time.
I have tried to follow tutorials here: http://docs.bokeh.org/en/0.9.3/docs/user_guide/charts.html but I keep struggling and getting errors.
Here is my code:
#input dates at this occasion
dates = ['2017-03-27','2017-03-24', '2017-03-23']
#dataframe is taken from input and contains columns date,time,close and other columns that I am not using
df
#I create a dictionary of dataframe in the structure described above
dict = {k: list(v) for k, v in df.groupby("date")["close"]}
#i want to plot chart
output_file("chart2.html")
p = figure(title="Dates line charts", x_axis_label='Index', y_axis_label='Price')
p = TimeSeries(dict, index='Index', legend=True, title="FX", ylabel='Price Prices')
show(p)
I am getting this error:
AttributeError: unexpected attribute 'index' to Chart, possible attributes are above, background_fill_alpha, background_fill_color, below, border_fill_alpha, border_fill_color, css_classes, disabled, extra_x_ranges, extra_y_ranges, h_symmetry, height, hidpi, inner_height, inner_width, js_callbacks, left, lod_factor, lod_interval, lod_threshold, lod_timeout, min_border, min_border_bottom, min_border_left, min_border_right, min_border_top, name, outline_line_alpha, outline_line_cap, outline_line_color, outline_line_dash, outline_line_dash_offset, outline_line_join, outline_line_width, plot_height, plot_width, renderers, right, sizing_mode, tags, title, title_location, tool_events, toolbar, toolbar_location, toolbar_sticky, v_symmetry, webgl, width, x_mapper_type, x_range, xlabel, xscale, y_mapper_type, y_range, ylabel or yscale
Thank you for the help.
Note from maintainers: This question is about the obsolete bokeh.charts API removed years ago. For information on plotting with modern Boheh, including timseries, see:
https://docs.bokeh.org/en/latest/docs/user_guide/plotting.html
You are looking at very old documentation (0.9.3). The latest documentation (0.12.4) for bokeh Timeseries can be found here.
As you can see, Timeseries no longer accepts an index parameter. The available parameters are
data (list(list), numpy.ndarray, pandas.DataFrame, list(pd.Series)) –
a 2d data source with columns of data for each stepped line.
x (str or
list(str), optional) – specifies variable(s) to use for x axis
y (str
or list(str), optional) – specifies variable(s) to use for y axis
builder_type (str or Builder, optional) – the type of builder to use
to produce the renderers. Supported options are ‘line’, ‘step’, or
‘point’.
Just follow the example given in the most recent documentation and you should not run into the same problem.