Display additional values in holoviews sankey labels or hover information - python

I would like to find a way to modify the labels on holoviews sankey diagrams that they show, in addition to the numerical values, also the percentage values.
For example:
import holoviews as hv
import pandas as pd
hv.extension('bokeh')
data = {'A':['XX','XY','YY','XY','XX','XX'],
'B':['RR','KK','KK','RR','RK','KK'],
'values':[10,5,8,15,19,1]}
df = pd.DataFrame(data, columns=['A','B','values'])
sankey = hv.Sankey(df)
For 'From' label 'YY' which is 'YY - 8' change this to 'YY - 8 (13.7%)' - add the additional percentage in there.
I have found ways to change from the absolute value to percentage by using something along the lines of:
value_dim = hv.Dimension('Percentage', unit='%')
But can't find a way to have both values in the label.
Additionally, I tried to modify the hover tag. In my search to find ways to modify this I found ways to reference and display various attributes in the hover information (through the bokeh tooltips) but it does not seem like you can manipulate this information.

In this post two possible ways are explained how to achive the wanted result. Let's start with the example DataFrame and the necessary imports.
import holoviews as hv
from holoviews import opts, dim # only needed for 2. solution
import pandas as pd
data = {'A':['XX','XY','YY','XY','XX','XX'],
'B':['RR','KK','KK','RR','RK','KK'],
'values':[10,5,8,15,19,1],
}
df = pd.DataFrame(data)
1. Option
Use hv.Dimension(spec, **params), which gives you the opportunity to apply a formatter with the keyword value_format to a column name. This formatter is simple the combination of the value and the value in percent.
total = df.groupby('A', sort=False)['values'].sum().sum()
def fmt(x):
return f'{x} ({round(x/total,2)}%)'
hv.Sankey(df, vdims = hv.Dimension('values', value_format=fmt))
2. Option
Extend the DataFrame df by one column wich stores the labels, you want to use. This can be later reused inside the Sankey, with opts(labels=dim('labels')). To ckeck if the calculations are correct, you can turn show_values on, but this will cause a duplicate inside the labels. Therefor in the final solution show_values is set to False. This can be sometime tricky to find the correct order.
labels = []
for item in ['A', 'B']:
grouper = df.groupby(item, sort=False)['values']
total_sum = grouper.sum().sum()
for name, group in grouper:
_sum = group.sum()
_percent = round(_sum/total_sum,2)
labels.append(f'{name} - {_sum} ({_percent}%)')
df['labels'] = labels
hv.Sankey(df).opts(show_values=False, labels=dim('labels'))
The downside of this solution is, that we apply a groupby for both columns 'A' and 'B'. This is something holoviews will do, too. So this is not very efficient.
Output
Comment
Both solutions create nearly the same figure, except that the HoverTool is not equal.

Related

Altair - use nth column as input variable in graph

I would like to refer to the nth column of a dataframe and use this as input for drawing a graph using altair. The code looks like this:
#create dataframe
df = pd.DataFrame({'date': ['2020-04-03', '2020-04-04', '2020-04-05', '2020-04-06','2020-04-03', '2020-04-04','2020-04-05','2020-04-06'],
'ID': ['a','a','a','a','b','b','b','b'],'bar': [np.nan,8,np.nan,np.nan, np.nan, 8,np.nan,np.nan],
'line': [8,np.nan,10,8, 4, 5,6,7] })
#define columns to be used
bb = df.columns[2]
ll = df.columns[3]
#make graph
bars = alt.Chart(df).mark_bar(color="grey", size=5).encode(
alt.X('monthdate(date):O'), alt.Y(bb))
lines = (alt.Chart(df).mark_line(point=True,size=2)
.transform_filter('isValid(datum.line)')
.encode(alt.X('monthdate(date):O'), y='line:Q'))
alt.layer(bars + lines,width=350,height=150).facet(facet=alt.Facet('ID:N'),
).resolve_axis(y='independent',x='independent')
I managed for the "bars" part of the graph. But I am not sure how to do this inside
the transform_filter function. Instead of specifying the column name "line" I would like to use "ll" or the 3rd column of the dataframe. Is there a way to do this? Thanks for any help
datum is referencing a javascript object, and I don't believe there is any way to access javascript object properties (like line) by index rather than property name. However you could construct the string using Python f-strings which allows you to substitute in the variable value where you need it in the string:
.transform_filter(f'isValid(datum.{ll})')
There is also an expr module in Altair (second example here), but I can't think of a way to use that which is easier than the above.

Boxplot visualization

So I have to do this boxplot, and I want to limit the variables from a column in a dataset, and the problem I am having is that I don't know how to do that. this is what I have for now, I want to pick the top ten nationalities that are in the column, but I cannot figure out how to do it.
If I understand your question correctly, this should work for a dataframe called df with a "Nationality" column called Nationality:
import collections
counts = collections.Counter(df.Nationality)
top10countries = [elem for elem, _ in counts.most_common(10)]
df_top10 = df[df['Nationality'].isin(top10countries)]
and then use df_top10 to make boxplots.

Using Pandas Styling (DataFrame.style property) to iterate through product prices

I have a Pandas dataframe that contains data of prices of various products as taken on different dates, the columns are ‘date’, ‘product’, ‘price’.
My goal is to highlight the price cell where there has been a price reduction for that particular product. Much like this example .csv seen below…
An example .csv showing what I want to achieve using Pandas Styling
I understand that each product will need to be separated and then the prices of that product evaluated in pairs. I have used the following code in another part of the script to successfully achieve this:
integer = 0
for iteration in range(iterations):
first_price_pair = one_product.iloc[integer,2]
integer=integer+1
second_price_pair = one_product.iloc[integer,2]
# one_product is selected by using .drop_duplicates() on 'product'
price_dif = first_price_pair - second_price_pair
if second_price_pair < first_price_pair:
# highlight cell green - INDICATES PRICE REDUCTION FROM PREV PRICE
elif second_price_pair == first_price_pair:
# no change to cell colour
elif second_price_pair > first_price_pair:
# highlight cell RED - INDICATES PRICE INCREASE FROM PREV PRICE
My problem is when I attempt to use - DataFrame.style - for applying the highlighting. It appears that once ‘styling’ has been applied to the DF, the DF is then converted to type: pandas.io.formats.style.Styler - and that this can then not be modified.
I’d appreciate it if someone can confirm it is possible to achieve what I’m trying to do and if so, give me some guidance on how to achieve it.
Thank you!
To apply highlights you might want to use either:
Styler.applymap()
Styler.apply()
The difference between the two lies in the way you want to select the elements as applymap() works elementwise and apply() works with column-/row-/table-wise.
Both methods require a function to generate the CSS attributes you want to change.
In your case if you put it in an if statement it might be something like this:
import pandas as pd
df = pd.DataFrame(np.random.randint(-4,4, size=(5,5)))
def background_cell(x, row_idx, col_idx, color):
b_color = 'background-color: green'
df_styler = pd.DataFrame('', index=x.index, columns=x.columns)
df_styler.iloc[row_idx, col_idx] = b_color
return df_styler
df.style.apply(background_cell, row_idx=1, col_idx=1, color='green', axis=None)
This is going to change the background of the cell [1,1]. You can call df.style.apply() with a different colour and the index of the cell you want to change.
I think you overwrote the Styler on the DataFrame variable by typing
df = df.style.apply(...)
that's why you lost it and couldn't modify it anymore.
The styling is a method you can use to show the DataFrame, so you should use it whenever you are printing it, although it won't be an attribute of the DataFrame itself.

Rotating the column name for a Panda DataFrame

I'm trying to make nicely formatted tables from pandas. Some of my column names are far too long. The cells for these columns are large cause the whole table to be a mess.
In my example, is it possible to rotate the column names as they are displayed?
data = [{'Way too long of a column to be reasonable':4,'Four?':4},
{'Way too long of a column to be reasonable':5,'Four?':5}]
pd.DataFrame(data)
Something like:
data = [{'Way too long of a column to be reasonable':4,'Four?':4},
{'Way too long of a column to be reasonable':5,'Four?':5}]
dfoo = pd.DataFrame(data)
dfoo.style.set_table_styles(
[dict(selector="th",props=[('max-width', '80px')]),
dict(selector="th.col_heading",
props=[("writing-mode", "vertical-rl"),
('transform', 'rotateZ(-90deg)'),
])]
)
is probably close to what you want:
see result here
Looking at the pybloqs source code for the accepted answer's solution, I was able to find out how to rotate the columns without installing pybloqs. Note that this also rotates the index, but I have added code to remove those.
from IPython.display import HTML, display
data = [{'Way too long of a column to be reasonable':4,'Four?':4},
{'Way too long of a column to be reasonable':5,'Four?':5}]
df = pd.DataFrame(data)
styles = [
dict(selector="th", props=[("font-size", "125%"),
("text-align", "center"),
("transform", "translate(0%,-30%) rotate(-5deg)")
]),
dict(selector=".row_heading, .blank", props= [('display', 'none;')])
]
html = df.style.set_table_styles(styles).render()
display(HTML(html))
I placed #Bobain's nice answer into a function so I can re-use it throughout a notebook.
def format_vertical_headers(df):
"""Display a dataframe with vertical column headers"""
styles = [dict(selector="th", props=[('width', '40px')]),
dict(selector="th.col_heading",
props=[("writing-mode", "vertical-rl"),
('transform', 'rotateZ(180deg)'),
('height', '290px'),
('vertical-align', 'top')])]
return (df.fillna('').style.set_table_styles(styles))
format_vertical_headers(pandas.DataFrame(data))
Using the Python library 'pybloqs' (http://pybloqs.readthedocs.io/en/latest/), it is possible to rotate the column names as well as add a padding to the top. The only downside (as the documentation mentions) is that the top-padding does not work inline with Jupyter. The table must be exported.
import pandas as pd
from pybloqs import Block
import pybloqs.block.table_formatters as tf
from IPython.core.display import display, HTML
data = [{'Way too long of a column to be reasonable':4,'Four?':4},
{'Way too long of a column to be reasonable':5,'Four?':5}]
dfoo =pd.DataFrame(data)
fmt_header = tf.FmtHeader(fixed_width='5cm',index_width='10%',
top_padding='10cm', rotate_deg=60)
table_block = Block(dfoo, formatters=[fmt_header])
display(HTML(table_block.render_html()))
table_block.save('Table.html')
I can get it so that the text is completely turned around 90 degrees, but can't figure out how to use text-orientation: upright as it just makes the text invisible :( You were missing the writing-mode property that has to be set for text-orientation to have any effect. Also, I made it only apply to column headings by modifying the selector a little.
dfoo.style.set_table_styles([dict(selector="th.col_heading",props=[("writing-mode", "vertical-lr"),('text-orientation', 'upright')])])
Hopefully this gets you a little closer to your goal!

Formatting in HoverTool

I love how easy it is to set up basic hover feedback with HoverTool, but I'm wrestling with a couple aspects of the display. I have time-series data, with measurements that represent amounts in US$. This data starts out life as a pandas.Series. Legible plotting is easy (following assumes jupyter notebook):
p = figure(title='Example currency', x_axis_type='datetime',
plot_height=200, plot_width=600, tools='')
p.line(my_data.index, my_data)
p.yaxis[0].formatter = NumeralTickFormatter(format='$0,0')
show(p)
This shows me the time-series, with date-formatting on the x-axis and y-axis values that look like "$150,000", "$200,000", "$250,000", etc. I have two questions about HoverTool behavior:
Controlling formatting for $x and $y.
Accessing the name of the dataset under the cursor.
Simply adding a HoverTool allows me to see values, but in unhelpful units:
p.add_tools(HoverTool())
The corresponding tooltip values with these defaults show "1.468e+5" rather than "$146,800" (or even "146800", the underlying Series value); similarly, the date value appears as "1459728000000" rather than (say) "2016-04-04". I can manually work around this display issue by making my pandas.Series into a ColumnDataSource and adding string columns with the desired formatting:
# Make sure Series and its index have `name`, before converting to DataFrame
my_data.name = 'revenue'
my_data.index.name = 'day'
df = my_data.reset_index()
# Add str columns for tooltip display
df['daystr'] = df['day'].dt.strftime('%m %b %Y')
df['revstr'] = df['revenue'].apply(lambda x: '${:,d}'.format(int(x)))
cds = ColumnDataSource(df)
p = figure(title='Example currency', x_axis_type='datetime',
plot_height=200, plot_width=600, tools='')
p.line('day', 'revenue', source=cds)
p.yaxis[0].formatter = NumeralTickFormatter(format='$0,0')
p.add_tools(HoverTool(tooltips=[('Amount', '#revstr'), ('Day', '#daystr')]))
show(p)
but is there a way to handle the formatting in the HoverTool configuration instead? That seems much more desirable than all the data-set transformation that's required above. I looked through the documentation and (quickly) scanned through the source, and didn't see anything obvious that would save me from building the "output" columns as above.
Related to that, when I have several lines in a single plot, is there any way for me to access the name (or perhaps legend value) of each line within HoverTool.tooltips? It would be extremely helpful to include something in the tooltip to differentiate which dataset values are coming from, rather than needing to rely on (say) line-color in conjunction with the tool-tip display. For now, I've added an additional column to the ColumnDataSource that's just the string value I want to show; that obviously only works for datasets that include a single measurement column. When multiple lines are sharing an underlying ColumnDataSource, it would be sufficient to access the column-name that's provided to y.
Hey i know its 2 years late but this is for other people that come across this
p.add_tools(HoverTool(
tooltips=[
('Date', '#Date{%F}'),
('Value', '#Value{int}')],
formatters={
'Date':'datetime',
'Value':'numeral'},mode='vline'
))

Categories

Resources