Altair: Sorting faceted "text" chart not reflecting expectation - python

This is a direct follow up to Sorting based on the alt.Color field in Altair
using the same dataframe (that is included for ease of reference). I asked a follow up in the comments section, but after giving it a whirl on my own and getting close, I am creating a new question.
Unnamed: 0,Species,Unknown,group,LDA Score,p value
11,a,3.474929757,bad,3.07502591,5.67e-05
16,b,3.109308852,bad,2.739744898,0.000651725
31,c,3.16979865,bad,2.697247855,0.03310557
38,d,0.06730106400000001,bad,2.347746497,0.013009626000000002
56,e,2.788383183,good,2.223874347,0.0027407140000000004
65,f,2.644346144,bad,2.311106698,0.00541244
67,g,3.626001112,good,2.980960068,0.038597163
74,h,3.132399759,good,2.849798377,0.007021518000000001
117,i,3.192113412,good,2.861299028,8.19e-06
124,j,0.6140430960000001,bad,2.221483531,0.0022149739999999998
147,k,2.873671544,bad,2.390164757,0.002270102
184,l,3.003479213,bad,2.667274876,0.008129727
188,m,2.46344998,good,2.182085465,0.001657861
256,n,0.048663767,bad,2.952260299,0.013009626000000002
285,o,2.783848855,good,2.387345098,0.00092491
286,p,3.636218971,good,3.094046639,0.001584756
The follow up question was after grouping by "color", how can I do a subsequent ordering within the groups by "LDA Score" or essentially by bar length and have the text column sort by LDA, as well. I didn't know how to incorporate a second level or ordering in the code I was using, so I opted to turn the groups into facets and then try sorting by LDA Score for both the bar charts and the text column. I am getting the proper sorting by LDA score on the charts, but I can't seem to make it work for the text column. I am pasting the code and the image. As you can see, I am telling it to use LDA Score as the sorting field for the "text" chart (which is the pvalue), but it is still sorting alphabetically by species. Any thoughts? To be honest I feel like I'm heading down the rabbit hole where I'm forcing a solution to work in the current code, so if you think a different strategy altogether is the better way to go, let me know.
FYI, there are some formatting issues (like redundant labels on axes) that you can ignore for now.
bars = alt.Chart(df).mark_bar().encode(
alt.X('LDA Score'),
alt.Y("Species:N", sort='-x'),
color='group:N',
row='group:N'
).resolve_scale(y='independent'
)
text = alt.Chart(df).mark_text().encode(
alt.Text('p value:Q', format='.2e'),
alt.Y('Species:N', sort=alt.EncodingSortField(field='LDA Score', op='count', order='descending')),
row='group:N'
).resolve_scale(y='independent'
).properties(width=50)
#bars | text
alt.hconcat(bars, text, spacing=0)

Drop op="count". The count in each row is exactly 1 (there is one data point in each row). It sounds like you want to instead sort by the data value.
It also would make sense in this context to use this same sort expression for both y encodings, since they're meant to match:
y_sort = alt.EncodingSortField(field='LDA Score', order='descending')
bars = alt.Chart(df).mark_bar().encode(
alt.X('LDA Score'),
alt.Y("Species:N", sort=y_sort),
color='group:N',
row='group:N'
).resolve_scale(
y='independent'
)
text = alt.Chart(df).mark_text().encode(
alt.Text('p value:Q', format='.2e'),
alt.Y("Species:N", sort=y_sort, axis=None),
alt.Row('group:N', header=alt.Header(title=None, labelFontSize=0))
).resolve_scale(
y='independent'
).properties(width=50)
alt.hconcat(bars, text, spacing=0)
(labelFontSize is a workaround because there is a bug with labels=False)

Related

Altair interactive filter on multiple values of one column

Here is the code of my plot.
input_dropdown = alt.binding_select(options=[None]+all_ids, name='Series ID', labels=["All"]+all_ids)
selection = alt.selection_single(fields=['ID'], bind=input_dropdown)
chart = alt.Chart(source_df).mark_line().encode(
x=alt.X('Date:T', title='Date'),
y=alt.Y('Value:Q', title='Value'),
color = alt.Color('ID:N', title='Series ID'),
strokeDash='Type:N'
).properties(
width=700
).add_selection(
selection
).transform_filter(
selection
)
st.altair_chart(chart)
Currently I can filter data displayed by choosing one value of ID column.
What should I do to filter by multiple ID values?
Smth like, show me the data for both ids '1' and '2'.
This is currently not possible via a widget like a dropdown because it is not implemented in the underlying Vega and Vega-Lite libraries. You could another chart as a selection element, or maybe use a streamlit component since it looks like your code is using that library already.

altair dynamic combination of selection conditions

I'm trying to create a chart where it is possible to select combinations of different columns of the data by toggling checkboxes. However I don't always want to display the checkboxes for all the columns. So I want to add the selections to the chart in a 'dynamic' way.
The thing I want to accomplish is that I want to make a pre-selection of which categories I want to visualize (this is done before the altair chart is created). These categories are then added as checkboxes in altair. However the only way I could find to do this is by adding them in a hardcoded way like the "sel1[0] & sel1[1] & sel1[2] & sel1[3] & sel1[4]" in the code below:
sel1 = [
alt.selection_single(
bind=alt.binding_checkbox(name=field),
fields=[field],
init={field: False}
)
for field in category_selection
]
transform_args = {str(col): f'toBoolean(datum.{col})' for col in category_selection}
alt.Chart(df1).transform_calculate(**transform_args).mark_point(filled=True).encode(
x='dim1',
y='dim2',
opacity=alt.condition(
sel1[0] & sel1[1] & sel1[2] & sel1[3] & sel1[4],
alt.value(1), alt.value(0)
)
).add_selection(
*sel1
)
I have tried doing it like this:
alt.Chart(df1).transform_calculate(**transform_args).mark_point(filled=True).encode(
x='dim1',
y='dim2',
opacity=alt.condition(
{'and': sel[:2]},
alt.value(1), alt.value(0)
)
).add_selection(
*sel1[:2]
)
But that does not work.
I can't seem to figure out how to achieve something like this with altair. Could someone provide an example on how to do this with checkboxes or help me find another method to achieve the same thing?
TLDR: I basically want to support a variable amount of categories that also supports the ability to create combinations of the categories.
EDIT: Tried to make it more clear what I'm trying achieve with the code.
It sounds like you want to write the equivalent of this without knowing the length of sel:
sel = [alt.selection_single() for i in range(3)]
combined = sel[0] & sel[1] & sel[2]
For Python operators in general, you can do so like this:
import operator
import functools
combined = functools.reduce(operator.and_, sel)
In Altair, you can alternatively construct the resulting Vega-Lite specification directly:
combined = {"selection": {"and": [s.name for s in sel]}}
Any of these three approaches should lead to identical results when used within an Altair chart.

Sorting based on the alt.Color field in Altair

I am attempting to sort a horizontal barchart based on the group to which it belongs. I have included the dataframe, code that I thought would get me to group-wise sorting, and image. The chart is currently sorted according to the species column in alphabetical order, but I would like it sorted by the group so that all "bads" are together, similarly, all "goods" are together. Ideally, I would like to take it one step further so that the goods and bads are subsequently sorted by value of 'LDA Score', but that was the next step.
Dataframe:
Unnamed: 0,Species,Unknown,group,LDA Score,p value
11,a,3.474929757,bad,3.07502591,5.67e-05
16,b,3.109308852,bad,2.739744898,0.000651725
31,c,3.16979865,bad,2.697247855,0.03310557
38,d,0.06730106400000001,bad,2.347746497,0.013009626000000002
56,e,2.788383183,good,2.223874347,0.0027407140000000004
65,f,2.644346144,bad,2.311106698,0.00541244
67,g,3.626001112,good,2.980960068,0.038597163
74,h,3.132399759,good,2.849798377,0.007021518000000001
117,i,3.192113412,good,2.861299028,8.19e-06
124,j,0.6140430960000001,bad,2.221483531,0.0022149739999999998
147,k,2.873671544,bad,2.390164757,0.002270102
184,l,3.003479213,bad,2.667274876,0.008129727
188,m,2.46344998,good,2.182085465,0.001657861
256,n,0.048663767,bad,2.952260299,0.013009626000000002
285,o,2.783848855,good,2.387345098,0.00092491
286,p,3.636219,good,3.094047,0.001584756
The code:
bars = alt.Chart(df).mark_bar().encode(
alt.X('LDA Score:Q'),
alt.Y("Species:N"),
alt.Color('group:N', sort=alt.EncodingSortField(field="Clinical group", op='distinct', order='ascending'))
)
bars
The resulting figure:
Two things:
If you want to sort the y-axis, you should put the sort expression in the y encoding. Above, you are sorting the color labels in the legend.
Sorting by field in Vega-Lite only works for numeric data (Edit: this is incorrect; see below), so you can use a calculate transform to map the entries to numbers by which to sort.
The result might look something like this:
alt.Chart(df).transform_calculate(
order='datum.group == "bad" ? 0 : 1'
).mark_bar().encode(
alt.X('LDA Score:Q'),
alt.Y("Species:N", sort=alt.SortField('order')),
alt.Color('group:N')
)
Edit: it turns out the reason sorting by group fails is that the default operation for sort fields is sum, which only works well on quantitative data. If you choose a different operation, you can sort on nominal data directly. For example, this shows the correct output:
alt.Chart(df).mark_bar().encode(
alt.X('LDA Score:Q'),
alt.Y("Species:N", sort=alt.EncodingSortField('group', op='min')),
alt.Color('group:N')
)
See vega/vega-lite#6064 for more information.

Seaborn pairplot not displaying in matrix form

I have a DataFrame with the variables below. I am trying to find the relationship by plotting "profit" with other variables excluding "Date".
Date
Billable_Fixed Bid
Billable_Time_Material
Billable_Transaction_Based
Non_Billable
Indirect_Costs
Unbilled_CP_and_AM
Direct_Costs
Profit
Code:
cols = [
'Billable_Fixed Bid',
'Billable_Time_Material',
'Billable_Transaction_Based',
'Non_Billable',
'Unbilled_CP_and_AM',
'Direct_Costs'
]
sns.pairplot(data1,x_vars=cols,y_vars='Profit',size =5,kind='reg')
Problem is the plots are getting displayed in a single line, which is not clearly visible.
I want it to display 2 plots per line so that it is clearly visible.
Can anyone help?
As per the comment: Using FacetGrid with col_wrap=2 will solve your problem. Check the examples in the documentation.

Formatting in HoverTool

I love how easy it is to set up basic hover feedback with HoverTool, but I'm wrestling with a couple aspects of the display. I have time-series data, with measurements that represent amounts in US$. This data starts out life as a pandas.Series. Legible plotting is easy (following assumes jupyter notebook):
p = figure(title='Example currency', x_axis_type='datetime',
plot_height=200, plot_width=600, tools='')
p.line(my_data.index, my_data)
p.yaxis[0].formatter = NumeralTickFormatter(format='$0,0')
show(p)
This shows me the time-series, with date-formatting on the x-axis and y-axis values that look like "$150,000", "$200,000", "$250,000", etc. I have two questions about HoverTool behavior:
Controlling formatting for $x and $y.
Accessing the name of the dataset under the cursor.
Simply adding a HoverTool allows me to see values, but in unhelpful units:
p.add_tools(HoverTool())
The corresponding tooltip values with these defaults show "1.468e+5" rather than "$146,800" (or even "146800", the underlying Series value); similarly, the date value appears as "1459728000000" rather than (say) "2016-04-04". I can manually work around this display issue by making my pandas.Series into a ColumnDataSource and adding string columns with the desired formatting:
# Make sure Series and its index have `name`, before converting to DataFrame
my_data.name = 'revenue'
my_data.index.name = 'day'
df = my_data.reset_index()
# Add str columns for tooltip display
df['daystr'] = df['day'].dt.strftime('%m %b %Y')
df['revstr'] = df['revenue'].apply(lambda x: '${:,d}'.format(int(x)))
cds = ColumnDataSource(df)
p = figure(title='Example currency', x_axis_type='datetime',
plot_height=200, plot_width=600, tools='')
p.line('day', 'revenue', source=cds)
p.yaxis[0].formatter = NumeralTickFormatter(format='$0,0')
p.add_tools(HoverTool(tooltips=[('Amount', '#revstr'), ('Day', '#daystr')]))
show(p)
but is there a way to handle the formatting in the HoverTool configuration instead? That seems much more desirable than all the data-set transformation that's required above. I looked through the documentation and (quickly) scanned through the source, and didn't see anything obvious that would save me from building the "output" columns as above.
Related to that, when I have several lines in a single plot, is there any way for me to access the name (or perhaps legend value) of each line within HoverTool.tooltips? It would be extremely helpful to include something in the tooltip to differentiate which dataset values are coming from, rather than needing to rely on (say) line-color in conjunction with the tool-tip display. For now, I've added an additional column to the ColumnDataSource that's just the string value I want to show; that obviously only works for datasets that include a single measurement column. When multiple lines are sharing an underlying ColumnDataSource, it would be sufficient to access the column-name that's provided to y.
Hey i know its 2 years late but this is for other people that come across this
p.add_tools(HoverTool(
tooltips=[
('Date', '#Date{%F}'),
('Value', '#Value{int}')],
formatters={
'Date':'datetime',
'Value':'numeral'},mode='vline'
))

Categories

Resources