Problem plotting data in altair, data is swapped in pyramid graph - python

I am trying to plot the data from my pd file that contains data about man and woman in different function levels. However whilst plotting the pyramid df the data is swapped. PhD and assistant are swapped and associate and postdoc. However I can't find a problem or mistake.
import altair as alt
from vega_datasets import data
import pandas as pd
df_natuur_vrouw = df_natuur[df_natuur['geslacht'] == 'V']
df_natuur_man = df_natuur[df_natuur['geslacht'] == 'M']
df_techniek_vrouw = df_techniek[df_techniek['geslacht'] == 'V']
df_techniek_man = df_techniek[df_techniek['geslacht'] == 'M']
slider = alt.binding_range(min=2011, max=2020, step=1)
select_year = alt.selection_single(name='year', fields=['year'],
bind=slider, init={'year': 2020})
base_vrouw = alt.Chart(df_natuur_vrouw).add_selection(
select_year
).transform_filter(
select_year
).properties(
width=250
)
base_man = alt.Chart(df_natuur_man).add_selection(
select_year
).transform_filter(
select_year
).properties(
width=250
)
color_scale = alt.Scale(domain=['M', 'V'],
range=['#003865', '#ee7203'])
left = base_vrouw.encode(
y=alt.Y('functieniveau:O', axis=None),
x=alt.X('percentage_afgerond:Q',
title='percentage',
scale=alt.Scale(domain=[0, 100], reverse=True)),
color=alt.Color('geslacht:N', scale=color_scale, legend=None)
).mark_bar().properties(title='Female')
middle = base_vrouw.encode(
y=alt.Y('functieniveau:O', axis=None, sort=['Professor', 'Associate Professor', 'Assistant Professor', 'Postdoc', 'PhD']),
text=alt.Text('functieniveau:O'),
).mark_text().properties(width=110)
right = base_man.encode(
y=alt.Y('functieniveau:O', axis=None),
x=alt.X('percentage_afgerond:Q', title='percentage', scale=alt.Scale(domain=[0, 100])),
color=alt.Color('geslacht:N', scale=color_scale, legend=None)
).mark_bar().properties(title='Male')
alt.concat(left, middle, right, spacing=5, title='Percentage male and female employees per academic level in nature sector 2011-2020')
The data I want to show, however the values for PHD and assistant are swapped and so are the values for associate professor and postdoc

It is a little hard to tell without having a sample of the data to be able to run the code, but the problem is likely that you are sorting the middle plot, but no the left and right plots. Try applying the same Y sort order to the bar plots as you are using for the text and see if that works.

Related

Using Altair's interval selection as a filter in a multi-view chart

In my concatenated chart, I’m using an interval selection as a filter (see the GIF, Python code, and VL spec below).
Even though my selection appears to be empty, my filtered chart still shows some data. What I'd like to achieve is to show the average temperature, for each station, based on the date range selected in the interval selection.
Is anyone able to have a look and nudge me in the right direction?
Here's a reproducible example in the VL editor
Here's the Python code I'm using:
def make_chart(df, station):
brush = alt.selection(
type="interval",
encodings=["x"],
on="[mousedown[event.altKey], mouseup] > mousemove",
translate="[mousedown[event.altKey], mouseup] > mousemove!",
zoom="wheel![event.altKey]",
)
interaction = alt.selection(
type="interval",
bind="scales",
on="[mousedown[event.shiftKey], mouseup] > mousemove",
translate="[mousedown[event.shiftKey], mouseup] > mousemove!",
zoom="wheel![event.shiftKey]",
)
points = alt.Chart().mark_circle().encode(
alt.X('yearmonthdate(date):T', title='Date'),
alt.Y('temp:Q', title='Mean Temperature in 2020 (F)'),
size=alt.Size('wind:Q', scale=alt.Scale(domain=[1, 20], range=[1,500])),
color=alt.Color('temp:Q', scale=alt.Scale(scheme='blueorange', domainMid=32),
legend=alt.Legend(title='Mean Temperature')),
tooltip=['date', 'name', 'temp', 'wind']
).properties(
width=550,
height=300
).transform_filter(
alt.datum.name == station
).add_selection(
brush
).add_selection(interaction)
bars = alt.Chart().mark_bar().encode(
x=alt.X('mean(temp)', title='Mean Temperature (F)'),
y=alt.Y('name:N', title='Station', axis=alt.Axis(labelLimit=90), sort='-x'),
color=alt.Color('mean(temp):Q', scale=alt.Scale(scheme='blueorange', domainMid=32))
).transform_filter(
brush
).properties(
width=550,
)
chart=alt.vconcat(points, bars, data=df, title=f"Mean Temperature Dashboard for NY"
)
return chart
Your chart seems to be working fine to me. The only thing I noticed is that some circles are not showing up which can make it look like there is no data there when you brush. If you set your domain to start at zero, these then appear fine and everything works as expected.
enter link description here

Using interval selection: manipulate what is taken into aggregation of individual encoding channels of altair

I am making an XY-scatter chart, where both axes show aggregated data.
For both variables I want to have an interval selection in two small charts below where I can brush along the x-axis to set a range.
The selection should then be used to filter what is taken into account for each aggregation operation individually.
On the example of the cars data set, let's say I what to look at Horsepower over Displacement. But not of every car: instead I aggregate (sum) by Origin. Additionally I create two plots of totally mean HP and displacement over time, where I add interval selections, as to be able to set two distinct time ranges.
Here is an example of what it should look like, although the selection functionality is not yet as intended.
And here below is the code to produce it. Note, that I left some commented sections in there which show what I already tried, but does not work. The idea for the transform_calculate came from this GitHub issue. But I don't know how I could use the extracted boundary values for changing what is included in the aggregations of x and y channels. Neither the double transform_window took me anywhere. Could a transform_bin be useful here? How?
Basically, what I want is: when brush1 reaches for example from 1972 to 1975, and brush2 from 1976 to 1979, I want the scatter chart to plot the summed HP of each country in the years 1972, 1973 and 1974 against each countries summed displacement from 1976, 1977 and 1978 (for my case I don't need the exact date format, the Year might as well be integers here).
import altair as alt
from vega_datasets import data
cars = data.cars.url
brush1 = alt.selection(type="interval", encodings=['x'])
brush2 = alt.selection(type="interval", encodings=['x'])
scatter = alt.Chart(cars).mark_point().encode(
x = 'HP_sum:Q',
y = 'Dis_sum:Q',
tooltip = 'Origin:N'
).transform_filter( # Ok, I can filter the whole data set, but that always acts on both variables (HP and displacement) together... -> not what I want.
brush1 | brush2
).transform_aggregate(
Dis_sum = 'sum(Displacement)',
HP_sum = 'sum(Horsepower)',
groupby = ['Origin']
# ).transform_calculate( # Can I extract the selection boundaries like that? And if yes: how can I use these extracts to calculate the aggregationsof HP and displacement?
# b1_lower='(isDefined(brush1.x) ? (brush1.x[0]) : 1)',
# b1_upper='(isDefined(brush1.x) ? (brush1.x[1]) : 1)',
# b2_lower='(isDefined(brush2.x) ? (brush2.x[0]) : 1)',
# b2_upper='(isDefined(brush2.x) ? (brush2.x[1]) : 1)',
# ).transform_window( # Maybe instead of calculate I can use two window transforms...??
# conc_sum = 'sum(conc)',
# frame = [brush1.x[0],brush1.x[1]], # This will not work, as it sets the frame relative (back- and foreward) to each datum (i.e. sliding window), I need it to correspond to the entire data set
# groupby=['sample']
# ).transform_window(
# freq_sum = 'sum(freq)',
# frame = [brush2.x[0],brush2.x[1]], # ...same problem here
# groupby=['sample']
)
range_sel1 = alt.Chart(cars).mark_line().encode(
x = 'Year:T',
y = 'mean(Horsepower):Q'
).add_selection(
brush1
).properties(
height = 100
)
range_sel2 = alt.Chart(cars).mark_line().encode(
x = 'Year:T',
y = 'mean(Displacement):Q'
).add_selection(
brush2
).properties(
height = 100
)
scatter & range_sel1 & range_sel2
Interval selection cannot be used for aggregate charts yet in Vega-Lite. The error behavior have been updated in a recent PR to Vega-Lite to show a helpful message.
Not sure if I understand your requirements correctly, does this look close to what you want? (Just added param selections on top of your vertically concatenated graphs)
Vega Editor

call back from slider change not updating my plot in bokeh in jupyter lab?

I am working on a Bokeh visualisation of datasets across a number of categories. The initial part of the visual is a donut chart of the categories showing the total number of items in each category. I am trying to get the chart to update based on a min-max range using RangeSlider - but the chart does not update.
The input source for the glyphs is the output from a create_cat_df - which is returned as a Pandas DF, then converted into a CDS using ColumnDataSource.from_df().
The chart appears okay when this code is run (with slider alongside) - but moving the slider changes nothing.
There is a similar post here.
The answer here was useful in putting me onto from_df - but even after following this I can't get the code to work.
def create_doc(doc):
### INPUT widget
cat_min_max = RangeSlider(start=0, end=1000, value=[0, 1000], step=1, title="Category min-max items (m)")
inputs = column(cat_min_max, width=300, height=850) # in preparation for multiple widgets
### Tooltip & tools
TOOLTIPS_2 = [("Item", "$item") # a sample
]
hover_2 = HoverTool(tooltips=TOOLTIPS_2, names = ['cat'])
tools = [hover_2, TapTool(), WheelZoomTool(), PanTool(), ResetTool()]
### Create Figure
p = figure(plot_width=width, plot_height=height, title="",
x_axis_type=None, y_axis_type=None,
x_range=(-420, 420), y_range=(-420, 420),
min_border=0, outline_line_color=None,
background_fill_color="#f0e1d2",
tools = tools, toolbar_location="left")
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
# taptool
url = "https://google.com/" #dummy URL
taptool = p.select(type=TapTool)
taptool.callback = OpenURL(url=url)
# create cat_source CDS using create_cat_df function (returns pandas df) and 'from_df' method
cat_source = ColumnDataSource.from_df(create_cat_df(cat_min_max.value[0], cat_min_max.value[1]))
## plot category wedges
p.annular_wedge('centre_x', 'centre_y', 'inner', 'outer', 'start', 'end', color='color',
alpha='alpha', direction='clock', source=cat_source, name='cat')
r = row([inputs, p])
def callback(attr, old, new):
cat_source.data = ColumnDataSource.from_df(create_cat_df(cat_min_max.value[0], cat_min_max.value[1]))
cat_min_max.on_change('value', callback)
doc.add_root(r)
show(create_doc)
I would like to get the code working and the chart updating. There are a number more glyphs & different data layers to layer in, but I want to get the basics working first.
According to Bokeh documentation the ColumnDataSource.from_df() method returns a dictionary while you need to pass a ColumnDatSource to the source argument in p.annular_wedge(source = cat_source)
So instead of:
cat_source = ColumnDataSource.from_df(create_cat_df(cat_min_max.value[0], cat_min_max.value[1]))
You should do:
cat_source = ColumnDataSource(data = ColumnDataSource.from_df(create_cat_df(cat_min_max.value[0], cat_min_max.value[1])))

Only getting one bar in bqplot chart

I have some data of the form:
Name Score1 Score2 Score3 Score4
Bob -2 3 5 7
and im trying to use bqplot to plot a really basic bar chart
i'm trying:
sc_ord = OrdinalScale()
y_sc_rf = LinearScale()
bar_chart = Bars(x=data6.Name,
y=[data6.Score1, data6.Score2, data6.Score3],
scales={'x': sc_ord, 'y': y_sc_rf},
labels=['Score1', 'Score2', 'Score3'],
)
ord_ax = Axis(label='Score', scale=sc_ord, grid_lines='none')
y_ax = Axis(label='Scores', scale=y_sc_rf, orientation='vertical',
grid_lines='solid')
Figure(axes=[ord_ax, y_ax], marks=[bar_chart])
but all im getting is one bar, i assume because Name only has one value, is there a way to set the column headers as the x data? or some other way to solve this
I think this is what Doug is getting at. Your length of x and y data should be the same. In this case, x is the column labels, and y is the score values. You should set the 'Name' column of your DataFrame as the index; this will prevent it from being plotted as a value.
PS. Next time, if you make sure your code is a complete example that can be run from scratch without external data (a MCVE, https://stackoverflow.com/help/mcve) you are likely to get a much quicker answer.
BQPlot documentation has lots of good examples using the more complex pyplot interface which are worth reading: https://github.com/bloomberg/bqplot/blob/master/examples/Marks/Object%20Model/Bars.ipynb
from bqplot import *
import pandas as pd
data = pd.DataFrame(
index = ['Bob'],
columns = ['score1', 'score2', 'score3', 'score4'],
data = [[-2, 3,5,7]]
)
sc_ord = OrdinalScale()
y_sc_rf = LinearScale()
bar_chart = Bars(x=data.columns, y = data.iloc[0],
scales={'x': sc_ord, 'y': y_sc_rf},
labels=data.index[1:].tolist(),
)
ord_ax = Axis(label='Score', scale=sc_ord, grid_lines='none')
y_ax = Axis(label='Scores', scale=y_sc_rf, orientation='vertical',
grid_lines='solid')
Figure(axes=[ord_ax, y_ax], marks=[bar_chart])

Bokeh not reading sorted Pandas Dataframe correctly

I'm trying to make a sorted bar graph with a dual y-axis. So far, I have the graph in place but I am trying to sort it from the most to the least enrollment.
Here is the code for how I've sorted my data frame:
#Merging the Two DFs
viz_df = pd.merge(totenrl_df, gpaAvg_df, on = 'Institution')
#Sorting Values based off Enrollment
viz_df = viz_df.sort_values('Total Female Enrollment', ascending = False)
#Reducing DF to exclude those who do not have enrollment numbers
viz_df = viz_df[viz_df['Total Female Enrollment'] != 0]
viz_df.head(10)
This yields me:
The problem I'm running into is getting Bokeh to understand this sorted graph. This is the code I have for generating my graph:
#Creating Output File
output_notebook()
#Setting Dimensions
p = figure(plot_width = 800, plot_height = 400)
#Creating Point Renderer
p.cross(viz_df.index,viz_df['Totals, Female: Cumulative GPA (Tot. F)'].values,line_width = 0.5, color = 'red', y_range_name = 'GPA')
#Creating alterante range for GPA
p.extra_y_ranges = {'GPA': Range1d(start = 0, end = 4.0)}
p.add_layout(LinearAxis(y_range_name = 'GPA'), 'right')
#Creating Bar Values
h = viz_df['Total Female Enrollment'].values
#Adding Bar Renderer
p.vbar(x = viz_df.index, bottom = 0, top = h, width = .5)
show(p)
Which yields me:
I don't know what I'm doing wrong, I think it should be working. Thank you for any help.
I just had this same issue. The problem is, Bokeh is referring to the index column to provide data. When you sort, Pandas doesn't automatically reset the index (you can observe this in the index column of the table you shared) so all you have to do is, after the sort, put .reset_index() and it should run fine.

Categories

Resources