Plot both median and mean in Altair plot - python

How can I plot both the mean and median in Altair, distinguished by a color encoding?
Below is my first attempt, but it doesn't include the legend, and does not seem like the most elegant way.
import altair as alt
from vega_datasets import data
source = data.cars()
mean = alt.Chart(source).mark_line(color='red', point=True).encode(
x='Year',
y='mean(Miles_per_Gallon)'
)
median = alt.Chart(source).mark_line().encode(
x='Year',
y='median(Miles_per_Gallon)'
)
mean+median

This can best be done with an Aggregate Transform to compute the aggregates, followed by a Fold Transform to allow the two columns to be used in a single encoding:
import altair as alt
from vega_datasets import data
source = data.cars()
alt.Chart(source).transform_aggregate(
mean='mean(Miles_per_Gallon)',
median='median(Miles_per_Gallon)',
groupby=['Year']
).transform_fold(
['mean', 'median'],
as_=['aggregate', 'value']
).mark_line().encode(
x='Year',
y='value:Q',
color='aggregate:N',
)

Related

Is there a way to format tooltip values in Altair boxplot

Is is possible to format the values within a tooltip for a boxplot? From this Vega documentation, it appears so, but I can't quite figure out how to do it with Altair for python
from vega_datasets import data
import altair as alt
source = data.population.url
alt.Chart(source).mark_boxplot().encode(
alt.X("age:O"),
alt.Y("people:Q"),
tooltip=[
alt.Tooltip("people:Q", format=",.2f"),
],
)
I believe you need to provide an aggregation for composite marks like mark_boxplot. This works:
from vega_datasets import data
import altair as alt
source = data.population.url
alt.Chart(source).mark_boxplot().encode(
alt.X("age:O"),
alt.Y("people:Q"),
tooltip=alt.Tooltip("mean(people):Q", format=",.2f"),)
Update: As it is currently impossible to add multiple aggregated tooltips to a boxplot, I combined my answer with How to change Altair boxplot infobox to display mean rather than median? to put a transparent box with a custom tooltip on top of the boxplot. I still kept the boxplot underneath in order to have the outliers and whiskers plotted as a Tukey boxplot instead of min-max. I also added a point for the mean, since this is what I wanted to see in the tooltip:
alt.Chart(source).mark_boxplot(median={'color': '#353535'}).encode(
alt.X("age:O"),
alt.Y("people:Q"),
tooltip=[
alt.Tooltip("people:Q", format=",.2f"),
],
) + alt.Chart(source).mark_circle(color='#353535', size=15).encode(
x='age:O',
y='mean(people):Q'
) + alt.Chart(source).transform_aggregate(
min="min(people)",
max="max(people)",
mean="mean(people)",
median="median(people)",
q1="q1(people)",
q3="q3(people)",
count="count()",
groupby=['age']
).mark_bar(opacity=0).encode(
x='age:O',
y='q1:Q',
y2='q3:Q',
tooltip=alt.Tooltip(['min:Q', 'q1:Q', 'mean:Q', 'median:Q', 'q3:Q', 'max:Q', 'count:Q'], format='.1f')
)
There is a way to add multiple columns to the tooltip. You can pass in multiple columns in square brackets as a list.
import altair as alt
from vega_datasets import data
stocks = data.stocks()
alt.Chart(stocks).mark_point().transform_window(
previous_price = 'lag(price)'
).transform_calculate(
pct_change = '(datum.price - datum.previous_price) / datum.previous_price'
).encode(
x='date',
y='price',
color='symbol',
tooltip=[ 'price', 'symbol', alt.Tooltip('pct_change:Q', format='.1%')]
)

Any way to correctly make weekly time series line chart in matplotlib?

I am trying to make a linear chart that visualizes the product's export and sales activity by using weekly base data. Basically, I want to use this data to see how the exporting number of different commodities is changing along with weekly time base data. I could able to aggregate data for making a line chart for the export trends of different commodities for top-5 counties, but the resulted plot in my attempt didn't make my expected output. Can anyone point me out how to make this right? Is there any better way to make a product export trend line chart using matplotlib or seaborn in python? Can anyone suggest a possible better way of doing this? Any thoughts
my current attempt
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
url = 'https://gist.githubusercontent.com/adamFlyn/e9ad428a266eccb5dc38b4cee7084372/raw/cfcbe9cf0ed19ada6a4ea409644db7414de9c87f/sales_df.csv'
df = pd.read_csv(url)
df.drop(columns=['Unnamed: 0'], inplace=True)
df_grp = df.groupby(['weekEndingDate','country', 'commodity'])['weeklyExports'].sum().unstack().reset_index()
df_grp = df_grp .fillna(0)
for c in df_grp[['FCF_Beef', 'FCF_Pork']]:
fig, ax = plt.subplots(figsize=(7, 4), dpi=144)
df_grp_new = df_grp .groupby(['country', 'weekEndingDate'])[c].sum().unstack().fillna(0)
df_grp_new = df_grp_new .T
df_grp_new.drop([col for col, val in df_grp_new .sum().iteritems() if val < 1000], axis=1, inplace=True)
for col in df_grp_new.columns:
sns.lineplot(x='WeekEndingDate', y='weekly export', ci=None, data=df_grp_new, label=col)
ax.relim()
ax.autoscale_view()
ax.xaxis.label.set_visible(False)
plt.legend(bbox_to_anchor=(1., 1), loc='upper left')
plt.ylabel('weekly export')
plt.margins(x=0)
plt.title(c)
plt.tight_layout()
plt.grid(True)
plt.show()
plt.close()
but these attempts didn't make my expected output. Essentially, I want to see how weekly export of different commodities like beef and pork for different countries by weekly base time series. Can anyone suggest to me what went wrong in my code? How can I get a desirable line chart by using the above data? Any idea?
desired output
here is the example desired plots (just style) that I want to make in my attempt:
Plenty of ways to do it. If you make your time column into datetime seaborn will handle formatting the axis for you.
You could use a facetgrid to split by commodity, or if you want finer control over the individual charts plot them using lineplot, filtering the df by the commodity prior.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
url = 'https://gist.githubusercontent.com/adamFlyn/e9ad428a266eccb5dc38b4cee7084372/raw/cfcbe9cf0ed19ada6a4ea409644db7414de9c87f/sales_df.csv'
df = pd.read_csv(url)
df.drop(columns=['Unnamed: 0'], inplace=True)
df['weekEndingDate'] = pd.to_datetime(df['weekEndingDate'])
# sns.set(rc={'figure.figsize':(11.7,8.27)})
g = sns.FacetGrid(df, col='commodity', height=8, sharex=False, sharey=False, legend_out=True)
g.map_dataframe(sns.lineplot, x='weekEndingDate',y='weeklyExports', hue='country', ci=None)
g.add_legend()

Using color on bar chart with Altair seems to prevent zero=False on scale from having anticipated effect

The first chart from the below code (based on this: https://altair-viz.github.io/gallery/us_population_over_time_facet.html) seems to force Y-axis to not begin at zero, as anticipated. But the second chart, which includes a color in the encoding, seems to make the zero=False in alt.Scale no longer respected
Edit: forgot to mention using Altair 4.1.0
import altair as alt
from vega_datasets import data
import pandas as pd
source = data.population.url
df = pd.read_json(source)
df = df[df["age"] <= 40]
alt.Chart(df).mark_bar().encode(
x="age:O",
y=alt.Y(
"sum(people):Q",
title="Population",
axis=alt.Axis(format="~s"),
scale=alt.Scale(zero=False),
),
facet=alt.Facet("year:O", columns=5),
).resolve_scale(y="independent").properties(
title="US Age Distribution By Year", width=90, height=80
)
alt.Chart(df).mark_bar().encode(
x="age:O",
y=alt.Y(
"sum(people):Q",
title="Population",
axis=alt.Axis(format="~s"),
scale=alt.Scale(zero=False),
),
facet=alt.Facet("year:O", columns=5),
color=alt.Color("year"),
).resolve_scale(y="independent").properties(
title="US Age Distribution By Year", width=90, height=80
)
This happens because the scales are automatically adjusted to show all the groups in the variable you are coloring by. It is easier to understand if we look at a single barplot with stacked colors:
import altair as alt
from vega_datasets import data
import pandas as pd
source = data.population.url
df = pd.read_json(source)
df = df[df["age"] <= 40]
alt.Chart(df.query('year < 1880')).mark_bar().encode(
x="age:O",
y=alt.Y(
"sum(people):Q",
axis=alt.Axis(format="~s"),
scale=alt.Scale(zero=False)),
color=alt.Color("year"))
You are calculating the sum, which means that all the years are going to be somewhere in that bar stacked on top of each other. Altair / Vega-Lite expands the axis so that includes all groups in your colored variable.
If you instead would color by age, the axis would again expand to include all the colored group, but because they are now not at the bottom of each bar, the axis is cut above zero.
import altair as alt
from vega_datasets import data
import pandas as pd
source = data.population.url
df = pd.read_json(source)
df = df[df["age"] <= 40]
alt.Chart(df.query('year < 1880')).mark_bar().encode(
x="age:O",
y=alt.Y(
"sum(people):Q",
axis=alt.Axis(format="~s"),
scale=alt.Scale(zero=False)),
color=alt.Color("age"))
The only discrepancy is why doesn't it just show the tip of the darkest color in the first plot and cut around 2M? I am not sure about that on the top of my head.

How to plot y-axis bands in Altair charts?

Can Altair plot bands on the y axis, similar to this Highcharts example?
The docs have an example showing how to draw a line on the y axis, but adapting the example to use plot_rect to draw a band instead doesn't quite work:
import altair as alt
from vega_datasets import data
weather = data.seattle_weather.url
chart = alt.Chart(weather).encode(
alt.X("date:T")
)
bars = chart.mark_bar().encode(
y='precipitation:Q'
)
band = chart.mark_rect().encode(
y=alt.value(20),
y2=alt.value(50),
color=alt.value('firebrick')
)
alt.layer(bars, band)
I think the problem when you give a value with alt.value is that you specify the value in pixels starting from the top of the graph : it is not mapped to the data.
In the initial answer, with mark_rule, it would'nt create a clean band but a lot of vertical stripes, so here is a way to correctly plot a band.
First solution is to create a brand new data frame for the band, and layer that on top of the bars:
import altair as alt
import pandas as pd
from vega_datasets import data
weather = data('seattle_weather')
band_df = pd.DataFrame([{'x_min': weather.date.min(),
'x_max': weather.date.max(),
'y_min': 20,
'y_max': 50}])
bars = alt.Chart(weather).mark_bar().encode(
x=alt.X('date:T'),
y=alt.Y('precipitation:Q', title="Precipitation")
)
band_2 = alt.Chart(band_df).mark_rect(color='firebrick', opacity=0.3).encode(
x='x_min:T',
x2='x_max:T',
y='y_min:Q',
y2='y_max:Q'
)
alt.layer(bars, band_2)
Second option, if you do not want/cannot create a dataframe, is to use transform_calculate, and manually specify x and x2 in the band chart:
bars = alt.Chart().mark_bar().encode(
x=alt.X('date:T', title='Date'),
y=alt.Y('precipitation:Q', title="Precipitation")
)
band_3 = alt.Chart().mark_rect(color='firebrick', opacity=0.3).encode(
x='min(date):T',
x2='max(date):T',
y='y_min:Q',
y2='y_max:Q'
).transform_calculate(y_min='20', y_max='50')
alt.layer(bars, band_3, data=data.seattle_weather.url)
Initial answer
I would do 2 things to mimic the highchart example you gave. First, use a transform_calculate to set y_min and y_max values. And second, I'll use mark_rule so that the band span on the X axis where there are values. (I also added some opacity and changed the order of the layers so that the band is behind the bars.)
import altair as alt
from vega_datasets import data
weather = data.seattle_weather.url
chart = alt.Chart().encode(
alt.X("date:T")
)
bars = chart.mark_bar().encode(
y='precipitation:Q'
)
band = chart.mark_rule(color='firebrick',
opacity=0.3).encode(
y=alt.Y('y_min:Q'),
y2=alt.Y('y_max:Q')
).transform_calculate(y_min="20",
y_max="50")
alt.layer(band, bars, data=weather)

Change facet title position in Altair?

How can I move the facet titles (in this case, year) to be above each plot? The default seems to be on the side of the chart. Can this be easily changed?
import altair as alt
from vega_datasets import data
df = data.seattle_weather()
alt.Chart(df).mark_rect().encode(
alt.Y('month(date):O', title='day'),
alt.X('date(date):O', title='month'),
color='temp_max:Q'
).facet(
row='year(date):N',
)
A general solution for this is to use the labelOrient option of the header parameter:
df = df[df.date.dt.year < 2014] # make a smaller chart
alt.Chart(df).mark_rect().encode(
alt.Y('month(date):O', title='day'),
alt.X('date(date):O', title='month'),
color='temp_max:Q'
).facet(
row=alt.Row('year(date):N', header=alt.Header(labelOrient='top'))
)
One solution is to remove the row specification and set the facet to have a single column
import altair as alt
from vega_datasets import data
df = data.seattle_weather()
alt.Chart(df).mark_rect().encode(
alt.Y('month(date):O', title='day'),
alt.X('date(date):O', title='month'),
color='temp_max:Q'
).facet('year(date):N', columns=1)

Categories

Resources