Formatting Axes of Binned Temporal (Continuous) Bar Graphs - python

I'm having some issues formatting the x-axis of binned temporal bar graphs.
Here is the data:
import pandas as pd
import numpy as np
import altair as alt
c1 = pd.date_range(start="2021-01-01",end="2021-02-15")
c2 = np.random.randint(1,6, size=len(c1))
df = pd.DataFrame({"day": c1, "value": c2})
df = df.drop(np.random.choice(df.index, size=12, replace=False))
Here is one approach, let's call this A. This uses a timeUnit to bin the data, but behind the scenes it uses x/x2 encoding.
c1 = alt.Chart(df).mark_bar().encode\
( alt.X("monthdate(day):T")
, alt.Y("value")
)
t1 = alt.Chart(df).mark_text(baseline="bottom").encode\
( alt.X("monthdate(day):T")
, alt.Y("value")
, alt.Text("value")
)
(c1 + t1).interactive(bind_y=False).properties(width=800)
Here is another approach, let's call this B. This uses x/x2 encoding directly.
c2 = alt.Chart(df).transform_calculate\
( day = "toDate(datum.day)"
, start = "datum.day - 12*60*60*1000"
, end = "datum.day + 12*60*60*1000"
).mark_bar().encode\
( alt.X("start:T")
, alt.X2("end:T")
, alt.Y("value")
)
t2 = alt.Chart(df).mark_text(baseline="bottom").encode\
( alt.X("day:T")
, alt.Y("value")
, alt.Text("value")
)
(c2 + t2).interactive(bind_y=False).properties(width=800)
Each has their share of formatting issues, with much overlap.
A (timeUnit):
Align the text marks, axis labels, and axis ticks to the center of the bar marks. Is it possible to offset either the axis or the data by a data-dependent or zoom-dependent amount? For example axis offset is in pixels and therefore unsuitable. Vega offers xOffset encoding, but that might not be available to Altair. It's a pity that band or bandPosition fields don't work with continuous domains.
There should be just one tick, one grid, and one label per bar. I can solve this in vega-lite using "axis": {"tickCount": "day"} but this generates a schema error in Altair 4.1.0.
Unique axis label formatting on year and month transitions, including linebreaks within axis labels. I believe a registered custom format type will allow me to use a custom JavaScript function, but I could use an example.
Change the axis labels based on how many bars are visible at the current zoom level. For example, to add the weekday at high zoom levels. According to this it is possible. I'm not sure how, but it probably involves an alt.Condition on the axis format field. This format field is then passed to the custom format function (above, A.3).
B (x/x2):
Half-sized grid every now and then. See screenshots. This is pretty strange. edit: Apparently also an issue with A, judging by the screenshots below. Never noticed it till now.
Missing or inconsistent dividers between bars. A possible solution is to add alt.X(..., bin="binned"), but this removes the grid verticals and I haven't looked into adding them back. I don't want to reduce the x1-x2 width because that leads to an undesirable zoom-dependent gap.
Inconsistent axis labels. I've seen four separate label formats at once on the x-axis, and it's even more confusing than it appears: %a %d, %Y, %b %d, %B. It's weird that this isn't an issue with code version A. Hopefully this is fixed with a custom format function as in A.3. I see that vega's scale/encode/labels/update/text ties a timeFormat to the signal handler. Is that the same thing as a custom format function?
Same as A.2.
Same as A.3.
Same as A.4.
Here are the images documenting all these problems:
A (timeUnit):
B (x/x2):

Related

How do I reduce the number of ticks on an Altair graph?

I am using Altair to create a graph, but for some weird reason it's seems to be generating a tick for each of the points. Creating a graph like this Altair Graph
If I filter the dataframe, it produces weird axis values. Altair graph
Is there a way to reduce the amount of ticks? I tried tickCount in the y axis paramater and it didn't work since it seems to require integers.I also tried setting the axis value parameter to a list [0,0.2,0.4,0.6,0.8,1] and that didn't work either. Here is my code (sorry it's so lengthy!). Thank you in advance!
a = alt.Chart(df_filtered).mark_point().encode(x =alt.X('Process_Time_(mins)', axis = alt.Axis(title='Process Time (mins)')),
y = alt.Y('Heavy_Phase_%SS',axis=alt.Axis(title='Heavy Phase %SS', tickCount = 10),sort = 'descending'),
color = alt.Color('DSP_Lot', legend = alt.Legend(title = 'DSP_Lot')),
shape = alt.Shape('Strain', scale = alt.Scale(range = ["circle", "square", "cross", "diamond", "triangle-up", "triangle-down", "triangle-right", "triangle-left"])),
tooltip = [alt.Tooltip('DSP_Lot',title = 'Lot'), alt.Tooltip('Heavy_Phase_%SS', title = 'Heavy Phase %SS'),
alt.Tooltip('Process_Time_(mins)', title = 'Process Time (mins)'), alt.Tooltip('Purpose', title = 'Purpose'), alt.Tooltip('Strain', title = 'Strain'),
alt.Tooltip('Trial', title = 'Trial')]).properties(width = 1000, height = 500)
It's hard to tell without a reproducible example but I suspect the issue is that your y axis is defaulting to a nominal encoding type, in which case you get one tick mark per unique value. If you specify a quantitative type in the Y encoding, it may improve things:
y = alt.Y('Heavy_Phase_%SS:Q', ...)
The reason it defaults to nominal is probably because the associated column in the pandas dataframe has a string type rather than a numerical type.

Using interval selection: manipulate what is taken into aggregation of individual encoding channels of altair

I am making an XY-scatter chart, where both axes show aggregated data.
For both variables I want to have an interval selection in two small charts below where I can brush along the x-axis to set a range.
The selection should then be used to filter what is taken into account for each aggregation operation individually.
On the example of the cars data set, let's say I what to look at Horsepower over Displacement. But not of every car: instead I aggregate (sum) by Origin. Additionally I create two plots of totally mean HP and displacement over time, where I add interval selections, as to be able to set two distinct time ranges.
Here is an example of what it should look like, although the selection functionality is not yet as intended.
And here below is the code to produce it. Note, that I left some commented sections in there which show what I already tried, but does not work. The idea for the transform_calculate came from this GitHub issue. But I don't know how I could use the extracted boundary values for changing what is included in the aggregations of x and y channels. Neither the double transform_window took me anywhere. Could a transform_bin be useful here? How?
Basically, what I want is: when brush1 reaches for example from 1972 to 1975, and brush2 from 1976 to 1979, I want the scatter chart to plot the summed HP of each country in the years 1972, 1973 and 1974 against each countries summed displacement from 1976, 1977 and 1978 (for my case I don't need the exact date format, the Year might as well be integers here).
import altair as alt
from vega_datasets import data
cars = data.cars.url
brush1 = alt.selection(type="interval", encodings=['x'])
brush2 = alt.selection(type="interval", encodings=['x'])
scatter = alt.Chart(cars).mark_point().encode(
x = 'HP_sum:Q',
y = 'Dis_sum:Q',
tooltip = 'Origin:N'
).transform_filter( # Ok, I can filter the whole data set, but that always acts on both variables (HP and displacement) together... -> not what I want.
brush1 | brush2
).transform_aggregate(
Dis_sum = 'sum(Displacement)',
HP_sum = 'sum(Horsepower)',
groupby = ['Origin']
# ).transform_calculate( # Can I extract the selection boundaries like that? And if yes: how can I use these extracts to calculate the aggregationsof HP and displacement?
# b1_lower='(isDefined(brush1.x) ? (brush1.x[0]) : 1)',
# b1_upper='(isDefined(brush1.x) ? (brush1.x[1]) : 1)',
# b2_lower='(isDefined(brush2.x) ? (brush2.x[0]) : 1)',
# b2_upper='(isDefined(brush2.x) ? (brush2.x[1]) : 1)',
# ).transform_window( # Maybe instead of calculate I can use two window transforms...??
# conc_sum = 'sum(conc)',
# frame = [brush1.x[0],brush1.x[1]], # This will not work, as it sets the frame relative (back- and foreward) to each datum (i.e. sliding window), I need it to correspond to the entire data set
# groupby=['sample']
# ).transform_window(
# freq_sum = 'sum(freq)',
# frame = [brush2.x[0],brush2.x[1]], # ...same problem here
# groupby=['sample']
)
range_sel1 = alt.Chart(cars).mark_line().encode(
x = 'Year:T',
y = 'mean(Horsepower):Q'
).add_selection(
brush1
).properties(
height = 100
)
range_sel2 = alt.Chart(cars).mark_line().encode(
x = 'Year:T',
y = 'mean(Displacement):Q'
).add_selection(
brush2
).properties(
height = 100
)
scatter & range_sel1 & range_sel2
Interval selection cannot be used for aggregate charts yet in Vega-Lite. The error behavior have been updated in a recent PR to Vega-Lite to show a helpful message.
Not sure if I understand your requirements correctly, does this look close to what you want? (Just added param selections on top of your vertically concatenated graphs)
Vega Editor

Colour bars based on values in pandas dataframe when using plotnine

I am trying to build a waterfall chart using plotnine. I would like to colour the starting and ending bars as grey (ideally I want to specify hexadecimal colours), increases as green and decreases as red.
Below is some sample data and my current plot. I am trying to set fill to the pandas column colour, but the bars are all black. I have also tied putting fill in the geom_segment, but this does not work either.
df = pd.DataFrame({})
df['label'] = ('A','B','C','D','E')
df['percentile'] = (10)*5
df['value'] = (100,80,90,110,110)
df['yStart'] = (0,100,80,90,0)
df['barLabel'] = ('100','-20','+10','+20','110')
df['labelPosition'] = ('105','75','95','115','115')
df['colour'] = ('grey','red','green','green','grey')
p = (ggplot(df, aes(x=np.arange(0,5,1), xend=np.arange(0,5,1), y='yStart',yend='value',fill='colour'))
+ theme_light(6)
+ geom_segment(size=10)
+ ylab('value')
+ scale_y_continuous(breaks=np.arange(0,141,20), limits=[0,140], expand=(0,0))
)
EDIT
Based on teunbrand's comment of changing fill to color, I have the following. How do I specify the actual colour, preferably in hexadecimal format?
Just to close this question off, credit goes to teunbrand in the comments for the solution.
geom_segment() has a colour aesthetic but not a fill aesthetic. Replace fill='colour' with colour='colour'.
Plotnine will use default colours for the bars. Use scale_color_identity() if the contents of the DataFrame column are literal colours, or scale_colour_manual() to manually specify a tuple or list of colours. Both forms accept hexadecimal colours.

How do I create a strip plot similar to this using vega-lite?

I'm interested in being able to recreate this multidimensional strip plot below, generated by the Missing Numbers python library, using vega-lite, and I'm looking for a few pointers on how I might do this. The code to generate the image below looks a bit like this snippet:
>>> from quilt.data.ResidentMario import missingno_data
>>> collisions = missingno_data.nyc_collision_factors()
>>> collisions = collisions.replace("nan", np.nan)
>>> import missingno as msno
>>> %matplotlib inline
>>> msno.matrix(collisions.sample(250))
For each column, there is a mark shown for a specific combination of the index, and where the data is null, or not null.
When I look through a gallery of charts created by Altair, I see this horizontal strip plot, which seems to be presenting a similar kind of information, but I'm not sure how to express the same idea.
The viz below is showing a mark when there is data that matches a given combination of horse power and cylinder size - the horsepower and cylinder are encoded into the x and y channels.
I'm not show how I'd express the same for the cool nullity matrix thing, and I think I need some pointers here.
I get that I can reset and index to come up with a y index, but it's not clear to me how to index of the sample is encoded in the Y channel, I'm not sure how I'd populate the x-axis with a column listing the null/not null results. Is this a thing I'd need to do before it gets to vega-lite, or does vega support it?
Yes, you can do this after reshaping your data with a Fold Transform. It looks something like this using Altair:
import numpy as np
import quilt
quilt.install("ResidentMario/missingno_data")
from quilt.data.ResidentMario import missingno_data
collisions = missingno_data.nyc_collision_factors()
collisions = collisions.replace("nan", np.nan)
collisions = collisions.set_index("Unnamed: 0")
import altair as alt
alt.Chart(collisions.sample(250)).transform_window(
index='row_number()'
).transform_fold(
collisions.columns.to_list()
).transform_calculate(
defined="isValid(datum.value)"
).mark_rect().encode(
x=alt.X('key:N',
title=None,
sort=collisions.columns.to_list(),
axis=alt.Axis(orient='top', labelAngle=-45)
),
y=alt.Y('index:O', title=None),
color=alt.Color('defined:N',
legend=None,
scale=alt.Scale(domain=["true", "false"], range=["black", "white"])
)
).properties(
width=800, height=400
)

Manually setting xticks with xaxis_date() in Python/matplotlib

I've been looking into how to make plots against time on the x axis and have it pretty much sorted, with one strange quirk that makes me wonder whether I've run into a bug or (admittedly much more likely) am doing something I don't really understand.
Simply put, below is a simplified version of my program. If I put this in a .py file and execute it from an interpreter (ipython) I get a figure with an x axis with the year only, "2012", repeated a number of times, like this.
However, if I comment out the line (40) that sets the xticks manually, namely 'plt.xticks(tk)' and then run that exact command in the interpreter immediately after executing the script, it works great and my figure looks like this.
Similarly it also works if I just move that line to be after the savefig command in the script, that's to say to put it at the very end of the file. Of course in both cases only the figure drawn on screen will have the desired axis, and not the saved file. Why can't I set my x axis earlier?
Grateful for any insights, thanks in advance!
import matplotlib.pyplot as plt
import datetime
# define arrays for x, y and errors
x=[16.7,16.8,17.1,17.4]
y=[15,17,14,16]
e=[0.8,1.2,1.1,0.9]
xtn=[]
# convert x to datetime format
for t in x:
hours=int(t)
mins=int((t-int(t))*60)
secs=int(((t-hours)*60-mins)*60)
dt=datetime.datetime(2012,01,01,hours,mins,secs)
xtn.append(date2num(dt))
# set up plot
fig=plt.figure()
ax=fig.add_subplot(1,1,1)
# plot
ax.errorbar(xtn,y,yerr=e,fmt='+',elinewidth=2,capsize=0,color='k',ecolor='k')
# set x axis range
ax.xaxis_date()
t0=date2num(datetime.datetime(2012,01,01,16,35)) # x axis startpoint
t1=date2num(datetime.datetime(2012,01,01,17,35)) # x axis endpoint
plt.xlim(t0,t1)
# manually set xtick values
tk=[]
tk.append(date2num(datetime.datetime(2012,01,01,16,40)))
tk.append(date2num(datetime.datetime(2012,01,01,16,50)))
tk.append(date2num(datetime.datetime(2012,01,01,17,00)))
tk.append(date2num(datetime.datetime(2012,01,01,17,10)))
tk.append(date2num(datetime.datetime(2012,01,01,17,20)))
tk.append(date2num(datetime.datetime(2012,01,01,17,30)))
plt.xticks(tk)
plt.show()
# save to file
plt.savefig('savefile.png')
I don't think you need that call to xaxis_date(); since you are already providing the x-axis data in a format that matplotlib knows how to deal with. I also think there's something slightly wrong with your secs formula.
We can make use of matplotlib's built-in formatters and locators to:
set the major xticks to a regular interval (minutes, hours, days, etc.)
customize the display using a strftime formatting string
It appears that if a formatter is not specified, the default is to display the year; which is what you were seeing.
Try this out:
import datetime as dt
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter, MinuteLocator
x = [16.7,16.8,17.1,17.4]
y = [15,17,14,16]
e = [0.8,1.2,1.1,0.9]
xtn = []
for t in x:
h = int(t)
m = int((t-int(t))*60)
xtn.append(dt.datetime.combine(dt.date(2012,1,1), dt.time(h,m)))
def larger_alim( alim ):
''' simple utility function to expand axis limits a bit '''
amin,amax = alim
arng = amax-amin
nmin = amin - 0.1 * arng
nmax = amax + 0.1 * arng
return nmin,nmax
plt.errorbar(xtn,y,yerr=e,fmt='+',elinewidth=2,capsize=0,color='k',ecolor='k')
plt.gca().xaxis.set_major_locator( MinuteLocator(byminute=range(0,60,10)) )
plt.gca().xaxis.set_major_formatter( DateFormatter('%H:%M:%S') )
plt.gca().set_xlim( larger_alim( plt.gca().get_xlim() ) )
plt.show()
Result:
FWIW the utility function larger_alim was originally written for this other question: Is there a way to tell matplotlib to loosen the zoom on the plotted data?

Categories

Resources