How do I reduce the number of ticks on an Altair graph? - python

I am using Altair to create a graph, but for some weird reason it's seems to be generating a tick for each of the points. Creating a graph like this Altair Graph
If I filter the dataframe, it produces weird axis values. Altair graph
Is there a way to reduce the amount of ticks? I tried tickCount in the y axis paramater and it didn't work since it seems to require integers.I also tried setting the axis value parameter to a list [0,0.2,0.4,0.6,0.8,1] and that didn't work either. Here is my code (sorry it's so lengthy!). Thank you in advance!
a = alt.Chart(df_filtered).mark_point().encode(x =alt.X('Process_Time_(mins)', axis = alt.Axis(title='Process Time (mins)')),
y = alt.Y('Heavy_Phase_%SS',axis=alt.Axis(title='Heavy Phase %SS', tickCount = 10),sort = 'descending'),
color = alt.Color('DSP_Lot', legend = alt.Legend(title = 'DSP_Lot')),
shape = alt.Shape('Strain', scale = alt.Scale(range = ["circle", "square", "cross", "diamond", "triangle-up", "triangle-down", "triangle-right", "triangle-left"])),
tooltip = [alt.Tooltip('DSP_Lot',title = 'Lot'), alt.Tooltip('Heavy_Phase_%SS', title = 'Heavy Phase %SS'),
alt.Tooltip('Process_Time_(mins)', title = 'Process Time (mins)'), alt.Tooltip('Purpose', title = 'Purpose'), alt.Tooltip('Strain', title = 'Strain'),
alt.Tooltip('Trial', title = 'Trial')]).properties(width = 1000, height = 500)

It's hard to tell without a reproducible example but I suspect the issue is that your y axis is defaulting to a nominal encoding type, in which case you get one tick mark per unique value. If you specify a quantitative type in the Y encoding, it may improve things:
y = alt.Y('Heavy_Phase_%SS:Q', ...)
The reason it defaults to nominal is probably because the associated column in the pandas dataframe has a string type rather than a numerical type.

Related

Altair shortened form of x-axis label issues

I'm trying to make the x axis be a slice of the string of longspecies, but I want longspecies to show up as the legend. I've tried a couple different ways and without the domain the below code works, but once you add the domain it at least doubles the length of the x axis, probably adding the correlated values from species and longspecies.
I'm not sure how to just use longspecies and slice the tickmarks either (value and label seem like they aren't what I'm looking for). Any help would be appreciated.
the data setup:
testspecies = ['oak', 'elm', 'willow']
testmean = [np.random.rand()*25+75 for _ in range(len(testspecies))]
testlongspecies = [testspecies[i] + ' long form name' for i in range(len(testspecies))]
testset = zip(testmean, testspecies, testlongspecies, strict=True)
testdf = pd.DataFrame(testset, columns = ['average', 'species', 'long species'])
The chart:
alt.Chart(testdf).mark_bar().encode(
alt.X('species', title = None),
alt.Y('average', title = 'mean', scale = alt.Scale(domain = (50,100))),
color='long species'
)
If I add clip=True inside the mark, I get the following graph which sounds like it is what you are looking for (the x-axis labels are shorter than those in the legend):
alt.Chart(testdf).mark_bar(clip=True).encode(
alt.X('species', title = None),
alt.Y('average', title = 'mean', scale = alt.Scale(domain = (50,100))),
color='long species'
)

Using interval selection: manipulate what is taken into aggregation of individual encoding channels of altair

I am making an XY-scatter chart, where both axes show aggregated data.
For both variables I want to have an interval selection in two small charts below where I can brush along the x-axis to set a range.
The selection should then be used to filter what is taken into account for each aggregation operation individually.
On the example of the cars data set, let's say I what to look at Horsepower over Displacement. But not of every car: instead I aggregate (sum) by Origin. Additionally I create two plots of totally mean HP and displacement over time, where I add interval selections, as to be able to set two distinct time ranges.
Here is an example of what it should look like, although the selection functionality is not yet as intended.
And here below is the code to produce it. Note, that I left some commented sections in there which show what I already tried, but does not work. The idea for the transform_calculate came from this GitHub issue. But I don't know how I could use the extracted boundary values for changing what is included in the aggregations of x and y channels. Neither the double transform_window took me anywhere. Could a transform_bin be useful here? How?
Basically, what I want is: when brush1 reaches for example from 1972 to 1975, and brush2 from 1976 to 1979, I want the scatter chart to plot the summed HP of each country in the years 1972, 1973 and 1974 against each countries summed displacement from 1976, 1977 and 1978 (for my case I don't need the exact date format, the Year might as well be integers here).
import altair as alt
from vega_datasets import data
cars = data.cars.url
brush1 = alt.selection(type="interval", encodings=['x'])
brush2 = alt.selection(type="interval", encodings=['x'])
scatter = alt.Chart(cars).mark_point().encode(
x = 'HP_sum:Q',
y = 'Dis_sum:Q',
tooltip = 'Origin:N'
).transform_filter( # Ok, I can filter the whole data set, but that always acts on both variables (HP and displacement) together... -> not what I want.
brush1 | brush2
).transform_aggregate(
Dis_sum = 'sum(Displacement)',
HP_sum = 'sum(Horsepower)',
groupby = ['Origin']
# ).transform_calculate( # Can I extract the selection boundaries like that? And if yes: how can I use these extracts to calculate the aggregationsof HP and displacement?
# b1_lower='(isDefined(brush1.x) ? (brush1.x[0]) : 1)',
# b1_upper='(isDefined(brush1.x) ? (brush1.x[1]) : 1)',
# b2_lower='(isDefined(brush2.x) ? (brush2.x[0]) : 1)',
# b2_upper='(isDefined(brush2.x) ? (brush2.x[1]) : 1)',
# ).transform_window( # Maybe instead of calculate I can use two window transforms...??
# conc_sum = 'sum(conc)',
# frame = [brush1.x[0],brush1.x[1]], # This will not work, as it sets the frame relative (back- and foreward) to each datum (i.e. sliding window), I need it to correspond to the entire data set
# groupby=['sample']
# ).transform_window(
# freq_sum = 'sum(freq)',
# frame = [brush2.x[0],brush2.x[1]], # ...same problem here
# groupby=['sample']
)
range_sel1 = alt.Chart(cars).mark_line().encode(
x = 'Year:T',
y = 'mean(Horsepower):Q'
).add_selection(
brush1
).properties(
height = 100
)
range_sel2 = alt.Chart(cars).mark_line().encode(
x = 'Year:T',
y = 'mean(Displacement):Q'
).add_selection(
brush2
).properties(
height = 100
)
scatter & range_sel1 & range_sel2
Interval selection cannot be used for aggregate charts yet in Vega-Lite. The error behavior have been updated in a recent PR to Vega-Lite to show a helpful message.
Not sure if I understand your requirements correctly, does this look close to what you want? (Just added param selections on top of your vertically concatenated graphs)
Vega Editor

Python - Plot a graph with times on x-axis

I have the following dataframe in pandas:
dfClicks = pd.DataFrame({'clicks': [700,800,550],'date_of_click': ['10/25/1995
03:30','10/25/1995 04:30','10/25/1995 05:30']})
dfClicks['date_of_click'] = pd.to_datetime(dfClicks['date_of_click'])
dfClicks.set_index('date_of_click')
dfClicks.clicks = pd.to_numeric(dfClicks.clicks)
Could you please advise how I can plot the above such that the x-axis shows the date/time and the y axis the number of clicks? I will also need to plot another data frame which includes predicted clicks on the same graph, just to compare. The test could be a replica of above, with minor changes:
dfClicks2 = pd.DataFrame({'clicks': [750,850,500],'date_of_click': ['10/25/1995
03:30','10/25/1995 04:30','10/25/1995 05:30']})
dfClicks2['date_of_click'] = pd.to_datetime(dfClicks2['date_of_click'])
dfClicks2.set_index('date_of_click')
dfClicks2.clicks = pd.to_numeric(dfClicks2.clicks)
Change to numeric the column clicks and then:
ax = dfClicks.plot()
dfClicks2.plot(ax=ax)
ax.legend(["Clicks","Clicks2"])
Output:
UPDATE:
There is an error in how you set the index, change
dfClicks.set_index('date_of_click')
with:
dfClicks = dfClicks.set_index('date_of_click')

How do the x and y parameters in the Label object work for bokeh?

I've read the documentation for the Label class in Bokeh but the x and y parameters are quite confusing. Their behavior seems to change if you pass something to the x_units and y_units parameters but I don't understand what the units are supposed to be by default.
More specifically, I have a list of strings that I'm using for my x-axis:
xlab = [
'COREPCE2',
'COREPCE3',
'COREPCE4',
'COREPCE5',
'COREPCE6',
'',
'T5YIE'
]
p = figure(..., y_range = (0,.04), x_range = xlab)
If I wanted to draw basically anything else on the plot, I could just use those strings. For example I drew some lines like this:
p.line(['COREPCE2', 'T5YIE'], [.02,.02], color = 'black', line_dash = 'dashed')
p.line(['', ''], [0,.04], color = 'black')
And that works fine, this is the full chart.
Here's the issue though. I want to put a text label on the "COREPCE4" location of the x axis. If I try just passing the string for the x parameter in the Label class it just doesn't work:
section = Label(x = 'COREPCE4', y = .03, text = 'Survey of Professional Forecasters: August 9, 2019')
p.add_layout(section)
It throws an error: ValueError: expected a value of type Real, got COREPCE4 of type str. I don't really know what units its expecting. Is there a way to make Bokeh recognize that I want to use the x-axis label as my x parameter in the same way I've done with the other glyphs?
The propertied x_units, y_units, refer to screen (pixel) vs data-space (axis) units. As of Bokeh 1.3.4 the x and y properties of Label can only be set from floating point numbers, so they cannot be used directly with categorical coordinates. For now you should use LabelSet, even if you are only showing a single label, since it can work with categorical coordinates.

Seaborn clustermap fixed cell size

I am using the seaborn clustermap function and I would like to make multiple plots where the cell sizes are exactly identical. Also the size of the axis labels should be the same. This means figure size and aspect ratio will need to change, the rest needs to stay identical.
import pandas
import seaborn
import numpy as np
dataFrameA = pd.DataFrame([ [1,2],[3,4] ])
dataFrameB = pd.DataFrame( np.arange(3*6).reshape(3,-1))
Then decide how big the clustermap itself needs to be, something along the lines of:
dpi = 72
cellSizePixels = 150
This decides that dataFrameA should be should be 300 by 300 pixels. I think that those need to be converted to the size units of the figure, which will be cellSizePixels/dpi units per pixel. So for dataFrameA that will be a heatmap size of ~2.01 inches. Here I am introducing a problem: there is stuff around the heatmap, which will also take up some space, and I don't know how much space those will exactly take.
I tried to parametrize the heatmap function with a guess of the image size using the formula above:
def fixedWidthClusterMap( dpi, cellSizePixels, dataFrame):
clustermapParams = {
'square':False # Tried to set this to True before. Don't: the dendograms do not scale well with it.
}
figureWidth = (cellSizePixels/dpi)*dataFrame.shape[1]
figureHeight= (cellSizePixels/dpi)*dataFrame.shape[0]
return sns.clustermap( dataFrame, figsize=(figureWidth,figureHeight), **clustermapParams)
fixedWidthClusterMap(dpi, cellSizePixels, dataFrameA)
plt.show()
fixedWidthClusterMap(dpi, cellSizePixels, dataFrameB)
plt.show()
This yields:
My question: how do I obtain square cells which are exactly the size I want?
This is a bit tricky, because there are quite a few things to take into consideration, and in the end, it depends how "exact" you need the sizes to be.
Looking at the code for clustermap the heatmap part is designed to have a ratio of 0.8 compared to the axes used for the dendrograms. But we also need to take into account the margins used to place the axes. If one knows the size of the heatmap axes, one should therefore be able to calculate the desired figure size that would produce the right shape.
dpi = matplotlib.rcParams['figure.dpi']
marginWidth = matplotlib.rcParams['figure.subplot.right']-matplotlib.rcParams['figure.subplot.left']
marginHeight = matplotlib.rcParams['figure.subplot.top']-matplotlib.rcParams['figure.subplot.bottom']
Ny,Nx = dataFrame.shape
figWidth = (Nx*cellSizePixels/dpi)/0.8/marginWidth
figHeigh = (Ny*cellSizePixels/dpi)/0.8/marginHeight
Unfortunately, it seems matplotlib must adjust things a bit during plotting, because that was not enough the get perfectly square heatmap cells. So I choose to resize the various axes create by clustermap after the fact, starting with the heatmap, then the dendrogram axes.
I think the resulting image is pretty close to what you were trying to get, but my tests sometime show some errors by 1-2 px, which I attribute to rounding errors due to all the conversions between sizes in inches and pixels.
dataFrameA = pd.DataFrame([ [1,2],[3,4] ])
dataFrameB = pd.DataFrame( np.arange(3*6).reshape(3,-1))
def fixedWidthClusterMap(dataFrame, cellSizePixels=50):
# Calulate the figure size, this gets us close, but not quite to the right place
dpi = matplotlib.rcParams['figure.dpi']
marginWidth = matplotlib.rcParams['figure.subplot.right']-matplotlib.rcParams['figure.subplot.left']
marginHeight = matplotlib.rcParams['figure.subplot.top']-matplotlib.rcParams['figure.subplot.bottom']
Ny,Nx = dataFrame.shape
figWidth = (Nx*cellSizePixels/dpi)/0.8/marginWidth
figHeigh = (Ny*cellSizePixels/dpi)/0.8/marginHeight
# do the actual plot
grid = sns.clustermap(dataFrame, figsize=(figWidth, figHeigh))
# calculate the size of the heatmap axes
axWidth = (Nx*cellSizePixels)/(figWidth*dpi)
axHeight = (Ny*cellSizePixels)/(figHeigh*dpi)
# resize heatmap
ax_heatmap_orig_pos = grid.ax_heatmap.get_position()
grid.ax_heatmap.set_position([ax_heatmap_orig_pos.x0, ax_heatmap_orig_pos.y0,
axWidth, axHeight])
# resize dendrograms to match
ax_row_orig_pos = grid.ax_row_dendrogram.get_position()
grid.ax_row_dendrogram.set_position([ax_row_orig_pos.x0, ax_row_orig_pos.y0,
ax_row_orig_pos.width, axHeight])
ax_col_orig_pos = grid.ax_col_dendrogram.get_position()
grid.ax_col_dendrogram.set_position([ax_col_orig_pos.x0, ax_heatmap_orig_pos.y0+axHeight,
axWidth, ax_col_orig_pos.height])
return grid # return ClusterGrid object
grid = fixedWidthClusterMap(dataFrameA, cellSizePixels=75)
plt.show()
grid = fixedWidthClusterMap(dataFrameB, cellSizePixels=75)
plt.show()
Not a complete answer (not dealing with pixels) but I suspect OP has moved on after 4 years.
def reshape_clustermap(cmap, cell_width=0.02, cell_height=0.02):
ny, nx = cmap.data2d.shape
hmap_width = nx * cell_width
hmap_height = ny * cell_height
hmap_orig_pos = cmap.ax_heatmap.get_position()
cmap.ax_heatmap.set_position(
[hmap_orig_pos.x0, hmap_orig_pos.y0, hmap_width, hmap_height]
)
top_dg_pos = cmap.ax_col_dendrogram.get_position()
cmap.ax_col_dendrogram.set_position(
[hmap_orig_pos.x0, hmap_orig_pos.y0 + hmap_height, hmap_width, top_dg_pos.height]
)
left_dg_pos = cmap.ax_row_dendrogram.get_position()
cmap.ax_row_dendrogram.set_position(
[left_dg_pos.x0, left_dg_pos.y0, left_dg_pos.width, hmap_height]
)
if cmap.ax_cbar:
cbar_pos = cmap.ax_cbar.get_position()
hmap_pos = cmap.ax_heatmap.get_position()
cmap.ax_cbar.set_position(
[cbar_pos.x0, hmap_pos.y1, cbar_pos.width, cbar_pos.height]
)
cmap = sns.clustermap(dataFrameA)
reshape_clustermap(cmap)

Categories

Resources