How to adjust scale ranges in altair? - python

I'm having trouble getting all of the axes onto the same scale when using altair to make a group of plots like so:
class_list = ['c-CS-m','c-CS-s','c-SC-m','c-SC-s','t-CS-m','t-CS-s','t-SC-m','t-SC-s']
list_of_plots = []
for class_name in class_list:
list_of_plots.append(alt.Chart(data[data['class'] == class_name]).mark_bar().encode(
x = alt.X('DYRK1A', bin = True, scale=alt.Scale()),
y = 'count()').resolve_scale(
y='independent'
))
list_of_plots[0] & list_of_plots[1] | list_of_plots[2] & list_of_plots[3] | list_of_plots[4] & list_of_plots[5] | list_of_plots[6] & list_of_plots[7]
I'd like to have the x axis run from 0.0 to 1.4 and the y axis run from 0 to 120 so that all eight plots I'm producing are on the same scale! I've tried to use domain, inside the currently empty Scale() call but it seems to result in the visualisations that have x axis data from say 0.0 to 0.3 being super squished up and I can't understand why?
For context, I'm trying to plot continuous values for protein expression levels. The 8 plots are for different classes of mice that have been exposed to different conditions. The data is available at this link if that helps: https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression
Please let me know if I need to provide some more info in order for you to help me!

First of all, it looks like you're trying to create a wrapped facet chart. Rather than doing that manually with concatenation, it's better to use a wrapped facet encoding.
Second, when you specify resolve_scale(y='independent'), you're specifying that the y-scales should not match between subcharts. If instead you want all scales to be shared, you can use resolve_scale(y='shared'), or equivalently just leave that out, as it is the default.
To specify explicit axis domains, use alt.Scale(domain=[min, max]). Put together, it might look something like this:
alt.Chart(data).mark_bar().encode(
x = alt.X('DYRK1A', bin = True, scale=alt.Scale(domain=[0, 1.4])),
y = alt.Y('count()', scale=alt.Scale(domain=[0, 120]),
facet = alt.Facet('class:N', columns=4),
)

Related

Altair Color Scatter Plot on Condition

I have this df:
x y term s
0 0.000000 0.132653 matlab 0.893072
1 0.000000 0.142857 matrix 0.905120
2 0.012346 0.153061 laboratory 0.902610
3 0.987654 0.989796 be 0.857932
4 0.938272 0.959184 a 0.861948
The variable s tells us the "distance" of the term from the central line (slope 1).
And I need to make a scatterplot that looks like this:
I have this code so far:
chart = alt.Chart(scatterdata_df).mark_circle().encode(
x = alt.X('x:Q', axis = alt.Axis(tickMinStep = 0.05)),
y = alt.Y('y:Q', axis = alt.Axis(tickMinStep = 0.05)),
color=alt.condition('s:Q', alt.value('red'), alt.value('blue')),
tooltip = ['term']
).properties(
width = 500,
height = 500
)
chart
And that gives me an error.
Javascript Error: Expression parse error: (s:Q)?"red":"blue"
This usually means there's a typo in your chart specification. See the javascript console for the full traceback.
When I just do color = 's' I get this, which is closer:
But again I need that double-gradient of colors. I know that the gradient is respective of the s variable, but I'm not sure how to make it have two gradients, one for each side of the central line.
s:Q is not a valid conditional statement. But, for example, you could write a condition like this:
color = alt.condition(alt.datum.s < 0, alt.value('red'), alt.value('blue'))
and points with s < 0 would be colored red, and all others would be colored blue.
Alternatively, if you want to encode a continuous color scale by the value of s (rather than deciding between two colors based on a condition), you could do
color = 's:Q'
If you'd like to use a color scheme in this case that's different from the default, you can specify it this way:
color = alt.Color('s:Q', scale=alt.Scale(scheme='redblue'))
where the string passed to the scheme argument is one of the built-in named color schemes, listed at https://vega.github.io/vega/docs/schemes/#reference
For more information on customizing colors in Altair, see https://altair-viz.github.io/user_guide/customization.html#customizing-colors

How do I reduce the number of ticks on an Altair graph?

I am using Altair to create a graph, but for some weird reason it's seems to be generating a tick for each of the points. Creating a graph like this Altair Graph
If I filter the dataframe, it produces weird axis values. Altair graph
Is there a way to reduce the amount of ticks? I tried tickCount in the y axis paramater and it didn't work since it seems to require integers.I also tried setting the axis value parameter to a list [0,0.2,0.4,0.6,0.8,1] and that didn't work either. Here is my code (sorry it's so lengthy!). Thank you in advance!
a = alt.Chart(df_filtered).mark_point().encode(x =alt.X('Process_Time_(mins)', axis = alt.Axis(title='Process Time (mins)')),
y = alt.Y('Heavy_Phase_%SS',axis=alt.Axis(title='Heavy Phase %SS', tickCount = 10),sort = 'descending'),
color = alt.Color('DSP_Lot', legend = alt.Legend(title = 'DSP_Lot')),
shape = alt.Shape('Strain', scale = alt.Scale(range = ["circle", "square", "cross", "diamond", "triangle-up", "triangle-down", "triangle-right", "triangle-left"])),
tooltip = [alt.Tooltip('DSP_Lot',title = 'Lot'), alt.Tooltip('Heavy_Phase_%SS', title = 'Heavy Phase %SS'),
alt.Tooltip('Process_Time_(mins)', title = 'Process Time (mins)'), alt.Tooltip('Purpose', title = 'Purpose'), alt.Tooltip('Strain', title = 'Strain'),
alt.Tooltip('Trial', title = 'Trial')]).properties(width = 1000, height = 500)
It's hard to tell without a reproducible example but I suspect the issue is that your y axis is defaulting to a nominal encoding type, in which case you get one tick mark per unique value. If you specify a quantitative type in the Y encoding, it may improve things:
y = alt.Y('Heavy_Phase_%SS:Q', ...)
The reason it defaults to nominal is probably because the associated column in the pandas dataframe has a string type rather than a numerical type.

Using interval selection: manipulate what is taken into aggregation of individual encoding channels of altair

I am making an XY-scatter chart, where both axes show aggregated data.
For both variables I want to have an interval selection in two small charts below where I can brush along the x-axis to set a range.
The selection should then be used to filter what is taken into account for each aggregation operation individually.
On the example of the cars data set, let's say I what to look at Horsepower over Displacement. But not of every car: instead I aggregate (sum) by Origin. Additionally I create two plots of totally mean HP and displacement over time, where I add interval selections, as to be able to set two distinct time ranges.
Here is an example of what it should look like, although the selection functionality is not yet as intended.
And here below is the code to produce it. Note, that I left some commented sections in there which show what I already tried, but does not work. The idea for the transform_calculate came from this GitHub issue. But I don't know how I could use the extracted boundary values for changing what is included in the aggregations of x and y channels. Neither the double transform_window took me anywhere. Could a transform_bin be useful here? How?
Basically, what I want is: when brush1 reaches for example from 1972 to 1975, and brush2 from 1976 to 1979, I want the scatter chart to plot the summed HP of each country in the years 1972, 1973 and 1974 against each countries summed displacement from 1976, 1977 and 1978 (for my case I don't need the exact date format, the Year might as well be integers here).
import altair as alt
from vega_datasets import data
cars = data.cars.url
brush1 = alt.selection(type="interval", encodings=['x'])
brush2 = alt.selection(type="interval", encodings=['x'])
scatter = alt.Chart(cars).mark_point().encode(
x = 'HP_sum:Q',
y = 'Dis_sum:Q',
tooltip = 'Origin:N'
).transform_filter( # Ok, I can filter the whole data set, but that always acts on both variables (HP and displacement) together... -> not what I want.
brush1 | brush2
).transform_aggregate(
Dis_sum = 'sum(Displacement)',
HP_sum = 'sum(Horsepower)',
groupby = ['Origin']
# ).transform_calculate( # Can I extract the selection boundaries like that? And if yes: how can I use these extracts to calculate the aggregationsof HP and displacement?
# b1_lower='(isDefined(brush1.x) ? (brush1.x[0]) : 1)',
# b1_upper='(isDefined(brush1.x) ? (brush1.x[1]) : 1)',
# b2_lower='(isDefined(brush2.x) ? (brush2.x[0]) : 1)',
# b2_upper='(isDefined(brush2.x) ? (brush2.x[1]) : 1)',
# ).transform_window( # Maybe instead of calculate I can use two window transforms...??
# conc_sum = 'sum(conc)',
# frame = [brush1.x[0],brush1.x[1]], # This will not work, as it sets the frame relative (back- and foreward) to each datum (i.e. sliding window), I need it to correspond to the entire data set
# groupby=['sample']
# ).transform_window(
# freq_sum = 'sum(freq)',
# frame = [brush2.x[0],brush2.x[1]], # ...same problem here
# groupby=['sample']
)
range_sel1 = alt.Chart(cars).mark_line().encode(
x = 'Year:T',
y = 'mean(Horsepower):Q'
).add_selection(
brush1
).properties(
height = 100
)
range_sel2 = alt.Chart(cars).mark_line().encode(
x = 'Year:T',
y = 'mean(Displacement):Q'
).add_selection(
brush2
).properties(
height = 100
)
scatter & range_sel1 & range_sel2
Interval selection cannot be used for aggregate charts yet in Vega-Lite. The error behavior have been updated in a recent PR to Vega-Lite to show a helpful message.
Not sure if I understand your requirements correctly, does this look close to what you want? (Just added param selections on top of your vertically concatenated graphs)
Vega Editor

Using adjustText to avoid label overlap with Python prince correspondence analysis

I have read about how efficient the package adjustText is with respect to avoiding label overlap and I would like to use to the following diagram created by prince:
Here is the code that created the image:
import pandas as pd
import prince
from adjustText import adjust_text
pd.set_option('display.float_format', lambda x: '{:.6f}'.format(x))
X=pd.DataFrame(data=[ ... my data ... ],
columns=pd.Series([ ... my data ... ]),
index=pd.Series([ ... my data ...]),
)
ca = prince.CA(n_components=2,n_iter=3,copy=True,check_input=True,engine='auto',random_state=42)
ca = ca.fit(X)
ca.row_coordinates(X)
ca.column_coordinates(X)
ax = ca.plot_coordinates(X=X,ax=None,figsize=(6, 6),x_component=0,y_component=1,show_row_labels=True,show_col_labels=True)
ax.get_figure().savefig('figure.png')
In all examples of adjustText I could find, there was always a direct access to the coordinates of labels. How do I access the coordinates of labels in this case? How can I apply adjust_text to this figure?
First, deactivate label display by plot_coordinates():
ax = ca.plot_coordinates(X=X,ax=None,figsize=(6, 6),x_component=0,y_component=1,show_row_labels=False,show_col_labels=False)
Then, extract coordinates of columns and rows:
COLS=ca.column_coordinates(X).to_dict()
XCOLS=COLS[0]
YCOLS=COLS[1]
ROWS=ca.row_coordinates(X).to_dict()
XROWS=ROWS[0]
YROWS=ROWS[1]
Structures XCOLS, YCOLS, XROWS, YROWS are dictionaries with values that are floats (the coordinates). Let us merge the two x-axis dictionaries in a single x-axis dictionary I will call XGLOBAL, same thing for the y-axis dictionaries, into YGLOBAL:
XGLOBAL={ k : XCOLS.get(k,0)+XROWS.get(k,0) for k in set(XCOLS) | set(XROWS) }
YGLOBAL={ k : YCOLS.get(k,0)+YROWS.get(k,0) for k in set(YCOLS) | set(YROWS) }
Now I just apply adjust_text() as described in the documentation:
fig = ax.get_figure()
texts=[plt.text(XGLOBAL[x],YGLOBAL[x],x,fontsize=7) for x in XGLOBAL.keys()]
adjust_text(texts,arrowprops=dict(arrowstyle='-', color='red'))
fig.savefig('newfigure.png')
And the result is:
Notice that while the image generation was instantaneous without adjust_text, it took around 40 seconds with adjust_text.
You can also put a small angle in texts iteration. I saw on my side that it helps the adjust_text routine.
texts=[plt.text(XGLOBAL[x],YGLOBAL[x],x,fontsize=7,
rotation = -XGLOBAL.keys()+2*x) for x in XGLOBAL.keys()]

How do the x and y parameters in the Label object work for bokeh?

I've read the documentation for the Label class in Bokeh but the x and y parameters are quite confusing. Their behavior seems to change if you pass something to the x_units and y_units parameters but I don't understand what the units are supposed to be by default.
More specifically, I have a list of strings that I'm using for my x-axis:
xlab = [
'COREPCE2',
'COREPCE3',
'COREPCE4',
'COREPCE5',
'COREPCE6',
'',
'T5YIE'
]
p = figure(..., y_range = (0,.04), x_range = xlab)
If I wanted to draw basically anything else on the plot, I could just use those strings. For example I drew some lines like this:
p.line(['COREPCE2', 'T5YIE'], [.02,.02], color = 'black', line_dash = 'dashed')
p.line(['', ''], [0,.04], color = 'black')
And that works fine, this is the full chart.
Here's the issue though. I want to put a text label on the "COREPCE4" location of the x axis. If I try just passing the string for the x parameter in the Label class it just doesn't work:
section = Label(x = 'COREPCE4', y = .03, text = 'Survey of Professional Forecasters: August 9, 2019')
p.add_layout(section)
It throws an error: ValueError: expected a value of type Real, got COREPCE4 of type str. I don't really know what units its expecting. Is there a way to make Bokeh recognize that I want to use the x-axis label as my x parameter in the same way I've done with the other glyphs?
The propertied x_units, y_units, refer to screen (pixel) vs data-space (axis) units. As of Bokeh 1.3.4 the x and y properties of Label can only be set from floating point numbers, so they cannot be used directly with categorical coordinates. For now you should use LabelSet, even if you are only showing a single label, since it can work with categorical coordinates.

Categories

Resources