Colour bars based on values in pandas dataframe when using plotnine - python

I am trying to build a waterfall chart using plotnine. I would like to colour the starting and ending bars as grey (ideally I want to specify hexadecimal colours), increases as green and decreases as red.
Below is some sample data and my current plot. I am trying to set fill to the pandas column colour, but the bars are all black. I have also tied putting fill in the geom_segment, but this does not work either.
df = pd.DataFrame({})
df['label'] = ('A','B','C','D','E')
df['percentile'] = (10)*5
df['value'] = (100,80,90,110,110)
df['yStart'] = (0,100,80,90,0)
df['barLabel'] = ('100','-20','+10','+20','110')
df['labelPosition'] = ('105','75','95','115','115')
df['colour'] = ('grey','red','green','green','grey')
p = (ggplot(df, aes(x=np.arange(0,5,1), xend=np.arange(0,5,1), y='yStart',yend='value',fill='colour'))
+ theme_light(6)
+ geom_segment(size=10)
+ ylab('value')
+ scale_y_continuous(breaks=np.arange(0,141,20), limits=[0,140], expand=(0,0))
)
EDIT
Based on teunbrand's comment of changing fill to color, I have the following. How do I specify the actual colour, preferably in hexadecimal format?

Just to close this question off, credit goes to teunbrand in the comments for the solution.
geom_segment() has a colour aesthetic but not a fill aesthetic. Replace fill='colour' with colour='colour'.
Plotnine will use default colours for the bars. Use scale_color_identity() if the contents of the DataFrame column are literal colours, or scale_colour_manual() to manually specify a tuple or list of colours. Both forms accept hexadecimal colours.

Related

Altair Color Scatter Plot on Condition

I have this df:
x y term s
0 0.000000 0.132653 matlab 0.893072
1 0.000000 0.142857 matrix 0.905120
2 0.012346 0.153061 laboratory 0.902610
3 0.987654 0.989796 be 0.857932
4 0.938272 0.959184 a 0.861948
The variable s tells us the "distance" of the term from the central line (slope 1).
And I need to make a scatterplot that looks like this:
I have this code so far:
chart = alt.Chart(scatterdata_df).mark_circle().encode(
x = alt.X('x:Q', axis = alt.Axis(tickMinStep = 0.05)),
y = alt.Y('y:Q', axis = alt.Axis(tickMinStep = 0.05)),
color=alt.condition('s:Q', alt.value('red'), alt.value('blue')),
tooltip = ['term']
).properties(
width = 500,
height = 500
)
chart
And that gives me an error.
Javascript Error: Expression parse error: (s:Q)?"red":"blue"
This usually means there's a typo in your chart specification. See the javascript console for the full traceback.
When I just do color = 's' I get this, which is closer:
But again I need that double-gradient of colors. I know that the gradient is respective of the s variable, but I'm not sure how to make it have two gradients, one for each side of the central line.
s:Q is not a valid conditional statement. But, for example, you could write a condition like this:
color = alt.condition(alt.datum.s < 0, alt.value('red'), alt.value('blue'))
and points with s < 0 would be colored red, and all others would be colored blue.
Alternatively, if you want to encode a continuous color scale by the value of s (rather than deciding between two colors based on a condition), you could do
color = 's:Q'
If you'd like to use a color scheme in this case that's different from the default, you can specify it this way:
color = alt.Color('s:Q', scale=alt.Scale(scheme='redblue'))
where the string passed to the scheme argument is one of the built-in named color schemes, listed at https://vega.github.io/vega/docs/schemes/#reference
For more information on customizing colors in Altair, see https://altair-viz.github.io/user_guide/customization.html#customizing-colors

How do I reduce the number of ticks on an Altair graph?

I am using Altair to create a graph, but for some weird reason it's seems to be generating a tick for each of the points. Creating a graph like this Altair Graph
If I filter the dataframe, it produces weird axis values. Altair graph
Is there a way to reduce the amount of ticks? I tried tickCount in the y axis paramater and it didn't work since it seems to require integers.I also tried setting the axis value parameter to a list [0,0.2,0.4,0.6,0.8,1] and that didn't work either. Here is my code (sorry it's so lengthy!). Thank you in advance!
a = alt.Chart(df_filtered).mark_point().encode(x =alt.X('Process_Time_(mins)', axis = alt.Axis(title='Process Time (mins)')),
y = alt.Y('Heavy_Phase_%SS',axis=alt.Axis(title='Heavy Phase %SS', tickCount = 10),sort = 'descending'),
color = alt.Color('DSP_Lot', legend = alt.Legend(title = 'DSP_Lot')),
shape = alt.Shape('Strain', scale = alt.Scale(range = ["circle", "square", "cross", "diamond", "triangle-up", "triangle-down", "triangle-right", "triangle-left"])),
tooltip = [alt.Tooltip('DSP_Lot',title = 'Lot'), alt.Tooltip('Heavy_Phase_%SS', title = 'Heavy Phase %SS'),
alt.Tooltip('Process_Time_(mins)', title = 'Process Time (mins)'), alt.Tooltip('Purpose', title = 'Purpose'), alt.Tooltip('Strain', title = 'Strain'),
alt.Tooltip('Trial', title = 'Trial')]).properties(width = 1000, height = 500)
It's hard to tell without a reproducible example but I suspect the issue is that your y axis is defaulting to a nominal encoding type, in which case you get one tick mark per unique value. If you specify a quantitative type in the Y encoding, it may improve things:
y = alt.Y('Heavy_Phase_%SS:Q', ...)
The reason it defaults to nominal is probably because the associated column in the pandas dataframe has a string type rather than a numerical type.

Using interval selection: manipulate what is taken into aggregation of individual encoding channels of altair

I am making an XY-scatter chart, where both axes show aggregated data.
For both variables I want to have an interval selection in two small charts below where I can brush along the x-axis to set a range.
The selection should then be used to filter what is taken into account for each aggregation operation individually.
On the example of the cars data set, let's say I what to look at Horsepower over Displacement. But not of every car: instead I aggregate (sum) by Origin. Additionally I create two plots of totally mean HP and displacement over time, where I add interval selections, as to be able to set two distinct time ranges.
Here is an example of what it should look like, although the selection functionality is not yet as intended.
And here below is the code to produce it. Note, that I left some commented sections in there which show what I already tried, but does not work. The idea for the transform_calculate came from this GitHub issue. But I don't know how I could use the extracted boundary values for changing what is included in the aggregations of x and y channels. Neither the double transform_window took me anywhere. Could a transform_bin be useful here? How?
Basically, what I want is: when brush1 reaches for example from 1972 to 1975, and brush2 from 1976 to 1979, I want the scatter chart to plot the summed HP of each country in the years 1972, 1973 and 1974 against each countries summed displacement from 1976, 1977 and 1978 (for my case I don't need the exact date format, the Year might as well be integers here).
import altair as alt
from vega_datasets import data
cars = data.cars.url
brush1 = alt.selection(type="interval", encodings=['x'])
brush2 = alt.selection(type="interval", encodings=['x'])
scatter = alt.Chart(cars).mark_point().encode(
x = 'HP_sum:Q',
y = 'Dis_sum:Q',
tooltip = 'Origin:N'
).transform_filter( # Ok, I can filter the whole data set, but that always acts on both variables (HP and displacement) together... -> not what I want.
brush1 | brush2
).transform_aggregate(
Dis_sum = 'sum(Displacement)',
HP_sum = 'sum(Horsepower)',
groupby = ['Origin']
# ).transform_calculate( # Can I extract the selection boundaries like that? And if yes: how can I use these extracts to calculate the aggregationsof HP and displacement?
# b1_lower='(isDefined(brush1.x) ? (brush1.x[0]) : 1)',
# b1_upper='(isDefined(brush1.x) ? (brush1.x[1]) : 1)',
# b2_lower='(isDefined(brush2.x) ? (brush2.x[0]) : 1)',
# b2_upper='(isDefined(brush2.x) ? (brush2.x[1]) : 1)',
# ).transform_window( # Maybe instead of calculate I can use two window transforms...??
# conc_sum = 'sum(conc)',
# frame = [brush1.x[0],brush1.x[1]], # This will not work, as it sets the frame relative (back- and foreward) to each datum (i.e. sliding window), I need it to correspond to the entire data set
# groupby=['sample']
# ).transform_window(
# freq_sum = 'sum(freq)',
# frame = [brush2.x[0],brush2.x[1]], # ...same problem here
# groupby=['sample']
)
range_sel1 = alt.Chart(cars).mark_line().encode(
x = 'Year:T',
y = 'mean(Horsepower):Q'
).add_selection(
brush1
).properties(
height = 100
)
range_sel2 = alt.Chart(cars).mark_line().encode(
x = 'Year:T',
y = 'mean(Displacement):Q'
).add_selection(
brush2
).properties(
height = 100
)
scatter & range_sel1 & range_sel2
Interval selection cannot be used for aggregate charts yet in Vega-Lite. The error behavior have been updated in a recent PR to Vega-Lite to show a helpful message.
Not sure if I understand your requirements correctly, does this look close to what you want? (Just added param selections on top of your vertically concatenated graphs)
Vega Editor

Plotnine's scale fill and axis position

I would like to move the x-axis to the top of my plot and manually fill the colors. However, the usual method in ggplot does not work in plotnine. When I provide the position='top' in my scale_x_continuous() I receive the warning: PlotnineWarning: scale_x_continuous could not recognize parameter 'position'. I understand position is not in plotnine's scale_x_continuous, but what is the replacement? Also, scale_fill_manual() results in an Invalid RGBA argument: 'color' error. Specifically, the value requires an array-like object. Thus I provided the array of colors, but still had an issue. How do I manually set the colors for a scale_fill object?
import pandas as pd
from plotnine import *
lst = [[1,1,'a'],[2,2,'a'],[3,3,'a'],[4,4,'b'],[5,5,'b']]
df = pd.DataFrame(lst, columns =['xx', 'yy','lbls'])
fill_clrs = {'a': 'goldenrod1',
'b': 'darkslategray3'}
ggplot()+\
geom_tile(aes(x='xx', y='yy', fill = 'lbls'), df) +\
geom_text(aes(x='xx', y='yy', label='lbls'),df, color='white')+\
scale_x_continuous(expand=(0,0), position = "top")+\
scale_fill_manual(values = np.array(list(fill_clrs.values())))
Plotnine does not support changing the position of any axis.
You can pass a list or a dict of colour values to scale_fill_manual provided they are recognisable colour names. The colours you have are obscure and they are not recognised. To see that it works try 'red' and 'green', see https://matplotlib.org/gallery/color/named_colors.html for all the named colors. Otherwise, you can also use hex colors e.g. #ff00cc.

remove overlay text from pandas boxplot

I am trying to remove the overlay text on my boxplot I created using pandas. The code to generate it is as follows (minus a few other modifications):
ax = df.boxplot(column='min2',by=df['geomfull'],ax=axes,grid=False,vert=False, sym='',return_type='dict')
I just want to remove the "boxplot grouped by 0..." etc. and I can't work out what object it is in the plot. I thought it was an overflowing title but I can't find where the text is coming from! Thanks in advance.
EDIT: I found a work around which is to construct a new pandas frame with just the relevant list of things I want to box (removing all other variables).
data = {}
maps = ['BA4','BA5','BB4','CA4','CA5','EA4','EA5','EB4','EC4','EX4','EX5']
for mapi in maps:
mask = (df['geomfull'] == mapi)
arr = np.array(df['min2'][mask])
data[mapi] = arr
dfsub = pd.DataFrame(data)
Then I can use the df.plot routines as per examples....
bp = dfsub.plot(kind='box',ax=ax, vert=False,return_type='dict',sym='',grid=False)
This produces the same plot without the overlay.

Categories

Resources