plotnine geom_boxplot ignores required aesthetic and requires unnecessary aesthetic - python

I have data that looks like:
Scenario ymin lower middle upper ymax
One 16362.586379 20911.338893 27121.693254 35219.449009 46406.087619
Two 19779.003240 25390.096116 33108.174561 43545.202225 58464.277060
Rather than use all 50 k data points for every Scenario (there are many more than One and Two), I've computed the positions I need for the box and whiskers.
I try to plot this via
import pandas
import plotnine as p9
df = pandas.read_excel('boxplot_data.xlsx', sheet='Sheet1')
gg = p9.ggplot()
gg += p9.geoms.geom_boxplot(mapping=p9.aes(x='Scenario', ymin='ymin', lower='lower', middle='middle', upper='upper', ymax='ymax'), data=df, color='k', show_legend=False, inherit_aes=False)
gg += p9.themes.theme_seaborn()
gg += p9.labels.xlab('Scenario')
gg.save(filename='scenario_boxplot.png', dpi=300)
The documentation at https://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_boxplot.html#plotnine.geoms.geom_boxplot indicates that the geom_boxplot line of code supplies the required aesthetic parameters to define the box and whiskers.
Running this, however, gives
plotnine.exceptions.PlotnineError: 'stat_boxplot requires the
following missing aesthetics: y'
Why is stat_boxplot being called, with its required aesthetics, not geom_boxplot?
And more importantly, does anybody know how to correct this?

You are using geom_boxplot with stat_boxplot instead of stat_identity.
geom_boxplot(stat='identity', ...)

Related

pymc3 multivariate traceplot color coding

I am new to working with pymc3 and I am having trouble generating an easy-to-read traceplot.
I'm fitting a mixture of 4 multivariate gaussians to some (x, y) points in a dataset. The model runs fine. My question is with regard to manipulating the pm.traceplot() command to make the output more user-friendly.
Here's my code:
import matplotlib.pyplot as plt
import numpy as np
model = pm.Model()
N_CLUSTERS = 4
with model:
#cluster prior
w = pm.Dirichlet('w', np.ones(N_CLUSTERS))
#latent cluster of each observation
category = pm.Categorical('category', p=w, shape=len(points))
#make sure each cluster has some values:
w_min_potential = pm.Potential('w_min_potential', tt.switch(tt.min(w) < 0.1, -np.inf, 0))
#multivariate normal means
mu = pm.MvNormal('mu', [0,0], cov=[[1,0],[0,1]], shape = (N_CLUSTERS,2) )
#break symmetry
pm.Potential('order_mu_potential', tt.switch(
tt.all(
[mu[i, 0] < mu[i+1, 0] for i in range(N_CLUSTERS - 1)]), -np.inf, 0))
#multivariate centers
data = pm.MvNormal('data', mu =mu[category], cov=[[1,0],[0,1]], observed=points)
with model:
trace = pm.sample(1000)
A call to pm.traceplot(trace, ['w', 'mu']) produces this image:
As you can see, it is ambiguous which mean peak corresponds to an x or y value, and which ones are paired together. I have managed a workaround as follows:
from cycler import cycler
#plot the x-means and y-means of our data!
fig, (ax0, ax1) = plt.subplots(nrows=2)
plt.xlabel('$\mu$')
plt.ylabel('frequency')
for i in range(4):
ax0.hist(trace['mu'][:,i,0], bins=100, label='x{}'.format(i), alpha=0.6);
ax1.hist(trace['mu'][:,i,1],bins=100, label='y{}'.format(i), alpha=0.6);
ax0.set_prop_cycle(cycler('color', ['c', 'm', 'y', 'k']))
ax1.set_prop_cycle(cycler('color', ['c', 'm', 'y', 'k']))
ax0.legend()
ax1.legend()
This produces the following, much more legible plot:
I have looked through the pymc3 documentation and recent questions here, but to no avail. My question is this: is it possible to do what I have done here with matplotlib via builtin methods in pymc3, and if so, how?
Better differentiation between multidimensional variables and the different chains was recently added to ArviZ (the library PyMC3 relies on for plotting).
In ArviZ latest version, you should be able to do:
az.plot_trace(trace, compact=True, legend=True)
to get the different dimensions of each variable distinguished by color and the different chains distinguished by linestyle. The default setting is using matplotlib's default color cycle and 4 different linestyles, solid, dashed, dotted and dash-dotted. Both properties can be set to custom aesthetics and custom values by using compact_prop to customize dimension representation and chain_prop to customize chain representation. In addition, if using compact, it may also be a good idea to use combined=True to reduce the clutter in the first column. As an example:
az.plot_trace(trace, compact=True, combined=True, legend=True, chain_prop=("ls", "-"))
would plot the KDEs in the first column using the data from all chains, and would plot all chains using a solid linestyle (due to combined arg, only relevant for the second column). Two legends will be shown, one for the chain info and another for the compact info.
At least in recent versions, you can use compact=True as in:
pm.traceplot(trace, var_names = ['parameters'], compact=True)
to get one graph with all you params combined
Docs in: https://arviz-devs.github.io/arviz/_modules/arviz/plots/traceplot.html
However, I haven't been able to get the colors to differ between lines

Using python and matplotlib, fill between two lines not giving expected output

I am trying to plot a linear line with associated error.
I calculated values for slope (a) and intercepts (b). In addition, I calculated the error associated with these values. So I drew the line given by the typical formula below.
y=ax+b
However, in addition to the line, I also want to draw the associated error. I came up with the idea to draw the lines associated with these formulas and color the space between the lines gray.
y=(a+a_sd)x+(b+b_sd)
y=(a-a_sd)x+(b-b_sd)
Uisng the following piece of code, I am able to color part of the surface between the lines, but not the whole span (see included output).
I think this may be due to the fact that "distance" is not sorted, and fill_between is using distance[0] and distance[-1] as begin and end for the span, respectively.
As always, any help would be highly appreciated!
import matplotlib.pyplot as plt
distance=[0.35645334340084989, 0.55406894241607718, 0.10201413273193734, 0.13401365724625941, 0.71918808865838735, 0.14151335417722818]
time=[2.4004984846346171, 2.4909766335028447, 1.9852064018125195, 1.9083156734132103, 2.6380396934372863, 1.9114505780323543]
time_SD=[0.062393810960652669, 0.056945715242838917, 0.073960838867327183, 0.084111239062664475, 0.026912957190265499, 0.08595664694840538]
distance_SD=[0.035160608598240162, 0.032976715460514235, 0.02782911002465227, 0.035465701695038584, 0.043009444687382707, 0.038387585107200854]
a=1.17887019041
b=1.83339229489
a_sd=0.159771527859
b_sd=0.0762509747218
plt.errorbar(distance,time,yerr=time_SD, xerr=distance_SD, linestyle="None")
abline_values = [(a)*i + (b) for i in distance]
abline_values_plus = [(a+a_sd)*i + (b+b_sd) for i in distance]
abline_values_minus = [(a-a_sd)*i + (b-b_sd) for i in distance]
plt.plot(distance, abline_values,"r")
plt.fill_between(distance,abline_values_minus,abline_values_plus,facecolor='lightgrey', interpolate=True, edgecolors="None")
leg = plt.legend(loc="lower right", frameon=False, handlelength=0, handletextpad=0)
for item in leg.legendHandles:
item.set_visible(False)
plt.show()
In order to use pyplot.fill_between() the list to plot the horizontal coordinate should be sorted. Using an unsorted list of x values is possible, but can lead to undesired results.
Sorting a list can be done using sorted(list).
import matplotlib.pyplot as plt
distance=[0.35645334340084989, 0.55406894241607718, 0.10201413273193734, 0.13401365724625941, 0.71918808865838735, 0.14151335417722818]
time=[2.4004984846346171, 2.4909766335028447, 1.9852064018125195, 1.9083156734132103, 2.6380396934372863, 1.9114505780323543]
time_SD=[0.062393810960652669, 0.056945715242838917, 0.073960838867327183, 0.084111239062664475, 0.026912957190265499, 0.08595664694840538]
distance_SD=[0.035160608598240162, 0.032976715460514235, 0.02782911002465227, 0.035465701695038584, 0.043009444687382707, 0.038387585107200854]
a=1.17887019041
b=1.83339229489
a_sd=0.159771527859
b_sd=0.0762509747218
distance_sorted = sorted(distance)
plt.errorbar(distance,time,yerr=time_SD, xerr=distance_SD, linestyle="None")
abline_values = [(a)*i + (b) for i in distance_sorted]
abline_values_plus = [(a+a_sd)*i + (b+b_sd) for i in distance_sorted]
abline_values_minus = [(a-a_sd)*i + (b-b_sd) for i in distance_sorted]
plt.plot(distance_sorted, abline_values,"r")
plt.fill_between(distance_sorted,abline_values_minus,abline_values_plus, facecolor='lightgrey', edgecolors="None")
plt.show()
The documentation does not mention the requirement of x values being sorted. The reason is probably that fill_between actually works even with unsorted lists, just not the way one might expect. Maybe the following animation gives a more intuitive understanding on the issue:
You are right fill_between seems to expect the values to be sorted. The documentation is not clear about this behaviour though. The following example however shows the same effect:
import matplotlib.pyplot as plt
from numpy import random, array
#x = random.randn(20) #does not work
x = array(sorted(random.randn(20))) #works
a = 2
d = .5
y_h = x*(a+d)
y_l = x*(a-d)
plt.fill_between(x,y_h, y_l)
plt.show()
As a workaround just sort your values before calculating your errorlines using sorted.

Plot spectroscopic data from pandas dataframe in 3D with different array length

Is it possible to get something like this plot
from a pandas dataframe, in a a similar fashion as I would just simply do to do 2d-plots (df.plot())?
More precisely:
I have data that I read from csv files into pandas DataFrames with following structure:
1st level header A B C D E F
2nd level header 2.0 1.0 0.2 0.4 0.6 0.8
Index
126.4348 -467048 -814795 301388 298430 -187654 -1903170
126.4310 -468329 -810060 304366 305343 -192035 -1881625
126.4272 -469209 -804697 305795 312472 -197013 -1854848
126.4234 -469685 -799604 305647 318936 -200957 -1827665
126.4195 -469795 -795708 304101 323922 -202192 -1805153
126.4157 -469610 -793795 301497 326780 -199323 -1791743
126.4119 -469213 -794362 298257 327092 -191547 -1790418
126.4081 -468687 -797499 294817 324717 -178875 -1802122
126.4043 -468097 -802853 291546 319800 -162225 -1825540
126.4005 -467486 -809663 288700 312745 -143334 -1857270
126.3967 -466863 -816878 286401 304170 -124505 -1892389
126.3929 -466210 -823335 284645 294827 -108228 -1925312
126.3890 -465485 -827966 283331 285520 -96733 -1950795
126.3852 -464637 -829997 282315 277018 -91559 -1964894
126.3814 -463617 -829104 281457 269965 -93242 -1965702
126.3776 -462399 -825487 280670 264824 -101170 -1953728
126.3738 -460982 -819857 279942 261819 -113660 -1931820
126.3700 -459408 -813317 279344 260927 -128242 -1904669
126.3662 -457757 -807177 279009 261885 -142112 -1877955
126.3624 -456143 -802715 279090 264233 -152667 -1857303
126.3585 -454700 -800940 279722 267380 -158023 -1847241
126.3547 -453566 -802397 280969 270692 -157406 -1850358
126.3509 -452862 -807050 282792 273579 -151350 -1866803
126.3471 -452672 -814262 285033 275591 -141627 -1894249
126.3433 -453030 -822898 287426 276486 -130942 -1928303
126.3395 -453910 -831501 289627 276273 -122426 -1963297
126.3357 -455223 -838544 291266 275222 -119021 -1993312
126.3319 -456834 -842695 292004 273824 -122882 -2013246
126.3280 -458571 -843048 291599 272725 -134907 -2019718
126.3242 -460252 -839292 289952 272620 -154497 -2011656
... ... ... ... ... ... ...
What I would like to do with that
I would like to plot each of these columns (they are NMR spectra) against the index.
In a 2D overlay, this is simple usage of the pandas wrapper around matplotlib.
However, I would like to plot each spectrum in its own "line", along a third axis that has the second level headers as
ticks.
I tried to use matplotlibĀ“s 3D plotting functionality, but it seems to only be suitable if you actually have three arrays of equal length,
which in the case of my data does just not make sense, because each spectrum is recorded for one of the values from the second level header.
Am I maybe thinking too complicated when I try to make a 3D plot?
Is the figure I would like my plot to look like maybe not an actual 3D plot but rather some special version of overlaid 2D plots?
How I would prefer to do it
Bonus points for:
Using only python
Using only pandas and matplotlib
Already implemented functionality
If there is no obvious python way to do it, I would as well be happy about libraries of other languages that can do the same, such as
R or Octave. I am just not as familiar with these, so I would probably not be able to adapt more hacky solutions in these languages to suit my requirements.
This question might be very similar, but as I understand it, it does not necessarily extend to software other than python and doesn't have an example of what the result should look like, so I am not sure if answers to that question might actually be helpful for this specific purpose.
What is wrong with matplotlibĀ“s gallery examples
As lanery pointed out, polygon3D from the matplotlib gallery gets close to what I wish for.
However it has some drawbacks some of which are not acceptable for most scientific publications:
With negative values, the whole plot gets shifted to what I would
call "the middle of the screen", which looks kind of ugly, makes
it hard to extract information from the figure and makes it different
from the provided examples
You get that interactive plot window, which requires you to find an
angle from which you can see everything you need to see. That
might be good for some data exploration tasks, but if you use
scripts for your visualization and a minor change to the graphic
would force you to do some manual work again, this decreases the
advantage you expect from scripting
If you have values that differ strongly and are not linear, something
like [0,1,1.7,2.5,6.2], for your third dimension i.e. the second
level header in this case, the 2d plots have very different distances
from another, which is unacceptable, at least for any
non-programming audience reading the publications
It is quite long and technical for a quite common plotting operation
in spectroscopy. The amount of code would be fine if I wanted to
build software that can make 3D plots in some context. For science it
would be preferable to be able to accomplish something like this
with a low amount of code.
I gave you an example of plotting with the data from the continuous X and Y, and just hard-coded z based on your second level header.
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib
%matplotlib inline
df = pd.read_csv("C:\Users\User\SkyDrive\Documents\import_data.tcsv.txt",header=None)
fig = plt.figure()
ax = fig.gca(projection='3d')
# Plot a sin curve using the x and y axes.
x = df[0]
ax.plot(x, df[1], zs=2, zdir='z', label='A')
ax.plot(x, df[2], zs=1, zdir='z', label='B')
ax.plot(x, df[3], zs=0.2, zdir='z', label='C')
ax.plot(x, df[4], zs=0.4, zdir='z', label='D')
ax.plot(x, df[5], zs=0.6, zdir='z', label='E')
ax.plot(x, df[6], zs=0.8, zdir='z', label='F')
# Customize the view angle so it's easier to see that the scatter points lie
# on the plane y=0
ax.view_init(elev=-150., azim=40)
plt.show()
Your going to have to play with the options on view_init to rotate around and get the axes where you want. I'm not really clear with what your end goal was, but this is the end plot.

How to set the alpha value for matplotlib plots globally

Using Jupyther Notebooks i often find myself repeadily writing the following to change the alpha value of the plots:
plot(x,y1, alpha=.6)
plot(x,y2, alpha=.6)
...
I was hoping to find a matching value in the rcParameters to change to option globally like:
plt.rcParams['lines.alpha'] = 0.6 #not working
What is a possible workaround to change the alpha value for all plots?
Unfortunately, based on their How to entry:
If you need all the figure elements to be transparent, there is currently no global alpha setting, but you can set the alpha channel on individual elements.
So, via matplotlib there's currently no way to do this.
What I usually do for global values is define an external configuration file, define values and import them to the appropriate scripts.
my_conf.py
# Parameters:
# matplotlib alpha
ALPHA = .6
my_plots.py
import conf.py as CONF
plot(x,y1, alpha=CONF.ALPHA)
plot(x,y2, alpha=CONF.ALPHA)
This usually helps in keeping configuration separated and easy to update.
Answering my own question with help of the matplotlib team the following code will do the job by changeing the alpha value of the line colors globally:
alpha = 0.6
to_rgba = matplotlib.colors.ColorConverter().to_rgba
for i, col in enumerate(plt.rcParams['axes.color_cycle']):
plt.rcParams['axes.color_cycle'][i] = to_rgba(col, alpha)
Note: In matplotlib 1.5 color_cycle will be deprecated and replaced by prop_cycle
The ability the set the alpha value over the rcParams has also been added to the wishlist for Version 2.1
Updated version (perhaps cleaner available) :
from cycler import cycler
alpha = 0.5
to_rgba = matplotlib.colors.ColorConverter().to_rgba#
color_list=[]
for i, col in enumerate(plt.rcParams['axes.prop_cycle']):
color_list.append(to_rgba(col['color'], alpha))
plt.rcParams['axes.prop_cycle'] = cycler(color=color_list)
Another way to do it is just to specify the alpha and pass this to the rcParams
plt.rcParams['axes.prop_cycle'] = cycler(alpha=[0.5])
Bear in mind that this can be combined with any other cycling properties such as the color and line style.
cyc_color = cycler(color=['r','b','g')
cyc_lines = cycler(linestyle=['-', '--', ':'])
cyc_alpha = cycler(alpha=[0.5, 0.3])
cyc = (cyc_alpha * cyc_lines * cyc_color)
Be careful with the order of your cyclers, as the sequence above will cycle through colors, then lines, then alphas.

Manually setting xticks with xaxis_date() in Python/matplotlib

I've been looking into how to make plots against time on the x axis and have it pretty much sorted, with one strange quirk that makes me wonder whether I've run into a bug or (admittedly much more likely) am doing something I don't really understand.
Simply put, below is a simplified version of my program. If I put this in a .py file and execute it from an interpreter (ipython) I get a figure with an x axis with the year only, "2012", repeated a number of times, like this.
However, if I comment out the line (40) that sets the xticks manually, namely 'plt.xticks(tk)' and then run that exact command in the interpreter immediately after executing the script, it works great and my figure looks like this.
Similarly it also works if I just move that line to be after the savefig command in the script, that's to say to put it at the very end of the file. Of course in both cases only the figure drawn on screen will have the desired axis, and not the saved file. Why can't I set my x axis earlier?
Grateful for any insights, thanks in advance!
import matplotlib.pyplot as plt
import datetime
# define arrays for x, y and errors
x=[16.7,16.8,17.1,17.4]
y=[15,17,14,16]
e=[0.8,1.2,1.1,0.9]
xtn=[]
# convert x to datetime format
for t in x:
hours=int(t)
mins=int((t-int(t))*60)
secs=int(((t-hours)*60-mins)*60)
dt=datetime.datetime(2012,01,01,hours,mins,secs)
xtn.append(date2num(dt))
# set up plot
fig=plt.figure()
ax=fig.add_subplot(1,1,1)
# plot
ax.errorbar(xtn,y,yerr=e,fmt='+',elinewidth=2,capsize=0,color='k',ecolor='k')
# set x axis range
ax.xaxis_date()
t0=date2num(datetime.datetime(2012,01,01,16,35)) # x axis startpoint
t1=date2num(datetime.datetime(2012,01,01,17,35)) # x axis endpoint
plt.xlim(t0,t1)
# manually set xtick values
tk=[]
tk.append(date2num(datetime.datetime(2012,01,01,16,40)))
tk.append(date2num(datetime.datetime(2012,01,01,16,50)))
tk.append(date2num(datetime.datetime(2012,01,01,17,00)))
tk.append(date2num(datetime.datetime(2012,01,01,17,10)))
tk.append(date2num(datetime.datetime(2012,01,01,17,20)))
tk.append(date2num(datetime.datetime(2012,01,01,17,30)))
plt.xticks(tk)
plt.show()
# save to file
plt.savefig('savefile.png')
I don't think you need that call to xaxis_date(); since you are already providing the x-axis data in a format that matplotlib knows how to deal with. I also think there's something slightly wrong with your secs formula.
We can make use of matplotlib's built-in formatters and locators to:
set the major xticks to a regular interval (minutes, hours, days, etc.)
customize the display using a strftime formatting string
It appears that if a formatter is not specified, the default is to display the year; which is what you were seeing.
Try this out:
import datetime as dt
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter, MinuteLocator
x = [16.7,16.8,17.1,17.4]
y = [15,17,14,16]
e = [0.8,1.2,1.1,0.9]
xtn = []
for t in x:
h = int(t)
m = int((t-int(t))*60)
xtn.append(dt.datetime.combine(dt.date(2012,1,1), dt.time(h,m)))
def larger_alim( alim ):
''' simple utility function to expand axis limits a bit '''
amin,amax = alim
arng = amax-amin
nmin = amin - 0.1 * arng
nmax = amax + 0.1 * arng
return nmin,nmax
plt.errorbar(xtn,y,yerr=e,fmt='+',elinewidth=2,capsize=0,color='k',ecolor='k')
plt.gca().xaxis.set_major_locator( MinuteLocator(byminute=range(0,60,10)) )
plt.gca().xaxis.set_major_formatter( DateFormatter('%H:%M:%S') )
plt.gca().set_xlim( larger_alim( plt.gca().get_xlim() ) )
plt.show()
Result:
FWIW the utility function larger_alim was originally written for this other question: Is there a way to tell matplotlib to loosen the zoom on the plotted data?

Categories

Resources