I am using python's wonderful plotnine package. I would like to make a plot with dual y-axis, let's say Celsius on the left axis and Fahrenheit on the right.
I have installed the latest version of plotnine, v0.10.1.
This says the feature was added in v0.10.0.
I tried to follow the syntax on how one might do this in R's ggplot (replacing 'dot' notation with underscores) as follows:
import pandas as pd
from plotnine import *
df = pd.DataFrame({
'month':('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'),
'temperature':(26.0,25.8,23.9,20.3,16.7,14.1,13.5,15.0,17.3,19.7,22.0,24.2),
})
df['month'] = pd.Categorical(df['month'], categories=('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'), ordered=True)
p = (ggplot(df, aes(x='month', y='temperature'))
+ theme_light()
+ geom_line(group=1)
+ scale_y_continuous(
name='Celsius',
sec_axis=sec_axis(trans=~.*1.8+32, name='Fahrenheit')
)
)
p
This didn't like the specification of the transformation, so I tried a few different options. Removing this altogether produces the error:
NameError: name 'sec_axis' is not defined
The documentation does not contain a reference for sec_axis, and searching for 'secondary axis' doesn't help either.
How do you implement a secondary axis in plotnine?
This github issue thread that was mentioned in the question does not say in any way that the secondary axis feature has been implemented. It was added to v0.10.0 milestones list before it was released. Here, milestones list means a todo list of what was planned to be implemented before the version releases. However, upon the actual release, the changelog does not mention the secondary axis feature, which means that it was only planned to be implemented and was not actually implemented. Long story short, the planned feature didn't make it into development and release.
So, I'm sorry to say that currently as of v0.10.0 and now v0.10.1 it seems that this feature isn't there yet in plotnine.
Related
I'm trying to understand the pointplot function (Link to pointplot doc) to plot error bars.
Setting the 'errorbar' argument to 'sd' should plot the standard deviation along with the mean. But calculating the standard deviation manually results in a different value.
I used the example provided in the documentation:
import seaborn as sns
df = sns.load_dataset("penguins")
ax = sns.pointplot(data=df, x="island", y="body_mass_g", errorbar="sd")
data = ax.lines[1].get_ydata()
print(data[1] - data[0]) # prints 248.57843137254895
sd = df[df['island'] == 'Torgersen']['body_mass_g'].std()
print(sd) # prints 445.10794020256765
I expected both printed values to be the same, since both data[1] - data[0] and sd should be equal to the standard deviation of the variable 'body_mass_g' for the category 'Torgersen'. Other standard deviation provided by sns.pointplot are also not as expected.
I must be missing something obvious here but for the life of me I can't figure it out.
Appreciate any help. I tested the code locally and in google colab with the same results.
My PC had an outdated version of seaborn (0.11.2), where the argument 'errorbar' was named 'ci'. Using the correct argument resolves the problem. Strangly google Colab also uses version 0.11.2, contrary to their claim that they auto update their packages.
I am working on this python notebook in Google Collab:
https://github.com/AllenDowney/ModSimPy/blob/master/notebooks/chap01.ipynb
I had to change the configuration line because the one stated in the original was erroring out:
# Configure Jupyter to display the assigned value after an assignment
# Line commented below because errors out
# %config InteractiveShell.ast_node_interactivity='last_expr_or_assign'
# Edit solution given below
%config InteractiveShell.ast_node_interactivity='last_expr'
However, I think the original statement was meant to show values of assignments (if I'm not mistaken), so that when I run the following cell in the notebook, I should see an output:
meter = UNITS.meter
second = UNITS.second
a = 9.8 * meter / second**2
If so how can I make the notebook on google collab show output of assignments?
The short answer is: you can not show output of assignments in Colab.
Your confusion comes from how Google Colab works. The original script is meant to run in IPython. But Colab is not a regular IPython. When you run IPython shell, your %config InteractiveShell.ast_node_interactivity options are (citing documentation)
‘all’, ‘last’, ‘last_expr’ , ‘last_expr_or_assign’ or ‘none’,
specifying which nodes should be run interactively (displaying output
from expressions). ‘last_expr’ will run the last node interactively
only if it is an expression (i.e. expressions in loops or other blocks
are not displayed) ‘last_expr_or_assign’ will run the last expression
or the last assignment. Other values for this parameter will raise a
ValueError.
all will display all the variables, but not the assignments, for example
x = 5
x
y = 7
y
Out[]:
5
7
The differences between the options become more significant when you want to display variables in the loop.
In Colab your options are restricted to ['all', 'last', 'last_expr', 'none']. If you select all, the result for the above cell will be
Out[]:
57
Summarizing all that, there is no way of showing the result of assignment in Colab. Your only option (AFAIK) is to add the variable you want to see to the cell where it is assigned (which is similar to regular print):
meter = UNITS.meter
second = UNITS.second
a = 9.8 * meter / second**2
a
Google Colab has not yet been upgraded to the latest IPython version- if you explicitly upgrade with
!pip install -U ipython
then last_expr_or_assign will work.
Well the easy way is to just wrap your values in print statements like:
print(meter)
print(second)
print(a)
However, if you want to do it the jupyter way, looks like the answer is
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Found the above from this link: https://stackoverflow.com/a/36835741/7086982
As mentioned by others, it doesn't work as last_expr_or_assign was introduced in juptyer v6.1, and colab is using v5.x. Upgrading the jupyter version on colab might cause some instability (warning shown by colab):
WARNING: Upgrading ipython, ipykernel, tornado, prompt-toolkit or pyzmq can
cause your runtime to repeatedly crash or behave in unexpected ways and is not
recommended
One other solution is to use an extension such as ipydex, which provides magic comments (like ##:, ##:T, ##:S) which cause that either the return value or the right hand side of an assignment of a line is displayed.
!pip install ipydex
%load_ext ipydex.displaytools
e.g.
a = 4
c = 5 ##:
output
c := 5
---
I would like to get fine error bars using seaborn's regplot (finer than the correlation line).
The code below (adapted from here) sorts this out, but in a rather cumbersome way. Is there a more direct way to reach this, maybe via kws?
with matplotlib.rc_context({"lines.linewidth": 1}):
sns.regplot('A', 'B', data=my_dataframe, x_jitter=10., ci=68, \
ax=ax, x_estimator=np.mean, \
scatter_kws={"s":150}, \
line_kws={"linewidth": 2 })
There is a pull request incoming that I am working on that will allow you to specify the linewidth and allow you to determine if there should be caps on the error bars. See
https://github.com/mwaskom/seaborn/pull/898
If you clone my repository as a temporary fix, you should be able to specify it right now by adding the keyword conf_lw but I hope to have this integrated with unit tests written shortly.
I am using the rpy2 package to bring some R functionality to python.
The functions I'm using in R need a data.frame object, and by using rlike.TaggedList and then robjects.DataFrame I am able to make this work.
However I'm having performance issues, when comparing to the exact same R functions with the exact same data, which led me to try and use the rpy2 low level interface as mentioned here - http://rpy.sourceforge.net/rpy2/doc-2.3/html/performances.html
So far I have tried:
Using TaggedList with FloatSexpVector objects instead of numpy arrays, and the DataFrame object.
Dumping the TaggedList and DataFrame classes by using a dictionary like this:
d = dict((var_name, var_sexp_vector) for ...)
dataframe = robjects.r('data.frame')(**d)
Both did not get me any noticeable speedup.
I have noticed that DataFrame objects can get a rinterface.SexpVector in their constructor , so I have thought of creating a such a named vector, but I have no idea on how to put in the names (in R I know its just names(vec) = c('a','b'...)).
How do I do that? Is there another way?
And is there an easy way to profile rpy itself, so I could know where the bottleneck is?
EDIT:
The following code seem to work great (x4 faster) on newer rpy (2.2.3)
data = ro.r('list')([ri.FloatSexpVector(x) for x in vectors])[0]
data.names = ri.StrSexpVector(vector_names)
However it doesn't on version 2.0.8 (last one supported by windows), since R cant seem to be able to use the names: "Error in eval(expr, envir, enclos) : object 'y' not found"
Ideas?
EDIT #2:
Someone did the fine job of building a rpy2.3 binary for windows (python 2.7), the mentioned works great with it (almost x6 faster for my code)
link: https://bitbucket.org/breisfeld/rpy2_w32_fix/issue/1/binary-installer-for-win32
Python can be several times faster than R (even byte-compiled R), and I managed to perform operations on R data structures with rpy2 faster than R would have. Sharing the relevant R and rpy2 code would help make more specific advice (and improve rpy2 if needed).
In the meantime, SexpVector might not be what you want; it is little more than an abstract class for all R vectors (see class diagram for rpy2.rinterface). ListSexpVector might be more appropriate:
import rpy2.rinterface as ri
ri.initr()
l = ri.ListSexpVector([ri.IntSexpVector((1,2,3)),
ri.StrSexpVector(("a","b","c")),])
An important detail is that R lists are recursive data structures, and R avoids a catch 22-type of situation by having the operator "[[" (in addittion to "["). Python does not have that, and I have not (yet ?) implemented "[[" as a method at the low-level.
Profiling in Python can be done with the module stdlib module cProfile, for example.
Unhindered by any pre-existing knowledge of R, Rpy2 and ggplot2 I would never the less like to create a scatterplot of a trivial table from Python.
To set this up I've just installed:
Ubuntu 11.10 64 bit
R version 2.14.2 (from r-cran mirror)
ggplot2 (through R> install.packages('ggplot2'))
rpy2-2.2.5 (through easy_install)
Following this I am able to plot some example dataframes from an interactive R session using ggplot2.
However, when I merely try to import ggplot2 as I've seen in an example I found online, I get the following error:
from rpy2.robjects.lib import ggplot2
File ".../rpy2/robjects/lib/ggplot2.py", line 23, in <module>
class GGPlot(robjects.RObject):
File ".../rpy2/robjects/lib/ggplot2.py", line 26, in GGPlot
_rprint = ggplot2_env['print.ggplot']
File ".../rpy2/robjects/environments.py", line 14, in __getitem__
res = super(Environment, self).__getitem__(item)
LookupError: 'print.ggplot' not found
Can anyone tell me what I am doing wrong? As I said the offending import comes from an online example, so it might well be that there is some other way I should be using gplot2 through rpy2.
For reference, and unrelated to the problem above, here's an example of the dataframe I would like to plot, once I get the import to work (should not be a problem looking at the examples). The idea is to create a scatter plot with the lengths on the x axis, the percentages on the Y axis, and the boolean is used to color the dots, whcih I would then like to save to a file (either image or pdf). Given that these requirements are very limited, alternative solutions are welcome as well.
original.length row.retained percentage.retained
1 1875 FALSE 11.00
2 1143 FALSE 23.00
3 960 FALSE 44.00
4 1302 FALSE 66.00
5 2016 TRUE 87.00
There were changes in the R package ggplot2 that broke the rpy2 layer.
Try with a recent (I just fixed this) snapshot of the "default" branch (rpy2-2.3.0-dev) for the rpy2 code on bitbucket.
Edit: rpy2-2.3.0 is a couple of months behind schedule. I just pushed a bugfix release rpy2-2.2.6 that should address the problem.
Although I can't help you with a fix for the import error you're seeing, there is a similar example using lattice here: lattice with rpy2.
Also, the standard R plot function accepts coloring by using the factor function (which you can feed the row.retained column. Example:
plot(original.length, percentage.retained, type="p", col=factor(row.retained))
Based on fucitol's answer I've instead implemented the plot using both the default plot & lattice. Here are both the implementations:
from rpy2 import robjects
#Convert to R objects
original_lengths = robjects.IntVector(original_lengths)
percentages_retained = robjects.FloatVector(percentages_retained)
row_retained = robjects.StrVector(row_retained)
#Plot using standard plot
r = robjects.r
r.plot(x=percentages_retained,
y=original_lengths,
col=row_retained,
main='Title',
xlab='Percentage retained',
ylab='Original length',
sub='subtitle',
pch=18)
#Plot using lattice
from rpy2.robjects import Formula
from rpy2.robjects.packages import importr
lattice = importr('lattice')
formula = Formula('lengths ~ percentages')
formula.getenvironment()['lengths'] = original_lengths
formula.getenvironment()['percentages'] = percentages_retained
p = lattice.xyplot(formula,
col=row_retained,
main='Title',
xlab='Percentage retained',
ylab='Original length',
sub='subtitle',
pch=18)
rprint = robjects.globalenv.get("print")
rprint(p)
It's a shame I can't get ggplot2 to work, as it produces nicer graphs by default and I regard working with dataframes as more explicit. Any help in that direction is still welcome!
If you don't have any experience with R but with python, you can use numpy or pandas for data analysis and matplotlib for plotting.
Here is a small example how "this feels like":
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'original_length': [1875, 1143, 960, 1302, 2016],
'row_retained': [False, False, False, False, True],
'percentage_retained': [11.0, 23.0, 44.0, 66.0, 87.0]})
fig, ax = plt.subplots()
ax.scatter(df.original_length, df.percentage_retained,
c=np.where(df.row_retained, 'green', 'red'),
s=np.random.randint(50, 500, 5)
)
true_value = df[df.row_retained]
ax.annotate('This one is True',
xy=(true_value.original_length, true_value.percentage_retained),
xytext=(0.1, 0.001), textcoords='figure fraction',
arrowprops=dict(arrowstyle="->"))
ax.grid()
ax.set_xlabel('Original Length')
ax.set_ylabel('Precentage Retained')
ax.margins(0.04)
plt.tight_layout()
plt.savefig('alternative.png')
pandas also has an experimental rpy2 interface.
The problem is caused by the latest ggplot2 version which is 0.9.0. This version doesn't have the function print.ggplot() which is found in ggplot2 version 0.8.9.
I tried to tinker with the rpy2 code to make it work with the newest ggplot2 but the extend of the changes seem to be quite large.
Meanwhile, just downgrade your ggplot2 version to 0.8.9