Unexpected behavior of pyplot in the seaborn library. Bug? - python

I'm trying to understand the pointplot function (Link to pointplot doc) to plot error bars.
Setting the 'errorbar' argument to 'sd' should plot the standard deviation along with the mean. But calculating the standard deviation manually results in a different value.
I used the example provided in the documentation:
import seaborn as sns
df = sns.load_dataset("penguins")
ax = sns.pointplot(data=df, x="island", y="body_mass_g", errorbar="sd")
data = ax.lines[1].get_ydata()
print(data[1] - data[0]) # prints 248.57843137254895
sd = df[df['island'] == 'Torgersen']['body_mass_g'].std()
print(sd) # prints 445.10794020256765
I expected both printed values to be the same, since both data[1] - data[0] and sd should be equal to the standard deviation of the variable 'body_mass_g' for the category 'Torgersen'. Other standard deviation provided by sns.pointplot are also not as expected.
I must be missing something obvious here but for the life of me I can't figure it out.
Appreciate any help. I tested the code locally and in google colab with the same results.

My PC had an outdated version of seaborn (0.11.2), where the argument 'errorbar' was named 'ci'. Using the correct argument resolves the problem. Strangly google Colab also uses version 0.11.2, contrary to their claim that they auto update their packages.

Related

Plotnine : Secondary y-axis (dual axes)

I am using python's wonderful plotnine package. I would like to make a plot with dual y-axis, let's say Celsius on the left axis and Fahrenheit on the right.
I have installed the latest version of plotnine, v0.10.1.
This says the feature was added in v0.10.0.
I tried to follow the syntax on how one might do this in R's ggplot (replacing 'dot' notation with underscores) as follows:
import pandas as pd
from plotnine import *
df = pd.DataFrame({
'month':('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'),
'temperature':(26.0,25.8,23.9,20.3,16.7,14.1,13.5,15.0,17.3,19.7,22.0,24.2),
})
df['month'] = pd.Categorical(df['month'], categories=('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'), ordered=True)
p = (ggplot(df, aes(x='month', y='temperature'))
+ theme_light()
+ geom_line(group=1)
+ scale_y_continuous(
name='Celsius',
sec_axis=sec_axis(trans=~.*1.8+32, name='Fahrenheit')
)
)
p
This didn't like the specification of the transformation, so I tried a few different options. Removing this altogether produces the error:
NameError: name 'sec_axis' is not defined
The documentation does not contain a reference for sec_axis, and searching for 'secondary axis' doesn't help either.
How do you implement a secondary axis in plotnine?
This github issue thread that was mentioned in the question does not say in any way that the secondary axis feature has been implemented. It was added to v0.10.0 milestones list before it was released. Here, milestones list means a todo list of what was planned to be implemented before the version releases. However, upon the actual release, the changelog does not mention the secondary axis feature, which means that it was only planned to be implemented and was not actually implemented. Long story short, the planned feature didn't make it into development and release.
So, I'm sorry to say that currently as of v0.10.0 and now v0.10.1 it seems that this feature isn't there yet in plotnine.

Is statsmodel/exponential smoothing working correctly?

I am performing a time series analysis using statsmodels and the exponential smoothing method. I am trying to reproduce the results from
https://www.statsmodels.org/devel/examples/notebooks/generated/exponential_smoothing.html
with a particular dataframe (with the same format as the example, but only one outcome).
Here are the lines of code:
from statsmodels.tsa.api import ExponentialSmoothing, SimpleExpSmoothing, Holt
fit = ExponentialSmoothing(dataframe, seasonal_periods=4, trend='add', seasonal='mul', initialization_method="estimated").fit()
simulations = fit.simulate(5, repetitions=100, error='mul')
fit.fittedvalues.plot(ax=ax, style='--', color='green')
simulations.plot(ax=ax, style='-', alpha=0.05, color='grey', legend=False)
fit.forecast(8).rename('Holt-Winters (add-mul-seasonal)').plot(ax=ax, style='--', marker='o', color='green', legend=True)
However, when I run it, I get the error
TypeError: __init__() got an unexpected keyword argument 'initialization_method'
but when I check the parameters of ExponentialSmoothing in statsmodel, initialization_method is one of them, so I don't know what happens there.
Moving forward, I removed initialization_method from the parameters of ExponentialSmoothing within the code, then I get another error the line below
AttributeError: 'ExponentialSmoothing' object has no attribute 'simulate'
Again, I go and check if simulate is not deprecated in the latest version of statsmodels and no, it is still an attribute.
I upgraded the statsmodels, I upgraded pip and I still get the same errors.
What is it going on there?
Thanks in advance for any help!
Indeed, there was a bug in the previous version, that was corrected in the new version of statsmodels. One only needs to update to statsmodels 0.12.0 and this issue is solved.

skimage - TypeError: peak_local_max() got an unexpected keyword argument 'num_peaks_per_label'

The following code gives me the error present in the title :
from skimage.feature import peak_local_max
local_maxi = peak_local_max(imd,labels=iml,
indices=False,num_peaks_per_label=2)
Where imd is a "distance transformed image" which was obtained with :
from scipy import ndimage
imd = ndimage.distance_transform_edt(im)
im is the input binary image that I would like to later on segment with the watershed function of scikit-image. But to use this function properly, I first need to find the markers which will serve as the starting flooding points : that's what I'm trying to do with the 'peak_local_max' function.
Also, iml is the labeled version of im, that I got with :
from skimage.measure import label
iml = label(im)
I don't know what I've been doing wrong. Also, I've noticed that, the function seems to totally ignore its num_peaks argument. For instance, when I do :
local_maxi = peak_local_max(imd,labels=iml,
indices=True,num_peaks=1)
I always get the same number of peaks detected as when I set num_peaks=500 or num_peaks=np.inf. What am I missing here please ?
As #a_guest pointed out, my version of skimage wasn't matching with the version of the documentation I was referring to. The num_peaks_per_label argument is currently only available in the v0.13dev version. Updating my version to the dev version also fixed my problem with the num_peaks argument.

Suppressing Pandas dataframe plot output

I am plotting a dataframe:
ax = df.plot()
fig = ax.get_figure()
fig.savefig("{}/{}ts.png".format(IMGPATH, series[pfxlen:]))
It works fine. But, on the console, I get:
/usr/lib64/python2.7/site-packages/matplotlib/axes.py:2542: UserWarning: Attempting to set identical left==right results in singular transformations; automatically expanding. left=736249.924955, right=736249.924955 + 'left=%s, right=%s') % (left, right))
Basic searching hasn't showed me how to solve this error. So, I want to suppress these errors, since they garbage up the console. How can I do this?
Those aren't errors, but warnings. If you aren't concerned by those and just want to silence them, it's as simple as:
import warnings
warnings.filterwarnings('ignore')
Additionally, pandas and other libraries may trigger NumPY floating-point errors. If you encounter those, you have to silence them as well:
import numpy as np
np.seterr('ignore')

Rpy2 & ggplot2: LookupError 'print.ggplot'

Unhindered by any pre-existing knowledge of R, Rpy2 and ggplot2 I would never the less like to create a scatterplot of a trivial table from Python.
To set this up I've just installed:
Ubuntu 11.10 64 bit
R version 2.14.2 (from r-cran mirror)
ggplot2 (through R> install.packages('ggplot2'))
rpy2-2.2.5 (through easy_install)
Following this I am able to plot some example dataframes from an interactive R session using ggplot2.
However, when I merely try to import ggplot2 as I've seen in an example I found online, I get the following error:
from rpy2.robjects.lib import ggplot2
File ".../rpy2/robjects/lib/ggplot2.py", line 23, in <module>
class GGPlot(robjects.RObject):
File ".../rpy2/robjects/lib/ggplot2.py", line 26, in GGPlot
_rprint = ggplot2_env['print.ggplot']
File ".../rpy2/robjects/environments.py", line 14, in __getitem__
res = super(Environment, self).__getitem__(item)
LookupError: 'print.ggplot' not found
Can anyone tell me what I am doing wrong? As I said the offending import comes from an online example, so it might well be that there is some other way I should be using gplot2 through rpy2.
For reference, and unrelated to the problem above, here's an example of the dataframe I would like to plot, once I get the import to work (should not be a problem looking at the examples). The idea is to create a scatter plot with the lengths on the x axis, the percentages on the Y axis, and the boolean is used to color the dots, whcih I would then like to save to a file (either image or pdf). Given that these requirements are very limited, alternative solutions are welcome as well.
original.length row.retained percentage.retained
1 1875 FALSE 11.00
2 1143 FALSE 23.00
3 960 FALSE 44.00
4 1302 FALSE 66.00
5 2016 TRUE 87.00
There were changes in the R package ggplot2 that broke the rpy2 layer.
Try with a recent (I just fixed this) snapshot of the "default" branch (rpy2-2.3.0-dev) for the rpy2 code on bitbucket.
Edit: rpy2-2.3.0 is a couple of months behind schedule. I just pushed a bugfix release rpy2-2.2.6 that should address the problem.
Although I can't help you with a fix for the import error you're seeing, there is a similar example using lattice here: lattice with rpy2.
Also, the standard R plot function accepts coloring by using the factor function (which you can feed the row.retained column. Example:
plot(original.length, percentage.retained, type="p", col=factor(row.retained))
Based on fucitol's answer I've instead implemented the plot using both the default plot & lattice. Here are both the implementations:
from rpy2 import robjects
#Convert to R objects
original_lengths = robjects.IntVector(original_lengths)
percentages_retained = robjects.FloatVector(percentages_retained)
row_retained = robjects.StrVector(row_retained)
#Plot using standard plot
r = robjects.r
r.plot(x=percentages_retained,
y=original_lengths,
col=row_retained,
main='Title',
xlab='Percentage retained',
ylab='Original length',
sub='subtitle',
pch=18)
#Plot using lattice
from rpy2.robjects import Formula
from rpy2.robjects.packages import importr
lattice = importr('lattice')
formula = Formula('lengths ~ percentages')
formula.getenvironment()['lengths'] = original_lengths
formula.getenvironment()['percentages'] = percentages_retained
p = lattice.xyplot(formula,
col=row_retained,
main='Title',
xlab='Percentage retained',
ylab='Original length',
sub='subtitle',
pch=18)
rprint = robjects.globalenv.get("print")
rprint(p)
It's a shame I can't get ggplot2 to work, as it produces nicer graphs by default and I regard working with dataframes as more explicit. Any help in that direction is still welcome!
If you don't have any experience with R but with python, you can use numpy or pandas for data analysis and matplotlib for plotting.
Here is a small example how "this feels like":
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'original_length': [1875, 1143, 960, 1302, 2016],
'row_retained': [False, False, False, False, True],
'percentage_retained': [11.0, 23.0, 44.0, 66.0, 87.0]})
fig, ax = plt.subplots()
ax.scatter(df.original_length, df.percentage_retained,
c=np.where(df.row_retained, 'green', 'red'),
s=np.random.randint(50, 500, 5)
)
true_value = df[df.row_retained]
ax.annotate('This one is True',
xy=(true_value.original_length, true_value.percentage_retained),
xytext=(0.1, 0.001), textcoords='figure fraction',
arrowprops=dict(arrowstyle="->"))
ax.grid()
ax.set_xlabel('Original Length')
ax.set_ylabel('Precentage Retained')
ax.margins(0.04)
plt.tight_layout()
plt.savefig('alternative.png')
pandas also has an experimental rpy2 interface.
The problem is caused by the latest ggplot2 version which is 0.9.0. This version doesn't have the function print.ggplot() which is found in ggplot2 version 0.8.9.
I tried to tinker with the rpy2 code to make it work with the newest ggplot2 but the extend of the changes seem to be quite large.
Meanwhile, just downgrade your ggplot2 version to 0.8.9

Categories

Resources