Getting ggplot for Python to make a bar chart - python

Following this simple example, I am trying to make a dirt simple bar chart using yhat's ggplot python module. Here is the code suggested previously on StackOverflow:
In [1]:
from ggplot import *
import pandas as pd
df = pd.DataFrame({"x":[1,2,3,4], "y":[1,3,4,2]})
ggplot(aes(x="x", weight="y"), df) + geom_bar()
But I get an error:
Out[1]:
<repr(<ggplot.ggplot.ggplot at 0x104b18ad0>) failed: UnboundLocalError: local variable 'ax' referenced before assignment>

This works a newer version of ggplot-python. It's not that pretty (x-axis labels), we really have to work on that :-(

Related

Why am I getting "module 'rpy2.robjects.conversion' has no attribute 'py2rpy'" error?

I'm trying to convert a pandas dataframe into an R dataframe using the guide here. The code I have so far is:
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter
pd_df = pd.DataFrame({'int_values': [1, 2, 3],
'str_values': ['abc', 'def', 'ghi']})
with localconverter(ro.default_converter + pandas2ri.converter):
r_from_pd_df = ro.conversion.py2rpy(pd_df)
However, this is giving me the following error:
Traceback (most recent call last):
File <my_file_ref>, line 13, in <module>
r_from_pd_df = ro.conversion.py2rpy(pd_df)
AttributeError: module 'rpy2.robjects.conversion' has no attribute 'py2rpy'
I have found this relevant question where the OP refers to function names being changed however doesn't explain what the changes are. I've tried looking at the module but I'm not quite advanced enough to be able to make sense of it.
The accepted answer refers to checking versions which I've done and I'm definitely using rpy2 v3.3.3 which is the same as the guide I'm following.
Has anyone encountered this error and found a solution?
The section of the rpy2 documentation you are pointing out is built by running the code example. This is means that the example did work with the corresponding version of rpy2.
I am not sure you are using that version of rpy2 at runtime? For example, add print(rpy2.__version__) to check that this is the case.
For what is worth, the latest release in the rpy2 3.3.x series is 3.3.6 and there is probably no good reason to stay with 3.3.3. Otherwise rpy2 3.4.0 was just released; if using R 4.0 or greater, or the latest release of the R packages dplyr or ggplot2 together with their rpy2 wrapper, I would recommend to use that release.
[PS: I just tried your example with rpy2-3.4.0 and it runs without error]

Codes in Ipython vs Pycharm

I am a newbie and the following question may be dumb and not well written.
I tried the following block of codes in Ipython:
%pylab qt5
x = randn(100,100)
y = mean(x,0)
import seaborn
plot(y)
And it delivered a plot. Everything was fine.
However, when I copied and pasted those same lines of codes to Pycharm and tried running, syntax error messages appeared.
For instance,
%pylab was not recognized.
Then I tried to import numpy and matplotlib one by one. But then,
randn(.,.) was not recognized.
You can use IPython/Jupyter notebooks in PyCharm by following this guide:
https://www.jetbrains.com/help/pycharm/using-ipython-jupyter-notebook-with-pycharm.html
You may modify code like the snippet below in order to run in PyCharm:
from numpy.random import randn
from numpy import mean
import seaborn
x = randn(10, 10)
y = mean(x, 0)
seaborn.plt.plot(x)
seaborn.plt.show()

DataFrame object has no attribute 'sample'

Simple code like this won't work anymore on my python shell:
import pandas as pd
df=pd.read_csv("K:/01. Personal/04. Models/10. Location/output.csv",index_col=None)
df.sample(3000)
The error I get is:
AttributeError: 'DataFrame' object has no attribute 'sample'
DataFrames definitely have a sample function, and this used to work.
I recently had some trouble installing and then uninstalling another distribution of python. I don't know if this could be related.
I've previously had a similar problem when trying to execute a script which had the same name as a module I was importing, this is not the case here, and pandas.read_csv is actually working.
What could cause this?
As given in the documentation of DataFrame.sample -
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
Returns a random sample of items from an axis of object.
New in version 0.16.1.
(Emphasis mine).
DataFrame.sample is added in 0.16.1 , you can either -
Upgrade your pandas version to latest, you can use pip for that, Example -
pip install pandas --upgrade
Or if you don't want to upgrade, and want to sample few rows from the dataframe, you can also use random.sample(), Example -
import random
num = 100 #number of samples
sampleddata = df.loc[random.sample(list(df.index),num)]

Matplotlib crashes

I'm try to learn data visualization with matplotlib, but it keeps crashing whenever I try the plot function:
>>> import pylab
>>> import numpy as np
>>> import pandas as pd
>>> names1880 = pd.read_csv('/root/yob1880.txt', names=['name', 'sex', 'births'])
>>> pieces = []
>>> for year, group in names.groupby(['year', 'sex']):
... pieces.append(group.sort_index(by='births', ascending=False)[:1000])
... top1000 = pd.concat(pieces, ignore_index=True)
>>> table = top1000.pivot_table(rows='year', cols='sex', aggfunc=sum)
>>> table.plot(title='Sum of table1000 by year and sex',
yticks=np.linspace(0, 1.2, 13), xticks=range(1880, 2020, 10))
Qt: Session management error: None of the authentication protocols specified are supported
<matplotlib.axes.AxesSubplot object at 0x3063610>
I'm running Kali Linux 1.0.7 The data frame looks fine, my code runs fine until I attempt to plot it, so why am I getting that error every time, I try using the pylab.show() function and it doesn't even plot the line?
Looks like the problem is with Qt not Matplotlib. What happens when you run qtcreator or something similar? If that fails than you probably need to rebuild Qt.

Rpy2 & ggplot2: LookupError 'print.ggplot'

Unhindered by any pre-existing knowledge of R, Rpy2 and ggplot2 I would never the less like to create a scatterplot of a trivial table from Python.
To set this up I've just installed:
Ubuntu 11.10 64 bit
R version 2.14.2 (from r-cran mirror)
ggplot2 (through R> install.packages('ggplot2'))
rpy2-2.2.5 (through easy_install)
Following this I am able to plot some example dataframes from an interactive R session using ggplot2.
However, when I merely try to import ggplot2 as I've seen in an example I found online, I get the following error:
from rpy2.robjects.lib import ggplot2
File ".../rpy2/robjects/lib/ggplot2.py", line 23, in <module>
class GGPlot(robjects.RObject):
File ".../rpy2/robjects/lib/ggplot2.py", line 26, in GGPlot
_rprint = ggplot2_env['print.ggplot']
File ".../rpy2/robjects/environments.py", line 14, in __getitem__
res = super(Environment, self).__getitem__(item)
LookupError: 'print.ggplot' not found
Can anyone tell me what I am doing wrong? As I said the offending import comes from an online example, so it might well be that there is some other way I should be using gplot2 through rpy2.
For reference, and unrelated to the problem above, here's an example of the dataframe I would like to plot, once I get the import to work (should not be a problem looking at the examples). The idea is to create a scatter plot with the lengths on the x axis, the percentages on the Y axis, and the boolean is used to color the dots, whcih I would then like to save to a file (either image or pdf). Given that these requirements are very limited, alternative solutions are welcome as well.
original.length row.retained percentage.retained
1 1875 FALSE 11.00
2 1143 FALSE 23.00
3 960 FALSE 44.00
4 1302 FALSE 66.00
5 2016 TRUE 87.00
There were changes in the R package ggplot2 that broke the rpy2 layer.
Try with a recent (I just fixed this) snapshot of the "default" branch (rpy2-2.3.0-dev) for the rpy2 code on bitbucket.
Edit: rpy2-2.3.0 is a couple of months behind schedule. I just pushed a bugfix release rpy2-2.2.6 that should address the problem.
Although I can't help you with a fix for the import error you're seeing, there is a similar example using lattice here: lattice with rpy2.
Also, the standard R plot function accepts coloring by using the factor function (which you can feed the row.retained column. Example:
plot(original.length, percentage.retained, type="p", col=factor(row.retained))
Based on fucitol's answer I've instead implemented the plot using both the default plot & lattice. Here are both the implementations:
from rpy2 import robjects
#Convert to R objects
original_lengths = robjects.IntVector(original_lengths)
percentages_retained = robjects.FloatVector(percentages_retained)
row_retained = robjects.StrVector(row_retained)
#Plot using standard plot
r = robjects.r
r.plot(x=percentages_retained,
y=original_lengths,
col=row_retained,
main='Title',
xlab='Percentage retained',
ylab='Original length',
sub='subtitle',
pch=18)
#Plot using lattice
from rpy2.robjects import Formula
from rpy2.robjects.packages import importr
lattice = importr('lattice')
formula = Formula('lengths ~ percentages')
formula.getenvironment()['lengths'] = original_lengths
formula.getenvironment()['percentages'] = percentages_retained
p = lattice.xyplot(formula,
col=row_retained,
main='Title',
xlab='Percentage retained',
ylab='Original length',
sub='subtitle',
pch=18)
rprint = robjects.globalenv.get("print")
rprint(p)
It's a shame I can't get ggplot2 to work, as it produces nicer graphs by default and I regard working with dataframes as more explicit. Any help in that direction is still welcome!
If you don't have any experience with R but with python, you can use numpy or pandas for data analysis and matplotlib for plotting.
Here is a small example how "this feels like":
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'original_length': [1875, 1143, 960, 1302, 2016],
'row_retained': [False, False, False, False, True],
'percentage_retained': [11.0, 23.0, 44.0, 66.0, 87.0]})
fig, ax = plt.subplots()
ax.scatter(df.original_length, df.percentage_retained,
c=np.where(df.row_retained, 'green', 'red'),
s=np.random.randint(50, 500, 5)
)
true_value = df[df.row_retained]
ax.annotate('This one is True',
xy=(true_value.original_length, true_value.percentage_retained),
xytext=(0.1, 0.001), textcoords='figure fraction',
arrowprops=dict(arrowstyle="->"))
ax.grid()
ax.set_xlabel('Original Length')
ax.set_ylabel('Precentage Retained')
ax.margins(0.04)
plt.tight_layout()
plt.savefig('alternative.png')
pandas also has an experimental rpy2 interface.
The problem is caused by the latest ggplot2 version which is 0.9.0. This version doesn't have the function print.ggplot() which is found in ggplot2 version 0.8.9.
I tried to tinker with the rpy2 code to make it work with the newest ggplot2 but the extend of the changes seem to be quite large.
Meanwhile, just downgrade your ggplot2 version to 0.8.9

Categories

Resources